System Architecture

Overview

From a user’s perspective, the PATRIC systems are primarily interfaced through the PATRIC website using a standard modern web browser. The PATRIC website server software is designed to be hosted by industry standard application containers and is deployable in a number of different configurations. The server software, as well as the client interface (browser application) rely upon a number of additional systems in order to successfully build, deploy, and provide querying and analysis services to these applications.

Direct support for the website application is provided by a number of different databases and services. These services typically support the interactive capabilities of the website and its users. For example, Solr database instances provide all of the scientific querying capabilities against the PATRIC data. The PATRIC website and application aggregates the data and capabilities of PATRIC services to present them interactively to the user.

Several other key components are used—and are critical—to the PATRIC project, but don’t directly support the PATRIC website itself in production. This includes data analysis services, software, and databases used to collect, analyze and annotate PATRIC data prior to release and deployment to the production PATRIC website. Additionally, software and/or scripts to manage the data, services, and database loading and extraction are required.

Software Architecture

The Software Architecture section of this document describes the general use and interaction of the components that make up the PATRIC website and its direct and indirect components. Some components of the architecture are third-party components and their architectures and deployment will not be detailed here except where relevant to the understanding of the overall architecture described by the PATRIC Systems Documentation.

PATRIC Website

The Browser Application

The user’s web browser is the host of the entire PATRIC website, which is more accurately described as a Web Application. Logically, the set of pages that the PATRIC Web Server’s provide make up the entirety of the PATRIC Web Application. A user’s state is maintained across all of these pages at any one time and from the user’s perspective they are navigating through the interactive space of PATRIC. Each page can be considered to be hosting one or more individual applications that communicate with the web server to provide an interactive experience

The browser application is written with ECMAScript (Javascript, JS) and DojoJS. It communicates with the server via HTTP Requests (AJAX). The browser application is part of the PATRIC Server application and intermingled with other content on pages generated by the PATRIC Server. However, browser application run in the user’s web browser, on a different network endpoint from the server, may be restarted (when a page reloads) at any time, may be composed of “mashup” data from external sites, and so they require independent consideration from the server side of the PATRIC website.

The browser application provides full support to common modern web browsers, with support for specific UI functionalities degrading when not supported by the underlying browser. Instead of requiring all users to conform to a specific set of browsers to use the website at all, we prefer to provide the best support possible for modern browsers, and support for older browsers via fallback mechanisms or degraded functionality. Browsers known to currently work are Chrome, Firefox, Safari, and IE7+. Some applications (pages) may require Flash for fully functionality.

Source Code: https://github.com/PATRIC3/p3_web

Web Application Server

This component serves the web content to client browsers. It is currently comprised an ExpressJS application running in a NodeJS webserver. It serves HTML, CSS, Javascript, and images to client browsers. The bulk of the user interface is implemented in Browser Application, which itself is built upon the Dojo javascript library.

Source Code: https://github.com/PATRIC3/p3_web

Static Content

Static content refers to electronic documents contains website Use Case / Tutorial, command line interface Use Case / Tutorial, user guides and PATRIC news. The contents of these documents are served independently of the main web server software and are publicly accessible. This site provides an RSS feed, which the main website application consumes and displays on its front page. Files are converted to html using the Python-based Sphinx documentation generator. The files are stored in the PATRIC GitHub repository.

Source Code: https://github.com/PATRIC3/p3_docs

Workspace

The Workspace is an online document-based data store where data is organized into user-owned directories, analogous to DropBox or GoogleDrive. Any top-level directory may be shared with multiple users to enable collaborative work on uploaded data (also analogous to DropBox or GoogleDrive).

Source Code: https://github.com/PATRIC3/Workspace

Workspace API:

The Workspace is connected to the rest of the PATRIC tools and website via a programmatic JSON RPC API.

The API has 11 commands:

  • create: allows for the creation of a directory or a data object itself

  • get: allows for retrieval of an object from the workspace

  • ls: list the objects present in a particular directory of the workspace

  • copy: copy an object from one location to another

  • delete: delete an object

  • set_permissions: set permissions on a top-level directory to share with another user

  • list_permissions: list permissions currently set for a top-level directory

  • get_download_url: allows for retrieval of a RESTful URL to download an object

  • get_archive_url: allows for retrieval of a RESTful URL to download an archive of multiple objects

  • update_metadata: allows for the manipulation of metadata associated with an object or directory in the Workspace

  • update_auto_meta: an internal function enabling the update of automated-metadata for an object

The associated resource is: https://p3.theseed.org/services/Workspace

Data formats:

Objects of any type may be stored in the workspace, but most typically objects are simple text files, often stored in JSON format. Additionally, all objects are assigned a type (e.g., Genome, Model, FeatureSet), and this type indicates how the object is treated when viewed on the PATRIC website, as well as the handling of the object by automated processing scripts built into the workspace. The types accepted by the workspace are configurable and completely extensible.

Database structure:

The workspace uses MongoDB to store the directory structure, directory permissions, object lists, and object metadata. The objects themselves are stored either in Shock (typically for very large objects) or in a simple file-system. Because of its connection to Shock, the workspace supports federated data storage, which enables the handling of big data.

Object processing:

When an object is saved to the workspace, it always undergoes a processing step, the specific actions of which depend on the type on the object. This step computes automated metadata for the object to facilitate object query and summary, but it can also handle other tasks as needed (e.g., indexing in Solr).

Download service:

In order to support transparent and efficient downloading of data files from the workspace, the Download Service allows the PATRIC website to provide URL-based access to private files in the workspace. Access to these URLs do not require a password; to ensure privacy, they are un-guessable hashes and are only valid for a short time.

Data API

The data API provides access to querying, retrieval, and indexing of public PATRIC data and for private annotated data. The API provides a REST interface to the rich data PATRIC provides. The data can be retrieved directly by ID or it can be queried using the Request Query Language (RQL) syntax or using Solr syntax. As queries are submitted to the API they are modified and submitted to the backend data sources (Solr) to retrieve the data that is visible to the user. Users are able to view public data, any data they own, or any data that another user has shared with them.

Source Code: https://github.com/PATRIC3/p3_api

Data API:

The data API has two functions for each data type:

  • get()

  • query()

The associated resources are, respectively:

In addition to the API for querying and retrieving data, there is also an API endpoint for submitting new data to the system to be indexed in the database.

Command-line Interface (CLI)

PATRIC is an integration of different types of data and software tools that support research on bacterial pathogens. The typical biologist seeking access to the PATRIC data and tools will usually explore the web-based user interface. However, there are many instances in which programatic or command-line interfaces are more suitable, specially for querying data or submitting jobs in batch mode. For users that wish command-line access to PATRIC, we provide the tools described in this document. We call these tools the P3-scripts. They are intended to run on your machine, going over the network to access the services provided by PATRIC.

Source Code and Client Application: https://github.com/PATRIC3/PATRIC-distribution/

Currently, the following commands are available to the community:

p3-abstract-clusters

p3-get-feature-sequence

p3-put-genome-group

p3-aggregate-sss

p3-get-features-by-sequence

p3-rast

p3-aggregates-to-html

p3-get-genome-contigs

p3-related-by-clusters

p3-all-drugs

p3-get-genome-data

p3-rep-prots

p3-all-genomes

p3-get-genome-drugs

p3-rm

p3-blast

p3-get-genome-expression

p3-rmdir

p3-build-kmer-db

p3-get-genome-features

p3-role-matrix

p3-closest-seqs

p3-get-genome-group

p3-sequence-profile

p3-co-occur

p3-gto

p3-set-to-relation

p3-collate

p3-gto-dna

p3-signature-clusters

p3-config

p3-gto-fasta

p3-signature-families

p3-count

p3-gto-scan

p3-signature-peginfo

p3-count-families

p3-head

p3-similar-proteins-by-blast

p3-cp

p3-identical-dna

p3-similar-proteins-by-family

p3-drug-amr-data

p3-identical-proteins

p3-sort

p3-echo

p3-identify-clusters

p3-stats

p3-extract

p3-inAandB

p3-submit-genome-annotation

p3-extract-gto

p3-inAnotB

p3-submit-genome-assembly

p3-feature-gap

p3-inAorB

p3-tbl-to-fasta

p3-feature-upstream

p3-job-status

p3-tbl-to-html

p3-file-filter

p3-join

p3-tests

p3-find-couples

p3-kmer-compare

p3-whoami

p3-find-features

p3-list-feature-groups

p3-find-in-clusters

p3-list-genome-groups

p3-format-results

p3-login

p3-function-to-role

p3-logout

p3-generate-close-roles

p3-ls

p3-generate-clusters

p3-mass-cluster-run

p3-genome-amr-data

p3-match

p3-genome-fasta

p3-merge

p3-genus-species

p3-mkdir

p3-get-contig-data

p3-pick

p3-get-drug-genomes

p3-project-subsystems

p3-get-family-data

p3-put-feature-group

p3-get-family-features

p3-get-feature-data

p3-get-feature-group

Databases

PATRIC data is stored Solr and indexed in its entirety (all fields) as PATRIC releases data. Solr then provides read-only searching services to both the server and browser side of the PATRIC via HTTP requests. A standard Solr 6 installation can host the PATRIC data, but the deployment of Solr can be accomplished in a number of different ways that can have a dramatic impact on performance for many of the PATRIC activities. The performance of the Solr service is heavily memory dependent. It is important, at a minimum, to be able to fit the entire set of data indexes into memory. Additionally, cache and other such tunable parameters can require additional memory. In any deployment, this physical limitation of the available resources is likely to be one of the key defining factors for Solr configuration and performance.

Source Code: https://github.com/PATRIC3/patric_solr_cloud

User Service

The user service provides user profile management and authentication for the PATRIC system. The user system provides a REST interface to read and modify a user’s profile. It also provides authentication services for the PATRIC web application and related components. The backend services consume authentication tokens that are generated by the user service.

Source Code: https://github.com/PATRIC3/p3_user

Web/Proxy Server

All PATRIC websites and web applications run behind a web server which is used to host static files, proxy requests to underlying application servers, and in some cases load balancing among web server instances. This component is not strictly required for deployment of the PATRIC infrastructure in basic form, but greatly simplifies deployment and is the current method used for load balancing. NGINX is deployed on hosts with websites on the standard HTTP and HTTPS ports (80,443), while the underlying applications are deployed on unused ports. nginx is then configured to proxy requests to these localhosts using its Named Virtual Hosting system.

App Service

The PATRIC resource supports a number of computational services (e.g., genome assembly and annotation, model production, etc.). These services are hosted on an extensible set of computational resources at Argonne. The interface between the user’s interaction with the PATRIC website and the computational resources is called the App Service. The App Service presents a unified view of all supported services, allowing the user to submit requests, monitor progress, and view results within a common framework on the PATRIC website. For the developers, the App Service enables the development of new applications without the need to handle the details of process execution and management.

Source Code: https://github.com/PATRIC3/app_service

App Service API:

The App Service is connected to the rest of the PATRIC tools and website via a programmatic JSON RPC API. The API has 6 commands:

  • enumerate_apps

  • start_app

  • query_tasks

  • query_task_summary

  • query_task_details

  • enumerate_tasks

The associated resource is: https://p3.theseed.org/services/app_service

Hardware Deployment

The hardware hosted at Argonne National Laboratory on behalf of the University of Chicago’s bioinformatics computing core supporting the PATRIC services are as follows:

  • Production support services

    • 24 x E5-2620 CPUs

    • 256 GB RAM

  • Production support services

    • 40 x E5-2640 CPUs

    • 768 GB RAM

  • User Data Management and Compute Scheduling

    • 12 x E5-2620 CPUs

    • 256 GB RAM

  • Solr Cloud servers (x3)

    • 32 Xeon Gold 6134 CPUs

    • 760 GB RAM

    • 5.3 TB SSD storage

  • ARAST Server and Primary Compute

    • 12 x E5-2620 CPUs

    • 256 GB RAM

  • Compute server

    • 12 x E5-2620 CPUs

    • 256 GB RAM

  • Compute server (3)

    • 32 x Xeon Gold 6134 CPUs

    • 786 GB RAM

  • Loadbalanced / Failover Proxy Server

    • 2 systems, each 4 CPUs, 64GB RAM, 10Gb network

Storage is provided to the above systems through Fibre Channel SAN storage. The SOLR portion of PATRIC and the FTP site are currently consuming approximately 10 TB of storage.