Data¶

Below are the data types supported by PATRIC. Below the description of each is a source designation of either

Primary - retrieved and imported from an external source, with source location/contact information
Secondary – curated, derived, or generated by PATRIC, with access/download information

Genomes¶

Source: Primary

Description: The central data type in PATRIC is genomes. Most of the data and information within PATRIC is linked back to sequenced, assembled, and annotated genomes stored in the PATRIC database. Genomes are incorporated from RefSeq, GenBank, and other sources, and are annotated using a standard annotation protocol, RASTtk, to enable comparative analyses and linking of data across the website. In addition, PATRIC searches literature for large published AMR studies and assembles corresponding genomes using the reads available in the SRA database. As of June 2019, PATRIC contains 227,577 bacterial, 3,021 archaeal, 4,719 bacteriophage and 10 eukaryotic host genomes.

User Guide: https://docs.patricbrc.org/user_guides/data/data_types/genomes.html

Source Code:

Retrieve new microbial genomes from GenBank/Refseq: https://github.com/PATRIC3/p3_data/blob/master/getGenomesGenbank.pl
Annotate genomes using PATRIC’s annotation service

Genome Metadata¶

Source: Primary

Description: Genome metadata in PATRIC consists of more than 70 different metadata fields, called attributes, which are organized into the following seven broad categories: Status, Organism Info, Isolate Info, Host Info, Sequence Info, Phenotype Info, Project Info, and Others. The attributes are provided below.

Organism Info - Genome Name, Genome ID, NCBI Taxon ID, Genome Status, Organism Name, Strain, Serovar, Biovar, Pathovar, MLST, Other Typing, Culture Collection, Type Strain, Antimicrobial Resistance, and Antimicrobial Resistance Evidence
Isolate Info - Isolation Site, Isolation Source, Isolation Comments, Collection Date, Isolation Country, Geographic Location, Latitude, Longitude, Altitude, Depth, and Other Environmental
Host Info - Host Name, Host Gender, Host Age, Host Health, Body Sample Site, Body Sample Subsite, and Other Clinical
Sequence Info - Sequencing Status, Sequencing Platform, Sequencing Depth, Assembly Method, Chromosomes, Plasmids, Contigs, Sequences, Genome Length, CG Content, PATRIC CDS, and RefSeq CDS
Phenotype Info - Gram Stain, Cell Shape, Motility, Sporulation, Temperature Range, Optimal Temperature, Salinity, Oxygen Requirement, Habitat, and Disease
Project Info - Sequencing Center, Completion Date, Publication, SRA Accession, BioProject Accession, BioSample Accession, Assembly Accession, Genbank Accessions, and RefSeq Accessions
Others – Comments and Additional Metadata

PATRIC metadata is collected from multiple sources, such as GenBank records, BioProject and BioSample database, published literature, other NIAID sequencing centers, and other PATRIC collaborators. Automated metadata collections are augmented with manual curation to ensure consistency and accuracy.

User Guide: https://docs.patricbrc.org/user_guides/organisms_taxon/genome_metadata.html

Source Code:

Process curated genome metadata spreadsheet: https://github.com/PATRIC3/p3_data/blob/master/parseMetadataFile.pl
Process curated AMR metadata / antibiogram spreadsheet: https://github.com/PATRIC3/p3_data/blob/master/parseAMRMetadata.pl
Process genome metadata and antibiogram data from BioSample records: https://github.com/PATRIC3/p3_data/blob/master/parseBiosampleAMR.pl

Antimicrobial Resistance Data and Metadata¶

Source: Primary

Description: Genome-level antimicrobial resistance (AMR) phenotype data is collected from NCBI BioSample and Anti-biogram records as part of the genome ingestion process. In addition, AMR metadata is also received curated from the NIAID-funded Genomic Centers for Infectious Diseases and from publications. These data often include panel data for antibiotics and chemicals. As of October 2017, PATRIC has AMR metadata for 19,299 genomes. AMR metadata include the following:

Antimicrobial Resistance metadata field – Shows genomes that have been specifically tested against certain antibiotics and the resulting phenotype from that test. Allowable values are ‘Resistant’, ‘Susceptible’, and ‘Intermediate’. Multiple values may appear for the same genome depending on its resistance or susceptibility to different antibiotics.

Antimicrobial Resistance Evidence metadata field – Indicates the information source behind the AMR designation. Allowable values include “Phenotype’, ‘AMR Panel’, and ‘Comment.’

AMR Panel Data – Where available, PATRIC includes the source AMR panel (antibiogram) data as an FTP download. As of October 2018, PATRIC has 27,633 genomes with AMR panel data.

Antibiotics – PATRIC provides basic information about commonly used antibiotics, including their chemical and physical properties, pharmacology, and mechanism of action. In addition, each antibiotic is linked to other relevant data available in PATRIC, such as AMR phenotypes for genomes, AMR genes, and AMR regions. As of October 2018, PATRIC has information on 97 antibiotics.

AMR Phenotypes – PATRIC collects AMR phenotype data generated using antimicrobial susceptibility testing methods (AST) from published studies and collaborators. In addition, we also provide predicted AMR phenotypes using machine learning classifiers. As of October 2018, PATRIC has 284,295 antibiotic-specific records derived from AMR panel data.

AMR Genes – AMR genes refer to the genes implicated in or associated with the resistance to one or more antibiotics. The resistance may result from the presence or absence of a gene or specific mutations acquired spontaneously or through evolution over time. Known antibiotic resistance genes are integrated and mapped from the following sources: ARDB, CARD, NDARO, and PATRIC AMR-related curation.

AMR Regions – AMR regions refer to the small genomic regions implicated in or associated with the resistance to one or more antibiotics. The AMR regions are computationally predicted using machine learning classifiers used to predict AMR phenotypes. They may map to existing genes or intergenic regions and may help identify new AMR genes or understand AMR mechanisms.

User Guide: https://docs.patricbrc.org/user_guides/data/data_types/antimicrobial_resistance.html

Source Code:

Process curated AMR metadata / antibiogram spreadsheet: https://github.com/PATRIC3/p3_data/blob/master/parseAMRMetadata.pl
Process genome metadata and antibiogram data from BioSample records: https://github.com/PATRIC3/p3_data/blob/master/parseBiosampleAMR.pl

Other Clinical Metadata¶

Source: Primary

Description: Additional clinical metadata is available for a subset of the genomes available at PATRIC. These data are stored as searchable key-value pairs. Currently, there are 24,724 genomes in PATRIC with clinical metadata. Example values include, “hospital location: ICU,” “comorbidity: HIV negative”, “host_health_state:Carriage” etc.

Annotated Genome Features¶

Source: Secondary

Description: PATRIC genome features are the resulting annotations from the RASTtk system, and include coding sequences (CDS / gene calls), rRNAs, tRNAs, CRISPR elements, other miscellaneous genomic features, aspecialty gene designations, and AMR phenotypes where classifiers exist for the organism. PATRIC also retains the original GenBank/RefSeq annotations.

User Guide: https://docs.patricbrc.org/user_guides/data/data_types/genomic_features.html

Source Code:

Processing genomic features and related information from GenBank files and RAST genome objects: https://github.com/PATRIC3/p3_data/blob/master/rast2solr.pl

Specialty Genes¶

Source: Secondary

Description: Specialty Genes refers to the special classes of genes that are of particular interest to infectious disease researchers, such as antibiotic resistance genes, virulence factors, drug targets, and human homologs. As part of genome annotation, maps reference genes to their homologs based on high sequence similarity using BLASTP, and thus, providing consistent annotation of specialty genes across all bacterial genomes. The classes and sources of PATRIC specialty genes are provided below.

Antibiotic Resistance Genes
- ARDB – Antibiotic Resistance Genes Database
- CARD – The Comprehensive Antibiotic Resistance Database
- NDARO – National Database of Antibiotic Resistance Organisms
- PATRIC AMR genes – Manually curated from literature
Drug Targets
- DrugBank
- TTD – Therapeutic Targets Database
Human Homologs
- Proteins from the Reference Human Genome at NCBI RefSeq
Virulence Factors
- VFDB – Virulence Factor Database
- Victors – Virulence Factor Database
- PATRIC_VF - a manually curated virulence factor database developed by the PATRIC team
Transporters
- TCDB: Transporter Classification Database
Essential Genes
- PATRIC Essential Genes: Predicted using Metabolic modeling and FBA

User Guide: https://docs.patricbrc.org/user_guides/data/data_types/specialty_genes.html

Source Code:

Specialty gene search using BLAST or BLAT: https://github.com/PATRIC3/p3_data/blob/master/specialtyGenes.pl
Process and upload specialty genes: https://github.com/PATRIC3/p3_data/blob/master/uploadSpecialtyGenes.pl

Other Annotations¶

Source: Primary

Description: Some PATRIC features have additional annotations beyond those generated through RASTtk. These annotations include experimental and literature-based evidence, typically generated by collaborations or other external sources, such as the TBCAP Tuberculosis Annotation Project, NIAID-funded Functional Genomics Centers, and literature references from The SEED. These annotations are incorporated into PATRIC as key-value pairs that appear as comments associated with genome features. As of October 2017, PATRIC has 8,731,247 such annotations.

Protein Families¶

Source: Secondary

Description: PATRIC provides multiple sets of protein families to enable comparative genomic analysis at various levels. FIGfams are a set of iso-functional homologs, each containing proteins that have the same function and sequences that are similar along their full length. In addition, PATRIC includes genus-specific protein families (PLfams) and cross-genera protein families (PGfams) for all the public genomes in PATRIC. These protein families cover almost all of the proteins in the current public genomes (~100% protein coverage) to support more comprehensive comparative analysis.

User Guide: https://docs.patricbrc.org/user_guides/data/data_types/protein_families.html

Pathways¶

Source: Secondary

Description: Pathways in PATRIC are represented using KEGG (Kyoto Encyclopedia of Genes and Genomes) maps. As of October 2018, there are 147 unique pathways covering 2,820 unique EC numbers in PATRIC. These pathways are projected in all public and private bacterial genomes as part of the genome annotation process.

User Guide: https://docs.patricbrc.org/user_guides/data/data_types/pathways.html

Phylogenetic Trees¶

Source: Secondary

Description: PATRIC provides interactive phylogenetic trees computed at the order level for PATRIC genomes, and are available as downloadable Newick files. Trees have been computed for 14 orders. For some orders with very large numbers of genomes, sub-trees are computed at the family level instead: - Actinomycetales - Bacillales - Burkholderiales - Campylobacterales - Chlamydiales - Clostridiales - Enterobacteriales - Lactobacillales - Legionellales - Rhizobiales - Rickettsiales - Spirochaetales - Thiotrichales - Vibrionales

User Guide: https://docs.patricbrc.org/user_guides/organisms_taxon/phylogeny.html

Source Code:

Build phylogenetic trees with progressive refinement: https://github.com/PATRIC3/pepr
Build phylogenetic tree using codon tree service: https://github.com/PATRIC3/codon_trees
Display precomputed trees on the website: https://github.com/PATRIC3/p3_trees

Differential Expression Data¶

Source: Primary

Description: In the PATRIC context, Differential Expression Data can include quantitative gene expression data generated by high-throughput technologies, such as microarrays or RNA-Seq, or protein expression data as well. PATRIC has integrated a large number of published gene expression datasets related to bacterial pathogens from NCBI’s GEO database. Our manual curation process includes a review of the experiment description and the related publication to understand experimental design; combines data from replicates; and creates pair-wise comparisons or contrasts as described in the publication to identify differential gene expression, data normalization and log-transformation. As of October 2018, PATRIC includes 829 curated differential expression experiments with 5,743 comparisons. PATRIC also includes 21 curated host-response gene expression datasets for mouse and human, curated from Expression Atlas and other sources.

User Guide: https://docs.patricbrc.org/user_guides/data/data_types/transcriptomics.html

Source code:

Process new RNA-seq datasets using RNA-seq service: https://github.com/PATRIC3/app_service/blob/master/scripts/App-RNASeq.pl
Process new differential expression datasets: https://github.com/PATRIC3/app_service/blob/master/scripts/App-DifferentialExpression.pl
Process and upload differential expression data into Solr: https://github.com/PATRIC3/p3_data/blob/master/expression2solr.pl

Protein-Protein Interactions¶

Source: Primary

Description: PATRIC incorporates non-redundant, experimentally characterized protein-protein interactions (PPIs) from numerous public repositories, including IntAct, BIND, DIP, Uniprot, Mint, MPIdb, Spike, Reactome, MatrixDB, and InnateDB. Interaction data are initially retrieved by querying repositories in the PSICQUIC public registry for PPIs that have experimental support according to the PSI-MI interaction type and detection method ontologies, plus the presence of at least one literature reference. Verified data are subsequently matched to their corresponding taxa, stripped of redundant interactions, and categorized as either intraspecific (interactions that occur between proteins in the same species) or interspecific (interactions between proteins in different species, including host-pathogen (HP) PPIs). In addition to the experimentally characterized PPIs, PATRIC also provides computationally predicted PPIs and genetic interactions from STRING database. As of October 2018, PATRIC has 55,600,858 protein-protein interactions.

User Guide: https://docs.patricbrc.org/user_guides/organisms_taxon/interactions.html

Source Code:

Process and upload PPI data: https://github.com/PATRIC3/p3_data/blob/master/uploadPPI.pl

Protein Structures¶

Source: Primary

Description: PATRIC queries protein structures directly from PDB in real-time using their data APIs. The protein structures are shown to the users; however, they are not stored locally in the PATRIC database.

Other Special Data Sets¶

Source: Primary

Description: PATRIC contains collections of data of particular interest to researchers and NIAID program. These data sets typically are the result of experiments conducted in research projects funded by NIAID programs such as the Functional Genomics Centers, Systems Biology Centers, etc., as well as through collaborations in which PATRIC plays a part. These data sets are available as “Specialty Data Collections” from the PATRIC main Data menu and have summary project information and links to associated publications and additional data in other public repositories.