================================================= Comparing Your New Genome to Existing Neighbors ================================================= In this tutorial, we discuss a common task when working with a new private genome run through the :doc:`comprehensive-genome-analysis`-- comparing the genome to its close neighbors. In this example, we will look at the results of a comprehensive genome analysis run for the NCBI SRA sample SRR8179889. The sample was assembled into contigs and then annotated as a *Streptococcus pyogenes* (from which the sample was cultured). The output of the pipeline looks like this .. image:: images/cga_result.png The Full Genome Report contains a lengthy text analysis (with tables and diagrams) of the genome and its annotation. Among the diagrams will be a phylogenetic tree showing the new genome's relationship to its neighbors. The list of neighbors will be in the text file *tree_ingroup.txt*. Checking the Annotation Quality ------------------------------- If you double-click on **annotation**, it will take you to the annotation job for the new genome. .. image:: images/annotation_folder.png The quality analysis of the genome can be found in *GenomeReport.html*. The summary at the top is shown below. .. image:: images/quality_results.png The contamination score in this case is 11.1%, and anything more than 10% is considered a red flag. Below is a fragment of the problematic roles report from the quality analysis. .. image:: images/quality_notes.png Here we have two roles that should only occur once but actually appear twice, and one role which should not occur but was found once. Both of the roles that are duplicated have a version that matches the corresponding role in the reference genome and one that does not. So, *fig|1314.841.peg.1239* is likely correct, because it matches a feature with that role in the reference genome for this species, but *fig|1314.841.peg.2100* is suspect because it does not. To find out what it does match, we click on the link for it, which takes us to the Compare Regions display. .. image:: images/contamination_protein.png The Compare Regions view uses protein familes to link features together. The gene we selected-- *fig|1314.841.peg.2100*-- is shown in red on the first line. The closest proteins in the same family are shown in red on the lower lines. Each red protein is surrounded by its neighborhood on the contig, and all proteins are color-coded and numbered by family. So, all the dark blue proteins with the number 4 belong to the same family, and are generally found near our protein of interest. What we see clearly from this display is that our protein of interest in our new genome is more like something found in Helicobacter than in Streptococcus pyogenes. A similar search using the other two suspect features returns the same result. We can conclude the extra DNA in our original sample is from Helicobacter. A Visual Comparison Using the Protein Family Sorter --------------------------------------------------- We would now like to compare the genome directly to its close neighbors using the :doc:`/user_guides/services/protein_family_service`. The Protein Family Sorter requires you to enter genomes one at a time unless they are already in a genome group. First we need to find the closest neighbors in PATRIC. The Comprehensive Genome Analysis found a set suitable for generating a phylogenetic tree, but these are not necessarily the closest. Using the :doc:`/user_guides/services/similar_genome_finder_service`, we can get a list of the 50 or so closest genomes. We will select the first 10 and put them into a genome group using the **Group** icon. .. image:: images/selected_similar_genomes.png We are now ready to invoke the Protein Family Sorter. In the screen shot below, we entered the new genome as an individual, and *new.strep.neighbors* is the name of the genome group of close neighbors we just created. .. image:: images/sorter_data_entry.png The Protein Family Sorter produces a tabular analysis and a heatmap. The heatmap is shown below. .. image:: images/heatmap_warning.png At this resolution, every protein family is a vertical line. Black indicates there is no protein in the family. Yellow indicates there is one and only one protein from that family in the genome, and when there is more than one a shade of orange is shown. Each row represents a genome, and our new genome stands out dramatically. You can scale the display so that the columns are wide enough to see the protein family names. You can also click on an individual protein. We would like to see one of the proteins that is in the new genome but does not have family members in the neighboring genomes. Every part of the diagram is clickable, and you can even drag-select a whole region. Clicking on part of the yellow band inside the big black vertical bar produced the pop-up below. .. image:: images/heatmap_popup.png You can click on the feature link to see it in Feature View. Selecting the Compare Regions tab shows us that this particular protein belongs in Helicobacter pylori. .. image:: images/compare_region_invader.png This confirms what we learned from the quality report: the sample contains some Helicobacter DNA.