1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A pan-genome method to determine core regions of the Bacillus subtilis and Escherichia coli genomes

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background: Synthetic engineering of bacteria to produce industrial products is a burgeoning field of research and application. In order to optimize genome design, designers need to understand which genes are essential, which are optimal for growth, and locations in the genome that will be tolerated by the organism when inserting engineered cassettes.

          Methods: We present a pan-genome based method for the identification of core regions in a genome that are strongly conserved at the species level.

          Results: We show that the core regions determined by our method contain all or almost all essential genes. This demonstrates the accuracy of our method as essential genes should be core genes. We show that we outperform previous methods by this measure. We also explain why there are exceptions to this rule for our method.

          Conclusions: We assert that synthetic engineers should avoid deleting or inserting into these core regions unless they understand and are manipulating the function of the genes in that region. Similarly, if the designer wishes to streamline the genome, non-core regions and in particular low penetrance genes would be good targets for deletion. Care should be taken to remove entire cassettes with similar penetrance of the genes within cassettes as they may harbor toxin/antitoxin genes which need to be removed in tandem. The bioinformatic approach introduced here saves considerable time and effort relative to knockout studies on single isolates of a given species and captures a broad understanding of the conservation of genes that are core to a species.

          Related collections

          Most cited references58

          • Record: found
          • Abstract: found
          • Article: not found

          Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

          The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Roary: rapid large-scale prokaryote pan genome analysis

            Summary: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors. Availability and implementation: Roary is implemented in Perl and is freely available under an open source GPLv3 license from http://sanger-pathogens.github.io/Roary Contact: roary@sanger.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries

              A fundamental question in microbiology is whether there is continuum of genetic diversity among genomes, or clear species boundaries prevail instead. Whole-genome similarity metrics such as Average Nucleotide Identity (ANI) help address this question by facilitating high resolution taxonomic analysis of thousands of genomes from diverse phylogenetic lineages. To scale to available genomes and beyond, we present FastANI, a new method to estimate ANI using alignment-free approximate sequence mapping. FastANI is accurate for both finished and draft genomes, and is up to three orders of magnitude faster compared to alignment-based approaches. We leverage FastANI to compute pairwise ANI values among all prokaryotic genomes available in the NCBI database. Our results reveal clear genetic discontinuity, with 99.8% of the total 8 billion genome pairs analyzed conforming to >95% intra-species and <83% inter-species ANI values. This discontinuity is manifested with or without the most frequently sequenced species, and is robust to historic additions in the genome databases.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data CurationRole: Formal AnalysisRole: Funding AcquisitionRole: InvestigationRole: MethodologyRole: Project AdministrationRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – Original Draft PreparationRole: Writing – Review & Editing
                Role: Data CurationRole: Formal AnalysisRole: ValidationRole: Writing – Original Draft PreparationRole: Writing – Review & Editing
                Role: ValidationRole: Writing – Original Draft PreparationRole: Writing – Review & Editing
                Role: Data CurationRole: Formal AnalysisRole: InvestigationRole: ValidationRole: Writing – Original Draft PreparationRole: Writing – Review & Editing
                Role: ConceptualizationRole: Funding AcquisitionRole: Project AdministrationRole: Writing – Review & Editing
                Role: Data CurationRole: InvestigationRole: ValidationRole: Writing – Review & Editing
                Role: ConceptualizationRole: Data CurationRole: Funding AcquisitionRole: MethodologyRole: Project AdministrationRole: ValidationRole: Writing – Original Draft PreparationRole: Writing – Review & Editing
                Journal
                F1000Res
                F1000Res
                F1000Research
                F1000Research
                F1000 Research Limited (London, UK )
                2046-1402
                13 April 2021
                2021
                : 10
                : 286
                Affiliations
                [1 ]J. Craig Venter Institute, Rockville, Maryland, 20850, USA
                [2 ]Natural Selection, Inc., San Diego, CA, 92121, USA
                [3 ]Noblis, Inc., Reston, VA, 20191, USA
                [1 ]Department of Computer Science, Aristotle University of Thessalonica, Thessalonica, Greece
                [2 ]Centre for Research & Technology Hellas, Thessalonica, Greece
                [1 ]Programming Associate, DBMI, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
                [2 ]Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
                Author notes

                No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Author information
                https://orcid.org/0000-0001-7498-8048
                https://orcid.org/0000-0001-6272-2875
                Article
                10.12688/f1000research.51873.1
                8156514
                34113437
                0a1fefb7-d783-4842-ae3d-e7127da3d3d0
                Copyright: © 2021 Sutton G et al.

                This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 31 March 2021
                Funding
                Funded by: IARPA
                Award ID: N6600118C-4506
                This research is based upon work supported [in part] by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) under Finding Engineering Linked Indicators (FELIX) program contract #N6600118C-4506. The principal investigator for the award is Sterling Thomas. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
                The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Articles

                pan-genome,pan-genome graph,core genes,essential genes
                pan-genome, pan-genome graph, core genes, essential genes

                Comments

                Comment on this article