23
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      MLgsc: A Maximum-Likelihood General Sequence Classifier

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

          Related collections

          Most cited references11

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          GenBank

          GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 300 000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank® staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood

            We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              FunGene: the functional gene pipeline and repository

              Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                6 July 2015
                2015
                : 10
                : 7
                : e0129384
                Affiliations
                [1 ]Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland
                [2 ]Vital-IT Group, Swiss Institute of Bioinformatics, Lausanne, Vaud, Switzerland
                [3 ]Laboratory of Biogeosciences, Institute of Earth Sciences, University of Lausanne, Lausanne, Vaud, Switzerland
                Georgia Institute of Technology, UNITED STATES
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: TJ VH TW PJ. Performed the experiments: TJ VH TW. Analyzed the data: TJ VH TW PJ. Contributed reagents/materials/analysis tools: TJ VH TW. Wrote the paper: TJ VH TW PJ. Designed and wrote software: TJ.

                Article
                PONE-D-14-14150
                10.1371/journal.pone.0129384
                4492669
                26148002
                978f221c-0f20-4d82-8a76-6806d543434c
                Copyright @ 2015

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

                History
                : 30 March 2014
                : 7 May 2015
                Page count
                Figures: 2, Tables: 3, Pages: 12
                Funding
                The authors acknowledge the elemo project (elemo.ch) for the sampling campaigns using the MIR submersibles. The authors are grateful for the help of the MIR team and colleagues from the elemo project. This work was supported by Swiss National Science Foundation grant No. 31003A-132358/1.
                Categories
                Research Article
                Custom metadata
                The source code and data used in the paper (evaluation and Spo0A example) are freely available on GitHub ( https://github.com/tjunier/mlgsc). The spo0A datasets are available at the Sequence Read Archive SRX364015 and SRX364014.

                Uncategorized
                Uncategorized

                Comments

                Comment on this article