MLgsc: A Maximum-Likelihood General Sequence Classifier

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

Related collections

Most cited references 11

Record: found
Abstract: found
Article: found

Is Open Access

GenBank

Dennis A Benson, Ilene Karsch-Mizrachi, David Lipman … (2009)

GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 300 000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank® staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.

0 comments Cited 295 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood

Simon A Berger, Denis Krompass, Alexandros Stamatakis (2011)

We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.

0 comments Cited 227 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

FunGene: the functional gene pipeline and repository

Jordan Fish, Benli Chai, Qiong Wang … (2013)

Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes.

0 comments Cited 210 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

I. King Jordan: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (iso-abbrev): PLoS ONE

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Electronic): 1932-6203

Publication date (Electronic): 6 July 2015

Publication date Collection: 2015

Volume: 10

Issue: 7

Electronic Location Identifier: e0129384

Affiliations

[1 ]Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland

[2 ]Vital-IT Group, Swiss Institute of Bioinformatics, Lausanne, Vaud, Switzerland

[3 ]Laboratory of Biogeosciences, Institute of Earth Sciences, University of Lausanne, Lausanne, Vaud, Switzerland

Georgia Institute of Technology, UNITED STATES

Author notes

Competing Interests: The authors have declared that no competing interests exist.

Conceived and designed the experiments: TJ VH TW PJ. Performed the experiments: TJ VH TW. Analyzed the data: TJ VH TW PJ. Contributed reagents/materials/analysis tools: TJ VH TW. Wrote the paper: TJ VH TW PJ. Designed and wrote software: TJ.

* E-mail: thomas.junier@ 123456unine.ch

Article

Publisher ID: PONE-D-14-14150

DOI: 10.1371/journal.pone.0129384

PMC ID: 4492669

PubMed ID: 26148002

SO-VID: 978f221c-0f20-4d82-8a76-6806d543434c

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

History

Date received : 30 March 2014

Date accepted : 7 May 2015

Page count

Figures: 2, Tables: 3, Pages: 12

Funding

The authors acknowledge the elemo project (elemo.ch) for the sampling campaigns using the MIR submersibles. The authors are grateful for the help of the MIR team and colleagues from the elemo project. This work was supported by Swiss National Science Foundation grant No. 31003A-132358/1.

Custom metadata

Data Availability The source code and data used in the paper (evaluation and Spo0A example) are freely available on GitHub ( https://github.com/tjunier/mlgsc). The spo0A datasets are available at the Sequence Read Archive SRX364015 and SRX364014.

MLgsc: A Maximum-Likelihood General Sequence Classifier

Read this article at

Abstract

Related collections

PLOS Climate

Most cited references 11

GenBank

Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood

FunGene: the functional gene pipeline and repository

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 73

Cited by 1

Most referenced authors 1,099