VSEARCH: a versatile open source tool for metagenomics

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool ( Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use.

Methods

When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads.

Results

VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0.

Discussion

VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

Related collections

Most cited references 15

Record: found
Abstract: found
Article: found

Is Open Access

Swarm: robust and fast clustering method for amplicon-based studies

Frédéric Mahé, Torbjørn Rognes, Christopher Quince … (2014)

Popular de novo amplicon clustering methods suffer from two fundamental flaws: arbitrary global clustering thresholds, and input-order dependency induced by centroid selection. Swarm was developed to address these issues by first clustering nearly identical amplicons iteratively using a local threshold, and then by using clusters’ internal structure and amplicon abundances to refine its results. This fast, scalable, and input-order independent approach reduces the influence of clustering parameters and produces robust operational taxonomic units.

0 comments Cited 345 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform

Melanie Schirmer, Umer Z. Ijaz, Rosalinda D'Amore … (2015)

With read lengths of currently up to 2 × 300 bp, high throughput and low sequencing costs Illumina's MiSeq is becoming one of the most utilized sequencing platforms worldwide. The platform is manageable and affordable even for smaller labs. This enables quick turnaround on a broad range of applications such as targeted gene sequencing, metagenomics, small genome sequencing and clinical molecular diagnostics. However, Illumina error profiles are still poorly understood and programs are therefore not designed for the idiosyncrasies of Illumina data. A better knowledge of the error patterns is essential for sequence analysis and vital if we are to draw valid conclusions. Studying true genetic variation in a population sample is fundamental for understanding diseases, evolution and origin. We conducted a large study on the error patterns for the MiSeq based on 16S rRNA amplicon sequencing data. We tested state-of-the-art library preparation methods for amplicon sequencing and showed that the library preparation method and the choice of primers are the most significant sources of bias and cause distinct error patterns. Furthermore we tested the efficiency of various error correction strategies and identified quality trimming (Sickle) combined with error correction (BayesHammer) followed by read overlapping (PANDAseq) as the most successful approach, reducing substitution error rates on average by 93%.

0 comments Cited 316 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Objective Criteria for the Evaluation of Clustering Methods

William Rand (1971)

0 comments Cited 200 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Torbjørn Rognes

Journal

Journal ID (nlm-ta): PeerJ

Journal ID (iso-abbrev): PeerJ

Journal ID (publisher-id): peerj

Journal ID (pmc): peerj

Title: PeerJ

Publisher: PeerJ Inc. (San Francisco, USA )

ISSN (Electronic): 2167-8359

Publication date (Electronic): 18 October 2016

Publication date Collection: 2016

Volume: 4

Electronic Location Identifier: e2584

Affiliations

[1 ]Department of Informatics, University of Oslo , Oslo, Norway

[2 ]Department of Microbiology, Oslo University Hospital , Oslo, Norway

[3 ]Heidelberg Institute for Theoretical Studies , Heidelberg, Germany

[4 ]Institute for Theoretical Informatics, Karlsruhe Institute of Technology , Karlsruhe, Germany

[5 ]School of Engineering, University of Glasgow , Glasgow, United Kingdom

[6 ]Warwick Medical School, University of Warwick , Coventry, United Kingdom

[7 ]Department of Ecology, University of Kaiserslautern , Kaiserslautern, Germany

[8 ]UMR LSTM, CIRAD , Montpellier, France

Article

Publisher ID: 2584

DOI: 10.7717/peerj.2584

PMC ID: 5075697

PubMed ID: 27781170

SO-VID: 40b8ea39-5d63-4ee2-ba18-71fd2cde5923

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

History

Date received : 5 September 2016

Date accepted : 17 September 2016

Funding

Funded by: UNINETT Sigma2

Award ID: NN9383K

Funded by: Unilever

Funded by: MRC Cloud Infrastructure for Microbial Bioinformatics (CLIMB)

Award ID: MR/L015080/1

Award ID: MR/M50161X/1

Funded by: Deutsche Forschungsgemeinschaft

Award ID: #DU1319/1-1

This research was supported in part with computational resources at the University of Oslo provided by UNINETT Sigma2 project NN9383K and funded by the Research Council of Norway. BN was funded by BBSRC CASE studentship supported by Unilever. CQ was funded through the MRC Cloud Infrastructure for Microbial Bioinformatics (CLIMB) project (MR/L015080/1) through fellowship (MR/M50161X/1). FM was supported by the Deutsche Forschungsgemeinschaft (grant #DU1319/1-1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

VSEARCH: a versatile open source tool for metagenomics

Read this article at

Abstract

Background

Methods

Results

Discussion

Related collections

Open source discrete and agent-based modeling frameworks for biology

Most cited references 15

Swarm: robust and fast clustering method for amplicon-based studies

Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform

Objective Criteria for the Evaluation of Clustering Methods

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Comments

Comment on this article

Similar content 22

Cited by 3,189

Most referenced authors 2,641