OTU picking on large datasets: comparing methods on a diversity of situations

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

De novo OTU picking from large metabarcoding read datasets is at the same time a current and a complex task, and several methods coexist to perform it. We present here the outcome of a collective project developed within Working Group on « Data Analysis and Storage » in DNAqua.net. Our aim has been to organize a thorough comparison of OTU composition according to some selected methods called by the wrappers, in a diversity of situations. This has been done by disposing of a set of different datasets, and a set of different methods, applying each method on each dataset, and comparing the results. We have deliberately chosen to work with cleaned datasets only, and not to include cleaning in the process.We have worked with a set of about 60 different datasets, some environmental, some as mock communities, produced by six teams, in different countries (D, F, I, T, UK), each with specific markers for different organisms. All datasets have been cleaned beforehand by the team proposing it. We have installed four different tools for building OTUs by unsupervised clustering : Swarm (Mahé et al. 2015), Vsearch (Rognes et al. 2016) with the same receipe for all datasets, usearch (Edgar 2010) with a unique command, the same for all datasets, and yapotu, which computes pairwise Smith-Waterman distances between all reads of a given dataset, and then clusters them with graph based techniques. Yapotu approach is expected to be the most accurate one, as there are no heuristics in the calculations.We have harmonized common input/output format for the four methods, to make comparisons. Here is a summary of the indicators selected for comparing results.We have first computed basic indicators per sample and method, like the number of OTU, the number of singletons, the number of OTUs with ten reads or more (after dereplication), and the fraction of reads that have been allocated to an OTU. The four methods displayed a great variety of counts, with highest number of OTUs and singletons for Swarm, then slighltly equivalent figures (but a smaller number of singletons) for yapotu, and significantly smaller counts for Vsearch and Usearch. However, the counts for the number of OTUs with 10 reads or more are much more convergent between the four methods.We have then compared rank-size curves, which have been computed for all pairs (sample by method). Here again, yapotu and swarm results are very similar, whereas Vsearch and Usearch sometimes are close to the former pattern, sometimes very different (I attach a figure?)We then have computed 10 different diversity indices, like OTU richness, Shannon, Chao, eveness. Here again, results provided by Swarm and Yapotu are very similar, with very strong correlations between indices over all samples by method, whereas the correlations with Vsearch and Usearch is very poor.Finally, we have computed all contingency tables (in a sparse format) between all pairs of methods (hence, 6 pairs) for all samples, which accurately describe whether OTUs composition are similar or dissimilar between methods. We have observed that swarm OTUs are systematically nested within yapotu OTUs, and most often, there is a one to one correspondence between a Swarm and a Yapotu OTU.As a conclusion, we show that- Swarm and yapotu yield very similar results including for fine details, the only diffrence being a larger number of singletons provided by Swarm ;- This shows that Swarm OTUs are very close to OTUs built by single linkage Clustering on Smith-Waterman pairwise distances, and consolidates these approaches.- Very often, both Vsearch and Usearch diverge from those convergent results, but not always, and it is not easy to understand when and why. Some further investigations are needed therefore.All datasets will be publicly available for further benchmarking of a wider set of methods and datasets.

Related collections

Most cited references 3

Record: found
Abstract: found
Article: not found

Search and clustering orders of magnitude faster than BLAST.

Robert Edgar (2010)

Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch.

0 comments Cited 3400 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

VSEARCH: a versatile open source tool for metagenomics

Torbjørn Rognes, Tomáš Flouri, Ben Nichols … (2016)

Background VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use. Methods When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads. Results VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0. Discussion VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

0 comments Cited 2370 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Swarm v2: highly-scalable and high-resolution amplicon clustering

Frédéric Mahé, Torbjørn Rognes, Christopher Quince … (2015)

Previously we presented Swarm v1, a novel and open source amplicon clustering program that produced fine-scale molecular operational taxonomic units (OTUs), free of arbitrary global clustering thresholds and input-order dependency. Swarm v1 worked with an initial phase that used iterative single-linkage with a local clustering threshold (d), followed by a phase that used the internal abundance structures of clusters to break chained OTUs. Here we present Swarm v2, which has two important novel features: (1) a new algorithm for d = 1 that allows the computation time of the program to scale linearly with increasing amounts of data; and (2) the new fastidious option that reduces under-grouping by grafting low abundant OTUs (e.g., singletons and doubletons) onto larger ones. Swarm v2 also directly integrates the clustering and breaking phases, dereplicates sequencing reads with d = 0, outputs OTU representatives in fasta format, and plots individual OTUs as two-dimensional networks.