BOLDigger – a Python package to identify and organise sequences with the Barcode of Life Data systems

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

DNA metabarcoding workflows produce hundreds to ten-thousands of Operational Taxonomic Units (OTUs) or Exact Sequence Variants (ESVs) per analysis. In most workflows, a taxonomic assignment to these generated sequences is needed. This is typically done using publicly available databases. Especially, yet not exclusively, for Eumetazoan metabarcoding, the Barcode of Life Data system (BOLD) is the most comprehensive and curated reference barcode database and, therefore, typically the first choice for taxonomic assignment. While an application programme interface (API) exists to query data in large batches, no information on the many and important unpublished data are obtained through the API. The alternative approach using the BOLD identification engine on the website provides full access, yet it is restricted to 100 sequences at once. We developed a small platform-independent and graphical user interface (GUI) software package, BOLDigger, which aims to solve this problem by automating the process of sending successive requests of up to 100 sequences without surpassing the capacities of BOLD. BOLDigger can be used to download the results of the identification engine, as well as metadata for the obtained hits. For the selection of the best fitting hit, three different methods are implemented. A new approach, combining a threshold-based approach with the metadata information, was implemented to make use of the metadata.

Related collections

Most cited references 5

Record: found
Abstract: found
Article: found

Is Open Access

Swarm v2: highly-scalable and high-resolution amplicon clustering

Frédéric Mahé, Torbjørn Rognes, Christopher Quince … (2015)

Previously we presented Swarm v1, a novel and open source amplicon clustering program that produced fine-scale molecular operational taxonomic units (OTUs), free of arbitrary global clustering thresholds and input-order dependency. Swarm v1 worked with an initial phase that used iterative single-linkage with a local clustering threshold (d), followed by a phase that used the internal abundance structures of clusters to break chained OTUs. Here we present Swarm v2, which has two important novel features: (1) a new algorithm for d = 1 that allows the computation time of the program to scale linearly with increasing amounts of data; and (2) the new fastidious option that reduces under-grouping by grafting low abundant OTUs (e.g., singletons and doubletons) onto larger ones. Swarm v2 also directly integrates the clustering and breaking phases, dereplicates sequencing reads with d = 0, outputs OTU representatives in fasta format, and plots individual OTUs as two-dimensional networks.

0 comments Cited 228 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Multiple Multilocus DNA Barcodes from the Plastid Genome Discriminate Plant Species Equally Well

Aron Fazekas, Kevin Burgess, Prasad R Kesanakurti … (2008)

A universal barcode system for land plants would be a valuable resource, with potential utility in fields as diverse as ecology, floristics, law enforcement and industry. However, the application of plant barcoding has been constrained by a lack of consensus regarding the most variable and technically practical DNA region(s). We compared eight candidate plant barcoding regions from the plastome and one from the mitochondrial genome for how well they discriminated the monophyly of 92 species in 32 diverse genera of land plants (N = 251 samples). The plastid markers comprise portions of five coding (rpoB, rpoC1, rbcL, matK and 23S rDNA) and three non-coding (trnH-psbA, atpF–atpH, and psbK–psbI) loci. Our survey included several taxonomically complex groups, and in all cases we examined multiple populations and species. The regions differed in their ability to discriminate species, and in ease of retrieval, in terms of amplification and sequencing success. Single locus resolution ranged from 7% (23S rDNA) to 59% (trnH-psbA) of species with well-supported monophyly. Sequence recovery rates were related primarily to amplification success (85–100% for plastid loci), with matK requiring the greatest effort to achieve reasonable recovery (88% using 10 primer pairs). Several loci (matK, psbK–psbI, trnH-psbA) were problematic for generating fully bidirectional sequences. Setting aside technical issues related to amplification and sequencing, combining the more variable plastid markers provided clear benefits for resolving species, although with diminishing returns, as all combinations assessed using four to seven regions had only marginally different success rates (69–71%; values that were approached by several two- and three-region combinations). This performance plateau may indicate fundamental upper limits on the precision of species discrimination that is possible with DNA barcoding systems that include moderate numbers of plastid markers. Resolution to the contentious debate on plant barcoding should therefore involve increased attention to practical issues related to the ease of sequence recovery, global alignability, and marker redundancy in multilocus plant DNA barcoding systems.