AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Accurate mapping of next-generation sequencing (NGS) reads to reference genomes is crucial for almost all NGS applications and downstream analyses. Various repetitive elements in human and other higher eukaryotic genomes contribute in large part to ambiguously (non-uniquely) mapped reads. Most available NGS aligners attempt to address this by either removing all non-uniquely mapping reads, or reporting one random or "best" hit based on simple heuristics. Accurate estimation of the mapping quality of NGS reads is therefore critical albeit completely lacking at present. Here we developed a generalized software toolkit "AlignerBoost", which utilizes a Bayesian-based framework to accurately estimate mapping quality of ambiguously mapped NGS reads. We tested AlignerBoost with both simulated and real DNA-seq and RNA-seq datasets at various thresholds. In most cases, but especially for reads falling within repetitive regions, AlignerBoost dramatically increases the mapping precision of modern NGS aligners without significantly compromising the sensitivity even without mapping quality filters. When using higher mapping quality cutoffs, AlignerBoost achieves a much lower false mapping rate while exhibiting comparable or higher sensitivity compared to the aligner default modes, therefore significantly boosting the detection power of NGS aligners even using extreme thresholds. AlignerBoost is also SNP-aware, and higher quality alignments can be achieved if provided with known SNPs. AlignerBoost’s algorithm is computationally efficient, and can process one million alignments within 30 seconds on a typical desktop computer. AlignerBoost is implemented as a uniform Java application and is freely available at https://github.com/Grice-Lab/AlignerBoost.

Related collections

Most cited references 16

Record: found
Abstract: found
Article: found

Is Open Access

featureCounts: An efficient general-purpose program for assigning sequence reads to genomic features

, , (2013)

Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages.

0 comments Cited 770 times – based on 0 reviews

Preprint

     Review now

Bookmark

Record: found
Abstract: found
Article: not found

ART: a next-generation sequencing read simulator.

Weichun Huang, Leping Li, Jason R. Myers … (2012)

ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art.

0 comments Cited 652 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads.

Gerton Lunter, Martin Goodson (2011)

High-volume sequencing of DNA and RNA is now within reach of any research laboratory and is quickly becoming established as a key research tool. In many workflows, each of the short sequences ("reads") resulting from a sequencing run are first "mapped" (aligned) to a reference sequence to infer the read from which the genomic location derived, a challenging task because of the high data volumes and often large genomes. Existing read mapping software excel in either speed (e.g., BWA, Bowtie, ELAND) or sensitivity (e.g., Novoalign), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly so for short insertions and deletions (indels). Here, we present a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. This results in a higher useable sequence yield and improved accuracy compared to that of existing software.

0 comments Cited 523 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Timothée Poisot: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput. Biol

Journal ID (publisher-id): plos

Journal ID (pmc): ploscomp

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Electronic): 5 October 2016

Publication date Collection: October 2016

Volume: 12

Issue: 10

Electronic Location Identifier: e1005096

Affiliations

[001]Department of Dermatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

Universite de Montreal, CANADA

Author notes

The authors have declared that no competing interests exist.

Conceived and designed the experiments: QZ.
Performed the experiments: QZ.
Analyzed the data: QZ.
Contributed reagents/materials/analysis tools: EAG.
Wrote the paper: QZ EAG.

* E-mail: zhengqi@ 123456mail.med.upenn.edu (QZ); egrice@ 123456mail.med.upenn.edu (EAG)

Article

Publisher ID: PCOMPBIOL-D-16-00658

DOI: 10.1371/journal.pcbi.1005096

PMC ID: 5051939

PubMed ID: 27706155

SO-VID: 9da19d06-1efe-4db5-a5bb-2b56a1cf0bd0

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 21 April 2016

Date accepted : 2 August 2016

Page count

Figures: 8, Tables: 0, Pages: 20

Funding

Funded by: funder-id http://dx.doi.org/10.13039/100000002, National Institutes of Health;

Funded by: funder-id http://dx.doi.org/10.13039/100000069, National Institute of Arthritis and Musculoskeletal and Skin Diseases;

Award ID: R01-AR066663

Award Recipient : Elizabeth A. Grice

Funded by: funder-id http://dx.doi.org/10.13039/100000056, National Institute of Nursing Research;

Award ID: R01-NR015639

Award Recipient : Elizabeth A. Grice

Funded by: funder-id http://dx.doi.org/10.13039/100006920, University of Pennsylvania;

This work was supported by grants from the National Institutes of Health, National Institutes of Arthritis and Musculoskeletal and Skin Disease (Grant R01 AR066663 to EAG) and the National Institute of Nursing Research (Grant R01 NR015639 to EAG) and the Department of Dermatology at University of Pennsylvania. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

Data Availability All relevant data are within the paper and its Supporting Information files.

AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework

Read this article at

Abstract

Related collections

Journal of Systems Thinking

Most cited references 16

featureCounts: An efficient general-purpose program for assigning sequence reads to genomic features

ART: a next-generation sequencing read simulator.

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 203

Cited by 4

Most referenced authors 526