MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO ( http://gkno.me).

Related collections

Most cited references 42

Record: found
Abstract: not found
Article: not found

Identification of common molecular subsequences.

T.F. Smith, M.S. Waterman (1981)

0 comments Cited 1696 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

An integrated semiconductor device enabling non-optical genome sequencing.

Jonathan Rothberg, Wolfgang Hinz, Todd M. Rearick … (2011)

The seminal importance of DNA sequencing to the life sciences, biotechnology and medicine has driven the search for more scalable and lower-cost solutions. Here we describe a DNA sequencing technology in which scalable, low-cost semiconductor manufacturing techniques are used to make an integrated circuit able to directly perform non-optical DNA sequencing of genomes. Sequence data are obtained by directly sensing the ions produced by template-directed DNA polymerase synthesis using all-natural nucleotides on this massively parallel semiconductor-sensing device or ion chip. The ion chip contains ion-sensitive, field-effect transistor-based sensors in perfect register with 1.2 million wells, which provide confinement and allow parallel, simultaneous detection of independent sequencing reactions. Use of the most widely used technology for constructing integrated circuits, the complementary metal-oxide semiconductor (CMOS) process, allows for low-cost, large-scale production and scaling of the device to higher densities and larger array sizes. We show the performance of the system by sequencing three bacterial genomes, its robustness and scalability by producing ion chips with up to 10 times as many sensors and sequencing a human genome.

0 comments Cited 549 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads.

Gerton Lunter, Martin Goodson (2011)

High-volume sequencing of DNA and RNA is now within reach of any research laboratory and is quickly becoming established as a key research tool. In many workflows, each of the short sequences ("reads") resulting from a sequencing run are first "mapped" (aligned) to a reference sequence to infer the read from which the genomic location derived, a challenging task because of the high data volumes and often large genomes. Existing read mapping software excel in either speed (e.g., BWA, Bowtie, ELAND) or sensitivity (e.g., Novoalign), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly so for short insertions and deletions (indels). Here, we present a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. This results in a higher useable sequence yield and improved accuracy compared to that of existing software.

0 comments Cited 523 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Chuhsing Kate Hsiao: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (iso-abbrev): PLoS ONE

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Electronic): 1932-6203

Publication date Collection: 2014

Publication date (Electronic): 5 March 2014

Volume: 9

Issue: 3

Electronic Location Identifier: e90581

Affiliations

[1 ]Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America

[2 ]Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America

National Taiwan University, Taiwan

Author notes

* E-mail: wanping.lee@ 123456bc.edu

Competing Interests: The authors have declared that no competing interests exist.

Conceived and designed the experiments: WL AW EG GM. Performed the experiments: WL. Analyzed the data: WL AW. Contributed reagents/materials/analysis tools: MS CS EG. Wrote the paper: WL AW. Started the project: MS.

Article

Publisher ID: PONE-D-13-47613

DOI: 10.1371/journal.pone.0090581

PMC ID: 3944147

PubMed ID: 24599324

SO-VID: 5fa02417-0693-4b70-b038-2c97406a3a9e

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 12 November 2013

Date accepted : 31 January 2014

Page count

Pages: 11

Funding

NIH: 5R01HG004719-04; NIH: 3U01HG006513-02S1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

Read this article at

Abstract

Related collections

PLOS Climate

Most cited references 42

Identification of common molecular subsequences.

An integrated semiconductor device enabling non-optical genome sequencing.

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 70

Cited by 128

Most referenced authors 2,970