Identification of rare alleles and their carriers using compressed se(que)nsing

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Identification of rare variants by resequencing is important both for detecting novel variations and for screening individuals for known disease alleles. New technologies enable low-cost resequencing of target regions, although it is still prohibitive to test more than a few individuals. We propose a novel pooling design that enables the recovery of novel or known rare alleles and their carriers in groups of individuals. The method is based on a Compressed Sensing (CS) approach, which is general, simple and efficient. CS allows the use of generic algorithmic tools for simultaneous identification of multiple variants and their carriers. We model the experimental procedure and show via computer simulations that it enables the recovery of rare alleles and their carriers in larger groups than were possible before. Our approach can also be combined with barcoding techniques to provide a feasible solution based on current resequencing costs. For example, when targeting a small enough genomic region (∼100 bp) and using only ∼10 sequencing lanes and ∼10 distinct barcodes per lane, one recovers the identity of 4 rare allele carriers out of a population of over 4000 individuals. We demonstrate the performance of our approach over several publicly available experimental data sets.

Related collections

Most cited references 61

Record: found
Abstract: found
Article: not found

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Aaron McKenna, Matthew Hanna, Eric R. Banks … (2010)

Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

0 comments Cited 5476 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Compressed sensing

D.L. Donoho (2006)

0 comments Cited 3463 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A genome-wide association study identifies novel risk loci for type 2 diabetes.

Robert Sladek, Ghislain Rocheleau, Johan Rung … (2007)

Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of which were hitherto unknown. A systematic search for these variants was recently made possible by the development of high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935 single-nucleotide polymorphisms in a French case-control cohort. Markers with the most significant difference in genotype frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2 gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in insulin-producing beta-cells, and two linkage disequilibrium blocks that contain genes potentially involved in beta-cell development or function (IDE-KIF11-HHEX and EXT2-ALX4). These associations explain a substantial portion of disease risk and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits.

0 comments Cited 759 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Nucleic Acids Res

Journal ID (publisher-id): nar

Journal ID (hwp): nar

Title: Nucleic Acids Research

Publisher: Oxford University Press

ISSN (Print): 0305-1048

ISSN (Electronic): 1362-4962

Publication date Collection: October 2010

Publication date (Print): October 2010

Publication date (Electronic): 10 August 2010

Publication date PMC-release: 10 August 2010

Volume: 38

Issue: 19

Page: e179

Affiliations

¹Department of Computer Science, The Open University of Israel, Raanana 43107, ²Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel and ³Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA

Author notes

*To whom correspondence should be addressed. Tel: +972-9-7781252; Fax: +972-9-7780605; Email: shental@ 123456openu.ac.il

Article

Publisher ID: gkq675

DOI: 10.1093/nar/gkq675

PMC ID: 2965256

PubMed ID: 20699269

SO-VID: ee62e3b4-2b0f-47bd-ae0c-1aacbe3bb88c

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 17 January 2010

Date revision received : 20 June 2010

Date accepted : 19 July 2010

Comments

Comment on this article

scite_

Cited by 21

See all cited by

Most referenced authors 1,283

See all reference authors

Identification of rare alleles and their carriers using compressed se(que)nsing

Read this article at

Abstract

Related collections

Genome Engineering using CRISPR

Most cited references 61

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Compressed sensing

A genome-wide association study identifies novel risk loci for type 2 diabetes.

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 120

Cited by 21

Most referenced authors 1,283