Blog
About

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

The variant call format and VCFtools

Read this article at

Bookmark
      There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

      Abstract

      Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.

      Availability: http://vcftools.sourceforge.net

      Contact: rd@ 123456sanger.ac.uk

      Related collections

      Most cited references 5

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The Sequence Alignment/Map format and SAMtools

      Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk
        Bookmark
        • Record: found
        • Abstract: found
        • Article: not found

        The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

        Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
          Bookmark
          • Record: found
          • Abstract: found
          • Article: not found

          A map of human genome variation from population-scale sequencing.

          The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
            Bookmark

            Author and article information

            Affiliations
            1Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, 2Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK, 3Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, 4Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02141, 5Department of Biology, Boston College, MA 02467, 6National Institutes of Health National Center for Biotechnology Information, MD 20894, USA and 7Department of Statistics, University of Oxford, Oxford OX1 3TG, UK
            Author notes
            * To whom correspondence should be addressed.

            † The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

            Associate Editor: John Quackenbush

            Journal
            Bioinformatics
            bioinformatics
            bioinfo
            Bioinformatics
            Oxford University Press
            1367-4803
            1367-4811
            1 August 2011
            7 June 2011
            7 June 2011
            : 27
            : 15
            : 2156-2158
            3137218
            21653522
            10.1093/bioinformatics/btr330
            btr330
            © The Author(s) 2011. Published by Oxford University Press.

            This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

            Counts
            Pages: 3
            Categories
            Applications Note
            Sequence Analysis

            Bioinformatics & Computational biology

            Comments

            Comment on this article