30
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data

      research-article
      1 , 2 , 1 , 1 , 3 , 1 , 2 , 4 , 1 , 4 , 5 ,
      BMC Genomics
      BioMed Central
      The International Conference on Intelligent Biology and Medicine (ICIBM)
      22-24 April 2012

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Accurate calling of SNPs and genotypes from next-generation sequencing data is an essential prerequisite for most human genetics studies. A number of computational steps are required or recommended when translating the raw sequencing data into the final calls. However, whether each step does contribute to the performance of variant calling and how it affects the accuracy still remain unclear, making it difficult to select and arrange appropriate steps to derive high quality variants from different sequencing data. In this study, we made a systematic assessment of the relative contribution of each step to the accuracy of variant calling from Illumina DNA sequencing data.

          Results

          We found that the read preprocessing step did not improve the accuracy of variant calling, contrary to the general expectation. Although trimming off low-quality tails helped align more reads, it introduced lots of false positives. The ability of markup duplication, local realignment and recalibration, to help eliminate false positive variants depended on the sequencing depth. Rearranging these steps did not affect the results. The relative performance of three popular multi-sample SNP callers, SAMtools, GATK, and GlfMultiples, also varied with the sequencing depth.

          Conclusions

          Our findings clarify the necessity and effectiveness of computational steps for improving the accuracy of SNP and genotype calls from Illumina sequencing data and can serve as a general guideline for choosing SNP calling strategies for data with different coverage.

          Related collections

          Most cited references9

          • Record: found
          • Abstract: found
          • Article: not found

          ChIP-seq accurately predicts tissue-specific activity of enhancers.

          A major yet unresolved quest in decoding the human genome is the identification of the regulatory sequences that control the spatial and temporal expression of genes. Distant-acting transcriptional enhancers are particularly challenging to uncover because they are scattered among the vast non-coding portion of the genome. Evolutionary sequence constraint can facilitate the discovery of enhancers, but fails to predict when and where they are active in vivo. Here we present the results of chromatin immunoprecipitation with the enhancer-associated protein p300 followed by massively parallel sequencing, and map several thousand in vivo binding sites of p300 in mouse embryonic forebrain, midbrain and limb tissue. We tested 86 of these sequences in a transgenic mouse assay, which in nearly all cases demonstrated reproducible enhancer activity in the tissues that were predicted by p300 binding. Our results indicate that in vivo mapping of p300 binding is a highly accurate means for identifying enhancers and their associated activities, and suggest that such data sets will be useful to study the role of tissue-specific enhancers in human biology and disease on a genome-wide scale.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Genome-wide association study identifies a novel breast cancer susceptibility locus at 6q25.1

            A genome-wide association study was conducted among Chinese women to identify risk variants for breast cancer. By analyzing 607,728 SNPs in 1505 cases and 1522 controls, we selected 29 promising SNPs for a fast-track replication in an independent set of 1554 cases and 1576 controls. Four replicated loci were further investigated in a third set of samples including 3472 cases and 900 controls. SNP rs2046210 at 6q25.1, located upstream of the estrogen receptor 1 gene (ESR1), exhibited strong and consistent association with breast cancer across all three stages. Adjusted odds ratio (95% CI) were 1.36 (1.24–1.49) and 1.59 (1.40–1.82), respectively, for genotypes A/G and A/A versus G/G (P for trend, 2.0×10−15) in the pooled analysis of samples from all three stages. A similar, although weaker, association was also found in an independent study including 1591 cases and 1466 controls of European ancestry (Ptrend, 0.01). These results provide strong evidence implicating 6q25.1 as a susceptibility locus for breast cancer.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data

              We develop a statistical tool SNVer for calling common and rare variants in analysis of pooled or individual next-generation sequencing (NGS) data. We formulate variant calling as a hypothesis testing problem and employ a binomial–binomial model to test the significance of observed allele frequency against sequencing error. SNVer reports one single overall P-value for evaluating the significance of a candidate locus being a variant based on which multiplicity control can be obtained. This is particularly desirable because tens of thousands loci are simultaneously examined in typical NGS experiments. Each user can choose the false-positive error rate threshold he or she considers appropriate, instead of just the dichotomous decisions of whether to ‘accept or reject the candidates’ provided by most existing methods. We use both simulated data and real data to demonstrate the superior performance of our program in comparison with existing methods. SNVer runs very fast and can complete testing 300 K loci within an hour. This excellent scalability makes it feasible for analysis of whole-exome sequencing data, or even whole-genome sequencing data using high performance computing cluster. SNVer is freely available at http://snver.sourceforge.net/.
                Bookmark

                Author and article information

                Contributors
                Conference
                BMC Genomics
                BMC Genomics
                BMC Genomics
                BioMed Central
                1471-2164
                2012
                17 December 2012
                : 13
                : Suppl 8
                : S8
                Affiliations
                [1 ]Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
                [2 ]Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
                [3 ]Vanderbilt Epidemiology Center, Vanderbilt University, Nashville, TN 37232, USA
                [4 ]Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
                [5 ]Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
                Article
                1471-2164-13-S8-S8
                10.1186/1471-2164-13-S8-S8
                3535703
                23281772
                8b528464-e8a4-427c-8581-d1ed68107c2d
                Copyright ©2012 Liu et al.; licensee BioMed Central Ltd.

                This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                The International Conference on Intelligent Biology and Medicine (ICIBM)
                Nashville, TN, USA
                22-24 April 2012
                History
                Categories
                Research

                Genetics
                Genetics

                Comments

                Comment on this article