75
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Information theoretic alignment free variant calling

      research-article

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of length k as a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence.

          The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence.

          Most cited references11

          • Record: found
          • Abstract: found
          • Article: not found

          Population genomics of post-vaccine changes in pneumococcal epidemiology

          Whole genome sequencing of 616 asymptomatically carried pneumococci was used to study the impact of the 7-valent pneumococcal conjugate vaccine. Comparison of closely related isolates revealed the role of transformation in facilitating capsule switching to non-vaccine serotypes and the emergence of drug resistance. However, such recombination was found to occur at significantly different rates across the species, and the evolution of the population was primarily driven by changes in the frequency of distinct genotypes extant pre-vaccine. These alterations resulted in little overall effect on accessory genome composition at the population level, contrasting with the fall in pneumococcal disease rates after the vaccine’s introduction.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            An Information Measure for Classification

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

              Heng Li (2012)
              Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we proposed FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: hengli@broadinstitute.org
                Bookmark

                Author and article information

                Contributors
                Journal
                peerj-cs
                peerj-cs
                PeerJ Comput. Sci.
                PeerJ Computer Science
                PeerJ Comput. Sci.
                PeerJ Inc. (San Francisco, USA )
                2376-5992
                25 July 2016
                : 2
                : e71
                Affiliations
                [1 ]IBM Research—Australia , Carlton, VIC, Australia
                [2 ]Department of Computing and Information Systems, The University of Melbourne , Parkville, VIC, Australia
                [3 ]Centre For Epidemiology and Biostatistics, The University of Melbourne , Parkville, VIC, Australia
                [4 ]School of Mathematics and Statistics, The University of Melbourne , Parkville, VIC, Australia
                Article
                cs-71
                10.7717/peerj-cs.71
                4f70b95f-52d4-4fff-8017-087f9d430c4d
                ©2016 Bedo et al.

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

                History
                : 30 December 2015
                : 25 May 2016
                Funding
                The authors received no funding for this work.
                Categories
                Bioinformatics
                Computational Biology

                Computer science
                Alignment free,Variant,Assembly free,Genome,Acteria,Feature extraction
                Computer science
                Alignment free, Variant, Assembly free, Genome, Acteria, Feature extraction

                Comments

                Comment on this article