Information theoretic alignment free variant calling

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of length k as a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence.

The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence.

Most cited references 11

Record: found
Abstract: found
Article: not found

Population genomics of post-vaccine changes in pneumococcal epidemiology

Nicholas Croucher, Jonathan A. Finkelstein, Stephen I. Pelton … (2013)

Whole genome sequencing of 616 asymptomatically carried pneumococci was used to study the impact of the 7-valent pneumococcal conjugate vaccine. Comparison of closely related isolates revealed the role of transformation in facilitating capsule switching to non-vaccine serotypes and the emergence of drug resistance. However, such recombination was found to occur at significantly different rates across the species, and the evolution of the population was primarily driven by changes in the frequency of distinct genotypes extant pre-vaccine. These alterations resulted in little overall effect on accessory genome composition at the population level, contrasting with the fall in pneumococcal disease rates after the vaccine’s introduction.

0 comments Cited 201 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

An Information Measure for Classification

C. S. Wallace, D. Boulton (1968)

0 comments Cited 163 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

Heng Li (2012)

Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we proposed FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: hengli@broadinstitute.org

0 comments Cited 68 times – based on 0 reviews

Preprint

     Review now

Bookmark

All references

Author and article information

Contributors

Justin Bedo

Journal

Journal ID (publisher-id): peerj-cs

Journal ID (pmc): peerj-cs

Journal ID (nlm-ta): PeerJ Comput. Sci.

Title: PeerJ Computer Science

Abbreviated Title: PeerJ Comput. Sci.

Publisher: PeerJ Inc. (San Francisco, USA )

ISSN (Electronic): 2376-5992

Publication date (Electronic): 25 July 2016

Volume: 2

Electronic Location Identifier: e71

Affiliations

[1 ]IBM Research—Australia , Carlton, VIC, Australia

[2 ]Department of Computing and Information Systems, The University of Melbourne , Parkville, VIC, Australia

[3 ]Centre For Epidemiology and Biostatistics, The University of Melbourne , Parkville, VIC, Australia

[4 ]School of Mathematics and Statistics, The University of Melbourne , Parkville, VIC, Australia

Article

Publisher ID: cs-71

DOI: 10.7717/peerj-cs.71

SO-VID: 4f70b95f-52d4-4fff-8017-087f9d430c4d

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

History

Date received : 30 December 2015

Date accepted : 25 May 2016

Funding

The authors received no funding for this work.

Information theoretic alignment free variant calling

Read this article at

Abstract

Most cited references 11

Population genomics of post-vaccine changes in pneumococcal epidemiology

An Information Measure for Classification

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Comments

Comment on this article

Similar content 10

Most referenced authors 268