7
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local k-mer spectra (frequency distribution of k-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤ k ≤ 4) on relatively small viral genomes of Papillomaviridae and Herpesviridae, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in Papillomaviridae and Herpesviridae formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the k-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown k-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest k-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard k-mer analysis.

          Related collections

          Most cited references27

          • Record: found
          • Abstract: found
          • Article: not found

          GenBank

          GenBank® is a comprehensive database that contains publicly available DNA sequences for more than 165 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps to ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Local rates of recombination are positively correlated with GC content in the human genome.

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Next-generation phylogenomics

              Abstract Thanks to advances in next-generation technologies, genome sequences are now being generated at breadth (e.g. across environments) and depth (thousands of closely related strains, individuals or samples) unimaginable only a few years ago. Phylogenomics – the study of evolutionary relationships based on comparative analysis of genome-scale data – has so far been developed as industrial-scale molecular phylogenetics, proceeding in the two classical steps: multiple alignment of homologous sequences, followed by inference of a tree (or multiple trees). However, the algorithms typically employed for these steps scale poorly with number of sequences, such that for an increasing number of problems, high-quality phylogenomic analysis is (or soon will be) computationally infeasible. Moreover, next-generation data are often incomplete and error-prone, and analysis may be further complicated by genome rearrangement, gene fusion and deletion, lateral genetic transfer, and transcript variation. Here we argue that next-generation data require next-generation phylogenomics, including so-called alignment-free approaches. Reviewers Reviewed by Mr Alexander Panchin (nominated by Dr Mikhail Gelfand), Dr Eugene Koonin and Prof Peter Gogarten. For the full reviews, please go to the Reviewers’ comments section.
                Bookmark

                Author and article information

                Contributors
                Role: Academic Editor
                Journal
                Genes (Basel)
                Genes (Basel)
                genes
                Genes
                MDPI
                2073-4425
                19 April 2017
                April 2017
                : 8
                : 4
                : 122
                Affiliations
                [1 ]Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany; Sievers_Aaron@ 123456web.de (A.S.); KatharinaBosiek@ 123456gmx.de (K.B.); MarcBisch@ 123456gmx.de (M.B.); chrisdreessen@ 123456yahoo.de (C.D.); jaschelite@ 123456googlemail.com (J.R.); Fross@ 123456stud.uni-heidelberg.de (P.F.); hausmann@ 123456kip.uni-heidelberg.de (M.H.)
                [2 ]Department of Radiation Oncology, Universitätsmedizin Mannheim, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68167 Mannheim, Germany
                Author notes
                [* ]Correspondence: hilden@ 123456kip.uni-heidelberg.de ; Tel.: +49-151-559-63919
                Article
                genes-08-00122
                10.3390/genes8040122
                5406869
                28422050
                d3cdce34-f421-49fc-af97-4554145208f9
                © 2017 by the authors.

                Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).

                History
                : 10 February 2017
                : 04 April 2017
                Categories
                Article

                k-mer,k-mer analysis,sequence analysis,alignment-free,positional features

                Comments

                Comment on this article