14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists

      research-article
      , ,
      GigaScience
      BioMed Central
      K-mers, Sequence analysis, Next-generation sequencing

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          K-mer-based methods of genome analysis have attracted great interest because they do not require genome assembly and can be performed directly on sequencing reads. Many analysis tasks require one to compare k-mer lists from different sequences to find words that are either unique to a specific sequence or common to many sequences. However, no stand-alone k-mer analysis tool currently allows one to perform these algebraic set operations.

          Findings

          We have developed the GenomeTester4 toolkit, which contains a novel tool GListCompare for performing union, intersection and complement (difference) set operations on k-mer lists. We provide examples of how these general operations can be combined to solve a variety of biological analysis tasks.

          Conclusions

          GenomeTester4 can be used to simplify k-mer list manipulation for many biological analysis tasks.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s13742-015-0097-y) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references8

          • Record: found
          • Abstract: found
          • Article: not found

          GAGE: A critical evaluation of genome assemblies and assembly algorithms.

          New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms

            We introduce Sailfish, a computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Because Sailfish entirely avoids mapping reads, a time-consuming step in all current methods, it provides quantification estimates much faster than do existing approaches (typically 20 times faster) without loss of accuracy. By facilitating frequent reanalysis of data and reducing the need to optimize parameters, Sailfish exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              RefSeq microbial genomes database: new representation and annotation strategy

              The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.
                Bookmark

                Author and article information

                Contributors
                lauris.kaplinski@ut.ee
                maarja.lepamets@ut.ee
                maido.remm@ut.ee
                Journal
                Gigascience
                Gigascience
                GigaScience
                BioMed Central (London )
                2047-217X
                3 December 2015
                3 December 2015
                2015
                : 4
                : 58
                Affiliations
                [ ]Department of Bioinformatics, University of Tartu, Riia 23, Tartu, 51010 Estonia
                [ ]Estonian Biocentre, Riia 23B, Tartu, 51010 Estonia
                Article
                97
                10.1186/s13742-015-0097-y
                4669650
                26640690
                dab4c078-c0e7-46c5-a38d-da50fa5e7685
                © Kaplinski et al. 2015

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 17 April 2015
                : 11 November 2015
                Categories
                Technical Note
                Custom metadata
                © The Author(s) 2015

                k-mers,sequence analysis,next-generation sequencing
                k-mers, sequence analysis, next-generation sequencing

                Comments

                Comment on this article