82
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      On the optimal trimming of high-throughput mRNA sequence data

      research-article
      1 , 2
      Frontiers in Genetics
      Frontiers Media S.A.
      quality trimming, quality control, illumina, RNAseq, assembly error

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The widespread and rapid adoption of high-throughput sequencing technologies has afforded researchers the opportunity to gain a deep understanding of genome level processes that underlie evolutionary change, and perhaps more importantly, the links between genotype and phenotype. In particular, researchers interested in functional biology and adaptation have used these technologies to sequence mRNA transcriptomes of specific tissues, which in turn are often compared to other tissues, or other individuals with different phenotypes. While these techniques are extremely powerful, careful attention to data quality is required. In particular, because high-throughput sequencing is more error-prone than traditional Sanger sequencing, quality trimming of sequence reads should be an important step in all data processing pipelines. While several software packages for quality trimming exist, no general guidelines for the specifics of trimming have been developed. Here, using empirically derived sequence data, I provide general recommendations regarding the optimal strength of trimming, specifically in mRNA-Seq studies. Although very aggressive quality trimming is common, this study suggests that a more gentle trimming, specifically of those nucleotides whose P hred score <2 or <5, is optimal for most studies across a wide variety of metrics.

          Related collections

          Most cited references25

          • Record: found
          • Abstract: found
          • Article: not found

          Streaming fragment assignment for real-time analysis of sequencing experiments

          We present eXpress, a software package for highly efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use. It can determine abundances of sequenced molecules in real time, and can be applied to ChIP-seq, metagenomics and other large-scale sequencing data. We demonstrate its use on RNA-seq data, showing greater efficiency than other quantification methods.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Biases in Illumina transcriptome sequencing caused by random hexamer priming

            Generation of cDNA using random hexamer priming induces biases in the nucleotide composition at the beginning of transcriptome sequencing reads from the Illumina Genome Analyzer. The bias is independent of organism and laboratory and impacts the uniformity of the reads along the transcriptome. We provide a read count reweighting scheme, based on the nucleotide frequencies of the reads, that mitigates the impact of the bias.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

              , , (2013)
              Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results - In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions - Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
                Bookmark

                Author and article information

                Journal
                Front Genet
                Front Genet
                Front. Genet.
                Frontiers in Genetics
                Frontiers Media S.A.
                1664-8021
                31 January 2014
                2014
                : 5
                : 13
                Affiliations
                [1] 1Department of Molecular, Cellular and Biomedical Sciences, University of New Hampshire Durham, NH, USA
                [2] 2Hubbard Center for Genome Studies Durham, NH, USA
                Author notes

                Edited by: Mick Watson, The Roslin Institute, UK

                Reviewed by: C. Titus Brown, Michigan State University, USA; Christian Cole, University of Dundee, UK

                *Correspondence: Matthew D. MacManes, Department of Molecular, Cellular and Biomedical Sciences, University of New Hampshire, Rudman Hall #189, 46 College Road, Durham NH 03824, USA e-mail: macmanes@ 123456gmail.com Twitter: @PeroMHC

                This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics.

                Article
                10.3389/fgene.2014.00013
                3908319
                24567737
                1b20ef57-4872-4fed-b5a3-0d5a93840a57
                Copyright © 2014 MacManes.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                History
                : 14 November 2013
                : 14 January 2014
                Page count
                Figures: 4, Tables: 1, Equations: 0, References: 36, Pages: 7, Words: 5232
                Categories
                Genetics
                Original Research Article

                Genetics
                illumina,quality control,rnaseq,quality trimming,assembly error
                Genetics
                illumina, quality control, rnaseq, quality trimming, assembly error

                Comments

                Comment on this article