11
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A comparative evaluation of hybrid error correction methods for error-prone long reads

      research-article
      1 , 1 , 1 , 2 , 3 ,
      Genome Biology
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies. However, their notorious high error rate impedes straightforward data analysis and limits their application. A handful of error correction methods for these error-prone long reads have been developed to date. The output data quality is very important for downstream analysis, whereas computing resources could limit the utility of some computing-intense tools. There is a lack of standardized assessments for these long-read error-correction methods.

          Results

          Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences.

          Conclusions

          Taking into account all of these metrics, we provide a suggestive guideline for method choice based on available data size, computing resources, and individual research goals.

          Electronic supplementary material

          The online version of this article (10.1186/s13059-018-1605-z) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references33

          • Record: found
          • Abstract: found
          • Article: not found

          Assembly algorithms for next-generation sequencing data.

          The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly. Copyright 2010 Elsevier Inc. All rights reserved.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis

            Background: Given the demonstrated utility of Third Generation Sequencing [Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT)] long reads in many studies, a comprehensive analysis and comparison of their data quality and applications is in high demand. Methods: Based on the transcriptome sequencing data from human embryonic stem cells, we analyzed multiple data features of PacBio and ONT, including error pattern, length, mappability and technical improvements over previous platforms. We also evaluated their application to transcriptome analyses, such as isoform identification and quantification and characterization of transcriptome complexity, by comparing the performance of size-selected PacBio, non-size-selected ONT and their corresponding Hybrid-Seq strategies (PacBio+Illumina and ONT+Illumina). Results: PacBio shows overall better data quality, while ONT provides a higher yield. As with data quality, PacBio performs marginally better than ONT in most aspects for both long reads only and Hybrid-Seq strategies in transcriptome analysis. In addition, Hybrid-Seq shows superior performance over long reads only in most transcriptome analyses. Conclusions: Both PacBio and ONT sequencing are suitable for full-length single-molecule transcriptome analysis. As this first use of ONT reads in a Hybrid-Seq analysis has shown, both PacBio and ONT can benefit from a combined Illumina strategy. The tools and analytical methods developed here provide a resource for future applications and evaluations of these rapidly-changing technologies.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Improving PacBio Long Read Accuracy by Short Read Alignment

              The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
                Bookmark

                Author and article information

                Contributors
                kinfai.au@osumc.edu
                Journal
                Genome Biol
                Genome Biol
                Genome Biology
                BioMed Central (London )
                1474-7596
                1474-760X
                4 February 2019
                4 February 2019
                2019
                : 20
                : 26
                Affiliations
                [1 ]ISNI 0000 0004 1936 8294, GRID grid.214572.7, Department of Internal Medicine, , University of Iowa, ; Iowa City, IA 52242 USA
                [2 ]ISNI 0000 0004 1936 8294, GRID grid.214572.7, Department of Biostatistics, , University of Iowa, ; Iowa City, IA 52242 USA
                [3 ]ISNI 0000 0001 2285 7943, GRID grid.261331.4, Department of Biomedical Informatics, , The Ohio State University, ; Columbus, OH 43210 USA
                Author information
                http://orcid.org/0000-0002-9222-4241
                Article
                1605
                10.1186/s13059-018-1605-z
                6362602
                30717772
                4c5d6cb2-0606-4ae7-b187-64b3f7190e44
                © The Author(s). 2019

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 20 March 2018
                : 5 December 2018
                Funding
                Funded by: University of Iowa
                Funded by: FundRef http://dx.doi.org/10.13039/100000051, National Human Genome Research Institute;
                Award ID: R01HG008759
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100001797, Pharmaceutical Research and Manufacturers of America Foundation;
                Categories
                Research
                Custom metadata
                © The Author(s) 2019

                Genetics
                Genetics

                Comments

                Comment on this article