10
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores

      research-article
      1 , 1 , 2 , 3 , 4 , 5 , 6
      Bioinformatics
      Oxford University Press

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Recent advances in high-throughput long-read sequencers, such as PacBio and Oxford Nanopore sequencers, produce longer reads with more errors than short-read sequencers. In addition to the high error rates of reads, non-uniformity of errors leads to difficulties in various downstream analyses using long reads. Many useful simulators, which characterize long-read error patterns and simulate them, have been developed. However, there is still room for improvement in the simulation of the non-uniformity of errors.

          Results

          To capture characteristics of errors in reads for long-read sequencers, here, we introduce a generative model for quality scores, in which a hidden Markov Model with a latest model selection method, called factorized information criteria, is utilized. We evaluated our developed simulator from various points, indicating that our simulator successfully simulates reads that are consistent with real reads.

          Availability and implementation

          The source codes of PBSIM2 are freely available from https://github.com/yukiteruono/pbsim2.

          Supplementary information

          Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references38

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Opportunities and challenges in long-read sequencing data analysis

          Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Nanopore sequencing and assembly of a human genome with ultra-long reads

            We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ~30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ~3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ~6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Real-time DNA sequencing from single polymerase molecules.

              We present single-molecule, real-time sequencing data obtained from a DNA polymerase performing uninterrupted template-directed synthesis using four distinguishable fluorescently labeled deoxyribonucleoside triphosphates (dNTPs). We detected the temporal order of their enzymatic incorporation into a growing DNA strand with zero-mode waveguide nanostructure arrays, which provide optical observation volume confinement and enable parallel, simultaneous detection of thousands of single-molecule sequencing reactions. Conjugation of fluorophores to the terminal phosphate moiety of the dNTPs allows continuous observation of DNA synthesis over thousands of bases without steric hindrance. The data report directly on polymerase dynamics, revealing distinct polymerization states and pause sites corresponding to DNA secondary structure. Sequence data were aligned with the known reference sequence to assay biophysical parameters of polymerization for each template position. Consensus sequences were generated from the single-molecule reads at 15-fold coverage, showing a median accuracy of 99.3%, with no systematic error beyond fluorophore-dependent error rates.
                Bookmark

                Author and article information

                Contributors
                Role: Associate Editor
                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                01 March 2021
                25 September 2020
                25 September 2020
                : 37
                : 5
                : 589-595
                Affiliations
                [1 ] Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo , Kashiwa 277-8561, Japan
                [2 ] Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST) , Tokyo 135–0064, Japan
                [3 ] Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University , Tokyo 169–8555, Japan
                [4 ] Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST) , Tokyo 169–8555, Japan
                [5 ] Institute for Medical-oriented Structural Biology, Waseda University , Tokyo 162–8480, Japan
                [6 ] Graduate School of Medicine, Nippon Medical School , Tokyo 113–8602, Japan
                Author notes
                To whom correspondence should be addressed. mhamada@ 123456waseda.jp
                Author information
                https://orcid.org/0000-0001-9466-1034
                Article
                btaa835
                10.1093/bioinformatics/btaa835
                8097687
                32976553
                29fcc42d-b789-41a3-b727-0cb46036894c
                © The Author(s) 2020. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

                History
                : 22 June 2020
                : 20 August 2020
                : 08 September 2020
                : 11 September 2020
                Page count
                Pages: 7
                Funding
                Funded by: MEXT KAKENHI;
                Award ID: JP24680031
                Award ID: JP16H05879
                Award ID: JP20H00624
                Award ID: JP16H06279
                Award ID: JP25240044
                Categories
                Original Papers
                Genome Analysis
                AcademicSubjects/SCI01060

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article