2passtools: two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Transcription of eukaryotic genomes involves complex alternative processing of RNAs. Sequencing of full-length RNAs using long reads reveals the true complexity of processing. However, the relatively high error rates of long-read sequencing technologies can reduce the accuracy of intron identification. Here we apply alignment metrics and machine-learning-derived sequence information to filter spurious splice junctions from long-read alignments and use the remaining junctions to guide realignment in a two-pass approach. This method, available in the software package 2passtools ( https://github.com/bartongroup/2passtools), improves the accuracy of spliced alignment and transcriptome assembly for species both with and without existing high-quality annotations.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13059-021-02296-0.

Related collections

Most cited references 26

Record: found
Abstract: found
Article: not found

STAR: ultrafast universal RNA-seq aligner.

Alexander Dobin, Carrie A. Davis, Felix Schlesinger … (2013)

Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

0 comments Cited 13261 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

BEDTools: a flexible suite of utilities for comparing genomic features

Aaron Quinlan, Ira Hall (2010)

Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing web-based methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools Contact: aaronquinlan@gmail.com; imh4y@virginia.edu Supplementary information: Supplementary data are available at Bioinformatics online.

0 comments Cited 6367 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Minimap2: pairwise alignment for nucleotide sequences

Heng Li (2018)

Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.

0 comments Cited 3722 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Matthew T. Parker: m.t.parker@dundee.ac.uk

Katarzyna Knop: k.knop@dundee.ac.uk

Geoffrey J. Barton: g.j.barton@dundee.ac.uk

Gordon G. Simpson:

ORCID: http://orcid.org/0000-0001-6744-5889

g.g.simpson@dundee.ac.uk

Journal

Journal ID (nlm-ta): Genome Biol

Journal ID (iso-abbrev): Genome Biol

Title: Genome Biology

Publisher: BioMed Central (London )

ISSN (Print): 1474-7596

ISSN (Electronic): 1474-760X

Publication date (Electronic): 1 March 2021

Publication date PMC-release: 1 March 2021

Publication date Collection: 2021

Volume: 22

Electronic Location Identifier: 72

Affiliations

[1 ]GRID grid.8241.f, ISNI 0000 0004 0397 2876, School of Life Sciences, , University of Dundee, ; Dow Street, Dundee, DD1 5EH UK

[2 ]GRID grid.43641.34, ISNI 0000 0001 1014 6626, James Hutton Institute, ; Invergowrie, DD2 5DA UK

Author information

Gordon G. Simpson http://orcid.org/0000-0001-6744-5889

Article

Publisher ID: 2296

DOI: 10.1186/s13059-021-02296-0

PMC ID: 7919322

PubMed ID: 33648554

SO-VID: 64dc8499-e033-4ed8-ba5d-1c30ad00c987

License:

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

History

Date received : 27 May 2020

Date accepted : 10 February 2021

Funding

Funded by: University of Dundee Global Challenges Research Fund

Funded by: FundRef http://dx.doi.org/10.13039/501100000268, Biotechnology and Biological Sciences Research Council;

Award ID: BB/M01066/1

Award ID: BB/J00247X/1

Award ID: BB/M004155/1

Funded by: FundRef http://dx.doi.org/10.13039/100010665, H2020 Marie Skłodowska-Curie Actions;

Award ID: 799300

Award Recipient : Katarzyna Knop

Custom metadata

ScienceOpen disciplines: Genetics

Keywords: splicing,long-read sequencing,spliced alignment,rna-seq,gene expression,transcriptome assembly,machine learning,nanopore sequencing

Data availability:

ScienceOpen disciplines: Genetics

Keywords: splicing, long-read sequencing, spliced alignment, rna-seq, gene expression, transcriptome assembly, machine learning, nanopore sequencing

2passtools: two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

Read this article at

Abstract

Supplementary Information

Related collections

RNA drug delivery

Most cited references 26

STAR: ultrafast universal RNA-seq aligner.

BEDTools: a flexible suite of utilities for comparing genomic features

Minimap2: pairwise alignment for nucleotide sequences

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 113

Cited by 7

Most referenced authors 1,405