Magic-BLAST, an accurate RNA-seq aligner for long and short reads

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline.

Results

Magic-BLAST uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome.

Conclusions

We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.

Electronic supplementary material

The online version of this article (10.1186/s12859-019-2996-x) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 8

Record: found
Abstract: found
Article: found

Is Open Access

Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis

Sayed Sahraeian, Marghoob Mohiyuddin, Robert Sebra … (2017)

RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome.

0 comments Cited 120 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Simulation-based comprehensive benchmarking of RNA-seq aligners

Giacomo Baruzzo, Katharina E. Hayer, Eun-Ji Kim … (2016)

Benchmarking on synthetic data reveals differences between common RNA-seq alignment software tools, particularly for complex genomic regions.

0 comments Cited 108 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.

Peter Robinson, M Gribskov (1996)

In this paper, we borrow the idea of the receiver operating characteristic (ROC) from clinical medicine and demonstrate its application to sequence comparison. The ROC includes elements of both sensitivity and specificity, and is a quantitative measure of the usefulness of a diagnostic. The ROC is used in this work to investigate the effects of scoring table and gap penalties on database searches. Studies on three families of proteins, 4Fe-4S ferredoxins, lysR bacterial regulatory proteins, and bacterial RNA polymerase sigma-factors lead to the following conclusions: sequence families are quite idiosyncratic, but the best PAM distance for database searches using the Smith-Waterman method is somewhat larger than predicted by theoretical methods, about 200 PAM. The length independent gap penalty (gap initiation penalty) is quite important, but shows a broad peak at values of about 20-24. The length dependent gap penalty (gap extension penalty) is almost irrelevant suggesting that successful database searches rely only to a limited degree on gapped alignments. Taken together, these observations lead to the conclusion that the optimal conditions for alignments and database searches are not, and should not be expected to be, the same.

0 comments Cited 84 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Grzegorz M. Boratyn: boratyng@ncbi.nlm.nih.gov

Jean Thierry-Mieg: mieg@ncbi.nlm.nih.gov

Danielle Thierry-Mieg: Danielle.Thierry-Mieg@nih.gov

Ben Busby: busbybr@ncbi.nlm.nih.gov

Thomas L. Madden:

ORCID: http://orcid.org/0000-0003-1641-7321

madden@ncbi.nlm.nih.gov

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date (Electronic): 25 July 2019

Publication date PMC-release: 25 July 2019

Publication date Collection: 2019

Volume: 20

Electronic Location Identifier: 405

Affiliations

ISNI 0000 0001 2297 5165, GRID grid.94365.3d, National Center for Biotechnology Information, National Library of Medicine, , National Institutes of Health, ; 8600 Rockville Pike, Bethesda, MD 20894 USA

Author information

Thomas L. Madden http://orcid.org/0000-0003-1641-7321

Article

Publisher ID: 2996

DOI: 10.1186/s12859-019-2996-x

PMC ID: 6659269

PubMed ID: 31345161

SO-VID: c56e0fbe-7011-4060-932a-9e8485f711a4

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.