RNA-Seq Assembly – Are We There Yet?

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Transcriptomic sequence resources represent invaluable assets for research, in particular for non-model species without a sequenced genome. To date, the Next Generation Sequencing technologies 454/Roche and Illumina have been used to generate transcriptome sequence databases by mRNA-Seq for more than fifty different plant species. While some of the databases were successfully used for downstream applications, such as proteomics, the assembly parameters indicate that the assemblies do not yet accurately reflect the actual plant transcriptomes. Two different assembly strategies have been used, overlap consensus based assemblers for long reads and Eulerian path/de Bruijn graph assembler for short reads. In this review, we discuss the challenges and solutions to the transcriptome assembly problem. A list of quality control parameters and the necessary scripts to produce them are provided.

Related collections

Most cited references 60

Record: found
Abstract: found
Article: not found

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

Geo M Pertea, Xiaoqiu Huang, Feng Liang … (2003)

TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.

0 comments Cited 793 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

BLAT--the BLAST-like alignment tool.

W. Kent (2002)

Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-based BLAT server for the human genome.

0 comments Cited 336 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome

Evandro Novaes, Derek Drost, William Farmerie … (2008)

Background Benefits from high-throughput sequencing using 454 pyrosequencing technology may be most apparent for species with high societal or economic value but few genomic resources. Rapid means of gene sequence and SNP discovery using this novel sequencing technology provide a set of baseline tools for genome-level research. However, it is questionable how effective the sequencing of large numbers of short reads for species with essentially no prior gene sequence information will support contig assemblies and sequence annotation. Results With the purpose of generating the first broad survey of gene sequences in Eucalyptus grandis, the most widely planted hardwood tree species, we used 454 technology to sequence and assemble 148 Mbp of expressed sequences (EST). EST sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes, promoting discovery of homologues to almost half of Arabidopsis genes, and a comprehensive survey of allelic variation in the transcriptome. By aligning the sequencing reads from multiple genotypes we detected 23,742 SNPs, 83% of which were validated in a sample. Genome-wide nucleotide diversity was estimated for 2,392 contigs using a modified theta (θ) parameter, adapted for measuring genetic diversity from polymorphisms detected by randomly sequencing a multi-genotype cDNA pool. Diversity estimates in non-synonymous nucleotides were on average 4x smaller than in synonymous, suggesting purifying selection. Non-synonymous to synonymous substitutions (Ka/Ks) among 2,001 contigs averaged 0.30 and was skewed to the right, further supporting that most genes are under purifying selection. Comparison of these estimates among contigs identified major functional classes of genes under purifying and diversifying selection in agreement with previous researches. Conclusion In providing an abundance of foundational transcript sequences where limited prior genomic information existed, this work created part of the foundation for the annotation of the E. grandis genome that is being sequenced by the US Department of Energy. In addition we demonstrated that SNPs sampled in large-scale with 454 pyrosequencing can be used to detect evolutionary signatures among genes, providing one of the first genome-wide assessments of nucleotide diversity and Ka/Ks for a non-model plant species.

0 comments Cited 199 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Front Plant Sci

Journal ID (iso-abbrev): Front Plant Sci

Journal ID (publisher-id): Front. Plant Sci.

Title: Frontiers in Plant Science

Publisher: Frontiers Research Foundation

ISSN (Electronic): 1664-462X

Publication date (Electronic): 25 September 2012

Publication date Collection: 2012

Volume: 3

Electronic Location Identifier: 220

Affiliations

[1] ¹simpleCenter of Excellence on Plant Sciences (CEPLAS), Institute for Plant Biochemistry, Heinrich Heine University Düsseldorf, Germany

[2] ²simpleCenter of Excellence on Plant Sciences (CEPLAS), Institute for Plant Developmental and Molecular Biology, Heinrich Heine University Düsseldorf, Germany

Author notes

Edited by: Bjoern Usadel, Rheinisch-Westfaelische Technische Hochschule Aachen University, Germany

Reviewed by: Jose M. Jimenez-Gomez, Max Planck Institute for Plant Breeding, Germany; Marc Lohse, Max Planck Institute of Molecular Plant Physiology, Germany

*Correspondence: Andrea Bräutigam, Institute for Plant Biochemistry, 26.03.01.Room 32, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany. e-mail: andrea.braeutigam@ 123456uni-duesseldorf.de

This article was submitted to Frontiers in Plant Systems Biology, a specialty of Frontiers in Plant Science.

Article

DOI: 10.3389/fpls.2012.00220

PMC ID: 3457010

PubMed ID: 23056003

SO-VID: 041f8707-d958-4341-8fec-f0166cec6870

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.

History

Date received : 06 August 2012

Date accepted : 05 September 2012

Page count

Figures: 2, Tables: 2, Equations: 0, References: 74, Pages: 12, Words: 10797

Comments

Comment on this article

scite_

Cited by 42

See all cited by

Most referenced authors 1,089

See all reference authors

RNA-Seq Assembly – Are We There Yet?

Read this article at

Abstract

Related collections

Arabidopsis genomics

Most cited references 60

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

BLAT--the BLAST-like alignment tool.

High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 225

Cited by 42

Most referenced authors 1,089