A de novo assembly of the newt transcriptome combined with proteomic validation identifies new protein families expressed during tissue regeneration

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Notophthalmus viridescens, an urodelian amphibian, represents an excellent model organism to study regenerative processes, but mechanistic insights into molecular processes driving regeneration have been hindered by a paucity and poor annotation of coding nucleotide sequences. The enormous genome size and the lack of a closely related reference genome have so far prevented assembly of the urodelian genome.

Results

We describe the de novo assembly of the transcriptome of the newt Notophthalmus viridescens and its experimental validation. RNA pools covering embryonic and larval development, different stages of heart, appendage and lens regeneration, as well as a collection of different undamaged tissues were used to generate sequencing datasets on Sanger, Illumina and 454 platforms. Through a sequential de novo assembly strategy, hybrid datasets were converged into one comprehensive transcriptome comprising 120,922 non-redundant transcripts with a N50 of 975. From this, 38,384 putative transcripts were annotated and around 15,000 transcripts were experimentally validated as protein coding by mass spectrometry-based proteomics. Bioinformatical analysis of coding transcripts identified 826 proteins specific for urodeles. Several newly identified proteins establish novel protein families based on the presence of new sequence motifs without counterparts in public databases, while others containing known protein domains extend already existing families and also constitute new ones.

Conclusions

We demonstrate that our multistep assembly approach allows de novo assembly of the newt transcriptome with an annotation grade comparable to well characterized organisms. Our data provide the groundwork for mechanistic experiments to answer the question whether urodeles utilize proprietary sets of genes for tissue regeneration.

Related collections

Most cited references 37

Record: found
Abstract: found
Article: not found

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

Geo M Pertea, Xiaoqiu Huang, Feng Liang … (2003)

TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.

0 comments Cited 793 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

PROSITE, a protein domain database for functional characterization and annotation

Christian J A Sigrist, Lorenzo Cerutti, Edouard de Castro … (2010)

PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. It is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of these profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE is largely used for the annotation of domain features of UniProtKB/Swiss-Prot entries. Among the 983 (DNA-binding) domains, repeats and zinc fingers present in Swiss-Prot (release 57.8 of 22 September 2009), 696 (∼70%) are annotated with PROSITE descriptors using information from ProRule. In order to allow better functional characterization of domains, PROSITE developments focus on subfamily specific profiles and a new profile building method giving more weight to functionally important residues. Here, we describe AMSA, an annotated multiple sequence alignment format used to build a new generation of generalized profiles, the migration of ScanProsite to Vital-IT, a cluster of 633 CPUs, and the adoption of the Distributed Annotation System (DAS) to facilitate PROSITE data integration and interchange with other sources. The latest version of PROSITE (release 20.54, of 22 September 2009) contains 1308 patterns, 863 profiles and 869 ProRules. PROSITE is accessible at: http://www.expasy.org/prosite/.

0 comments Cited 292 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Optimization of de novo transcriptome assembly from next-generation sequencing data.

Yann Surget-Groba, Juan I. Montoya-Burgos (2010)

Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.

0 comments Cited 183 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Mario Looso

Jens Preussner

Konstantinos Sousounis

Marc Bruckskotten

Christian S Michel

Ettore Lignelli

Richard Reinhardt

Sabrina Höffner

Marcus Krüger

Panagiotis A Tsonis

Thilo Borchardt

Thomas Braun

Journal

Journal ID (nlm-ta): Genome Biol

Journal ID (iso-abbrev): Genome Biol

Title: Genome Biology

Publisher: BioMed Central

ISSN (Print): 1465-6906

ISSN (Electronic): 1465-6914

Publication date (Print): 2013

Publication date (Electronic): 20 February 2013

Volume: 14

Issue: 2

Page: R16

Affiliations

[1 ]Max-Planck-Institute for Heart and Lung Research, Ludwigstrasse 43, 61231 Bad Nauheim, Germany

[2 ]Max-Planck Genome Centre Cologne, Carl-von-Linné-Weg 10, 50829 Köln, Germany

[3 ]Department of Biology and Center for Tissue Regeneration and Engineering at Dayton, University of Dayton, OH 45469-2320, USA

Article

Publisher ID: gb-2013-14-2-r16

DOI: 10.1186/gb-2013-14-2-r16

PMC ID: 4054090

PubMed ID: 23425577

SO-VID: be0937fc-4054-4f92-b5ce-d11d05873cb1

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 27 September 2012

Date revision received : 30 January 2013

Date accepted : 20 February 2013

A de novo assembly of the newt transcriptome combined with proteomic validation identifies new protein families expressed during tissue regeneration

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Arabidopsis genomics

Most cited references 37

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

PROSITE, a protein domain database for functional characterization and annotation

Optimization of de novo transcriptome assembly from next-generation sequencing data.

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 175

Cited by 41