ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We describe an algorithm, ReAS, to recover ancestral sequences for transposable elements (TEs) from the unassembled reads of a whole genome shotgun. The main assumptions are that these TEs must exist at high copy numbers across the genome and must not be so old that they are no longer recognizable in comparison to their ancestral sequences. Tested on the japonica rice genome, ReAS was able to reconstruct all of the high copy sequences in the Repbase repository of known TEs, and increase the effectiveness of RepeatMasker in identifying TEs from genome sequences.

Abstract

Synopsis

Transposable elements (TEs) are a major component of the genomes of multicellular organisms. They are parasitic creatures that invade the genome, insert multiple copies of themselves, and then die. All we see now are the decayed remnants of their ancestral sequences. Reconstruction of these ancestral sequences can bring dead TEs back to life. Algorithms for detecting TEs compare present-day sequences to a library of ancestral sequences. Unknown to many, pervasive use of whole genome shotgun (WGS) methods in large-scale sequencing have made TE reconstructions increasingly problematic. To minimize assembly errors, WGS methods must reject the highly repetitive sequences that characterize most TEs, especially the most recent TEs, which are the least diverged from their ancestral sequences (and most informative for reconstruction). This is acceptable to many, because the most important parts of the genes are not repetitive, but for the TE aficionados, it is a problem. ReAS is a novel algorithm that does TE reconstruction using only the unassembled reads of a WGS. Tested against the WGS for japonica rice, it is shown to produce a library that is superior to the manually curated Repbase database of known ancestral TEs.

Related collections

Most cited references 20

Record: found
Abstract: found
Article: not found

A draft sequence of the rice genome (Oryza sativa L. ssp. indica).

J. Yu (2002)

We have produced a draft sequence of the rice genome for the most widely cultivated subspecies in China, Oryza sativa L. ssp. indica, by whole-genome shotgun sequencing. The genome was 466 megabases in size, with an estimated 46,022 to 55,615 genes. Functional coverage in the assembled sequences was 92.0%. About 42.2% of the genome was in exact 20-nucleotide oligomer repeats, and most of the transposons were in the intergenic regions between genes. Although 80.6% of predicted Arabidopsis thaliana genes had a homolog in rice, only 49.4% of predicted rice genes had a homolog in A. thaliana. The large proportion of rice genes with no recognizable homologs is due to a gradient in the GC content of rice coding sequences.

0 comments Cited 694 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice.

Shoshi Kikuchi, Kouji Satoh, Toshifumi Nagata … (2003)

We collected and completely sequenced 28,469 full-length complementary DNA clones from Oryza sativa L. ssp. japonica cv. Nipponbare. Through homology searches of publicly available sequence data, we assigned tentative protein functions to 21,596 clones (75.86%). Mapping of the cDNA clones to genomic DNA revealed that there are 19,000 to 20,500 transcription units in the rice genome. Protein informatics analysis against the InterPro database revealed the existence of proteins presented in rice but not in Arabidopsis. Sixty-four percent of our cDNAs are homologous to Arabidopsis proteins.

0 comments Cited 187 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Segmental duplications: organization and impact within the current human genome project assembly.

H Massa, Evan Eichler, Barbara Trask … (2001)

Segmental duplications play fundamental roles in both genomic disease and gene evolution. To understand their organization within the human genome, we have developed the computational tools and methods necessary to detect identity between long stretches of genomic sequence despite the presence of high copy repeats and large insertion-deletions. Here we present our analysis of the most recent genome assembly (January 2001) in which we focus on the global organization of these segments and the role they play in the whole-genome assembly process. Initially, we considered only large recent duplication events that fell well-below levels of draft sequencing error (alignments 90%-98% similar and > or =1 kb in length). Duplications (90%-98%; > or =1 kb) comprise 3.6% of all human sequence. These duplications show clustering and up to 10-fold enrichment within pericentromeric and subtelomeric regions. In terms of assembly, duplicated sequences were found to be over-represented in unordered and unassigned contigs indicating that duplicated sequences are difficult to assign to their proper position. To assess coverage of these regions within the genome, we selected BACs containing interchromosomal duplications and characterized their duplication pattern by FISH. Only 47% (106/224) of chromosomes positive by FISH had a corresponding chromosomal position by comparison. We present data that indicate that this is attributable to misassembly, misassignment, and/or decreased sequencing coverage within duplicated regions. Surprisingly, if we consider putative duplications >98% identity, we identify 10.6% (286 Mb) of the current assembly as paralogous. The majority of these alignments, we believe, represent unmerged overlaps within unique regions. Taken together the above data indicate that segmental duplications represent a significant impediment to accurate human genome assembly, requiring the development of specialized techniques to finish these exceptional regions of the genome. The identification and characterization of these highly duplicated regions represents an important step in the complete sequencing of a human reference genome.

0 comments Cited 137 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (publisher-id): pcbi

Title: PLoS Computational Biology

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Print): September 2005

Publication date (Electronic): 23 September 2005

Volume: 1

Issue: 4

Electronic Location Identifier: e43

Affiliations

[1 ] James D. Watson Institute of Genome Sciences of Zhejiang University, Hangzhou, China

[2 ] Beijing Institute of Genomics of Chinese Academy of Sciences, Beijing Genomics Institute, Beijing, China

[3 ] College of Life Sciences, Peking University, Beijing, China

[4 ] UW Genome Center, Department of Medicine, University of Washington, Seattle, Washington, United States of America

[5 ] The Institute of Human Genetics, University of Aarhus, Aarhus, Denmark

[6 ] Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, Denmark

The National Center for Genome Resources, United States of America

Author notes

* To whom correspondence should be addressed. E-mail: gksw@ 123456genomics.org.cn (GKW), wangj@ 123456genomics.org.cn (JW)

Article

Publisher ID: 05-PLCB-RA-0052R1 Serial Item and Contribution ID: plcb-01-04-04

DOI: 10.1371/journal.pcbi.0010043

PMC ID: 1232128

PubMed ID: 16184192

SO-VID: e9926cf3-e449-4bf6-9720-c070834422c1

Copyright © Copyright: © 2005 Li et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 11 March 2005

Date accepted : 23 August 2005

Custom metadata

Citation: Li R, Ye J, Li S, Wang J, Han Y, et al (2005) ReAS : Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput Biol 1(4): e43.

ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun

Read this article at

Abstract

Abstract

Synopsis

Related collections

Journal of Systems Thinking Preprints

Most cited references 20

A draft sequence of the rice genome (Oryza sativa L. ssp. indica).

Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice.

Segmental duplications: organization and impact within the current human genome project assembly.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 15

Cited by 42

Most referenced authors 2,617