Splign: algorithms for computing spliced alignments with identification of paralogs

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

The computation of accurate alignments of cDNA sequences against a genome is at the foundation of modern genome annotation pipelines. Several factors such as presence of paralogs, small exons, non-consensus splice signals, sequencing errors and polymorphic sites pose recognized difficulties to existing spliced alignment algorithms.

Results

We describe a set of algorithms behind a tool called Splign for computing cDNA-to-Genome alignments. The algorithms include a high-performance preliminary alignment, a compartment identification based on a formally defined model of adjacent duplicated regions, and a refined sequence alignment. In a series of tests, Splign has produced more accurate results than other tools commonly used to compute spliced alignments, in a reasonable amount of time.

Conclusion

Splign's ability to deal with various issues complicating the spliced alignment problem makes it a helpful tool in eukaryotic genome annotation processes and alternative splicing studies. Its performance is enough to align the largest currently available pools of cDNA data such as the human EST set on a moderate-sized computing cluster in a matter of hours. The duplications identification (compartmentization) algorithm can be used independently in other areas such as the study of pseudogenes.

Reviewers

This article was reviewed by: Steven Salzberg, Arcady Mushegian and Andrey Mironov (nominated by Mikhail Gelfand).

Related collections

Most cited references 10

Record: found
Abstract: found
Article: not found

SSAHA: a fast search method for large DNA databases.

Z. Ning, A Cox, J Mullikin (2001)

We describe an algorithm, SSAHA (Sequence Search and Alignment by Hashing Algorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the "hits" for each k-tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.

0 comments Cited 373 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A computer program for aligning a cDNA sequence with a genomic DNA sequence.

L Florea, G Hartzell, Z. Zhang … (1998)

We address the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors. A freely available computer program, described herein, solves the problem for a 100-kb genomic sequence in a few seconds on a workstation.

0 comments Cited 192 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

WindowMasker: window-based masker for sequenced genomes.

Richa Agarwala, E Michael Gertz, Aleksandr Morgulis … (2006)

Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf

0 comments Cited 143 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Biol Direct

Title: Biology Direct

Publisher: BioMed Central

ISSN (Electronic): 1745-6150

Publication date Collection: 2008

Publication date (Electronic): 21 May 2008

Volume: 3

Page: 20

Affiliations

[1 ]National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20814, USA

Article

Publisher ID: 1745-6150-3-20

DOI: 10.1186/1745-6150-3-20

PMC ID: 2440734

PubMed ID: 18495041

SO-VID: 8206427c-9857-4430-9c24-9b270c8890db

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Splign: algorithms for computing spliced alignments with identification of paralogs

Read this article at

Abstract

Background

Results

Conclusion

Reviewers

Related collections

Resource Identification

Most cited references 10

SSAHA: a fast search method for large DNA databases.

A computer program for aligning a cDNA sequence with a genomic DNA sequence.

WindowMasker: window-based masker for sequenced genomes.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 142

Cited by 148

Most referenced authors 672