Next-generation genome annotation: we still struggle to get it right

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that we have used for the past two decades. The sheer number of genomes necessitates the use of fully automated procedures for annotation, but errors in annotation are just as prevalent as they were in the past, if not more so. How are we to solve this growing problem?

Related collections

Most cited references 6

Record: found
Abstract: found
Article: not found

No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini.

Georgios Koutsovoulos, Sujai Kumar, Dominik Laetsch … (2016)

Tardigrades are meiofaunal ecdysozoans that are key to understanding the origins of Arthropoda. Many species of Tardigrada can survive extreme conditions through cryptobiosis. In a recent paper [Boothby TC, et al. (2015) Proc Natl Acad Sci USA 112(52):15976-15981], the authors concluded that the tardigrade Hypsibius dujardini had an unprecedented proportion (17%) of genes originating through functional horizontal gene transfer (fHGT) and speculated that fHGT was likely formative in the evolution of cryptobiosis. We independently sequenced the genome of H. dujardini As expected from whole-organism DNA sampling, our raw data contained reads from nontarget genomes. Filtering using metagenomics approaches generated a draft H. dujardini genome assembly of 135 Mb with superior assembly metrics to the previously published assembly. Additional microbial contamination likely remains. We found no support for extensive fHGT. Among 23,021 gene predictions we identified 0.2% strong candidates for fHGT from bacteria and 0.2% strong candidates for fHGT from nonmetazoan eukaryotes. Cross-comparison of assemblies showed that the overwhelming majority of HGT candidates in the Boothby et al. genome derived from contaminants. We conclude that fHGT into H. dujardini accounts for at most 1-2% of genes and that the proposal that one-sixth of tardigrade genes originate from functional HGT events is an artifact of undetected contamination.

0 comments Cited 118 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The completion of the Mammalian Gene Collection (MGC).

Gary F. Temple, Daniela Gerhard, Rebekah Rasooly … (2009)

Since its start, the Mammalian Gene Collection (MGC) has sought to provide at least one full-protein-coding sequence cDNA clone for every human and mouse gene with a RefSeq transcript, and at least 6200 rat genes. The MGC cloning effort initially relied on random expressed sequence tag screening of cDNA libraries. Here, we summarize our recent progress using directed RT-PCR cloning and DNA synthesis. The MGC now contains clones with the entire protein-coding sequence for 92% of human and 89% of mouse genes with curated RefSeq (NM-accession) transcripts, and for 97% of human and 96% of mouse genes with curated RefSeq transcripts that have one or more PubMed publications, in addition to clones for more than 6300 rat genes. These high-quality MGC clones and their sequences are accessible without restriction to researchers worldwide.

0 comments Cited 61 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Removing contaminants from databases of draft genomes

Jennifer Lu, Steven L. Salzberg (2018)

Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.

0 comments Cited 53 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Steven L. Salzberg:

ORCID: http://orcid.org/0000-0002-8859-7432

salzberg@jhu.edu

Journal

Journal ID (nlm-ta): Genome Biol

Journal ID (iso-abbrev): Genome Biol

Title: Genome Biology

Publisher: BioMed Central (London )

ISSN (Print): 1474-7596

ISSN (Electronic): 1474-760X

Publication date (Electronic): 16 May 2019

Publication date PMC-release: 16 May 2019

Publication date Collection: 2019

Volume: 20

Electronic Location Identifier: 92

Affiliations

ISNI 0000 0001 2171 9311, GRID grid.21107.35, Departments of Biomedical Engineering, Computer Science, and Biostatistics, , Johns Hopkins University, ; Baltimore, MD 21205 USA

Author information

Steven L. Salzberg http://orcid.org/0000-0002-8859-7432

Article

Publisher ID: 1715

DOI: 10.1186/s13059-019-1715-2

PMC ID: 6521345

PubMed ID: 31097009

SO-VID: 2a96b1a5-b330-40c3-baf5-c64a1d5f7dee

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Funding

Funded by: National Institutes of Health

Award ID: R35-GM130151

Funded by: National Science Foundation

Award ID: IOS-1744309

Custom metadata

ScienceOpen disciplines: Genetics

Data availability:

ScienceOpen disciplines: Genetics

Comments

Comment on this article

scite_

Cited by 124

See all cited by

- Version 1

Next-generation genome annotation: we still struggle to get it right

Read this article at

Abstract

Related collections

Genome Integrity

Most cited references 6

No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini.

The completion of the Mammalian Gene Collection (MGC).

Removing contaminants from databases of draft genomes

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 462

Cited by 124