RefSeq: an update on prokaryotic genome annotation and curation

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule—BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

Related collections

Most cited references 30

Record: found
Abstract: found
Article: not found

NCBI prokaryotic genome annotation pipeline

Tatiana Tatusova, Michael DiCuccio, Azat Badretdin … (2016)

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

0 comments Cited 1889 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The Pfam protein families database: towards a more sustainable future

Robert D. Finn, Penelope Coggill, Ruth Y. Eberhardt … (2015)

In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool.

0 comments Cited 787 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database

Baofeng Jia, Amogelang Raphenya, Brian Alcock … (2016)

The Comprehensive Antibiotic Resistance Database (CARD; http://arpcard.mcmaster.ca) is a manually curated resource containing high quality reference data on the molecular basis of antimicrobial resistance (AMR), with an emphasis on the genes, proteins and mutations involved in AMR. CARD is ontologically structured, model centric, and spans the breadth of AMR drug classes and resistance mechanisms, including intrinsic, mutation-driven and acquired resistance. It is built upon the Antibiotic Resistance Ontology (ARO), a custom built, interconnected and hierarchical controlled vocabulary allowing advanced data sharing and organization. Its design allows the development of novel genome analysis tools, such as the Resistance Gene Identifier (RGI) for resistome prediction from raw genome sequence. Recent improvements include extensive curation of additional reference sequences and mutations, development of a unique Model Ontology and accompanying AMR detection models to power sequence analysis, new visualization tools, and expansion of the RGI for detection of emergent AMR threats. CARD curation is updated monthly based on an interplay of manual literature curation, computational text mining, and genome analysis.

0 comments Cited 592 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Nucleic Acids Res

Journal ID (iso-abbrev): Nucleic Acids Res

Journal ID (publisher-id): nar

Title: Nucleic Acids Research

Publisher: Oxford University Press

ISSN (Print): 0305-1048

ISSN (Electronic): 1362-4962

Publication date (Print): 04 January 2018

Publication date (Electronic): 03 November 2017

Publication date PMC-release: 03 November 2017

Volume: 46

Issue: Database issue , Database issue

Pages: D851-D860

Affiliations

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892-6511, USA

Author notes

To whom correspondence should be addressed. Tel: +1 301 594 7689; Email: haftdh@ 123456ncbi.nlm.nih.gov

Author information

Daniel H Haft http://orcid.org/0000-0001-8101-4938

Article

Publisher ID: gkx1068

DOI: 10.1093/nar/gkx1068

PMC ID: 5753331

PubMed ID: 29112715

SO-VID: 9156446f-12f3-4979-a8f6-c9dbffaf9da3

License:

This work is written by (a) US Government employee(s) and is in the public domain in the US.

History

Date accepted : 25 October 2017

Date revision received : 17 October 2017

Date received : 15 September 2017

Page count

Pages: 10

Comments

Comment on this article

scite_

Cited by 451

See all cited by

Most referenced authors 656

See all reference authors

- Version 1
- Version 1

RefSeq: an update on prokaryotic genome annotation and curation

Read this article at

Abstract

Related collections

Genome Integrity

Most cited references 30

NCBI prokaryotic genome annotation pipeline

The Pfam protein families database: towards a more sustainable future

CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 401

Cited by 451

Most referenced authors 656