Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data*

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time.

To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier.

Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.

Related collections

Most cited references 26

Record: found
Abstract: found
Article: not found

Ensembl 2011

Paul Flicek, M. Amode, Daniel Barrell … (2010)

The Ensembl project (http://www.ensembl.org) seeks to enable genomic science by providing high quality, integrated annotation on chordate and selected eukaryotic genomes within a consistent and accessible infrastructure. All supported species include comprehensive, evidence-based gene annotations and a selected set of genomes includes additional data focused on variation, comparative, evolutionary, functional and regulatory annotation. The most advanced resources are provided for key species including human, mouse, rat and zebrafish reflecting the popularity and importance of these species in biomedical research. As of Ensembl release 59 (August 2010), 56 species are supported of which 5 have been added in the past year. Since our previous report, we have substantially improved the presentation and integration of both data of disease relevance and the regulatory state of different cell types.

0 comments Cited 342 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

NCBI Reference Sequences: current status, policy and new initiatives

Kim D. Pruitt, Tatiana Tatusova, William Klimke … (2009)

NCBI's Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) is a curated non-redundant collection of sequences representing genomes, transcripts and proteins. RefSeq records integrate information from multiple sources and represent a current description of the sequence, the gene and sequence features. The database includes over 5300 organisms spanning prokaryotes, eukaryotes and viruses, with records for more than 5.5 × 106 proteins (RefSeq release 30). Feature annotation is applied by a combination of curation, collaboration, propagation from other sources and computation. We report here on the recent growth of the database, recent changes to feature annotations and record types for eukaryotic (primarily vertebrate) species and policies regarding species inclusion and genome annotation. In addition, we introduce RefSeqGene, a new initiative to support reporting variation data on a stable genomic coordinate system.

0 comments Cited 330 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Ongoing and future developments at the Universal Protein Resource

emmanuel boutet, Claire O'Donovan, Amos Bairoch (2010)

The primary mission of Universal Protein Resource (UniProt) is to support biological research by maintaining a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is updated and distributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org.

0 comments Cited 281 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Mol Cell Proteomics

Journal ID (hwp): mcprot

Journal ID (pmc): mcprot

Journal ID (publisher-id): MCP

Title: Molecular & Cellular Proteomics : MCP

Publisher: The American Society for Biochemistry and Molecular Biology

ISSN (Print): 1535-9476

ISSN (Electronic): 1535-9484

Publication date (Print): September 2011

Publication date (Electronic): 23 June 2011

Publication date PMC-release: 23 June 2011

Volume: 10

Issue: 9

Electronic Location Identifier: M111.008490

Affiliations

[1]From the ‡EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK;

[2]§Department of Medicine I, Medical University of Vienna, Borschkegasse 8a, 1090 Vienna, Austria

Author notes

¶ To whom correspondence should be addressed: EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. E-mail: juan@ 123456ebi.ac.uk .

Article

Publisher ID: M111.008490

DOI: 10.1074/mcp.M111.008490

PMC ID: 3186200

PubMed ID: 21700957

SO-VID: 36db3d89-e02e-40ce-974e-7b021acfffbd

License:

Creative Commons Attribution Non-Commercial License applies to Author Choice Articles

Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data*

Read this article at

Abstract

Related collections

Higher order chromatin architecture

Most cited references 26

Ensembl 2011

NCBI Reference Sequences: current status, policy and new initiatives

Ongoing and future developments at the Universal Protein Resource

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 53

Cited by 9

Most referenced authors 1,142