+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

      1 , 2 , 3 , 2 , 4 , 5 , *

      PLoS Computational Biology

      Public Library of Science

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.

          Author Summary

          One of the core elements of modern biological scientific investigation is the universal availability of millions of protein sequences from thousands of different organisms, allowing for exciting new investigations into biological questions. These sequences, found in large primary sequence databases such as GenBank NR or UniProt/TrEMBL, in secondary databases such as the valuable pathways database KEGG, or in highly curated databases such as UniProt/Swiss-Prot, are often annotated by computationally predicted protein functions. The scale of the available predicted function information is enormous but the accuracy of these predictions is essentially unknown. We investigate the critical question of the accuracy of functional predictions in these four public databases. We used 37 well-characterized enzyme families as a gold standard for comparing the accuracy of functional annotations in these databases. We find that function prediction error (i.e., misannotation) is a serious problem in all but the manually curated database Swiss-Prot. We discuss several approaches for mitigating the consequences of these high levels of misannotation.

          Related collections

          Most cited references 70

          • Record: found
          • Abstract: not found
          • Article: not found

          Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

           S Altschul (1997)
          The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
            • Record: found
            • Abstract: found
            • Article: not found

            MUSCLE: multiple sequence alignment with high accuracy and high throughput.

             Robert Edgar (2004)
            We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.
              • Record: found
              • Abstract: not found
              • Article: not found

              Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.


                Author and article information

                Role: Editor
                PLoS Comput Biol
                PLoS Computational Biology
                Public Library of Science (San Francisco, USA )
                December 2009
                December 2009
                11 December 2009
                : 5
                : 12
                [1 ]Graduate Group in Biophysics, University of California San Francisco, San Francisco, California, United States of America
                [2 ]Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
                [3 ]Department of Biochemistry, University of Zürich, Zürich, Switzerland
                [4 ]Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, United States of America
                [5 ]California Institute for Quantitative Biosciences, University of California San Francisco, San Francisco, California, United States of America
                Spanish National Cancer Research Centre (CNIO), Spain
                Author notes

                Conceived and designed the experiments: AMS ID PCB. Performed the experiments: AMS. Analyzed the data: AMS. Contributed reagents/materials/analysis tools: AMS SDB PCB. Wrote the paper: AMS SDB ID PCB.

                Schnoes et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                Page count
                Pages: 13
                Research Article
                Computational Biology
                Genetics and Genomics/Bioinformatics
                Genetics and Genomics/Gene Function

                Quantitative & Systems biology


                Comment on this article