+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      InterPro in 2019: improving coverage, classification and access to protein sequence annotations

      1 , 2 , 3 , 1 , 4 , 5 , 3 , 1 , 1 , 1 , 6 , 7 , 8 , 9 , 1 , 1 , 1 , 10 , 11 , 12 , 13 , 14 , 15 , 1 , 16 , 6 , 1 , 1 , 1 , 1 , 1 , 5 , 1 , 5 , 1 , 1 , 5 , 16 , 7 , 10 , 11 , 13 , 1 , 1

      Nucleic Acids Research

      Oxford University Press

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          The InterPro database ( classifies protein sequences into families and predicts the presence of functionally important domains and sites. Here, we report recent developments with InterPro (version 70.0) and its associated software, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website. These developments extend and enrich the information provided by InterPro, and provide greater flexibility in terms of data access. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB, and discuss how our evaluation of residue coverage may help guide future curation activities.

          Related collections

          Most cited references 39

          • Record: found
          • Abstract: not found
          • Article: not found

          Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

            • Record: found
            • Abstract: found
            • Article: not found

            Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

            We describe and validate a new membrane protein topology prediction method, TMHMM, based on a hidden Markov model. We present a detailed analysis of TMHMM's performance, and show that it correctly predicts 97-98 % of the transmembrane helices. Additionally, TMHMM can discriminate between soluble and membrane proteins with both specificity and sensitivity better than 99 %, although the accuracy drops when signal peptides are present. This high degree of accuracy allowed us to predict reliably integral membrane proteins in a large collection of genomes. Based on these predictions, we estimate that 20-30 % of all genes in most genomes encode membrane proteins, which is in agreement with previous estimates. We further discovered that proteins with N(in)-C(in) topologies are strongly preferred in all examined organisms, except Caenorhabditis elegans, where the large number of 7TM receptors increases the counts for N(out)-C(in) topologies. We discuss the possible relevance of this finding for our understanding of membrane protein assembly mechanisms. A TMHMM prediction service is available at Copyright 2001 Academic Press.
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The Pfam protein families database: towards a more sustainable future

              In the last two years the Pfam database ( has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool.

                Author and article information

                Nucleic Acids Res
                Nucleic Acids Res
                Nucleic Acids Research
                Oxford University Press
                08 January 2019
                06 November 2018
                06 November 2018
                : 47
                : Database issue , Database issue
                : D351-D360
                [1 ]European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
                [2 ]School of Computer Science, The University of Manchester, Manchester M13 9PL, UK
                [3 ]Department of Bioengineering & Therapeutic Sciences, University of California, San Francisco, CA 94158, USA
                [4 ]European Molecular Biology Laboratory, Structural and Computational Biology Unit, Meyerhofstr.1, 69117 Heidelberg, Germany
                [5 ]Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
                [6 ]Medical Research Council Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK
                [7 ]J. Craig Venter Institute (JCVI), 9605 Medical Center Drive, Suite 150, Rockville, MD 20850, USA
                [8 ]Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
                [9 ]Biobyte Solutions GmbH, Bothestr 142, 69126 Heidelberg, Germany
                [10 ]National Center for Biotechnology Information, National Library of Medicine, NIH Bldg, 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
                [11 ]Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA
                [12 ]Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
                [13 ]Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35131 Padua, Italy
                [14 ]Department of Agricultural Sciences, University of Udine, via Palladio 8, 33100 Udine, Italy
                [15 ]Fondazione Edmund Mach, Via E. Mach 1, 38010 S. Michele all’Adige, Italy
                [16 ]Structural and Molecular Biology, University College London, Darwin Building, London WC1E 6BT, UK
                Author notes
                To whom correspondence should be addressed. Tel: +44 1223 492679; Fax: +44 1223 494468; Email: rdf@
                © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                Pages: 10
                Funded by: Wellcome Trust 10.13039/100004440
                Award ID: 108433/Z/15/Z
                Funded by: Biotechnology and Biological Sciences Research Council 10.13039/501100000268
                Award ID: BB/N00521X/1
                Award ID: BB/N019172/1
                Award ID: BB/L024136/1
                Funded by: National Science Foundation, Division of Biological Infrastructure 10.13039/100006445
                Award ID: 1458808
                Database Issue



                Comment on this article