231
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

      research-article
      1 , * , 1 , 1 , 1 , 1 , 1 ,   1 , 2 , 1 , 3 , 4 , 4 , 4 , 5 , 5 , 6 , 6 , 6 , 6 , 7 , 6 , 3 , 8 , 8 , 9 , 8 , 1 , 6 , 4 , 5 , 8 , 8 , 1 , 1 , 1
      PLoS Biology
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

          Author Summary

          The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. Given the wide-ranging roles microbes play in many ecosystems, metagenomics studies of microbial communities will reveal insights into protein families and their evolution. Because most microbes will not grow in the laboratory using current cultivation techniques, scientists have turned to cultivation-independent techniques to study microbial diversity. One such technique—shotgun sequencing—allows random sampling of DNA sequences to examine the genomic material present in a microbial community. We used shotgun sequencing to examine microbial communities in water samples collected by the Sorcerer II Global Ocean Sampling (GOS) expedition. Our analysis predicted more than six million proteins in the GOS data—nearly twice the number of proteins present in current databases. These predictions add tremendous diversity to known protein families and cover nearly all known prokaryotic protein families. Some of the predicted proteins had no similarity to any currently known proteins and therefore represent new families. A higher than expected fraction of these novel families is predicted to be of viral origin. We also found that several protein domains that were previously thought to be kingdom specific have GOS examples in other kingdoms. Our analysis opens the door for a multitude of follow-up protein family analyses and indicates that we are a long way from sampling all the protein families that exist in nature.

          Abstract

          The GOS data identified 6.12 million predicted proteins covering nearly all known prokaryotic protein families, and several new families. This almost doubles the number of known proteins and shows that we are far from identifying all the proteins in nature.

          Related collections

          Most cited references129

          • Record: found
          • Abstract: found
          • Article: not found

          Profile hidden Markov models.

          S. Eddy (1998)
          The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The Pfam protein families database.

            Pfam is a large collection of protein families and domains. Over the past 2 years the number of families in Pfam has doubled and now stands at 6190 (version 10.0). Methodology improvements for searching the Pfam collection locally as well as via the web are described. Other recent innovations include modelling of discontinuous domains allowing Pfam domain definitions to be closer to those found in structure databases. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://Pfam.cgb.ki.se/).
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Amino acid substitution matrices from protein blocks.

              Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.
                Bookmark

                Author and article information

                Contributors
                Role: Academic Editor
                Journal
                PLoS Biol
                pbio
                PLoS Biology
                Public Library of Science (San Francisco, USA )
                1544-9173
                1545-7885
                March 2007
                13 March 2007
                : 5
                : 3
                : e16
                Affiliations
                [1 ] J. Craig Venter Institute, Rockville, Maryland, United States of America
                [2 ] University of California, Davis, California, United States of America
                [3 ] Razavi-Newman Center for Bioinformatics, Salk Institute for Biological Studies, La Jolla, California, United States of America
                [4 ] Burnham Institute for Medical Research, La Jolla, California, United States of America
                [5 ] University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America
                [6 ] University of California Berkeley, Berkeley, California, United States of America
                [7 ] Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
                [8 ] University of California San Diego, San Diego, California, United States of America
                [9 ] Brown University, Providence, Rhode Island, United States of America
                Washington University St. Louis, United States of America
                Author notes
                * To whom correspondence should be addressed. E-mail: Shibu.Yooseph@ 123456venterinstitute.org
                Article
                06-PLBI-RA-0500R3 plbi-05-03-23
                10.1371/journal.pbio.0050016
                1821046
                17355171
                5eeb45ec-fc6f-40d4-b2c2-b0926061e918
                Copyright: © 2007 Yooseph et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                : 24 March 2006
                : 15 August 2006
                Page count
                Pages: 35
                Categories
                Research Article
                Computational Biology
                Evolutionary Biology
                Genetics and Genomics
                Molecular Biology
                Eubacteria
                Viruses
                Oceanic Metagenomics
                Custom metadata
                Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol 5(3): e16. doi: 10.1371/journal.pbio.0050016
                oceaniclogo.jpg

                Life sciences
                Life sciences

                Comments

                Comment on this article