Introduction Despite many efforts to classify and organize proteins [1–6] from both structural and functional perspectives, we are far from a clear understanding of the size and diversity of the protein universe [7–9]. Environmental shotgun sequencing projects, in which genetic sequences are sampled from communities of microorganisms [10–14], are poised to make a dramatic impact on our understanding of proteins and protein families. These studies are not limited to culturable organisms, and there are no selection biases for protein classes or organisms. These studies typically provide a gene-centric (as opposed to an organism-centric) view of the environment and allow the examination of questions related to protein family evolution and diversity. The protein predictions from some of these studies are characterized both by their sheer number and diversity. For instance, the recent Sargasso Sea study  resulted in 1.2 million protein predictions and identified new subfamilies for several known protein families. Protein exploration starts by clustering proteins into groups or families of evolutionarily related sequences. The notion of a protein family, while biologically very relevant, is hard to realize precisely in mathematical terms, thereby making the large-scale computational clustering and classification problem nontrivial. Techniques for these problems typically rely on sequence similarity to group sequences. Proteins can be grouped into families based on the highly conserved structural units, called domains, that they contain [15,16]. Alternatively, proteins are grouped into families based on their full sequence [17,18]. Many of these classifications, together with various expert-curated databases  such as Swiss-Prot , Pfam [15,21], and TIGRFAM [22,23], or integrated efforts such as Uniprot  and InterPro , provide rich resources for protein annotation. However, a vast number of protein predictions remain unclassified both in terms of structure and function. Given varying rates of evolution, there is unlikely to be a single similarity threshold or even a small set of thresholds that can be used to define every protein family in nature. Consequently, estimates of the number of families that exist in nature vary considerably based on the different thresholds used and assumptions made in the classification process [26–29]. In this study, we explored proteins using a comprehensive dataset of publicly available sequences together with environmental sequence data generated by the Sorcerer II Global Ocean Sampling (GOS) expedition . We used a novel clustering technique based on full-length sequence similarity both to predict proteins and to group related sequences. The goals were to understand the rate of discovery of protein families with the increasing number of protein predictions, explore novel families, and assess the impact of the environmental sequences from the expedition on known proteins and protein families. We used hidden Markov model (HMM) profiling to examine the relative biases in protein domain distributions in the GOS data and existing protein databases. This profiling was also used to assess the impact of the GOS data on target selection for protein structure characterization efforts. We carried out in-depth analyses on several protein families to validate our clustering approach and to understand the diversity and evolutionary information that the GOS data added; the families included ultraviolet (UV) irradiation DNA damage repair enzymes, phosphatases, proteases, and the metabolic enzymes glutamine synthetase and RuBisCO. Results/Discussion Data Generation, Sequence Clustering, and HMM Profiling We used the following publicly available datasets in this study (Table 1)—the National Center for Biotechnology Information (NCBI)'s nonredundant protein database (NCBI-nr) [31,32], NCBI Prokaryotic Genomes (PG) [31,33], TIGR Gene Indices (TGI-EST) , and Ensembl (ENS) [35,36]. The rationale for including these datasets is discussed in Materials and Methods. All datasets were downloaded on February 10, 2005. None of the above-mentioned databases contained sequences from the Sargasso Sea study , the largest environmental survey to date, and so we pooled reads from the Sargasso Sea study with the reads from the Sorcerer II GOS expedition , creating a combined set that we call the GOS dataset. The GOS dataset was assembled using the Celera Assembler  as described in  (see Materials and Methods). The GOS dataset was primarily generated from the 0.1 μm to 0.8 μm size filters and thus is expected to be mostly microbial . The data also included a small set of sequences from a viral size ( 200). While the 17,067 medium- and large-sized clusters constitute only 6% of the total number of clusters, they account for 85% of all the sequences that are clustered (Table 3). Many of the largest clusters correspond to families that have functionally diversified and expanded (Table 4). While some large families, such as the HIV envelope glycoprotein family and the immunoglobulins, also reflect biases in sequence databases, many more, including ABC transporters, kinases, and short-chain dehydrogenases, reflect their expected abundance in nature. Rate of Discovery of Protein Families We examined the rate of discovery of protein families using our clustering method to determine whether our sampling of the protein universe is reaching saturation. We find that for the present number of sequences there is an approximately linear trend in the rate of discovery of clusters with the addition of new (i.e., nonredundant) sequences (Figure 2). Moreover, the observed distribution of cluster sizes is well approximated by a power law [42,43], and this observed power law can be used to predict the rate of growth of the number of clusters of a given size (see Materials and Methods). This rate is dependent on the value of the power law exponent and decreases with increasing cluster sizes. We find good agreement between the observed and predicted growth rates for different cluster sizes. The approximately linear relationship between the number of clusters and the number of protein sequences indicates that there are likely many more protein families (either novel or subfamilies distantly related to known families) remaining to be discovered. GOS versus Known Prokaryotic versus Known Nonprokaryotic We also examined the GOS coverage of known proteins and protein families. Based on the cell-size filtering performed while collecting the GOS samples, we expected that the sample would predominantly be a size-limited subset of prokaryotic organisms . We studied the content of the 17,067 medium- and large-sized clusters across three groupings: (1) GOS, (2) known prokaryotic (PG together with bacterial and archaeal portions of NCBI-nr), and (3) known nonprokaryotic (TGI-EST and ENS together with viral and eukaryotic portions of NCBI-nr). The Venn diagram in Figure 3 shows the breakdown of these clusters by content (see Materials and Methods). The largest section contains GOS-only clusters (23.40%) emphasizing the significant novelty provided by the GOS data. The next section consists of clusters containing sequences from only the known nonprokaryotic grouping (20.78%), followed closely by the section containing clusters with sequences from all three groupings (20.23%). The large known nonprokaryotic–only grouping shows that our current GOS sampling methodology will not cover all protein families, and perhaps misses some protein families that are exclusive to higher eukaryotes. The large section of clusters that include all three groupings indicates a large core of well-conserved protein families across all domains of life. In contrast, the known prokaryotic protein families are almost entirely covered by the GOS data. Novelty Added by GOS Data There are 3,995 medium and large clusters that contain only sequences from the GOS dataset. Some are divergent members of known families that failed to be merged by the clustering parameters used, or are too divergent to be detected by any current homology detection methods. The remaining clusters are completely novel families. In exploring the 3,995 GOS-only clusters, 44.9% of them contain sequences that have HMM matches, or BLAST matches to sequences in a more recent snapshot of NCBI-nr (downloaded in August 2005) than was used in this study. The recent NCBI-nr matches include phage sequences from cyanophages (P-SSM2 and P-SSM4)  and sequences from the SAR-11 genome (Candidatus pelagibacter ubique HTCC1062) . We used profile–profile searches  to show that an additional 12.5% of the GOS-only clusters can be linked to profiles built from Protein Data Bank (PDB), COG, or Pfam. The 2,295 clusters with detected homology are referred to as Group I clusters. The remaining 1,700 (42.6%) GOS-only clusters with no detectable homology to known families are labeled as Group II clusters. We applied a guilt-by-association operon method to annotate the GOS-only clusters with a strategy that did not rely on direct sequence homology to known families. Function was inferred for the GOS-only clusters by examining their same-strand neighbors on the assembly (see Materials and Methods). Similar strategies have been successfully used to infer protein function in finished microbial genomes [46–48]. Despite minimal assembly of GOS reads, many scaffolds and mini-scaffolds contain at least partial fragments of more than one predicted ORF, thereby making this approach feasible. For 90 (5.3%) of the Group II clusters, and for 214 (9.3%) of the Group I clusters, at least one Gene Ontology (GO)  biological process term at p-value ≤0.05 can be inferred. The inferred functions and neighbors of some of these GOS-only clusters are highlighted in Table 5. We observed that for Group I clusters, the neighbor-inferred function is often bolstered by some information from weak homology to known sequences. While neighboring clusters as a whole are of diverse function, a number of GOS-only clusters seem to be next to clusters implicated in photosynthesis or electron transport. These GOS-only clusters could be of viral origin, as cyanophage genomes contain and express some photosynthetic genes that appear to be derived from their hosts [44,50,51]. In support of these observations, we identified five photosynthesis-related clusters containing hundreds to thousands of viral sequences, including psbA, psbD, petE, SpeD, and hli in the GOS data; furthermore, our nearest-neighbor analysis of these sequences reveals the presence of multiple viral proteins (unpublished data). Although the majority of GOS-only sequences are bacterial, a higher than expected proportion of the GOS-only clusters are predicted to be of viral origin, implying that viral sequences and families are poorly explored relative to other microbes. To assign a kingdom to the GOS-only clusters, we first inferred the kingdom of neighboring sequences based on the taxonomy of the top four BLAST matches to the NCBI-nr database (see Materials and Methods). A possible kingdom was assigned to the GOS-only cluster if more than 50% of assignable neighboring sequences belong to the same kingdom. Viewed in this way, 11.8% of Group I clusters and 17.3% of Group II clusters with at least one kingdom-assigned neighbor have more than 50% viral neighbors (Figure 4). Only 3.3% and 3.4% of random samples of clusters with size distributions matching that of Group I and Group II clusters have more than 50% viral neighbors, while 7.7% of all clusters pass this criterion. A total of 547 GOS-only clusters contain sequences collected from the viral size fraction included in the GOS dataset. For these clusters, 38.9% of the Group I subset and 27.5% of the Group II subset with one or more kingdom-assigned neighbors would be inferred as viral, based on the conservative criteria of having more than 50% viral assignable neighbors. Several alternative kingdom assignment methods were tried (see Materials and Methods) and provide for a similar conclusion. The GOS-only clusters also tend to be more AT-rich than sequences from a random size-matched sample of clusters (35.9% ± 8% GC content for Group II clusters versus 49.5% ± 11% GC content for sample). Phage genomes with a Prochlorococcus host  are also AT rich (37% average GC content). Our analysis of the graph constructed based on inferred operon linkages between all clusters indicates that the GOS-only clusters may constitute large sets of cotranscribed genes (see Materials and Methods). The high proportion of potentially viral novel clusters observed here is reasonable, as 60%–80% of the ORFs in most finished marine phage genomes are not homologous to known protein sequences . Viral metagenomics projects have reported an equally high fraction of novel ORFs , and a recent marine metagenomics project estimated that up to 21% of photic zone sequences could be of viral origin . It has also been reported that 40% of ORFans (sequences that lack similarity to known proteins and predicted proteins) exist in close spatial proximity to each other in bacterial genomes, and this combined with proximity to integration signals has been used to suggest a viral horizontally transferred origin for many bacterial ORFans . Others have noted a clustering of ORFans in genome islands and suggested they derive from a phage-related gene pool . A recent analysis of genome islands from related Prochlorococcus found that phage-like genes and novel genes cohabit these dynamic areas of the genome . In our GOS-only clusters, 37 of the 1,700 clusters with no detectable similarity (2.2%) have at least ten bacterial-classified and ten viral-classified neighboring ORFs. This is 6.2-fold higher than the rate seen for the size-matched sample of all clusters (six clusters, 0.35%). This would seem to add more support to a phage origin for at least some ORFans found in bacterial genomes. If a sizable portion of the novel families in the GOS data are in fact of viral origin, it suggests that we are far from fully exploring the molecular diversity of viruses, a conclusion echoed in previous studies of viral metagenomes [53,57,58]. In studies of bacterial genomes, discovery of new ORFans shows no sign of reaching saturation . Coverage of many phage families in the GOS data may be low, given that there are inherent differences in the abundance of their presumed bacterial hosts. These GOS-only clusters were operationally defined as having at least 20 nonredundant sequences. Reducing this threshold to ten nonredundant sequences adds 7,241 additional clusters. Whether this vast diversity represents new families or is a reflection of the inability to detect distant homology will require structural and biochemical studies, as well as continued development of computational methods to identify remotely related sequences. Comparison of Domain Profiles in GOS and PG Datasets We used HMM profiling to address the question of which biochemical and biological functions are expanded or contracted in GOS compared to the largely terrestrial genomes in PG. Significant differences are seen in 68% of domains (4,722 out of the 6,975 domains that match either GOS or PG; p-value 20 different strains sequenced, 168 ORFans are identified in strain CFT073, and 67 of them have GOS matches. The only eukaryotic organism in this list is Candida albicans SC5314, a fungal human pathogen, which has 49 ORFans with GOS matches. We examined a small but interesting subset of the ORFans that have 3-D structures deposited in PDB. Out of 65 PDB ORFans, GOS matches for eight of them are found (see Supporting Information for their PDB identifiers and names). They include four restriction endonucleases, three hypothetical proteins, and a glucosyltransferase. GOS sequences can play an important role in identifying the functions of existing ORFans or in confirming protein predictions. For example, we found that the hypothetical protein AF1548, which is a PDB ORFan, has matches to 16 GOS sequences. A PSI-BLAST search with AF1548 as the query against a combined set of GOS and NCBI-nr identified several significant restriction endonucleases after three iterations. With the support of 3-D structure and multiple sequence alignment of AF1548 and its GOS matches, we predict that AF1548 along with its GOS homologs are restriction endonucleases (Figure 10). When combined with an established consensus of active sites of the related endonucleases families , we predicted three catalytic residues. Genome Sequencing Projects and Protein Exploration With respect to protein exploration and novel family discovery, microbial sequencing offers more promise compared to sequencing more mammalian genomes. This is illustrated by Figure 11 , where the number of clusters that protein predictions from various finished mammalian genomes fall into was compared to the number of clusters that similar-sized random subsets of microbial sequences fall into (see Materials and Methods). As the figure shows, the rate of protein family discovery is higher for microbes than for mammals. Indeed, the rate of new family discovery is plateauing for mammalian sequences. This is not surprising, as mammalian divergence from a common ancestor is much more recent than microbial divergence from a common ancestor, which suggests that mammals will share a larger core set of less-diverged proteins. Microbial sequencing is also more cost effective than mammalian sequencing for acquiring protein sequences because microbial protein density is typically 80%–90% versus 1%–2% for mammals. This could be addressed with mammalian mRNA sequencing, but issues with acquiring rarely expressed mRNAs would need to be considered. There are, of course, other reasons to sequence mammalian genomes, such as understanding mammalian evolution and mammalian gene regulation. Conclusions The rate of protein family discovery is approximately linear in the (current) number of protein sequences. Additional sequencing, especially of microbial environments, is expected to reveal many more protein families and subfamilies. The potential for discovering new protein families is also supported by the GOS diversity seen at the nucleotide level across the different sampling sites . Averaged over the sites, 14% of the GOS sequence reads from a site are unique (at 70% nucleotide identity) to that site . The GOS data provides almost complete coverage of known prokaryotic protein families. In addition, it adds a great deal of diversity to many known families and offers new insights into the evolution of these families. This is illustrated using several protein families, including UV damage–repair enzymes, phosphatases, proteases, glutamine synthetase, RuBisCO, RecA (unpublished data), and kinases . Only a handful of protein families have been examined thus far, and many thousands more remain to be explored. The protein analysis presented indicates that we are far from exploring the diversity of viruses. This is reflected in several of the analyses. The GOS-only clusters show an overrepresentation of sequences of viral origin. In addition, our domain analysis using HMM profiling shows a lower Pfam coverage of the GOS sequences in the viral kingdom compared to the other kingdoms. At least two of the protein families we explored in detail (UV repair enzymes and glutamine synthetase) contain abundant new viral additions. The extraordinary diversity of viruses in a variety of environmental settings is only now beginning to be understood [57,119–121]. A separate analysis of GOS microbial and viral sequences (unpublished data) shows that multiple viral protein clusters contain significant numbers of host-derived proteins, suggesting that viral acquisition of host genes is quite widespread in the oceans. Data generated by this GOS study and similar environmental shotgun sequencing studies present their own analysis challenges. Methods for various analyses (e.g., sequence alignment, profile construction, phylogeny inference, etc.) are generally designed and optimized to work with full sequences. They have to be tailored to analyze the mostly fragmentary sequences that are generated by these projects. Nevertheless, these data are a valuable source of new discoveries. These data have the potential to refine old hypotheses and make new observations about proteins and their evolution. Our preliminary exploration of the GOS data identified novel protein families and also showed that many ORFan sequences from current databases have homologs in these data. The diversity added by GOS data to protein families also allows for the building of better profile models and thereby improves remote homology detection. The discovery of kingdom-crossing protein families that were previously thought to be kingdom-specific presents evidence that the GOS project has excavated proteins of more ancient lineage than that previously known, or that have undergone lateral gene transfer. This is another example of how metagenomics studies are changing our understanding of protein sequences, their evolution, and their distribution across the various forms of life and environments. Biases in the currently published databases due to oversampling of some proteins or organisms are illuminated by environmental surveys that lack such biases. Such knowledge can help us make better predictions of the real distribution patterns of proteins in the natural world and indicate where increased sampling would be likely to uncover new families or family members of tremendous diversity (such as in the viral kingdom). These data have other significant implications for the fields of protein evolution and protein structure prediction. Having several hundreds or even tens of thousands of diverse proteins from a family or examples of a specific protein fold should provide new approaches for developing protein structure prediction models. Development of algorithms that consider the alignments of all these family members/protein folds and analyze how amino acid sequence can vary without significantly altering the tertiary structure or function may provide insights that can be used to develop new ab inito methods for predicting protein structures. These same datasets could also be used to begin to understand how a protein evolves a new function. Finally, this large database of amino acid sequence data could help to better understand and predict the molecular interactions between proteins. For example, they may be used to predict the protein–protein interactions so critical for the formation of specific functional complexes within cells. The GOS data also have implications for nearly all computational methods relying on sequence data. The increase in the number of known protein sequences presents challenges to many algorithms due to the increased volume of sequences. In most cases this increase in sequence data can be compensated for with additional CPU cycles, but it is also a foreshadowing of times to come as the pace of large-scale sequence-collecting accelerates. A related challenge is the increase in the diversity of protein families, with many new divergent clades present. With more protein similarity relationships falling into the twilight zone overlapping with random sequence similarity, the number of false positives for homology detection methods increases, making the true relationships more difficult to identify. Nevertheless, a deeper knowledge of protein sequence and family diversity introduces unprecedented opportunities to mine similarity relationships for clues on molecular function and molecular interactions as well as providing much expanded data for all methods utilizing homologous sequence information data. The GOS dataset has demonstrated the usefulness of large-scale environmental shotgun sequencing projects in exploring proteins. These projects offer an unbiased view of proteins and protein families in an environmental sample. However, it should be noted that the GOS data reported here are limited to mostly ocean surface microbes. Even with this targeted sampling a tremendous amount of diversity is added to known families, and there is evidence for a large number of novel families. Additional data from larger filter sizes (that will sample more eukaryotes) coupled with metagenomic studies of different environments like soil, air, deep sea, etc. will help to achieve the ultimate goal of a whole-earth catalog for proteins. Materials and Methods Data description. NCBI-nr [31,32] is the single largest publicly available protein resource and includes protein sequences submitted to SWISS-PROT (curated protein database) , PDB (a database of amino acid sequences with solved structures) , PIR (Protein Information Resource) , and PRF (Protein Research Foundation). In addition, NCBI-nr also contains protein predictions from DNA sequences from both finished and unfinished genomes in GenBank , EMBL , and DNA Databank of Japan (DDBJ) . The nonredundancy in NCBI-nr is only to the level of distinct sequences, and any two sequences of the same length and content are merged into a single entry. NCBI-nr contains partial protein sequences and is not a fully curated database. Therefore it also contains contaminants in the form of sequences that are falsely predicted to be proteins. Expressed sequence tag (EST) databases also provide the potential to add a great deal of information to protein exploration and contain information that is not well represented in NCBI-nr. To this end, assemblies of EST sequences from the TIGR Gene Indices , an EST database, were included in this study. To minimize redundancy, only EST assemblies from those organisms for which the full genome is not yet known, were included. The protein predictions on metazoan genomes that are fully sequenced and annotated were obtained by including the Ensembl database [35,36] in this study. Both finished and unfinished sequences from prokaryotic genome projects submitted to NCBI were included. The protein predictions from the individual sequencing projects are submitted to NCBI-nr. Nevertheless, these genomes were included in this dataset both for the purpose of evaluating our approach and also for the purpose of identifying any proteins that were missed by the annotation process used in these projects. Thus, for this study the following publicly available datasets, all downloaded on February 10, 2005—NCBI-nr, PG, TGI-EST, and ENS—were used. The organisms in the PG set and the TGI-EST set are listed in Protocol S1. Assembly of the GOS dataset. Initial assembly (construction of “unitigs”) was performed so that only overlaps of at least 98% DNA sequence identity and no conflicts with other overlaps were accepted. False assemblies at this phase of the assembler are extremely rare, even in the presence of complex datasets [37,128]. Paired-end (also known as mate-pair) data were then used to order, orient, and merge unitigs into the final assemblies, but only when two mate pairs or a single mate pair and an overlap between unitigs implied the same layout. In one respect, mate pair data was used more aggressively than is typical in assembly of a single genome in that depth-of-coverage information was largely ignored . This potentially allows chimeric assemblies through a repeat within a genome or through an ortholog between genomes. Thus, a conclusion that relies on the correctness of a single assembly involving multiple unitigs should be considered tentative until the assembly can be confirmed in some way. Assemblies involved in key results in this paper were subjected to expert manual review based on thickness of overlaps, presence of well-placed mate pairs across thin overlaps or across gaps between contigs, and consistency of depth of coverage. Data release and availability. All the GOS protein predictions will be submitted to GenBank. In addition, all the data supporting this paper, including the clustering and the various analyses, will be made publicly available via the CAMERA project (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis; http://camera.calit2.net), which is funded by the Gordon and Betty Moore Foundation. All-against-all BLASTP search. We used two sets of computer resources. At the J. Craig Venter Institute, 125 dual 3.06-GHz Xeon processor systems with 2 Gb of memory per system were used. Each system had 80 GB local storage and was connected by GBit ethernet with storage area network (SAN) I/O of ~24 GBit/sec and network attached storage (NAS) I/O of ~16 GBit/sec. A total of 466,366 CPU hours was used on this system. In addition, access to the National Energy Research Scientific Computing Center (NERSC) Seaborg computer cluster was available, including 380 nodes each with sixteen 375-MHz Power3 processors. The systems had between 16 GB and 64 GB of memory. Only 128 nodes were used at a time. A total of 588,298 CPU hours was used on this system. The dataset of 28.6 million sequences was searched against itself in a half-matrix using NCBI BLAST  with the following parameters: -F “m L” -U T -p blastp -e 1 × 10−10 -z 3 × 109 -b 8000 -v 10. In this paper, similarity of an alignment is defined to be the fraction of aligned residues with a positive score according to the BLOSUM62 substitution matrix  used in the BLAST searches. Identification of nonredundant sequences. Given a set of sequences S and a threshold T, a nonredundant subset S′ of S was identified by first partitioning S (using the threshold T) and then picking a representative from each partition. The set of representatives constitutes the nonredundant set S′. The process was implemented using the following graph-theoretic approach. A directed graph G = (V, E) is constructed with vertex set V and edge set E. Each vertex in V represents a sequence from S. A directed edge (u,v) ∈ E if sequence u is longer than sequence v and their sequence comparison satisfies the threshold T; for sequences of identical length, the sequence with the lexicographically larger id is considered the longer of the two. Note that G does not have any cycles. Source vertices (i.e., vertices with no in-degree) are sorted in decreasing order of their out-degrees and (from largest out-degree to smallest) processed in this order. A source vertex u is processed as follows: mark all vertices that have not been seen before and are reachable from vertex u as being redundant and mark vertex u as their representative. We used two thresholds in this paper, 98% similarity and 100% identity. The former was used in the first stage of the clustering and the later was used in the HMM profile analysis. For the 98% similarity threshold, two sequences satisfy the threshold if the following three criteria are met: (1) similarity of the match is at least 98%; (2) at least 95% of the shorter sequence is covered by the match; and (3) (match score)/(self score of shorter sequence) ≥ 95%. For the 100% identity threshold, two sequences satisfy the threshold if their match identity is 100%. Description of the clustering algorithm. The starting point for the clustering was the set of pairwise sequence similarities identified using the all-against-all BLASTP compute. Because of both the volume and nature of the data, the clustering was carried out in four steps: redundancy removal, core set identification, core set merging, and final recruitment. A set of nonredundant sequences (at 98% similarity) was identified using the procedure given in Materials and Methods (Identification of nonredundant sequences). Only the nonredundant sequences were considered in further steps of the clustering process. The aim of the core set identification step was to identify core sets of highly related sequences. In graph-theoretic terms, this involves looking for dense subgraphs in a graph where the vertices correspond to sequences and an edge exists between two sequences if their sequence match satisfies some reasonable threshold (for instance, 40% similarity match over 80% of at least one sequence and are clearly homologous based on the BLAST threshold). Dense subgraphs were identified by using a heuristic. This approach utilizes long edges. These are edges where the match threshold is computed relative to the longer sequence. This was done to prevent, as much as possible, unrelated proteins from being put into the same core set. If all the sequences were full length, using long edges would have offered a good solution to keeping unrelated proteins apart. However, the situation here is complicated by the presence of a large amount of fragmentary sequence data of varying lengths. This was dealt with somewhat by working with rather stringent match thresholds and a two-stage process to identify the core sets. We used the concept of strict long edges and weak long edges. A strict long edge exists between two vertices (sequences) if their match has the following properties: (1) 90% of the longer sequence is involved in the match; (2) the match has 70% similarity; and (3) the score of the match is at least 60% of the self-score of the longer sequence. A weak long edge exists between two vertices (sequences) if their match has the following properties: (1) 80% of the longer sequence is involved in the match; (2) the match has 40% similarity; and (3) the score of the match is at least 30% of the self-score of the longer sequence. Core set identification had two substages: large core initialization and core extension. The large core initialization step identified sets of sequences where these sets were of a reasonable size and the sequences in them were very similar to each other. Furthermore, these sets could be extended in the core extension step by adding related sequences. In the large core initialization step, a directed graph G was constructed on the sequences using strict long edges, with each long edge being directed from the longer to the shorter sequence. For each vertex v in G, let S(v) denote the friends set of v consisting of v and all neighbors that v has an out-going edge to. Initially all the vertices in G are unmarked. Consider the set of all friends sets in the decreasing order of their size. For S(v) that is currently being considered, do the following: (1) initialize seed set A = S(v); (2) while there exists some v′ such that |S(v) ∩ S(v′)| ≥ k, set A = A ∪ S(v′). (Note: k = 10 is chosen); (3) output set A and mark all vertices in A; and (4) update all friends sets to contain only unmarked vertices. In the core extension step, we constructed a graph G using weak long edges. All vertices in seed sets (computed from the large core initialization step) were marked and the rest of the vertices unmarked. Each seed set was then greedily extended to be a core set by adding a currently unmarked vertex that has at least k neighbors (k = 10 is chosen) in the set; the added vertex was marked. After this process, a clique-finding heuristic was used to identify smaller cliques (of size at most k − 1) consisting of currently unmarked vertices; these were also extended to become core sets. A final step involved merging the computed core sets on the basis of weak edges connecting them. In the core set merging step, we constructed an FFAS (Fold and Function Assignment System) profile  for each core set using the longest sequence in the core set as query. FFAS was then used to carry out profile–profile comparisons in order to merge the core sets into larger sets of related sequences. Due to computational constraints imposed by the number of core sets, profiles were built on only core sets containing at least 20 sequences. Final recruitment involved constructing a PSI-BLAST profile  on core sets of size 20 or more (using the longest sequence in the core set as query) and then using PSI-BLAST (–z 1 × 109, –e 10) to recruit as yet unclustered sequences or small-sized clusters (size less than 20) to the larger core sets. For a sequence to be recruited, the sequence–profile match had to cover at least 60% of the length of the sequence with an E-value ≤ 1 × 10−7. In a final step, unclustered sequences were recruited to the clusters using their BLAST search results. A length-based threshold was used to determine if the sequence is to be recruited. Identification of clusters containing shadow ORFs. A well-known problem in predicting coding intervals for DNA sequences is shadow ORFs. The key requirement that coding intervals not contain in-frame stop codons requires that coding intervals be subintervals of ORFs. Long ORFs are therefore obvious candidates to be coding intervals. Unfortunately, the constraints on the coding interval to be an ORF often cause subintervals and overlapping intervals of the coding interval to also be ORFS in one of the five other reading frames (two on the same strand and three on the opposite strand). These coincidental ORFs are called shadow ORFs since they are found in the shadow of the coding ORF. In rare cases (and more frequently in certain viruses) coding intervals in different reading frames can overlap but usually only slightly. Overwhelmingly distinct coding intervals do not overlap. However, this constraint is not as strict for ORFs that contain a coding interval, as the exact extent of the coding interval is not known. Prokaryotes predominate in these data and are the focus of the ORF predictions. Their 3′ end of an ORF is very likely to be part of the coding interval because a stop codon is a clear signal for the termination of both the ORF and the coding interval (this signal could be obscured by frameshift errors in sequencing). The 5′ end is more problematic because the true start codon is not so easily identified and so the longest ORF with a reasonable start codon is chosen and this may extend the ORF beyond the true coding interval. For this reason different criteria were set for when ORFs have a significant overlap depending on the orientation (or the 5′ or 3′ ends) of the ORFs involved. Two ORFs on the same strand are considered overlapping if their intervals overlap by at least 100 bp. Two ORFs that are on the opposite strands are considered overlapping either if their intervals overlap by at least 50 bp and their 3′ ends are within each others intervals, or if their intervals overlap by at least 150 bp and the 5′ end of one is in the interval of the other. ORFs for coding intervals are clustered based on sequence similarity. In most cases this sequence similarity is due to the ORFs evolving from a common ancestral sequence. Due to functional constraints on the protein being coded for by the ORF, some sequence similarity is retained. There are no known explicit constraints on the shadow ORFs to constrain drift from the ancestral sequences. However, the shadow ORFs still tend to cluster together for some obvious reasons. The drift has not yet obliterated the similarity. There are implicit constraints due to the functional constraints on the overlapping coding ORF. There are also other possible unknown functional constraints beyond the coding ORF. At first it was surmised that within shadow ORF clusters the diversity should be higher than for the coding ORF, but this did not prove to be a reliable signal. The apparent problem is that the shadow ORFs tend to be fractured into more clusters due to the introduction of stop codons that are not constrained because the shadow ORFs are noncoding. What rapidly became apparent is that the most reliable signal that a cluster was made up of shadow ORFs is that the cluster was smaller than the coding cluster containing the ORFs overlapping the shadow ORFs. The basic rule for labeling a cluster as a shadow ORF cluster is that the size of the shadow ORF cluster is less than the size of another cluster that contained a significant proportion of the overlapping ORFs for the shadow ORF cluster. A specific set of rules was used to label shadow ORF clusters based on comparison to other clusters that contained ORFs overlapping ORFS in the shadow ORF cluster (called the overlapping cluster for this discussion). First, the overlapping cluster cannot be the same cluster as the shadow ORF cluster (there are sometimes overlapping ORFs within the same cluster due to frameshifts). Second, both the redundant and nonredundant sizes of the shadow ORF cluster must be smaller than the corresponding sizes of the overlapping cluster. Third, at least one-third of the shadow ORFs must have overlapping ORFs in the overlapping cluster. Fourth, less than one-half of the shadow ORFs are allowed to contain their overlapping ORFs (this test is rarely needed but did eliminate the vast majority of the very few obvious false positives that were found using these rules). Finally, the majority of the shadow ORFs that overlapped must overlap by more than half their length. When using this rule, 1,274,919 clusters were labeled as shadow ORF clusters, and 6,570,824 singletons were labeled as shadow ORFs. The rules need to be somewhat conservative so as not to eliminate coding clusters. To test these rules, clusters containing at least two NCBI-nr sequences were examined. Two sequences were used instead of one because occasional spurious shadow ORFs have been submitted to NCBI-nr. There were 989 shadow ORF clusters containing at least two NCBI-nr sequences and with more than one-tenth as many NCBI-nr sequences as the overlapping cluster. This was 0.86% of all clusters (114,331 in total) with at least two NCBI-nr sequences. Of these 989, a few were obvious mistakes, and the others involved very few NCBI-nr sequences of dubious curation, such as “hypothetical.” Just to be conservative, all of these 989 clusters were rescued and not labeled as shadow ORF clusters. Ka/Ks test to determine if sequences in a cluster are under selective pressure. For a cluster containing conserved but noncoding sequences, it is expected that there is no selection at the codon level. We checked this by computing the ratio of nonsynonymous to synonymous substitutions (Ka/Ks test) [130,131] on the DNA sequences from which the ORFs in the cluster were derived. For most proteins, Ka/Ks ≪ 1, and for proteins that are under strong positive selection, Ka/Ks ≫ 1. A Ka/Ks value close to 1 is an indication that sequences are under no selective pressure and hence are unlikely to encode proteins [134,135]. Weakly selected but legitimate coding sequences can have a Ka/Ks value close to 1. These were identified to some extent by using a model in which different partitions of the codons experience different levels of selective pressure. A cluster was rejected only if no partition was found to be under purifying selection at the amino acid level. The Ka/Ks test [130,131] was run only on those clusters (remaining after the shadow ORF filtering step) that did not contain sequences with HMM matches or have NCBI-nr sequences in them. Only the nonredundant sequences in a cluster were considered. Sequences in each of the clusters were aligned with MUSCLE . For each cluster, a strongly aligning subset of sequences was selected for the Ka/Ks analysis. The codeml program from PAML [135,136] was run using model M0 to calculate an overall (i.e., branch- and position-independent) Ka/Ks value for the cluster. Clusters with Ka/Ks ≤ 0.5, indicating purifying selection and therefore very likely coding, were considered as passing the Ka/Ks filter. In addition, the remaining clusters were examined by running codeml with model M3. This partitioned the positions of the alignment into three classes that may be evolving differently (typically, a few positions may be under positive selection while the remainder of the sequence is conserved). A likelihood ratio test was applied to select clusters for which M3 explained the data significantly better than M0 . If a cluster was thus selected, and if one of the resulting partitions had a Ka/Ks ≤ 0.5 and comprised at least 10% of the sequence, then that cluster was also considered as passing the Ka/Ks filter. All other clusters were marked as containing spurious ORFs. Statistics for the various stages of the clustering process The number of sequences that remain after redundancy removal (at 98% similarity) for each dataset is given in Table 12. Recall that the size of a cluster is the number of nonredundant sequences in it. Number of core sets of size two or more totals 1,586,454; number of nonredundant sequences in core sets of size two or more totals 8,337,256; and total number of sequences in core sets of size two or more is 12,797,641. Total number of clusters after profile merging and (PSI-BLAST and BLAST) recruitment is 1,871,434; number of clusters of size two or more totals 1,388,287; number of nonredundant sequences in clusters of size two or more totals 11,494,078; total number of sequences in clusters of size two or more is 16,565,015. The final clustering statistics (after shadow ORF detection and Ka/Ks tests) are as follows: number of clusters of size two or more totals 297,254; number of nonredundant sequences in clusters of size two or more totals 6,212,610; total number of sequences in clusters of size two or more is 9,978,637. In the final BLAST recruitment step, a pattern was seen involving highly compositionally biased sequences that recruited unrelated sequences to clusters. This was reflected in the pre- and post-BLAST recruitment numbers, where the postrecruitment sizes were more than three to four times the size of the prerecruitment numbers. There were 75 such clusters, and these were removed. Searching sequences using profile HMMs. The full set of 7,868 Pfam release 17 models was used, along with additional nonredundant profiles from TIGRFAM (1,720 of 2,443 profiles; version 4.1). HMM profiling was carried out using a TimeLogic DeCypher system (Active Motif, Inc., http://www.activemotif.com) and took 327 hours in total (on an eight-card machine). A sequence was considered as matching a Pfam (fragment model) if its sequence score was above the TC score for that Pfam and had an E-value ≤ 1 × 10−3. It was considered as matching a TIGRFAM if the match had an E-value ≤ 1 × 10−7. Evaluation of protein prediction via clustering. Our evaluation of protein prediction via the clustering shows a very favorable comparison to currently used protein prediction methods for prokaryotic genomes. We used the PG dataset for this evaluation (Table 2). Of the 3,049,695 PG ORFs, 575,729 sequences (19%) were clustered (the clustered set). Of the 614,100 predictions made by the genome projects, 600,911 sequences could be mapped to the PG ORF set (the submitted set); 93% of the unmapped sequences were 50% of their occurences is in one cluster. For the second evaluation, we selected all sequences with Pfam matches, and each sequence was assigned to the Pfam that matches it with the highest score. With this assignment, the Pfams induce a partition on the sequences. The distribution of the number of sequences in clusters induced by the Pfams was compared to those of clusters from the clustering method. Figure 12A shows comparison as a log–log plot of the number of sequences versus the number of clusters with at least that many sequences for the two cases. The plot shows that cluster size distributions are quite similar, with both the methods having an inflection point around 2,500. The difference between the two curves is that there are more big clusters (and also fewer small clusters) induced by the Pfams as compared to the clustering method. This can be explained by noting that two sequences that are in the same Pfam cluster can nevertheless be put into different clusters by the clustering method if they differ in their remaining portions. Our clustering also shows a good correspondence with HMM profiling on the phylogenetic markers that we looked at. The clustering identifies 7,423, 12,553, and 13,657 sequences, respectively, for RecA (cluster ID 1146), Hsp70 (cluster ID 197), and RpoB (cluster ID 1187). HMM profiling identifies 5,292, 12,298, and 12,165 sequences, respectively, for these families. For each of these families, there are at least 94% of sequences (relative to the smaller set) in common between clustering and HMM profiling. Difference in ratio of predicted proteins to total ORFs for the PG set and the GOS set. The ratio of clustered ORFs to total ORFs is significantly higher for the GOS ORFs (0.3471) compared to the PG ORFs (0.1888). This can be explained by the fragmentary nature of the GOS data. For the large majority of the GOS data, the average sequence length is 920 bp compared to full-length genomes for the PG data. For the PG data, clustered ORFs have a mean length of 325 aa and a median length of 280 aa. Unclustered ORFs have a mean length of 119 aa and a median length of 87 aa. Assuming that the genomic GOS data has a similar underlying ORF structure to PG data, the effect that GOS fragmentation had on ORF lengths is estimated. Each reading frame will have a mixture of clustered and unclustered ORFs, but on average there will be 2 ORFs per reading frame per 920-bp GOS fragment, and both ORFs will be truncated. Assuming the truncation point for the ORF is uniformly distributed across the ORF, the truncated ORF will drop below the 60-aa threshold to be considered as an ORF with a probability of 60/(length of the ORF). Using the median length, the percentage of clustered ORFs dropping below the threshold due to truncation is 21%; for unclustered ORFs, it is 69%. Accounting for this truncation, the expected ratio of clustered ORFs to total ORFs for the GOS ORFs based on the PG ORFs would be 0.3708, which is very close to the observed value. Kingdom assignment strategy and its evaluation. We used several approaches to assign kingdoms for GOS sequences. They are all fundamentally based upon a strategy that takes into account top BLAST matches of a GOS sequence to sequences in NCBI-nr, and then voting on a majority. We evaluated a simple strict-majority voting scheme (of the top four BLAST matches) using the NCBI-nr set. First, the redundancy in NCBI-nr was removed using a two-staged process. A nonredundant set of NCBI-nr sequences was computed involving matches with 98% similarity over 95% of the length of the shorter sequence (using the procedure discussed in Materials and Methods [Identification of nonredundant sequences]). This set was made further nonredundant by considering matches involving 90% similarity over 95% of the length of the shorter sequence. The nonredundant sequences that remained after this step constituted the evaluation dataset S. For each sequence in S, its top four BLAST matches to other sequences in S (ignoring self-matches) were used to assign a kingdom for it (based on a strict majority rule). This predicted kingdom assignment for the sequence was compared to its actual kingdom. A correct classification is obtained for 93% of the sequences. The correct classification rate per kingdom is given in Table 13. While this evaluation shows that the BLAST-based voting scheme provides a reasonable handle on the kingdom assignment problem, there are caveats associated with it. The kingdom assignment for a set of query sequences is greatly influenced by the taxonomic groups from each kingdom that are represented in the reference dataset against which these queries are being compared. If certain taxa are only sparsely represented in the reference set, then, depending on their position in the tree of life, queries from these taxa can be misclassified (using a nearest-neighbor type approach based on BLAST matches). This explains why the archaeal classification rate is quite low compared to the others. Thus, the true classification rate for the GOS dataset based on this approach will also depend on the differences in taxonomic biases in the GOS dataset (query) and the NCBI-nr set (reference). The kingdom proportion for the GOS dataset reported in Figure 1 is based on a kingdom assignment of scaffolds. Those GOS ORFs with BLAST matches to NCBI-nr were considered, and the top-four majority rule was used to assign a kingdom to each of them. Using the ORF coordinates on the scaffold, the fraction (of bp) of a scafffold assigned to each kingdom was computed. The scaffold was labeled as belonging to a kingdom if the fraction of the scaffold assigned to that kingdom was >50%. All ORFs on this scaffold were then assigned to the same kingdom. Cluster size distribution, the power law, and the rate of protein family discovery. Earlier studies of protein family sizes in single organisms [137–139] have suggested that P(d), the frequency of protein families of size d, satisfies a power law: that is, P(d) ≈ d − β with exponent β reported between 2.68 and 4.02. Power laws have been used to model various biological systems, including protein–protein interaction networks and gene regulatory networks [42,43]. Figure 12B illustrates the distribution of the cluster sizes from our data on a log–log scale, a scale for which a power law distribution gives a line. In contrast to family size distributions reported in single organisms, the cluster sizes from our data are not well described by a single power law. Rather, there appear to be different power laws: one governs the size distribution of very large clusters, and another describes the rest. This behavior is observed both in the distribution of the core set sizes and also in the distribution of the final cluster sizes. We identified an inflection point for both the core set distribution and the final clusters at around size 2,500, and estimated the power law exponent β via linear regression separately in each size regime. For the core set distribution, the exponent β = 1.99 (R 2 = 0.994) for clusters of size ≤ 2,500, and β = 3.34 (R 2 = 0.996) for clusters of size > 2,500. For the final cluster sizes, the exponent β = 1.72 (R 2 = 0.995) for clusters of size ≤ 2,500, and β = 2.72 (R 2 = 0.995) for clusters of size > 2,500. The estimates for β are different for the core clusters compared to the final clusters, reflecting a larger number of medium and large clusters in the final clustering as a result of the cluster-merging and additional recruitment steps. A similar dichotomy between the size distributions of large and small protein families was observed in a study  of protein families contained in the ProDom, Protomap, and COG databases, where the exponent β reported was in the range of 1.83 to 1.98 for the 50 smallest clusters and 2.54 to 3.27 for the 500 largest clusters in these databases. Our clustering method was run separately on the following seven datasets: set 1 consisted of only NCBI-nr sequences; set 2 consisted of all sequences in NCBI-nr, ENS, TGI-EST, and PG; sets 3 through 6 consisted of set 2 in combination with a random subset of 20%, 40%, 60%, and 80% of the GOS sequences, respectively; set 7 consisted of set 2 in combination with all the GOS sequences. On each of the seven datasets, the redundancy removal (using the 98% similarity filter) was run, followed by the core set detection steps. Figure 2 shows the number of core sets of varying sizes (≥3, ≥5, ≥10, and ≥20) as a function of the number of nonredundant sequences for each dataset. The observed linear growth in number of families with increase in sample size n is related to the power law distribution in the following way. We model protein families as a graph where each vertex corresponds to a protein sequence and an edge between two vertices indicates sequence similarity between the corresponding proteins. Consider a clustering (partitioning) of the vertices of a graph with n vertices such that the cluster sizes obey a power law distribution. Let C d (n) [respectively, C≥d (n)] denote the number of clusters of size d (respectively, ≥d). Since the distribution of cluster sizes follows a power law, there exist constants α, β such that for all x ≤ n, Cx (n) = αx− β. As every vertex of the graph is a member of exactly one cluster, The number of clusters of size at least d is Combining the two equations, we obtain values (up to a multiplicative constant) for C ≥d (n) as shown in Table 14. In all cases with β > 1, the number of clusters C ≥d (n) increases as n increases, and as d decreases. Specifically, for β > 2, the growth is linear in n for all d, with slope decreasing as d increases. For 1 50% of their neighboring ORFs were assigned the same kingdom. The general trends observed are the same for each method, though the coverage decreases slightly for the more stringent methods. Characteristics and kingdom distribution of known protein domains. For these analyses we used the predicted proteins from the public (NCBI-nr, PG, TGI-EST, and ENS) and GOS datasets. The public dataset contains multiple identical copies of some sequences due to overlaps between the source datasets. For example, many sequences in PG are also found in NCBI-nr. We filtered the public set at 100% identity to avoid overcounting these sequences. Because this filtering was necessary for the public dataset, the GOS dataset was also filtered at 100% identity. If two or more sequences were 100% identical at the residue level, but were of different lengths, only the longest sequence was kept. The resulting datasets of nonredundant proteins are referred to as public-100 and GOS-100. We assigned each protein in public-100 to a kingdom based on the species annotations provided in the source datasets (NCBI-nr, Ensembl, TIGR, and PG). The NCBI taxonomy tree was used to determine the kingdom of each species. Of 3,167,979 protein sequences in public-100, 3,158,907 can be annotated by kingdom. The remaining 9,072 sequences are largely synthetic. Determining the kingdom of origin of an environmental sequence can be difficult; while an unambiguous assignment can be made for some sequences, others can be assigned only tentatively or not at all. Therefore, we took a probabilistic approach (kingdom-weighting method), calculating “weights” or probabilities that each protein sequence originated from a given kingdom. The top four BLAST matches (E-value 90% similarities were grouped. NCBI-nr90 was then clustered to NCBI-nr80/70/60/50 and finally NCBI-nr30. After each clustering stage, the total number of clusters of NCBI-nr was decreased and non-ORFans were identified. A one-step clustering from NCBI-nr directly to NCBI-nr30 can be performed. However, the multistep clustering is computationally more efficient. At the 30% similarity level, all the NCBI-nr proteins were grouped into 391,833 clusters, including 259,571 singleton clusters. The proteins in nonsingleton clusters are by definition non-ORFans. However, proteins that remain as singletons are not necessarily ORFans, because their similarity to other proteins may not be reported for two reasons: (1) significant sequence similarity can be <30%; and (2) in order to prevent a cluster from being too diverse, CD-HIT, like all other clustering algorithms, may not add a sequence to that cluster even if the similarity between this sequence and a sequence in that cluster meet the similarity threshold. The 259,571 singletons were compared to NCBI-nr with BLASTP  to identify real ORFans from them. The default low-complexity filter was enabled in the BLAST comparisons, and similarity threshold in the form of an E-value was set to 1 × 10−6. In the end, 84,911 proteins with at least 100 aa are identified as ORFans. About 100,000 short ORFans less than 100 aa were removed from this study, because they may not be real proteins. Genome sequencing projects and rate of discovery. We used Ensembl sequences for Homo sapiens, Mus musculus, Rattus norvegicus, Canis familiaris, and Pan troglodytes. Their clustering information is shown in Table 15. When we considered the datasets in the order HS, HS + MM, HS + MM + RN, HS + MM + RN + CF, and HS + MM + RN + CF + PT, the numbers of distinct clusters were 10,536, 12,731, 13,605, 14,606, and 14,993, respectively. These numbers were compared against a random subset of NCBI-nr bacterial sequences (of a similar size) and also against a random subset of GOS sequences. We also randomized the order of the mammalian sequences to produce a dataset that was independent of the genome order being considered. Supporting Information Protocol S1 Supplementary Information (25 KB DOC) Click here for additional data file. Accession Numbers All NCBI-nr sequences from February 10, 2005 were used in our analysis. Protocol S1 lists the GenBank (http://www.ncbi.nlm.nih.gov/Genbank) accession numbers of (1) the genomic sequences used in the PG set, (2) the sequences used in building GS profiles, and (3) the NCBI-nr sequences used in building the IDO phylogeny. The other GenBank sequences discussed in this paper are Bacillus sp. NRRL B-14911 (89089741), Janibacter sp. HTCC2649 (84385106), Erythrobacter litoralis (84785911), and Nitrosococcus oceani (76881875). The Pfam (http://pfam.cgb.ki.se) structures discussed in this paper are envelope glycoprotein GP120 (PF00516), reverse transcriptase (PF00078), retroviral aspartyl protease (PF00077), bacteriophage T4-like capsid assembly protein (Gp20) (PF07230), major capsid protein Gp23 (PF07068), phage tail sheath protein (PF04984), IDO (PF01231), poxvirus A22 protein family (PF04848), and PP2C (PF00481). The glutamine synthetase TIGRFAM (http://www.tigr.org/TIGRFAMs) used in the paper is GlnA: glutamine synthetase, type I (TIGR00653). The PDB (http://www.rcsb.org/pdb) identifiers and the names of the eight PDB ORFans with GOS matches are: restriction endonuclease MunI (1D02), restriction endonuclease BglI (1DMU), restriction endonuclease BstYI (1SDO), restriction endonuclease HincII (1TX3); alpha-glucosyltransferase (1Y8Z), hypothetical protein PA1492 (1T1J), putative protein (1T6T), and hypothetical protein AF1548 (1Y88).