23
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A distinct lineage of giant viruses brings a rhodopsin photosystem to unicellular marine predators

      research-article
      a , b , c , a , d , a , d , a , d , e , a , a , d , a , 3 , f , b , f , c , g , h , i , i , f , a , g , i , e , j , c , b , k , 4 , a , d , 4
      Proceedings of the National Academy of Sciences of the United States of America
      National Academy of Sciences
      giant viruses, viral evolution, marine carbon cycle, single-cell genomics, host–virus interactions

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Significance

          Although viruses are well-characterized regulators of eukaryotic algae, little is known about those infecting unicellular predators in oceans. We report the largest marine virus genome yet discovered, found in a wild predatory choanoflagellate sorted away from other Pacific microbes and pursued using integration of cultivation-independent and laboratory methods. The giant virus encodes nearly 900 proteins, many unlike known proteins, others related to cellular metabolism and organic matter degradation, and 3 type-1 rhodopsins. The viral rhodopsin that is most abundant in ocean metagenomes, and also present in an algal virus, pumps protons when illuminated, akin to cellular rhodopsins that generate a proton-motive force. Giant viruses likely provision multiple host species with photoheterotrophic capacities, including predatory unicellular relatives of animals.

          Abstract

          Giant viruses are remarkable for their large genomes, often rivaling those of small bacteria, and for having genes thought exclusive to cellular life. Most isolated to date infect nonmarine protists, leaving their strategies and prevalence in marine environments largely unknown. Using eukaryotic single-cell metagenomics in the Pacific, we discovered a Mimiviridae lineage of giant viruses, which infects choanoflagellates, widespread protistan predators related to metazoans. The ChoanoVirus genomes are the largest yet from pelagic ecosystems, with 442 of 862 predicted proteins lacking known homologs. They are enriched in enzymes for modifying organic compounds, including degradation of chitin, an abundant polysaccharide in oceans, and they encode 3 divergent type-1 rhodopsins (VirR) with distinct evolutionary histories from those that capture sunlight in cellular organisms. One (VirR DTS) is similar to the only other putative rhodopsin from a virus (PgV) with a known host (a marine alga). Unlike the algal virus, ChoanoViruses encode the entire pigment biosynthesis pathway and cleavage enzyme for producing the required chromophore, retinal. We demonstrate that the rhodopsin shared by ChoanoViruses and PgV binds retinal and pumps protons. Moreover, our 1.65-Å resolved VirR DTS crystal structure and mutational analyses exposed differences from previously characterized type-1 rhodopsins, all of which come from cellular organisms. Multiple VirR types are present in metagenomes from across surface oceans, where they are correlated with and nearly as abundant as a canonical marker gene from Mimiviridae. Our findings indicate that light-dependent energy transfer systems are likely common components of giant viruses of photosynthetic and phagotrophic unicellular marine eukaryotes.

          Related collections

          Most cited references67

          • Record: found
          • Abstract: found
          • Article: not found

          The 1.2-megabase genome sequence of Mimivirus.

          We recently reported the discovery and preliminary characterization of Mimivirus, the largest known virus, with a 400-nanometer particle size comparable to mycoplasma. Mimivirus is a double-stranded DNA virus growing in amoebae. We now present its 1,181,404-base pair genome sequence, consisting of 1262 putative open reading frames, 10% of which exhibit a similarity to proteins of known functions. In addition to exceptional genome size, Mimivirus exhibits many features that distinguish it from other nucleocytoplasmic large DNA viruses. The most unexpected is the presence of numerous genes encoding central protein-translation components, including four amino-acyl transfer RNA synthetases, peptide release factor 1, translation elongation factor EF-TU, and translation initiation factor 1. The genome also exhibits six tRNAs. Other notable features include the presence of both type I and type II topoisomerases, components of all DNA repair pathways, many polysaccharide synthesis enzymes, and one intein-containing gene. The size and complexity of the Mimivirus genome challenge the established frontier between viruses and parasitic cellular organisms. This new sequence data might help shed a new light on the origin of DNA viruses and their role in the early evolution of eukaryotes.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Microbial and Animal Rhodopsins: Structures, Functions, and Molecular Mechanisms

            1 Introduction Organisms of all domains of life use photoreceptor proteins to sense and respond to light. The light-sensitivity of photoreceptor proteins arises from bound chromophores such as retinal in retinylidene proteins, bilin in biliproteins, and flavin in flavoproteins. Rhodopsins found in Eukaryotes, Bacteria, and Archaea consist of opsin apoproteins and a covalently linked retinal which is employed to absorb photons for energy conversion or the initiation of intra- or intercellular signaling. 1 Both functions are important for organisms to survive and to adapt to the environment. While lower organisms utilize the family of microbial rhodopsins for both purposes, animals solely use a different family of rhodopsins, a specialized subset of G-protein-coupled receptors (GPCRs). 1,2 Animal rhodopsins, for example, are employed in visual and nonvisual phototransduction, in the maintenance of the circadian clock and as photoisomerases. 3,4 While sharing practically no sequence similarity, microbial and animal rhodopsins, also termed type-I and type-II rhodopsins, respectively, share a common architecture of seven transmembrane α-helices (TM) with the N- and C-terminus facing out- and inside of the cell, respectively (Figure 1). 1,5 Retinal is attached by a Schiff base linkage to the ε-amino group of a lysine side chain in the middle of TM7 (Figures 1 and 2). The retinal Schiff base (RSB) is protonated (RSBH+) in most cases, and changes in protonation state are integral to the signaling or transport activity of rhodopsins. Figure 1 Topology of the retinal proteins. (A) These membrane proteins contain seven α-helices (typically denoted helix A to G in microbial opsins and TM1 to 7 in the animal opsins) spanning the lipid bilayer. The N-terminus faces the outside of the cell and the C-terminus the inside. Retinal is covalently attached to a lysine side chain on helix G or TM7, respectively. (B) Cartoon representation of the helical arrangement of a microbial rhodopsin with attached all-trans-retinal (bacteriorhodopsin, PDB ID: 1C3W). Figure 2 Genesis of the chromophore of microbial and animal rhodopsins. Cleavage of β-carotene is the source of the chromophore. The ground state of microbial and animal rhodopsins possesses all-trans- and 11-cis-retinal as its chromophore, respectively, bound to a Lys residue via a Schiff base, which is normally protonated and exists in the 15-anti configuration. It should be noted that microbial rhodopsins depend exclusively on all-trans-retinal, while some animal rhodopsins possess vitamin A2 (C3=C4 double bond for fish visual pigments) and hydroxyl (C3—OH for insect visual pigments) forms of 11-cis-retinal. Usually, photoactivation isomerizes microbial rhodopsin selectively at the C13=C14 double bond and animal rhodopsin at the C11=C12 double bond. Retinal, the aldehyde of vitamin A, is derived from β-carotene and is utilized in the all-trans/13-cis configurations in microbial rhodopsins and the 11-cis/all-trans configurations in animal rhodopsins (Figure 2). 1,6 For optimal light to energy or light to signal conversion, defined chromophore–protein interactions in rhodopsins direct the unique photophysical and photochemical processes, which start with specific retinal isomerization and culminate with distinct protein conformational changes. The protein environment is typically optimized for light-induced retinal isomerization from all-trans → 13-cis in microbial rhodopsins and for 11-cis → all-trans in animal rhodopsins. Variations in this isomerization pattern are discussed in sections 3 and 4. The 7TM protein scaffold of microbial rhodopsins is designed for light-driven ion pumps, light-gated ion channels, and light sensors which couple to transducer proteins (Figure 3). 7 Microbial rhodopsins were first found in the Archaea (Halobacterium salinarum, historically referred to as Halobacterium halobium) 8 and were therefore initially termed archaeal rhodopsins. H. salinarum contains bacteriorhodopsin (BR) 8 and halorhodopsin (HR) 9 that act as a light-driven outward proton pump and inward chloride pump, respectively. As ion pumps, they contribute to the formation of a membrane potential and thus have their function in light–energy conversion. The other two H. salinarum rhodopsins are sensory rhodopsin I and II (SRI and SRII), 10 which act as positive and negative phototaxis sensors, respectively. Since the original discovery of BR in H. salinarum, similar rhodopsins have been found in Eubacteria and lower Eukaryota, leading to the name microbial rhodopsins. For example, Anabaena sensory rhodopsin (ASR), the first sensory rhodopsin observed in the Eubacteria, 11 is a sensor that activates a soluble transducer (Figure 3). Figure 3 Microbial rhodopsins can function as pumps, channels, and light-sensors. Arrows indicate the direction of transport or flow of signal: (A) light-driven inward chloride pump (halorhodopsin (HR), PDB ID: 1E12), (B) light-driven outward proton pump (bacteriorhodopsin (BR), PDB ID: 1C3W), (C) light-gated cation channel (channelrhodopsin (ChR), PDB ID: 3UG9), (D) light-sensor activating transmembrane transducer protein (sensory rhodopsin II (SRII), PDB ID: 1JGJ), (E) light-sensor activating soluble transducer protein (Anabaena sensory rhodopsin (ASR), PDB ID: 1XIO). Channelrhodopsins (ChRs), another group of microbial rhodopsins, were discovered in green algae where they function as light-gated cation channels within the algal eye to depolarize the plasma membrane upon light absorption. 12,13 The primary depolarization of the eyespot membrane is transferred to the flagellar membrane and results in a reorientation of the alga toward a light source (photophobic responses and phototaxis). Thus, ChRs naturally function as signaling photoreceptors as well. Discovery of ChR led to an emergence of a new field, optogenetics, 14 in which light-gated ion channels and light-driven ion pumps are used to depolarize and hyperpolarize selected cells of neuronal networks, e.g., for therapeutic reasons 15 or in order to understand the circuitry of the brain. 16,17 Thus, studies on microbial rhodopsins are beneficial not only for our basic understanding of retinal proteins, but also for providing a toolset to study neuronal signaling through optogenetics. Animal rhodopsins belong to the superfamily of GPCRs which detect extracellular signals, typically by binding small molecule ligands like hormones and neurotransmitters. 18,19 By a ligand-induced conformational change, GPCRs become activated and capable of transducing the activation signal by catalyzing GDP/GTP exchange on membrane-bound heterotrimeric G proteins within the cell, thus initiating G protein-mediated signaling cascades (Figure 4). 19 After activation-dependent phosphorylation by a G-protein-coupled receptor kinase, active GPCRs can also interact with arrestin to effect G protein-independent signaling and attenuation of the ligand-mediated activation signal. 20 Animal rhodopsins are typically specialized GPCRs, capable of detecting single photons as a physical stimulus. 21 Because the 11-cis-retinal ligand is covalently bound within the retinal-binding pocket of the receptor, photon absorption and the ensuing retinal cis → trans isomerization convert an inactivating ligand (the inverse agonist 11-cis-retinal) into an activating ligand (the agonist all-trans-retinal) in situ. Vertebrate rhodopsin was discovered more than 130 years ago and has long been used as a prototypical GPCR. 22 Due to the relative ease of purification from native material, it has been studied extensively. 2 Figure 4 Animal rhodopsins are specialized G-protein-coupled receptors (GPCRs). (A) Binding of extracellular ligands stabilizes certain GPCR conformations which enable the GPCR to catalyze GDP/GTP exchange in heterotrimeric G proteins (Gαβγ) and/or to induce G-protein-independent, arrestin-mediated signaling. (B) Typical GPCR fold shown in cartoon representation for bovine rhodopsin (PDB ID: 1U19). Structures of animal and microbial rhodopsins differ largely (cf. Figure 1B) and are drawn in opposite orientations with respect to the membrane. As model for the GPCR family, animal rhodopsin is shown in the orientation commonly used for GPCRs. In a large number of publications, animal rhodopsins are shown for historical reasons in the orientation of microbial rhodopsins (C-terminus up). The goal of this review is to provide mechanistic insights from biophysical and structural studies into the function of microbial and animal rhodopsins, with the latter as representatives of GPCRs. After a general description of retinal photoisomerization in section 2, the functional mechanisms of various basic types of microbial rhodopsins will be discussed in section 3. In section 4, animal rhodopsins will be reviewed with a focus on bovine visual rhodopsin (abbreviated as Rho in some cases), the photoreceptor in retinal rod cells, followed by some extension to color visual rhodopsins and invertebrate rhodopsins. 2 Light Absorption and Photoisomerization 2.1 Color Tuning Light absorption initiates functions of both microbial and animal rhodopsins, 23,24 and the wavelength dependence of the absorption efficiency determines the colors of the proteins (Figure 5). The length of the π-conjugated polyene chain in the retinal chromophore as well as the protonation of the RSB linkage determine the energy gap of the π–π* transition, 25 so that the absorption of most rhodopsins is within the visible region (400–700 nm). Humans have a single photoreceptor for dim light vision (rhodopsin, λmax ∼500 nm) and three receptors for color vision (blue, λmax ∼425 nm; green, λmax ∼530 nm; red, λmax ∼560 nm), 26,27 whereas some shrimp species contain up to 16 rhodopsins covering the spectral range from 300 to 700 nm. 28 While the chromophore molecule is usually the same in all pigments (retinal bound via a (protonated) Schiff base), the absorption maxima differ significantly, implying an active protein control of the energy gap between the ground and excited states of the retinal chromophore. The mechanism of color tuning has fascinated researchers for a long time, and several factors have been determined to be responsible for it. Figure 5 Microbial rhodopsins exhibit a wide range of absorption maxima. Colors of microbial rhodopsins (A) and their absorption spectra (B). The following rhodopsins are shown: (1) a blue-proteorhodopsin (LC1-200, pH 7), (2) Q105L mutant of LC1-200 (pH 7), (3) a green-proteorhodopsin (EBAC31A08, pH 7), (4) A178R mutant of green-proteorhodopsin (pH 7), (5) bacteriorhodopsin (pH 7), (6) H. salinarum sensory rhodopsin I (pH 4). The protonation state of the chromophore plays a crucial role in color tuning; the unprotonated RSB absorbs in the UV region (λmax ∼360–380 nm), and this absorption is quite insensitive to the environment in contrast to the RSBH+, which exhibits a large variation in absorption covering the entire visible light spectrum. Other factors defining the spectral tuning of individual rhodopsins are given by chromophore–protein interactions such as electrostatic interactions with charged and polar amino acids, termed electrostatic tuning and extensively studied, first using retinal analogues, 29−32 and, later, site-directed mutagenesis. 33−35 Electrostatic tuning was elegantly demonstrated in a model system based on cellular retinol-binding protein II. This system was engineered to covalently bind all-trans-retinal and its absorption maximum was changed from 425 to 644 nm via mutations that changed the electrostatic potential within the retinal-binding pocket. 36 Interactions of retinal with charged, polar, and aromatic amino acids play a role in changing the electronic energy levels, as do hydrogen-bonding interactions and steric contact effects. Strong hydrogen bonds can lead to charge transfer, and steric contacts can lead to a twist of retinal. All these tuning processes in concert shape the absorbance maxima of retinal in microbial and animal rhodopsins. One of the most prominent factors in color tuning is the interaction of retinal with the counterion(s) (Figure 6). In the ground state, the retinal chromophore is positively charged due to RSBH+ (C=NH+). The excited state has strong charge transfer character where the positive charge is displaced toward the β-ionone ring, leading to a neutralization of the RSBH+ (Figure 6B). 37,38 Interaction of the RSBH+ with the negatively charged counterion(s) in microbial and animal rhodopsins leads to an electrostatic stabilization in the electronic ground state of retinal accompanied by an increase of the RSB pKa (Figure 6A). The resulting larger energy gap between ground and excited states causes a blue-shift of the absorption (Figure 6D, compare cases [A] and [B]). 39,40 If a negative charge is located near the β-ionone ring, the excited state is energetically stabilized compared to the ground state (Figure 6C,D), which leads to a smaller energy gap and therefore to a red-shift of the wavelength of electronic excitation. As the absorption maximum of isolated all-trans RSBH+ in gas phase is 610 nm (Figure 6B,D, case [B]), 40 in principle, absorption in the deep red range (λmax > 600 nm) should be possible for case [C] in Figure 6, while λmax 180 kJ/mol, thermal isomerization of a rhodopsin molecule once every 1010 years would be expected, implying that thermal and light-dependent processes follow different pathways. A more recent theoretical study proposed a pathway of thermal isomerization in rhodopsin with a transition state displaying the same charge-transfer character as the electronically excited state of Rho. 472 From a quantitative relation between rhodopsin’s photoactivation energy and its peak absorption, λmax, it was proposed that dark noise arises from thermal retinal isomerization which needs to overcome the same energy barrier as in the photoisomerization process. 399 On the basis of the slow hydrogen/deuterium exchange of Thr118 in bovine Rho, it was suggested that local protein structural fluctuations transiently widen the retinal binding pocket for thermal retinal isomerization. 473 It is interesting to note that Drosophila rhodopsin also has a light-independent role in temperature discrimination in larvae which may be related to thermal retinal isomerization. 474 To effectively release the structural constraints that stabilize the inactive rhodopsin state, photon energy is absorbed and used for retinal isomerization (cf. section 2), driving subsequent protein conformational changes. About 150 kJ/mol of the initially absorbed photon energy are stored in the “distorted” all-trans-retinal of the Batho photointermediate and gradually dissipated via a transient blue-shifted intermediate (BSI) 475 and the Lumi intermediate, concomitantly with a release of strain in retinal (Figures 21–23), eventually yielding Meta I after a few microseconds. The “early” photointermediates Batho and Lumi can be trapped by low temperature and have been studied structurally, 120,146 as well as spectroscopically. 476−478 The distinct absorption maxima of each intermediate reflect the gradual changes in chromophore–protein interaction. It is not until the formation of Meta I that significant backbone structural changes occur. 479,480 In Batho and to a lesser extent in Lumi intermediates, protein conformational adjustments to retinal relaxation are limited to a few amino acid side chains within the retinal binding pocket. 120,146 When the Meta I state is attained, an equilibrium between Meta I and Meta II states develops (Figure 21), which is dependent upon pH and temperature, with lower pH and higher temperature favoring Meta II. 481,482 Deprotonation of the RSBH+ upon Meta II formation results in the characteristic 100 nm blue-shift of the absorption maximum to the near UV region (λmax = 380 nm). 481,483 Meta II, as defined by its 380 nm absorption, comprises both the isochromic Meta IIa and Meta IIb substates which develop sequentially from Meta I. 484 In Meta IIb, proton uptake occurs to Glu134 of the (D/E)RY motif in TM3, 485,486 explaining why low pH favors Meta II despite the loss of a proton from the RSBH+. 487 Time-resolved EPR studies with bovine Rho in dodecylmaltoside detergent revealed that the large TM6 movement occurs during transition from Meta IIa to Meta IIb and led to the reaction scheme for the Meta states shown in Figure 21. 488 FTIR studies confirmed this reaction scheme for bovine Rho in its native membrane environment. 489,490 Electron crystallography on bovine Rho 2-D crystals that were illuminated and trapped in the Meta I photointermediate by the crystal lattice demonstrated that up until Meta I the protein backbone remains in a conformation similar to that of the rhodopsin dark state. 491 Illumination of 3-D bovine Rho crystals yielded the spectral shift characteristic to Meta II, but only revealed a small TM6 movement and some rearrangement of the cytoplasmic surface. 492,493 On the basis of absorption maximum and the extent of TM6 movement, the structure of this photoactivated Rho most likely represents the Meta IIa state. The Meta IIb state with fully opened cytoplasmic domain is represented in the bovine Meta II structures obtained by reconstitution of Ops* crystals with all-trans-retinal or by illumination of the constitutively active M257Y rhodopsin mutant before crystallization. 426,433 Examination of agonist-bound GPCR structures reveals that GPCRs exist in conformations with differing extents of TM6 movement, with no or little TM6 movement for inverse agonists and larger TM6 movement for partial agonists and full agonists. 418 However, stabilizing mutations, truncations of loops, and in many cases insertion of T4 lysozyme or apocytochrome b562 fusion partners into cytoplasmic loop 3 (connecting TMs 5 and 6) or at the N-terminus have been necessary for the crystallization of all nonrhodopsin GPCRs to have their structures determined to date. 419,494,495 This affects affinity for agonist or antagonist binding and likely exerts some influence upon the degree of movement of TM6. 496−498 NMR and hydrogen/deuterium exchange studies on β2AR lacking fusion partners provide evidence for substantial conformational heterogeneity of agonist- and inverse agonist-bound β2-AR preparations. 499−502 The heterogeneity supports the view of conformational equilibria of GPCRs that can be shifted to either side depending on the type of ligand and are comparable to the Meta I/Meta II equilibrium of rhodopsin (Figures 21 and 22). 503,504 A difference between diffusible ligand-activated GPCRs and the activation of rhodopsin by light is reflected in the mode by which the ligand acts in the activation process. It was proposed that diffusible ligands might select the suitable conformation from the equilibrium of inactive and active GPCR conformations, whereas in an induced-fit scenario the ligand would bind an inactive conformation and induce a conformational change toward the active conformation. 503 Rhodopsin with its activation by retinal photoisomerization is likely to correspond to an induced-fit scenario for initial events, whereas both scenarios are conceivable for later activation phases as well as for GPCRs activated by diffusible ligands. 4.2.2 Rhodopsin Activation The steps involved in rhodopsin activation are illustrated in Figures 24–26. In the rhodopsin dark state, 11-cis-retinal is tightly bound as an inverse agonist in its binding pocket and covalently fixed by the RSBH+ to Lys296 on TM7. The positively charged RSBH+ is stabilized by a complex counterion comprising negatively charged residues Glu113 on TM3 and Glu181 on extracellular loop 2, 505,506 with the former functioning as the primary counterion (Figure 25A). The β-ionone ring at the other end of the retinal is ensconced in a hydrophobic pocket formed by aromatic side chains, and the conjugated double-bond system linking the two is tightly engaged by a constriction in the retinal binding site which forces a negative 6-s-cis twist of the β-ionone ring about the C6—C7 single bond and a twist about the C11=C12 double bond. 46,420,421 A twist about the C12—C13 single bond results from a steric interaction between a proton at C10 and the methyl group at C20. In addition to the pretwist of the C11=C12 double bond, the proximity of the negatively charged Glu181, which was predicted from modeling, 507 reduces the bond order further to enable selective isomerization around the C11=C12 double bond in the direction shown in Figure 23. Following the gradual release of the potential energy stored in the distorted retinal–protein complex via Batho and the transient BSI, it is not until Lumi that a displacement of the β-ionone ring is observed as a result of the elongation of the retinal. The local perturbations of amino acid side chains in Lumi increase slightly when compared with Batho, but the protein structural changes are still limited to the residues making up the retinal binding site and do not propagate to the cytoplasmic surface. 146 FTIR studies using site-directed infrared labels suggest that the first global movement of the protein backbone is a small rotation of TM5 and TM6 which occurs upon Meta I formation. 479 Solid-state NMR experiments on Meta I and changes in electron density in TM6 on the side facing retinal in the Meta I electron crystallography structure are consistent with a slight motion of TM6 491 which can be described as rotation, but not outward movement. 480 Also consistent with this, another rhodopsin-specific constraint, the TM3/TM5 hydrogen bonding network including Glu122 and Trp126 on TM3 and His211 on TM5, changes upon formation of Meta I as concluded from FTIR data. 508 Straightening of the retinal due to isomerization is thought to move the β-ionone ring toward the region of Met207 to Phe212 on TM5 and thus driving TM rearrangement. 480,509,510 Figure 26 Structural and functional changes in the activation pathway of bovine Rho based on structural and complementary biophysical data discussed in the text and ref (490) and described in Figures 23 and 24. Until formation of Meta I, the RSBH+ remains protonated, but structural changes in the RSB region occur. In the rhodopsin dark state, RSBH+ interacts with the negatively charged Glu113 counterion (Figure 25A) 511−513 from which a hydrogen bonding network extends to Glu181 on extracellular loop 2. 46 On the basis of both UV–vis and Raman spectroscopy it was hypothesized that in Meta I Glu181 transfers a proton to Glu113 and the RSBH+ switches counterions from Glu113 to Glu181. 514 This view was later modified on the basis of FTIR, 505,515 NMR, 516 and molecular dynamics simulations 517 whose data argue for a complex-counterion made up by Glu113 and Glu181, with both residues being deprotonated and giving the retinal binding pocket a net negative charge. 505,516 In the complex-counterion switch model, Glu113 functions as the primary counterion in the photointermediates up to Lumi; in Meta I, the conformational changes in retinal lead to a shift of the counterion from Glu113 to Glu181. 505,515,517 With the deprotonation of the RSB, an equilibrium of Meta II states is reached, in which the RSB nitrogen is deprotonated prior to the occurrence of TM movements. 488,490 As a result of retinal isomerization, the RSBH+ reorients (Figure 23), and its high pK a (determined experimentally to be above 16 in the dark state 518 and suggested to have contributions from the negatively charged Glu181 505,506 ) drops so that the proton dissociates for the formation of the first Meta II state, Meta IIa. Concomitantly, Glu113 becomes protonated, and a direct internal proton transfer from the RSBH+ is likely to be the source of this proton. 519 According to FTIR studies, 490 RSBH+ deprotonation upon transition to Meta IIa leads to a rearrangement of the TM1/2/7 network as seen by changes of infrared bands assigned to Asp83 on TM2. In this water-mediated hydrogen-bonding network, Asp83 (on TM2) links Asn55 (on TM1) with the NPxxY(x)5,6F motif (TM7), and extends to Trp265 (on TM6) adjacent to retinal’s β-ionone ring. Activating conformational changes in the TM3/TM5 network occur in the Meta IIa → Meta IIb transition, as seen by changes of infrared bands assigned to Glu122. 490 This hydrogen bonding network links Glu122 and Trp126 (both on TM3) with His211 (on TM5) which is in contact with the β-ionone ring of retinal. Retinal movement toward TM5 exerts its full effect upon transition to Meta IIb, reflected in a weakening of the hydrogen bond to Glu122. 508,520 In the Meta IIa → Meta IIb transition, changes in infrared amide I marker bands indicate structural changes in the protein backbone. 490 According to Meta II structures, these structural alterations include an elongation of TM5 by 1.5 to 2.5 helix turns depending on the dark state reference structure (PDB ID: 1GZM/3C9L or 1U19, respectively) and a rotational tilt of the kinked TM6 where the TM rotation results in the 6–7 Å outward movement of the cytoplasmic end of TM6 (Figure 23). 426,429,433,435 Time-resolved EPR studies with a spin label sensor at position 227 on TM5, designed to detect TM6 movement, probed the Meta IIa → Meta IIb transition for this event, although the sensor might have detected movements of both TM5 and TM6. 488 TM6 rotation is facilitated by the breakage of the (D/E)RY ionic lock between TM3 and TM6 consisting of Glu134 and Arg135 (on TM3) and Glu247 and Thr251 (on TM6). The cytoplasmic end of TM5 contains two residues, Tyr223 and Lys231, which function as microswitches. 521 These residues face the lipid environment in the dark state, but stabilize the Meta IIb state when the Arg135-Glu247 ionic lock has been disrupted and TM5 and TM6 rearrangements have occurred (Figure 25C,D). By swiveling inward, Tyr223 can hydrogen bond to Arg135 of the ionic lock, thereby linking TM3 and TM5. Glu247 (on TM6) is freed in Meta IIb from Arg135 and can form a salt bridge to Lys231 on TM5, thus stabilizing TM6 in the outward position. Also in the transition to Meta II when TM6 moves, the electrostatic interaction between Tyr306 and Phe313 becomes disrupted allowing Tyr306 to swivel into the vacated space below Arg135 where it becomes part of an extended water-mediated hydrogen bonding network reaching from the retinal binding site via Ser298, Asp83, Asn302, Met257, Tyr306, and Tyr223 to Arg135. 435 Studies with mutants of the NPxxY(x)5,6F motif suggested dual roles for the NP and the Y(x)5,6F submotifs. 522 Whereas the first has a structural role related to the hydrogen bonding network, the latter is involved in the interaction with G protein. The crystal structure revealed how Tyr306 stabilizes Arg135 and thus the cytoplasmic cleft for binding of the Gtα C-terminus. In the structure of the Meta II·GαCT2 peptide complex the backbone carbonyls of Cys347 and Lys345 (on GαCT2) form hydrogen bonds to Arg135 and Gln312, respectively. The presence of the GαCT2 peptide in the cytoplasmic G protein binding cleft moves the cytoplasmic end of TM7 slightly toward the center, leading to a somewhat narrower binding cleft. The proposed water-mediated hydrogen bonding network provides an answer regarding how the retinal binding site and the G protein binding site 30 Å apart are connected. 426,435,464,465 4.2.3 Retinal Channeling in Rhodopsin The structure of bovine Ops* revealed that the retinal binding pocket has two openings toward the lipid bilayer, one opening between TM1 and 7, the other between TM5 and 6, which form the two halves of the channel leading to the retinal binding pocket (Figure 27). Using molecular docking of retinal, a 3–12 Å wide continuous channel through opsin with the retinal binding pocket as the central part was found. 523,524 The openings to these channels are lined with aromatic residues while the central part is more polar. A major constriction of the channel is around Lys296 which enforces a 90° elbow-like kink in the channel. Passage of that restriction would be easier for the kinked 11-cis-retinal, whereas the more elongated and rigid all-trans-retinal would require global conformational changes. A study with rhodopsins containing mutations throughout the channel failed to correlate specific opening(s) with retinal entry and exit. 525 However, the study suggested that the ease of retinal passage through constrictions in the channel is not rate-limiting for rhodopsin reconstitution or Meta II decay, but for RSB formation and hydrolysis, respectively. Upon rhodopsin activation, when the helix bundle opens up, bulk solvent molecules obtain access to the RSB, 526−528 most likely from the cytoplasmic side of rhodopsin for RSB hydrolysis and retinal release. 526 This hypothetical solvent channel appears to be quite narrow because only small nucleophiles such as water and hydroxylamine have access to the RSB for retinal hydrolysis. 526,529 Figure 27 Retinal channel in the Ops*/Meta II conformation. (A) Meta II structure (PDB ID: 3PXO) with the putative retinal channel indicated, rotated to face opening one (red arrow and half channel indicated in red mesh) which is located between TM1 and TM7, and (B) rotated to face opening two (green arrow and half channel indicated in green mesh) which is located between TM5 and TM6. Channels were determined using MOLE 577 on the Ops* structure (PDB ID: 3CAP). The Meta II structure was used for the figure so that the retinal could be shown as it is absent in the Ops* structure. The size of the opening of the retinal channel varies in different Ops*, Meta II, and Meta II-like structures due to variations in side chain rotamers. In the rhodopsin dark state with its compact helix bundle, and presumably in the conformation of inactive opsin, which is similar to dark state rhodopsin, 530 no opening is observed and the 11-cis-retinal appears to reside in a hermetically sealed retinal binding pocket (Figure 28A). 46,420,421 Opsin crystals in the Ops* conformation with an open channel therefore allowed reconstitution of Meta II in soaking experiments with all-trans-retinal. 426 All-trans-retinal can bind tightly in its binding pocket in Meta II with only slight adjustment of the surrounding amino acid side chains. The binding pocket, however, is flexible enough to allow retinal rotation along its long axis upon rhodopsin activation as concluded from the Meta II crystal structures. 426,433 A limited data set of intramolecular distances obtained by solid-state NMR experiments on Meta II is mostly consistent with the crystal structure. A different degree of retinal rotation in the NMR experiment, however, cannot be ruled out, but could be explained by the equilibrium of Meta II substates. 174,426,509,510,531 Figure 28 Retinal binding site of bovine rhodopsin. (A) Crystal structure of inactive dark state (PDB ID: 1U19) where 11-cis-retinal is tightly bound deep in the protein with no openings of the retinal binding site toward the lipidic environment. (B) In the active Ops* (or Meta II) conformation the retinal channel allows all-trans-retinal access and egress, and some detergents like β-d-octylglucoside, mimicking the all-trans-retinal chromophore, to enter the retinal binding site. Shown is an overlay of the two ligands in the retinal binding pocket as observed in the crystal structures of Meta II (all-trans-retinal depicted in yellow; PDB ID: 3PXO) and Ops* in complex with β-d-octylglucoside (depicted in green/red; two rotamers of Lys296 are shown in red; PDB ID: 4J4Q). Note that the ring moieties of chromophore and detergent are oriented in opposite directions. Whereas all-trans-retinal is covalently linked by the RSB to Lys296, β-d-octylglucoside is fixed in the ligand binding site by hydrogen bonding of its hydroxyl groups to the opsin environment. After uptake of 11-cis- or 9-cis-retinal by opsin (likely by the Ops* conformation when openings to the retinal channel are present) and RSB formation, a conformational change yields inactive rhodopsin or isorhodopsin, respectively, both of which generate the same photoproducts after photon absorption. 46,420,421,532 For the formation of artifical rhodopsin pigments, the retinal channel in opsin is flexible enough to take up a wide variety of bulkier retinal analogues, e.g., featuring larger or additional alkyl groups or alkyl rings which prevent isomerization around the C11=C12 double bond. 533 Recently, crystallographic evidence was provided that the detergent, β-d-octylglucoside, can bind within rhodopsin’s retinal binding pocket thus stabilizing the Ops* conformation (Figure 28B). 432 A study on rhodopsin reconstitution from bovine opsin and 11-cis-retinal in the presence of various glucose- or maltose-containing detergents suggested that even some maltoside detergents can enter the retinal channel, while the affinity of various detergents for opsin was dependent on the alkyl chain length. 432 The varied hydrogen-bonding possibilities of the glucose hydroxyl groups to opsin within the retinal binding pocket is reminiscent of the proposed dynamic binding of odorants within olfactory receptors. 534 With its lateral ligand entry and its active GPCR conformation, Ops* was suggested to be ideal for homology modeling of olfactory receptors binding olfactant agonists. 432 4.3 Mechanistic Variations in Other Rhodopsins 4.3.1 Color Pigments Rod and cone cells are responsible for scotopic and photopic vision, i.e., vision under low light and daylight conditions, respectively. Rods are characterized by high sensitivity, slow response, slow dark adaptation, and a single type of rhodopsin, whereas cones have low sensitivity, fast response, fast dark adaptation, and several types of cone rhodopsins. 3,535 Some of the features of cone cells can be attributed to the cone visual pigments, which absorb at different wavelengths [red (λmax = 560 nm), green (λmax = 530 nm), and blue (λmax = 420 nm) rhodopsins in humans] necessary to discriminate colors for color vision. The vertebrate cone opsins and rod rhodopsins (λmax = 500 nm) form a single family of homologous proteins, 26 where rod rhodopsins have evolved from cone visual pigments, being closer to the short-wavelength cone opsins. 536,537 The extinction coefficient and quantum yield and thus photosensitivity of bovine Rho and chicken green pigment are comparable. 538 Cone rhodopsins are also thought to undergo a photoactivation process similar to rod rhodopsin, with some variation for the long wavelength pigments. 539,540 A difference, however, is a shorter lifetime of the Meta II state, with parallel formation of Meta III (a late nonproductive photointermediate involved in the decay of the photoactivated state to the apoprotein opsin and retinal) from Meta I (which is in equilibrium with Meta II, Figure 22). 538,541,542 Additionally, regeneration of cone opsins with 11-cis-retinal is faster than that of rod rhodopsins. 538,543 The faster kinetics of cone opsins (by 1–2 orders of magnitude) of the active state decay and rhodopsin regeneration are optimal and necessary for the high light levels present during the daytime. Site-directed mutagenesis studies identified amino acids at position 122 and 189 (Glu122 and Ile189 in bovine Rho, replaced by Gln/Ile and Pro in cone opsins) to be responsible for this functional difference. 544,545 In bovine Rho, Glu122 is a part of a hydrogen bonding network with His211 (TM3/5 network). 546 Glu122 and Ile189, present in rhodopsin but not in cone pigments, also potentiate a more efficient G protein activation compared with cone pigments. 542 The lower Gt activation capacity of cone pigments and faster Meta II decay contribute to the lower photosensitivity of cones compared with rods. Formation and decay of Meta II are related to TM6 movements as outlined above, and EPR experiments could potentially give insights into structural dynamics and conformational changes of color pigments. Unfortunately, because of the difficulties in the sample isolation and preparation of cone pigments, structural studies on these pigments have lagged significantly behind those of bovine Rho. From resonance Raman spectroscopy it is known that the chromophore structure is similar between human green and red sensitive pigments, with both being similar to that of rhodopsin. 547 From FTIR spectroscopy it was concluded that the protein structure of monkey green and red pigments is clearly different from that of rhodopsin, and that hydrogen-bonding networks differ between green and red pigments. 145 It should be noted that in human red and green color vision pigments Glu181 (in bovine Rho) is replaced by a His residue that functions as a chloride binding site, 548,549 suggesting that participation of an anion is a prerequisite for seeing red light. Future structural and modeling studies will be needed to further elucidate the molecular mechanism(s) of color tuning and structural dynamics in cone pigments. 550 4.3.2 Bistable Rhodopsins and Photoisomerases The rhodopsins we have discussed so far are stable in the dark, but once retinal isomerization has occurred and metarhodopsin states are reached, the eventual decay into opsin and all-trans-retinal takes place. These rhodopsins are also called monostable rhodopsins, and their opsins must be regenerated with new 11-cis-retinal produced in the adjacent retinal pigment epithelium cells. 551 Vertebrates and invertebrates possess in addition to these monostable rhodopsins, a second type of rhodopsins, the bistable rhodopsins, 552 which mostly couple to the G protein, Gq, to initiate phosphoinositol signaling cascades. While this Gq signaling is typical for the majority of bistable rhodopsins including those of squid and octopus, other bistable rhodopsins have been shown to couple to other G proteins such as Go. 3,552 A characteristic of bistable rhodopsins is that the dark state and the active states are both thermally stable; i.e., RSB hydrolysis does not occur. 553 Furthermore, a secondary absorption of a photon is utilized to photoregenerate the dark state. 554 These opsins lack the conserved RSBH+ stabilizing counterion found in monostable rhodopsins (Glu113 on TM3 in bovine Rho). Instead, in squid rhodopsin the RSBH+ forms a hydrogen bond to Asn87 or Tyr111 side chains (the equivalent positions in bovine Rho are Gly89 on TM2 and Glu113 on TM3). 422 In other bistable opsins, position 113 is occupied by neutral amino acid residues such as Tyr, Phe, or Met, which can explain why in contrast to bovine Rho, RSBH+ deprotonation of the active state is not required. 555 The conserved Glu181 (bovine Rho numbering) may serve as the sole negatively charged residue near the RSBH+, which is, however, not close enough for direct interaction with the RSBH+. 422 Molecular evolutionary analysis implies that the counterion has been switched from Glu181 to Glu113 during the evolution of vertebrate opsins and that Glu181 serves as a counterion in bistable pigments. 3,552,556,557 Another distinctive feature of bistable pigments is that their opsins can bind all-trans-retinal to form the pigment. 3,552 Rhodopsin from the Japanese flying squid, T. pacificus is the only invertebrate opsin (and incidentally the only other GPCR purified from native source apart from bovine rhodopsin) to have its structure determined. This structural information has been instrumental in understanding its chromophore–protein interactions in the dark and bathorhodopsin states as well as in the artificial isorhodopsin pigment (containing 9-cis-retinal). 422,423,558 The crystal structures of dark state bovine Rho and squid rhodopsin exhibit similar features including the presence of the disulfide bridge at the extracellular side of the retinal pocket, a hydrophobic aromatic cage surrounding the retinal and the presence of the (D/E)RY (ionic lock) and NPxxY(x)5,6F motifs. 422,423 As in bovine Rho, the retinal is attached through a RSBH+ linkage to the conserved Lys residue on TM7, but retinal itself takes a more relaxed configuration in squid rhodopsin as opposed to the distortions observed in bovine Rho. 422 The structure for squid bathorhodopsin indicates that just as in the bovine Rho early intermediate states, there is little structural change except for a change in the twist of the retinal. 558 It has been postulated from protein structure and sequence analysis that the extended TM5 and TM6 observed in the squid rhodopsin structure may explain the selectivity of coupling to Gq proteins. 422,559 Squid rhodopsin contains an additional C-terminal domain that might be involved in G protein binding, but it was not structurally characterized as it was necessary to proteolytically remove this domain comprising the last 90 amino acids for crystallization. 422,423 Another reason for the functional difference may be the extent of TM6 movement upon photoactivation of bistable rhodopsins as proposed from site-directed fluorescence labeling measurements. A comparison of bovine Rho and parapinopsin, a bistable nonvisual Gt-coupled rhodopsin, revealed much smaller light-induced TM5 and TM6 movement for parapinopsin which correlates with its reduced capability to activate G protein. 560 Of further interest among Gq-coupled bistable rhodopsins is melanopsin, which has been identified in various vertebrates. 552,561 In mammals, melanopsin is localized in intrinsically photosensitive retinal ganglion cells and is involved in nonvisual functions, including photoentrainment of the circadian clock, pupillary light reflex, and sleep. 4 Another, more divergent grouping of opsins are the photoisomerases, which function to bind all-trans retinoids and photoisomerize them to their 11-cis-forms. 410 Structural information remains elusive for these opsins. The best characterized of these photoisomerases are retinochrome from molluscan species and the mammalian retinal G-protein-coupled receptor (RGR). 97,562,563 In mollusks, retinochrome functions to provide the 11-cis-retinal to newly synthesized apoprotein opsin, whereas in mammals, RGR appears to play more of a regulatory role in the mobilization of all-trans-retinyl esters into the retinoid cycle. 562,564 These photoisomerases lack the NPxxY(x)5,6Y motif and thus may be deficient in G protein coupling. 3 A close relative is peropsin from mammalian retinal pigment epithelium (RPE) cells. Peropsin shows photoisomerase activity, but contains the (D/E)RY and NPxxY(x)5,6Y motifs and therefore may couple to G protein and be involved in activation of signaling cascades. 565 5 Conclusions and Perspective In the past, the impact of research on both microbial and animal rhodopsins propagated far beyond the boundaries of the retinal-binding protein field. It will suffice to give just a few of the most striking examples. Structural work on BR and Rho has greatly contributed to our understanding of the structural principles of membrane proteins in general. The detailed understanding of the proton transport mechanism by BR has been extremely useful to the bioenergetics community, who extended these principles to important systems such as cytochrome oxidases and ATPases. The great structural and mechanistic advances in the understanding of visual signal transduction significantly enriched the GPCR field, and for many years bovine Rho served (and continues to serve) as a model GPCR. More recently, the discovery of proteorhodopsins gave a strong push to the field of metagenomics, while the discovery of channelrhodopsins gave birth to optogenetics. Finally, rhodopsins have been serving as testing grounds for many cutting-edge biophysical techniques, aiding, for example, in the development of time-resolved and low-temperature FTIR spectroscopy, ultrafast spectroscopy, advanced Raman techniques, new methods for 2-D and 3-D crystallization of membrane proteins, protein solid-state NMR, and high-field EPR, including site-directed spin-labeling techniques. But what are the next exciting steps in rhodopsin research? Expanding on these recent trends, we can predict many more interesting developments without being too speculative. Judging from the large number of new rhodopsin variants unearthed by genomic and metagenomic sequencing, new interesting functions of retinal proteins will continue to be discovered. This applies to both microbial and animal rhodopsins, and will lead to new breakthroughs in understanding microbial, invertebrate, and vertebrate physiology and evolution. These new functional variants of rhodopsins may also find use in optogenetics, enriching its arsenal of tools. New advanced techniques of structural biology and biophysics will be applied to these new rhodopsins leading to new insights, capitalizing on such emerging methods as, for example, high-speed AFM, ultrafast time-resolved crystallography, structural mass spectroscopy, and dynamic nuclear polarization NMR. From the point of view of structural biology, the next challenging frontier in the field will be to understand the mechanisms of rhodopsin–protein interactions. While structural methods have enjoyed great success in determining structures of isolated rhodopsins and their binding partners, structures of inactive and activated protein complexes, especially those of membrane and soluble proteins, remain elusive. This is especially important for visual rhodopsins with their multiple interacting partners, and bears on the entirety of the GPCR field, contributing to a better understanding of GPCR signaling cascade, activation, and regulation.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

              Introduction Despite many efforts to classify and organize proteins [1–6] from both structural and functional perspectives, we are far from a clear understanding of the size and diversity of the protein universe [7–9]. Environmental shotgun sequencing projects, in which genetic sequences are sampled from communities of microorganisms [10–14], are poised to make a dramatic impact on our understanding of proteins and protein families. These studies are not limited to culturable organisms, and there are no selection biases for protein classes or organisms. These studies typically provide a gene-centric (as opposed to an organism-centric) view of the environment and allow the examination of questions related to protein family evolution and diversity. The protein predictions from some of these studies are characterized both by their sheer number and diversity. For instance, the recent Sargasso Sea study [10] resulted in 1.2 million protein predictions and identified new subfamilies for several known protein families. Protein exploration starts by clustering proteins into groups or families of evolutionarily related sequences. The notion of a protein family, while biologically very relevant, is hard to realize precisely in mathematical terms, thereby making the large-scale computational clustering and classification problem nontrivial. Techniques for these problems typically rely on sequence similarity to group sequences. Proteins can be grouped into families based on the highly conserved structural units, called domains, that they contain [15,16]. Alternatively, proteins are grouped into families based on their full sequence [17,18]. Many of these classifications, together with various expert-curated databases [19] such as Swiss-Prot [20], Pfam [15,21], and TIGRFAM [22,23], or integrated efforts such as Uniprot [24] and InterPro [25], provide rich resources for protein annotation. However, a vast number of protein predictions remain unclassified both in terms of structure and function. Given varying rates of evolution, there is unlikely to be a single similarity threshold or even a small set of thresholds that can be used to define every protein family in nature. Consequently, estimates of the number of families that exist in nature vary considerably based on the different thresholds used and assumptions made in the classification process [26–29]. In this study, we explored proteins using a comprehensive dataset of publicly available sequences together with environmental sequence data generated by the Sorcerer II Global Ocean Sampling (GOS) expedition [30]. We used a novel clustering technique based on full-length sequence similarity both to predict proteins and to group related sequences. The goals were to understand the rate of discovery of protein families with the increasing number of protein predictions, explore novel families, and assess the impact of the environmental sequences from the expedition on known proteins and protein families. We used hidden Markov model (HMM) profiling to examine the relative biases in protein domain distributions in the GOS data and existing protein databases. This profiling was also used to assess the impact of the GOS data on target selection for protein structure characterization efforts. We carried out in-depth analyses on several protein families to validate our clustering approach and to understand the diversity and evolutionary information that the GOS data added; the families included ultraviolet (UV) irradiation DNA damage repair enzymes, phosphatases, proteases, and the metabolic enzymes glutamine synthetase and RuBisCO. Results/Discussion Data Generation, Sequence Clustering, and HMM Profiling We used the following publicly available datasets in this study (Table 1)—the National Center for Biotechnology Information (NCBI)'s nonredundant protein database (NCBI-nr) [31,32], NCBI Prokaryotic Genomes (PG) [31,33], TIGR Gene Indices (TGI-EST) [34], and Ensembl (ENS) [35,36]. The rationale for including these datasets is discussed in Materials and Methods. All datasets were downloaded on February 10, 2005. None of the above-mentioned databases contained sequences from the Sargasso Sea study [10], the largest environmental survey to date, and so we pooled reads from the Sargasso Sea study with the reads from the Sorcerer II GOS expedition [30], creating a combined set that we call the GOS dataset. The GOS dataset was assembled using the Celera Assembler [37] as described in [30] (see Materials and Methods). The GOS dataset was primarily generated from the 0.1 μm to 0.8 μm size filters and thus is expected to be mostly microbial [30]. The data also included a small set of sequences from a viral size ( 200). While the 17,067 medium- and large-sized clusters constitute only 6% of the total number of clusters, they account for 85% of all the sequences that are clustered (Table 3). Many of the largest clusters correspond to families that have functionally diversified and expanded (Table 4). While some large families, such as the HIV envelope glycoprotein family and the immunoglobulins, also reflect biases in sequence databases, many more, including ABC transporters, kinases, and short-chain dehydrogenases, reflect their expected abundance in nature. Rate of Discovery of Protein Families We examined the rate of discovery of protein families using our clustering method to determine whether our sampling of the protein universe is reaching saturation. We find that for the present number of sequences there is an approximately linear trend in the rate of discovery of clusters with the addition of new (i.e., nonredundant) sequences (Figure 2). Moreover, the observed distribution of cluster sizes is well approximated by a power law [42,43], and this observed power law can be used to predict the rate of growth of the number of clusters of a given size (see Materials and Methods). This rate is dependent on the value of the power law exponent and decreases with increasing cluster sizes. We find good agreement between the observed and predicted growth rates for different cluster sizes. The approximately linear relationship between the number of clusters and the number of protein sequences indicates that there are likely many more protein families (either novel or subfamilies distantly related to known families) remaining to be discovered. GOS versus Known Prokaryotic versus Known Nonprokaryotic We also examined the GOS coverage of known proteins and protein families. Based on the cell-size filtering performed while collecting the GOS samples, we expected that the sample would predominantly be a size-limited subset of prokaryotic organisms [30]. We studied the content of the 17,067 medium- and large-sized clusters across three groupings: (1) GOS, (2) known prokaryotic (PG together with bacterial and archaeal portions of NCBI-nr), and (3) known nonprokaryotic (TGI-EST and ENS together with viral and eukaryotic portions of NCBI-nr). The Venn diagram in Figure 3 shows the breakdown of these clusters by content (see Materials and Methods). The largest section contains GOS-only clusters (23.40%) emphasizing the significant novelty provided by the GOS data. The next section consists of clusters containing sequences from only the known nonprokaryotic grouping (20.78%), followed closely by the section containing clusters with sequences from all three groupings (20.23%). The large known nonprokaryotic–only grouping shows that our current GOS sampling methodology will not cover all protein families, and perhaps misses some protein families that are exclusive to higher eukaryotes. The large section of clusters that include all three groupings indicates a large core of well-conserved protein families across all domains of life. In contrast, the known prokaryotic protein families are almost entirely covered by the GOS data. Novelty Added by GOS Data There are 3,995 medium and large clusters that contain only sequences from the GOS dataset. Some are divergent members of known families that failed to be merged by the clustering parameters used, or are too divergent to be detected by any current homology detection methods. The remaining clusters are completely novel families. In exploring the 3,995 GOS-only clusters, 44.9% of them contain sequences that have HMM matches, or BLAST matches to sequences in a more recent snapshot of NCBI-nr (downloaded in August 2005) than was used in this study. The recent NCBI-nr matches include phage sequences from cyanophages (P-SSM2 and P-SSM4) [44] and sequences from the SAR-11 genome (Candidatus pelagibacter ubique HTCC1062) [45]. We used profile–profile searches [39] to show that an additional 12.5% of the GOS-only clusters can be linked to profiles built from Protein Data Bank (PDB), COG, or Pfam. The 2,295 clusters with detected homology are referred to as Group I clusters. The remaining 1,700 (42.6%) GOS-only clusters with no detectable homology to known families are labeled as Group II clusters. We applied a guilt-by-association operon method to annotate the GOS-only clusters with a strategy that did not rely on direct sequence homology to known families. Function was inferred for the GOS-only clusters by examining their same-strand neighbors on the assembly (see Materials and Methods). Similar strategies have been successfully used to infer protein function in finished microbial genomes [46–48]. Despite minimal assembly of GOS reads, many scaffolds and mini-scaffolds contain at least partial fragments of more than one predicted ORF, thereby making this approach feasible. For 90 (5.3%) of the Group II clusters, and for 214 (9.3%) of the Group I clusters, at least one Gene Ontology (GO) [49] biological process term at p-value ≤0.05 can be inferred. The inferred functions and neighbors of some of these GOS-only clusters are highlighted in Table 5. We observed that for Group I clusters, the neighbor-inferred function is often bolstered by some information from weak homology to known sequences. While neighboring clusters as a whole are of diverse function, a number of GOS-only clusters seem to be next to clusters implicated in photosynthesis or electron transport. These GOS-only clusters could be of viral origin, as cyanophage genomes contain and express some photosynthetic genes that appear to be derived from their hosts [44,50,51]. In support of these observations, we identified five photosynthesis-related clusters containing hundreds to thousands of viral sequences, including psbA, psbD, petE, SpeD, and hli in the GOS data; furthermore, our nearest-neighbor analysis of these sequences reveals the presence of multiple viral proteins (unpublished data). Although the majority of GOS-only sequences are bacterial, a higher than expected proportion of the GOS-only clusters are predicted to be of viral origin, implying that viral sequences and families are poorly explored relative to other microbes. To assign a kingdom to the GOS-only clusters, we first inferred the kingdom of neighboring sequences based on the taxonomy of the top four BLAST matches to the NCBI-nr database (see Materials and Methods). A possible kingdom was assigned to the GOS-only cluster if more than 50% of assignable neighboring sequences belong to the same kingdom. Viewed in this way, 11.8% of Group I clusters and 17.3% of Group II clusters with at least one kingdom-assigned neighbor have more than 50% viral neighbors (Figure 4). Only 3.3% and 3.4% of random samples of clusters with size distributions matching that of Group I and Group II clusters have more than 50% viral neighbors, while 7.7% of all clusters pass this criterion. A total of 547 GOS-only clusters contain sequences collected from the viral size fraction included in the GOS dataset. For these clusters, 38.9% of the Group I subset and 27.5% of the Group II subset with one or more kingdom-assigned neighbors would be inferred as viral, based on the conservative criteria of having more than 50% viral assignable neighbors. Several alternative kingdom assignment methods were tried (see Materials and Methods) and provide for a similar conclusion. The GOS-only clusters also tend to be more AT-rich than sequences from a random size-matched sample of clusters (35.9% ± 8% GC content for Group II clusters versus 49.5% ± 11% GC content for sample). Phage genomes with a Prochlorococcus host [44] are also AT rich (37% average GC content). Our analysis of the graph constructed based on inferred operon linkages between all clusters indicates that the GOS-only clusters may constitute large sets of cotranscribed genes (see Materials and Methods). The high proportion of potentially viral novel clusters observed here is reasonable, as 60%–80% of the ORFs in most finished marine phage genomes are not homologous to known protein sequences [52]. Viral metagenomics projects have reported an equally high fraction of novel ORFs [53], and a recent marine metagenomics project estimated that up to 21% of photic zone sequences could be of viral origin [51]. It has also been reported that 40% of ORFans (sequences that lack similarity to known proteins and predicted proteins) exist in close spatial proximity to each other in bacterial genomes, and this combined with proximity to integration signals has been used to suggest a viral horizontally transferred origin for many bacterial ORFans [54]. Others have noted a clustering of ORFans in genome islands and suggested they derive from a phage-related gene pool [55]. A recent analysis of genome islands from related Prochlorococcus found that phage-like genes and novel genes cohabit these dynamic areas of the genome [56]. In our GOS-only clusters, 37 of the 1,700 clusters with no detectable similarity (2.2%) have at least ten bacterial-classified and ten viral-classified neighboring ORFs. This is 6.2-fold higher than the rate seen for the size-matched sample of all clusters (six clusters, 0.35%). This would seem to add more support to a phage origin for at least some ORFans found in bacterial genomes. If a sizable portion of the novel families in the GOS data are in fact of viral origin, it suggests that we are far from fully exploring the molecular diversity of viruses, a conclusion echoed in previous studies of viral metagenomes [53,57,58]. In studies of bacterial genomes, discovery of new ORFans shows no sign of reaching saturation [59]. Coverage of many phage families in the GOS data may be low, given that there are inherent differences in the abundance of their presumed bacterial hosts. These GOS-only clusters were operationally defined as having at least 20 nonredundant sequences. Reducing this threshold to ten nonredundant sequences adds 7,241 additional clusters. Whether this vast diversity represents new families or is a reflection of the inability to detect distant homology will require structural and biochemical studies, as well as continued development of computational methods to identify remotely related sequences. Comparison of Domain Profiles in GOS and PG Datasets We used HMM profiling to address the question of which biochemical and biological functions are expanded or contracted in GOS compared to the largely terrestrial genomes in PG. Significant differences are seen in 68% of domains (4,722 out of the 6,975 domains that match either GOS or PG; p-value 20 different strains sequenced, 168 ORFans are identified in strain CFT073, and 67 of them have GOS matches. The only eukaryotic organism in this list is Candida albicans SC5314, a fungal human pathogen, which has 49 ORFans with GOS matches. We examined a small but interesting subset of the ORFans that have 3-D structures deposited in PDB. Out of 65 PDB ORFans, GOS matches for eight of them are found (see Supporting Information for their PDB identifiers and names). They include four restriction endonucleases, three hypothetical proteins, and a glucosyltransferase. GOS sequences can play an important role in identifying the functions of existing ORFans or in confirming protein predictions. For example, we found that the hypothetical protein AF1548, which is a PDB ORFan, has matches to 16 GOS sequences. A PSI-BLAST search with AF1548 as the query against a combined set of GOS and NCBI-nr identified several significant restriction endonucleases after three iterations. With the support of 3-D structure and multiple sequence alignment of AF1548 and its GOS matches, we predict that AF1548 along with its GOS homologs are restriction endonucleases (Figure 10). When combined with an established consensus of active sites of the related endonucleases families [118], we predicted three catalytic residues. Genome Sequencing Projects and Protein Exploration With respect to protein exploration and novel family discovery, microbial sequencing offers more promise compared to sequencing more mammalian genomes. This is illustrated by Figure 11 , where the number of clusters that protein predictions from various finished mammalian genomes fall into was compared to the number of clusters that similar-sized random subsets of microbial sequences fall into (see Materials and Methods). As the figure shows, the rate of protein family discovery is higher for microbes than for mammals. Indeed, the rate of new family discovery is plateauing for mammalian sequences. This is not surprising, as mammalian divergence from a common ancestor is much more recent than microbial divergence from a common ancestor, which suggests that mammals will share a larger core set of less-diverged proteins. Microbial sequencing is also more cost effective than mammalian sequencing for acquiring protein sequences because microbial protein density is typically 80%–90% versus 1%–2% for mammals. This could be addressed with mammalian mRNA sequencing, but issues with acquiring rarely expressed mRNAs would need to be considered. There are, of course, other reasons to sequence mammalian genomes, such as understanding mammalian evolution and mammalian gene regulation. Conclusions The rate of protein family discovery is approximately linear in the (current) number of protein sequences. Additional sequencing, especially of microbial environments, is expected to reveal many more protein families and subfamilies. The potential for discovering new protein families is also supported by the GOS diversity seen at the nucleotide level across the different sampling sites [30]. Averaged over the sites, 14% of the GOS sequence reads from a site are unique (at 70% nucleotide identity) to that site [30]. The GOS data provides almost complete coverage of known prokaryotic protein families. In addition, it adds a great deal of diversity to many known families and offers new insights into the evolution of these families. This is illustrated using several protein families, including UV damage–repair enzymes, phosphatases, proteases, glutamine synthetase, RuBisCO, RecA (unpublished data), and kinases [77]. Only a handful of protein families have been examined thus far, and many thousands more remain to be explored. The protein analysis presented indicates that we are far from exploring the diversity of viruses. This is reflected in several of the analyses. The GOS-only clusters show an overrepresentation of sequences of viral origin. In addition, our domain analysis using HMM profiling shows a lower Pfam coverage of the GOS sequences in the viral kingdom compared to the other kingdoms. At least two of the protein families we explored in detail (UV repair enzymes and glutamine synthetase) contain abundant new viral additions. The extraordinary diversity of viruses in a variety of environmental settings is only now beginning to be understood [57,119–121]. A separate analysis of GOS microbial and viral sequences (unpublished data) shows that multiple viral protein clusters contain significant numbers of host-derived proteins, suggesting that viral acquisition of host genes is quite widespread in the oceans. Data generated by this GOS study and similar environmental shotgun sequencing studies present their own analysis challenges. Methods for various analyses (e.g., sequence alignment, profile construction, phylogeny inference, etc.) are generally designed and optimized to work with full sequences. They have to be tailored to analyze the mostly fragmentary sequences that are generated by these projects. Nevertheless, these data are a valuable source of new discoveries. These data have the potential to refine old hypotheses and make new observations about proteins and their evolution. Our preliminary exploration of the GOS data identified novel protein families and also showed that many ORFan sequences from current databases have homologs in these data. The diversity added by GOS data to protein families also allows for the building of better profile models and thereby improves remote homology detection. The discovery of kingdom-crossing protein families that were previously thought to be kingdom-specific presents evidence that the GOS project has excavated proteins of more ancient lineage than that previously known, or that have undergone lateral gene transfer. This is another example of how metagenomics studies are changing our understanding of protein sequences, their evolution, and their distribution across the various forms of life and environments. Biases in the currently published databases due to oversampling of some proteins or organisms are illuminated by environmental surveys that lack such biases. Such knowledge can help us make better predictions of the real distribution patterns of proteins in the natural world and indicate where increased sampling would be likely to uncover new families or family members of tremendous diversity (such as in the viral kingdom). These data have other significant implications for the fields of protein evolution and protein structure prediction. Having several hundreds or even tens of thousands of diverse proteins from a family or examples of a specific protein fold should provide new approaches for developing protein structure prediction models. Development of algorithms that consider the alignments of all these family members/protein folds and analyze how amino acid sequence can vary without significantly altering the tertiary structure or function may provide insights that can be used to develop new ab inito methods for predicting protein structures. These same datasets could also be used to begin to understand how a protein evolves a new function. Finally, this large database of amino acid sequence data could help to better understand and predict the molecular interactions between proteins. For example, they may be used to predict the protein–protein interactions so critical for the formation of specific functional complexes within cells. The GOS data also have implications for nearly all computational methods relying on sequence data. The increase in the number of known protein sequences presents challenges to many algorithms due to the increased volume of sequences. In most cases this increase in sequence data can be compensated for with additional CPU cycles, but it is also a foreshadowing of times to come as the pace of large-scale sequence-collecting accelerates. A related challenge is the increase in the diversity of protein families, with many new divergent clades present. With more protein similarity relationships falling into the twilight zone overlapping with random sequence similarity, the number of false positives for homology detection methods increases, making the true relationships more difficult to identify. Nevertheless, a deeper knowledge of protein sequence and family diversity introduces unprecedented opportunities to mine similarity relationships for clues on molecular function and molecular interactions as well as providing much expanded data for all methods utilizing homologous sequence information data. The GOS dataset has demonstrated the usefulness of large-scale environmental shotgun sequencing projects in exploring proteins. These projects offer an unbiased view of proteins and protein families in an environmental sample. However, it should be noted that the GOS data reported here are limited to mostly ocean surface microbes. Even with this targeted sampling a tremendous amount of diversity is added to known families, and there is evidence for a large number of novel families. Additional data from larger filter sizes (that will sample more eukaryotes) coupled with metagenomic studies of different environments like soil, air, deep sea, etc. will help to achieve the ultimate goal of a whole-earth catalog for proteins. Materials and Methods Data description. NCBI-nr [31,32] is the single largest publicly available protein resource and includes protein sequences submitted to SWISS-PROT (curated protein database) [122], PDB (a database of amino acid sequences with solved structures) [123], PIR (Protein Information Resource) [124], and PRF (Protein Research Foundation). In addition, NCBI-nr also contains protein predictions from DNA sequences from both finished and unfinished genomes in GenBank [125], EMBL [126], and DNA Databank of Japan (DDBJ) [127]. The nonredundancy in NCBI-nr is only to the level of distinct sequences, and any two sequences of the same length and content are merged into a single entry. NCBI-nr contains partial protein sequences and is not a fully curated database. Therefore it also contains contaminants in the form of sequences that are falsely predicted to be proteins. Expressed sequence tag (EST) databases also provide the potential to add a great deal of information to protein exploration and contain information that is not well represented in NCBI-nr. To this end, assemblies of EST sequences from the TIGR Gene Indices [34], an EST database, were included in this study. To minimize redundancy, only EST assemblies from those organisms for which the full genome is not yet known, were included. The protein predictions on metazoan genomes that are fully sequenced and annotated were obtained by including the Ensembl database [35,36] in this study. Both finished and unfinished sequences from prokaryotic genome projects submitted to NCBI were included. The protein predictions from the individual sequencing projects are submitted to NCBI-nr. Nevertheless, these genomes were included in this dataset both for the purpose of evaluating our approach and also for the purpose of identifying any proteins that were missed by the annotation process used in these projects. Thus, for this study the following publicly available datasets, all downloaded on February 10, 2005—NCBI-nr, PG, TGI-EST, and ENS—were used. The organisms in the PG set and the TGI-EST set are listed in Protocol S1. Assembly of the GOS dataset. Initial assembly (construction of “unitigs”) was performed so that only overlaps of at least 98% DNA sequence identity and no conflicts with other overlaps were accepted. False assemblies at this phase of the assembler are extremely rare, even in the presence of complex datasets [37,128]. Paired-end (also known as mate-pair) data were then used to order, orient, and merge unitigs into the final assemblies, but only when two mate pairs or a single mate pair and an overlap between unitigs implied the same layout. In one respect, mate pair data was used more aggressively than is typical in assembly of a single genome in that depth-of-coverage information was largely ignored [10]. This potentially allows chimeric assemblies through a repeat within a genome or through an ortholog between genomes. Thus, a conclusion that relies on the correctness of a single assembly involving multiple unitigs should be considered tentative until the assembly can be confirmed in some way. Assemblies involved in key results in this paper were subjected to expert manual review based on thickness of overlaps, presence of well-placed mate pairs across thin overlaps or across gaps between contigs, and consistency of depth of coverage. Data release and availability. All the GOS protein predictions will be submitted to GenBank. In addition, all the data supporting this paper, including the clustering and the various analyses, will be made publicly available via the CAMERA project (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis; http://camera.calit2.net), which is funded by the Gordon and Betty Moore Foundation. All-against-all BLASTP search. We used two sets of computer resources. At the J. Craig Venter Institute, 125 dual 3.06-GHz Xeon processor systems with 2 Gb of memory per system were used. Each system had 80 GB local storage and was connected by GBit ethernet with storage area network (SAN) I/O of ~24 GBit/sec and network attached storage (NAS) I/O of ~16 GBit/sec. A total of 466,366 CPU hours was used on this system. In addition, access to the National Energy Research Scientific Computing Center (NERSC) Seaborg computer cluster was available, including 380 nodes each with sixteen 375-MHz Power3 processors. The systems had between 16 GB and 64 GB of memory. Only 128 nodes were used at a time. A total of 588,298 CPU hours was used on this system. The dataset of 28.6 million sequences was searched against itself in a half-matrix using NCBI BLAST [38] with the following parameters: -F “m L” -U T -p blastp -e 1 × 10−10 -z 3 × 109 -b 8000 -v 10. In this paper, similarity of an alignment is defined to be the fraction of aligned residues with a positive score according to the BLOSUM62 substitution matrix [129] used in the BLAST searches. Identification of nonredundant sequences. Given a set of sequences S and a threshold T, a nonredundant subset S′ of S was identified by first partitioning S (using the threshold T) and then picking a representative from each partition. The set of representatives constitutes the nonredundant set S′. The process was implemented using the following graph-theoretic approach. A directed graph G = (V, E) is constructed with vertex set V and edge set E. Each vertex in V represents a sequence from S. A directed edge (u,v) ∈ E if sequence u is longer than sequence v and their sequence comparison satisfies the threshold T; for sequences of identical length, the sequence with the lexicographically larger id is considered the longer of the two. Note that G does not have any cycles. Source vertices (i.e., vertices with no in-degree) are sorted in decreasing order of their out-degrees and (from largest out-degree to smallest) processed in this order. A source vertex u is processed as follows: mark all vertices that have not been seen before and are reachable from vertex u as being redundant and mark vertex u as their representative. We used two thresholds in this paper, 98% similarity and 100% identity. The former was used in the first stage of the clustering and the later was used in the HMM profile analysis. For the 98% similarity threshold, two sequences satisfy the threshold if the following three criteria are met: (1) similarity of the match is at least 98%; (2) at least 95% of the shorter sequence is covered by the match; and (3) (match score)/(self score of shorter sequence) ≥ 95%. For the 100% identity threshold, two sequences satisfy the threshold if their match identity is 100%. Description of the clustering algorithm. The starting point for the clustering was the set of pairwise sequence similarities identified using the all-against-all BLASTP compute. Because of both the volume and nature of the data, the clustering was carried out in four steps: redundancy removal, core set identification, core set merging, and final recruitment. A set of nonredundant sequences (at 98% similarity) was identified using the procedure given in Materials and Methods (Identification of nonredundant sequences). Only the nonredundant sequences were considered in further steps of the clustering process. The aim of the core set identification step was to identify core sets of highly related sequences. In graph-theoretic terms, this involves looking for dense subgraphs in a graph where the vertices correspond to sequences and an edge exists between two sequences if their sequence match satisfies some reasonable threshold (for instance, 40% similarity match over 80% of at least one sequence and are clearly homologous based on the BLAST threshold). Dense subgraphs were identified by using a heuristic. This approach utilizes long edges. These are edges where the match threshold is computed relative to the longer sequence. This was done to prevent, as much as possible, unrelated proteins from being put into the same core set. If all the sequences were full length, using long edges would have offered a good solution to keeping unrelated proteins apart. However, the situation here is complicated by the presence of a large amount of fragmentary sequence data of varying lengths. This was dealt with somewhat by working with rather stringent match thresholds and a two-stage process to identify the core sets. We used the concept of strict long edges and weak long edges. A strict long edge exists between two vertices (sequences) if their match has the following properties: (1) 90% of the longer sequence is involved in the match; (2) the match has 70% similarity; and (3) the score of the match is at least 60% of the self-score of the longer sequence. A weak long edge exists between two vertices (sequences) if their match has the following properties: (1) 80% of the longer sequence is involved in the match; (2) the match has 40% similarity; and (3) the score of the match is at least 30% of the self-score of the longer sequence. Core set identification had two substages: large core initialization and core extension. The large core initialization step identified sets of sequences where these sets were of a reasonable size and the sequences in them were very similar to each other. Furthermore, these sets could be extended in the core extension step by adding related sequences. In the large core initialization step, a directed graph G was constructed on the sequences using strict long edges, with each long edge being directed from the longer to the shorter sequence. For each vertex v in G, let S(v) denote the friends set of v consisting of v and all neighbors that v has an out-going edge to. Initially all the vertices in G are unmarked. Consider the set of all friends sets in the decreasing order of their size. For S(v) that is currently being considered, do the following: (1) initialize seed set A = S(v); (2) while there exists some v′ such that |S(v) ∩ S(v′)| ≥ k, set A = A ∪ S(v′). (Note: k = 10 is chosen); (3) output set A and mark all vertices in A; and (4) update all friends sets to contain only unmarked vertices. In the core extension step, we constructed a graph G using weak long edges. All vertices in seed sets (computed from the large core initialization step) were marked and the rest of the vertices unmarked. Each seed set was then greedily extended to be a core set by adding a currently unmarked vertex that has at least k neighbors (k = 10 is chosen) in the set; the added vertex was marked. After this process, a clique-finding heuristic was used to identify smaller cliques (of size at most k − 1) consisting of currently unmarked vertices; these were also extended to become core sets. A final step involved merging the computed core sets on the basis of weak edges connecting them. In the core set merging step, we constructed an FFAS (Fold and Function Assignment System) profile [39] for each core set using the longest sequence in the core set as query. FFAS was then used to carry out profile–profile comparisons in order to merge the core sets into larger sets of related sequences. Due to computational constraints imposed by the number of core sets, profiles were built on only core sets containing at least 20 sequences. Final recruitment involved constructing a PSI-BLAST profile [40] on core sets of size 20 or more (using the longest sequence in the core set as query) and then using PSI-BLAST (–z 1 × 109, –e 10) to recruit as yet unclustered sequences or small-sized clusters (size less than 20) to the larger core sets. For a sequence to be recruited, the sequence–profile match had to cover at least 60% of the length of the sequence with an E-value ≤ 1 × 10−7. In a final step, unclustered sequences were recruited to the clusters using their BLAST search results. A length-based threshold was used to determine if the sequence is to be recruited. Identification of clusters containing shadow ORFs. A well-known problem in predicting coding intervals for DNA sequences is shadow ORFs. The key requirement that coding intervals not contain in-frame stop codons requires that coding intervals be subintervals of ORFs. Long ORFs are therefore obvious candidates to be coding intervals. Unfortunately, the constraints on the coding interval to be an ORF often cause subintervals and overlapping intervals of the coding interval to also be ORFS in one of the five other reading frames (two on the same strand and three on the opposite strand). These coincidental ORFs are called shadow ORFs since they are found in the shadow of the coding ORF. In rare cases (and more frequently in certain viruses) coding intervals in different reading frames can overlap but usually only slightly. Overwhelmingly distinct coding intervals do not overlap. However, this constraint is not as strict for ORFs that contain a coding interval, as the exact extent of the coding interval is not known. Prokaryotes predominate in these data and are the focus of the ORF predictions. Their 3′ end of an ORF is very likely to be part of the coding interval because a stop codon is a clear signal for the termination of both the ORF and the coding interval (this signal could be obscured by frameshift errors in sequencing). The 5′ end is more problematic because the true start codon is not so easily identified and so the longest ORF with a reasonable start codon is chosen and this may extend the ORF beyond the true coding interval. For this reason different criteria were set for when ORFs have a significant overlap depending on the orientation (or the 5′ or 3′ ends) of the ORFs involved. Two ORFs on the same strand are considered overlapping if their intervals overlap by at least 100 bp. Two ORFs that are on the opposite strands are considered overlapping either if their intervals overlap by at least 50 bp and their 3′ ends are within each others intervals, or if their intervals overlap by at least 150 bp and the 5′ end of one is in the interval of the other. ORFs for coding intervals are clustered based on sequence similarity. In most cases this sequence similarity is due to the ORFs evolving from a common ancestral sequence. Due to functional constraints on the protein being coded for by the ORF, some sequence similarity is retained. There are no known explicit constraints on the shadow ORFs to constrain drift from the ancestral sequences. However, the shadow ORFs still tend to cluster together for some obvious reasons. The drift has not yet obliterated the similarity. There are implicit constraints due to the functional constraints on the overlapping coding ORF. There are also other possible unknown functional constraints beyond the coding ORF. At first it was surmised that within shadow ORF clusters the diversity should be higher than for the coding ORF, but this did not prove to be a reliable signal. The apparent problem is that the shadow ORFs tend to be fractured into more clusters due to the introduction of stop codons that are not constrained because the shadow ORFs are noncoding. What rapidly became apparent is that the most reliable signal that a cluster was made up of shadow ORFs is that the cluster was smaller than the coding cluster containing the ORFs overlapping the shadow ORFs. The basic rule for labeling a cluster as a shadow ORF cluster is that the size of the shadow ORF cluster is less than the size of another cluster that contained a significant proportion of the overlapping ORFs for the shadow ORF cluster. A specific set of rules was used to label shadow ORF clusters based on comparison to other clusters that contained ORFs overlapping ORFS in the shadow ORF cluster (called the overlapping cluster for this discussion). First, the overlapping cluster cannot be the same cluster as the shadow ORF cluster (there are sometimes overlapping ORFs within the same cluster due to frameshifts). Second, both the redundant and nonredundant sizes of the shadow ORF cluster must be smaller than the corresponding sizes of the overlapping cluster. Third, at least one-third of the shadow ORFs must have overlapping ORFs in the overlapping cluster. Fourth, less than one-half of the shadow ORFs are allowed to contain their overlapping ORFs (this test is rarely needed but did eliminate the vast majority of the very few obvious false positives that were found using these rules). Finally, the majority of the shadow ORFs that overlapped must overlap by more than half their length. When using this rule, 1,274,919 clusters were labeled as shadow ORF clusters, and 6,570,824 singletons were labeled as shadow ORFs. The rules need to be somewhat conservative so as not to eliminate coding clusters. To test these rules, clusters containing at least two NCBI-nr sequences were examined. Two sequences were used instead of one because occasional spurious shadow ORFs have been submitted to NCBI-nr. There were 989 shadow ORF clusters containing at least two NCBI-nr sequences and with more than one-tenth as many NCBI-nr sequences as the overlapping cluster. This was 0.86% of all clusters (114,331 in total) with at least two NCBI-nr sequences. Of these 989, a few were obvious mistakes, and the others involved very few NCBI-nr sequences of dubious curation, such as “hypothetical.” Just to be conservative, all of these 989 clusters were rescued and not labeled as shadow ORF clusters. Ka/Ks test to determine if sequences in a cluster are under selective pressure. For a cluster containing conserved but noncoding sequences, it is expected that there is no selection at the codon level. We checked this by computing the ratio of nonsynonymous to synonymous substitutions (Ka/Ks test) [130,131] on the DNA sequences from which the ORFs in the cluster were derived. For most proteins, Ka/Ks ≪ 1, and for proteins that are under strong positive selection, Ka/Ks ≫ 1. A Ka/Ks value close to 1 is an indication that sequences are under no selective pressure and hence are unlikely to encode proteins [134,135]. Weakly selected but legitimate coding sequences can have a Ka/Ks value close to 1. These were identified to some extent by using a model in which different partitions of the codons experience different levels of selective pressure. A cluster was rejected only if no partition was found to be under purifying selection at the amino acid level. The Ka/Ks test [130,131] was run only on those clusters (remaining after the shadow ORF filtering step) that did not contain sequences with HMM matches or have NCBI-nr sequences in them. Only the nonredundant sequences in a cluster were considered. Sequences in each of the clusters were aligned with MUSCLE [134]. For each cluster, a strongly aligning subset of sequences was selected for the Ka/Ks analysis. The codeml program from PAML [135,136] was run using model M0 to calculate an overall (i.e., branch- and position-independent) Ka/Ks value for the cluster. Clusters with Ka/Ks ≤ 0.5, indicating purifying selection and therefore very likely coding, were considered as passing the Ka/Ks filter. In addition, the remaining clusters were examined by running codeml with model M3. This partitioned the positions of the alignment into three classes that may be evolving differently (typically, a few positions may be under positive selection while the remainder of the sequence is conserved). A likelihood ratio test was applied to select clusters for which M3 explained the data significantly better than M0 [136]. If a cluster was thus selected, and if one of the resulting partitions had a Ka/Ks ≤ 0.5 and comprised at least 10% of the sequence, then that cluster was also considered as passing the Ka/Ks filter. All other clusters were marked as containing spurious ORFs. Statistics for the various stages of the clustering process The number of sequences that remain after redundancy removal (at 98% similarity) for each dataset is given in Table 12. Recall that the size of a cluster is the number of nonredundant sequences in it. Number of core sets of size two or more totals 1,586,454; number of nonredundant sequences in core sets of size two or more totals 8,337,256; and total number of sequences in core sets of size two or more is 12,797,641. Total number of clusters after profile merging and (PSI-BLAST and BLAST) recruitment is 1,871,434; number of clusters of size two or more totals 1,388,287; number of nonredundant sequences in clusters of size two or more totals 11,494,078; total number of sequences in clusters of size two or more is 16,565,015. The final clustering statistics (after shadow ORF detection and Ka/Ks tests) are as follows: number of clusters of size two or more totals 297,254; number of nonredundant sequences in clusters of size two or more totals 6,212,610; total number of sequences in clusters of size two or more is 9,978,637. In the final BLAST recruitment step, a pattern was seen involving highly compositionally biased sequences that recruited unrelated sequences to clusters. This was reflected in the pre- and post-BLAST recruitment numbers, where the postrecruitment sizes were more than three to four times the size of the prerecruitment numbers. There were 75 such clusters, and these were removed. Searching sequences using profile HMMs. The full set of 7,868 Pfam release 17 models was used, along with additional nonredundant profiles from TIGRFAM (1,720 of 2,443 profiles; version 4.1). HMM profiling was carried out using a TimeLogic DeCypher system (Active Motif, Inc., http://www.activemotif.com) and took 327 hours in total (on an eight-card machine). A sequence was considered as matching a Pfam (fragment model) if its sequence score was above the TC score for that Pfam and had an E-value ≤ 1 × 10−3. It was considered as matching a TIGRFAM if the match had an E-value ≤ 1 × 10−7. Evaluation of protein prediction via clustering. Our evaluation of protein prediction via the clustering shows a very favorable comparison to currently used protein prediction methods for prokaryotic genomes. We used the PG dataset for this evaluation (Table 2). Of the 3,049,695 PG ORFs, 575,729 sequences (19%) were clustered (the clustered set). Of the 614,100 predictions made by the genome projects, 600,911 sequences could be mapped to the PG ORF set (the submitted set); 93% of the unmapped sequences were 50% of their occurences is in one cluster. For the second evaluation, we selected all sequences with Pfam matches, and each sequence was assigned to the Pfam that matches it with the highest score. With this assignment, the Pfams induce a partition on the sequences. The distribution of the number of sequences in clusters induced by the Pfams was compared to those of clusters from the clustering method. Figure 12A shows comparison as a log–log plot of the number of sequences versus the number of clusters with at least that many sequences for the two cases. The plot shows that cluster size distributions are quite similar, with both the methods having an inflection point around 2,500. The difference between the two curves is that there are more big clusters (and also fewer small clusters) induced by the Pfams as compared to the clustering method. This can be explained by noting that two sequences that are in the same Pfam cluster can nevertheless be put into different clusters by the clustering method if they differ in their remaining portions. Our clustering also shows a good correspondence with HMM profiling on the phylogenetic markers that we looked at. The clustering identifies 7,423, 12,553, and 13,657 sequences, respectively, for RecA (cluster ID 1146), Hsp70 (cluster ID 197), and RpoB (cluster ID 1187). HMM profiling identifies 5,292, 12,298, and 12,165 sequences, respectively, for these families. For each of these families, there are at least 94% of sequences (relative to the smaller set) in common between clustering and HMM profiling. Difference in ratio of predicted proteins to total ORFs for the PG set and the GOS set. The ratio of clustered ORFs to total ORFs is significantly higher for the GOS ORFs (0.3471) compared to the PG ORFs (0.1888). This can be explained by the fragmentary nature of the GOS data. For the large majority of the GOS data, the average sequence length is 920 bp compared to full-length genomes for the PG data. For the PG data, clustered ORFs have a mean length of 325 aa and a median length of 280 aa. Unclustered ORFs have a mean length of 119 aa and a median length of 87 aa. Assuming that the genomic GOS data has a similar underlying ORF structure to PG data, the effect that GOS fragmentation had on ORF lengths is estimated. Each reading frame will have a mixture of clustered and unclustered ORFs, but on average there will be 2 ORFs per reading frame per 920-bp GOS fragment, and both ORFs will be truncated. Assuming the truncation point for the ORF is uniformly distributed across the ORF, the truncated ORF will drop below the 60-aa threshold to be considered as an ORF with a probability of 60/(length of the ORF). Using the median length, the percentage of clustered ORFs dropping below the threshold due to truncation is 21%; for unclustered ORFs, it is 69%. Accounting for this truncation, the expected ratio of clustered ORFs to total ORFs for the GOS ORFs based on the PG ORFs would be 0.3708, which is very close to the observed value. Kingdom assignment strategy and its evaluation. We used several approaches to assign kingdoms for GOS sequences. They are all fundamentally based upon a strategy that takes into account top BLAST matches of a GOS sequence to sequences in NCBI-nr, and then voting on a majority. We evaluated a simple strict-majority voting scheme (of the top four BLAST matches) using the NCBI-nr set. First, the redundancy in NCBI-nr was removed using a two-staged process. A nonredundant set of NCBI-nr sequences was computed involving matches with 98% similarity over 95% of the length of the shorter sequence (using the procedure discussed in Materials and Methods [Identification of nonredundant sequences]). This set was made further nonredundant by considering matches involving 90% similarity over 95% of the length of the shorter sequence. The nonredundant sequences that remained after this step constituted the evaluation dataset S. For each sequence in S, its top four BLAST matches to other sequences in S (ignoring self-matches) were used to assign a kingdom for it (based on a strict majority rule). This predicted kingdom assignment for the sequence was compared to its actual kingdom. A correct classification is obtained for 93% of the sequences. The correct classification rate per kingdom is given in Table 13. While this evaluation shows that the BLAST-based voting scheme provides a reasonable handle on the kingdom assignment problem, there are caveats associated with it. The kingdom assignment for a set of query sequences is greatly influenced by the taxonomic groups from each kingdom that are represented in the reference dataset against which these queries are being compared. If certain taxa are only sparsely represented in the reference set, then, depending on their position in the tree of life, queries from these taxa can be misclassified (using a nearest-neighbor type approach based on BLAST matches). This explains why the archaeal classification rate is quite low compared to the others. Thus, the true classification rate for the GOS dataset based on this approach will also depend on the differences in taxonomic biases in the GOS dataset (query) and the NCBI-nr set (reference). The kingdom proportion for the GOS dataset reported in Figure 1 is based on a kingdom assignment of scaffolds. Those GOS ORFs with BLAST matches to NCBI-nr were considered, and the top-four majority rule was used to assign a kingdom to each of them. Using the ORF coordinates on the scaffold, the fraction (of bp) of a scafffold assigned to each kingdom was computed. The scaffold was labeled as belonging to a kingdom if the fraction of the scaffold assigned to that kingdom was >50%. All ORFs on this scaffold were then assigned to the same kingdom. Cluster size distribution, the power law, and the rate of protein family discovery. Earlier studies of protein family sizes in single organisms [137–139] have suggested that P(d), the frequency of protein families of size d, satisfies a power law: that is, P(d) ≈ d − β with exponent β reported between 2.68 and 4.02. Power laws have been used to model various biological systems, including protein–protein interaction networks and gene regulatory networks [42,43]. Figure 12B illustrates the distribution of the cluster sizes from our data on a log–log scale, a scale for which a power law distribution gives a line. In contrast to family size distributions reported in single organisms, the cluster sizes from our data are not well described by a single power law. Rather, there appear to be different power laws: one governs the size distribution of very large clusters, and another describes the rest. This behavior is observed both in the distribution of the core set sizes and also in the distribution of the final cluster sizes. We identified an inflection point for both the core set distribution and the final clusters at around size 2,500, and estimated the power law exponent β via linear regression separately in each size regime. For the core set distribution, the exponent β = 1.99 (R 2 = 0.994) for clusters of size ≤ 2,500, and β = 3.34 (R 2 = 0.996) for clusters of size > 2,500. For the final cluster sizes, the exponent β = 1.72 (R 2 = 0.995) for clusters of size ≤ 2,500, and β = 2.72 (R 2 = 0.995) for clusters of size > 2,500. The estimates for β are different for the core clusters compared to the final clusters, reflecting a larger number of medium and large clusters in the final clustering as a result of the cluster-merging and additional recruitment steps. A similar dichotomy between the size distributions of large and small protein families was observed in a study [140] of protein families contained in the ProDom, Protomap, and COG databases, where the exponent β reported was in the range of 1.83 to 1.98 for the 50 smallest clusters and 2.54 to 3.27 for the 500 largest clusters in these databases. Our clustering method was run separately on the following seven datasets: set 1 consisted of only NCBI-nr sequences; set 2 consisted of all sequences in NCBI-nr, ENS, TGI-EST, and PG; sets 3 through 6 consisted of set 2 in combination with a random subset of 20%, 40%, 60%, and 80% of the GOS sequences, respectively; set 7 consisted of set 2 in combination with all the GOS sequences. On each of the seven datasets, the redundancy removal (using the 98% similarity filter) was run, followed by the core set detection steps. Figure 2 shows the number of core sets of varying sizes (≥3, ≥5, ≥10, and ≥20) as a function of the number of nonredundant sequences for each dataset. The observed linear growth in number of families with increase in sample size n is related to the power law distribution in the following way. We model protein families as a graph where each vertex corresponds to a protein sequence and an edge between two vertices indicates sequence similarity between the corresponding proteins. Consider a clustering (partitioning) of the vertices of a graph with n vertices such that the cluster sizes obey a power law distribution. Let C d (n) [respectively, C≥d (n)] denote the number of clusters of size d (respectively, ≥d). Since the distribution of cluster sizes follows a power law, there exist constants α, β such that for all x ≤ n, Cx (n) = αx− β. As every vertex of the graph is a member of exactly one cluster, The number of clusters of size at least d is Combining the two equations, we obtain values (up to a multiplicative constant) for C ≥d (n) as shown in Table 14. In all cases with β > 1, the number of clusters C ≥d (n) increases as n increases, and as d decreases. Specifically, for β > 2, the growth is linear in n for all d, with slope decreasing as d increases. For 1 50% of their neighboring ORFs were assigned the same kingdom. The general trends observed are the same for each method, though the coverage decreases slightly for the more stringent methods. Characteristics and kingdom distribution of known protein domains. For these analyses we used the predicted proteins from the public (NCBI-nr, PG, TGI-EST, and ENS) and GOS datasets. The public dataset contains multiple identical copies of some sequences due to overlaps between the source datasets. For example, many sequences in PG are also found in NCBI-nr. We filtered the public set at 100% identity to avoid overcounting these sequences. Because this filtering was necessary for the public dataset, the GOS dataset was also filtered at 100% identity. If two or more sequences were 100% identical at the residue level, but were of different lengths, only the longest sequence was kept. The resulting datasets of nonredundant proteins are referred to as public-100 and GOS-100. We assigned each protein in public-100 to a kingdom based on the species annotations provided in the source datasets (NCBI-nr, Ensembl, TIGR, and PG). The NCBI taxonomy tree was used to determine the kingdom of each species. Of 3,167,979 protein sequences in public-100, 3,158,907 can be annotated by kingdom. The remaining 9,072 sequences are largely synthetic. Determining the kingdom of origin of an environmental sequence can be difficult; while an unambiguous assignment can be made for some sequences, others can be assigned only tentatively or not at all. Therefore, we took a probabilistic approach (kingdom-weighting method), calculating “weights” or probabilities that each protein sequence originated from a given kingdom. The top four BLAST matches (E-value 90% similarities were grouped. NCBI-nr90 was then clustered to NCBI-nr80/70/60/50 and finally NCBI-nr30. After each clustering stage, the total number of clusters of NCBI-nr was decreased and non-ORFans were identified. A one-step clustering from NCBI-nr directly to NCBI-nr30 can be performed. However, the multistep clustering is computationally more efficient. At the 30% similarity level, all the NCBI-nr proteins were grouped into 391,833 clusters, including 259,571 singleton clusters. The proteins in nonsingleton clusters are by definition non-ORFans. However, proteins that remain as singletons are not necessarily ORFans, because their similarity to other proteins may not be reported for two reasons: (1) significant sequence similarity can be <30%; and (2) in order to prevent a cluster from being too diverse, CD-HIT, like all other clustering algorithms, may not add a sequence to that cluster even if the similarity between this sequence and a sequence in that cluster meet the similarity threshold. The 259,571 singletons were compared to NCBI-nr with BLASTP [38] to identify real ORFans from them. The default low-complexity filter was enabled in the BLAST comparisons, and similarity threshold in the form of an E-value was set to 1 × 10−6. In the end, 84,911 proteins with at least 100 aa are identified as ORFans. About 100,000 short ORFans less than 100 aa were removed from this study, because they may not be real proteins. Genome sequencing projects and rate of discovery. We used Ensembl sequences for Homo sapiens, Mus musculus, Rattus norvegicus, Canis familiaris, and Pan troglodytes. Their clustering information is shown in Table 15. When we considered the datasets in the order HS, HS + MM, HS + MM + RN, HS + MM + RN + CF, and HS + MM + RN + CF + PT, the numbers of distinct clusters were 10,536, 12,731, 13,605, 14,606, and 14,993, respectively. These numbers were compared against a random subset of NCBI-nr bacterial sequences (of a similar size) and also against a random subset of GOS sequences. We also randomized the order of the mammalian sequences to produce a dataset that was independent of the genome order being considered. Supporting Information Protocol S1 Supplementary Information (25 KB DOC) Click here for additional data file. Accession Numbers All NCBI-nr sequences from February 10, 2005 were used in our analysis. Protocol S1 lists the GenBank (http://www.ncbi.nlm.nih.gov/Genbank) accession numbers of (1) the genomic sequences used in the PG set, (2) the sequences used in building GS profiles, and (3) the NCBI-nr sequences used in building the IDO phylogeny. The other GenBank sequences discussed in this paper are Bacillus sp. NRRL B-14911 (89089741), Janibacter sp. HTCC2649 (84385106), Erythrobacter litoralis (84785911), and Nitrosococcus oceani (76881875). The Pfam (http://pfam.cgb.ki.se) structures discussed in this paper are envelope glycoprotein GP120 (PF00516), reverse transcriptase (PF00078), retroviral aspartyl protease (PF00077), bacteriophage T4-like capsid assembly protein (Gp20) (PF07230), major capsid protein Gp23 (PF07068), phage tail sheath protein (PF04984), IDO (PF01231), poxvirus A22 protein family (PF04848), and PP2C (PF00481). The glutamine synthetase TIGRFAM (http://www.tigr.org/TIGRFAMs) used in the paper is GlnA: glutamine synthetase, type I (TIGR00653). The PDB (http://www.rcsb.org/pdb) identifiers and the names of the eight PDB ORFans with GOS matches are: restriction endonuclease MunI (1D02), restriction endonuclease BglI (1DMU), restriction endonuclease BstYI (1SDO), restriction endonuclease HincII (1TX3); alpha-glucosyltransferase (1Y8Z), hypothetical protein PA1492 (1T1J), putative protein (1T6T), and hypothetical protein AF1548 (1Y88).
                Bookmark

                Author and article information

                Journal
                Proc Natl Acad Sci U S A
                Proc. Natl. Acad. Sci. U.S.A
                pnas
                pnas
                PNAS
                Proceedings of the National Academy of Sciences of the United States of America
                National Academy of Sciences
                0027-8424
                1091-6490
                8 October 2019
                23 September 2019
                23 September 2019
                : 116
                : 41
                : 20574-20583
                Affiliations
                [1] aMonterey Bay Aquarium Research Institute , Moss Landing, CA 95039;
                [2] bAtmosphere & Ocean Research Institute, University of Tokyo , Chiba 277-8564, Japan;
                [3] cLaboratory for Protein Functional & Structural Biology, RIKEN Center for Biosystems Dynamics Research , Yokohama, Kanagawa 230-0045, Japan;
                [4] dOcean EcoSystems Biology Unit, GEOMAR Helmholtz Centre for Ocean Research , 24105 Kiel, Germany;
                [5] eDepartment of Botany, University of British Columbia , Vancouver, BC V6T 1Z4, Canada;
                [6] fGraduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama University , Okayama 700-8530, Japan;
                [7] gLiving Systems Institute, School of Biosciences, College of Life and Environmental Sciences, University of Exeter , Exeter EX4 4SB, United Kingdom;
                [8] hDepartment of Energy Joint Genome Institute , Walnut Creek, CA 94598;
                [9] iDaniel K. Inouye Center for Microbial Oceanography, University of Hawaii, Manoa , Honolulu, HI 96822;
                [10] jDepartment of Ecology, Evolution and Marine Biology, University of California, Santa Barbara , CA 93106;
                [11] kDepartment of Biological Sciences, Graduate School of Science, University of Tokyo , Tokyo 113-0032, Japan
                Author notes
                4To whom correspondence may be addressed. Email: iwasaki@ 123456bs.s.u-tokyo.ac.jp or azworden@ 123456geomar.de .

                Edited by W. Ford Doolittle, Dalhousie University, Halifax, Canada, and approved August 8, 2019 (received for review May 27, 2019)

                Author contributions: D.M.N., S.Y., E.F.D., M.S., W.I., and A.Z.W. designed research; D.M.N., S.Y., T.H., C.P., S.W., R.K., Y.N., K.K., T.K.-S., R.R.M., D.R.M., D.K.O., Y.S., S.S., T.A.R., E.F.D., P.J.K., A.E.S., W.I., and A.Z.W. performed research; D.M.N., S.Y., T.H., C.P., C.J.C., E.H., N.A.T.I., C.-M.Y., C.B., G.L., T.A.R., E.F.D., P.J.K., A.E.S., M.S., and A.Z.W. analyzed data; and D.M.N., S.Y., W.I., and A.Z.W. wrote the paper.

                1D.M.N., S.Y., and T.H. contributed equally to this work.

                2Present address: Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, Amsterdam 1090 GE, The Netherlands.

                Author information
                http://orcid.org/0000-0001-9667-6010
                http://orcid.org/0000-0002-7652-8114
                http://orcid.org/0000-0001-7810-1336
                http://orcid.org/0000-0003-2232-1801
                http://orcid.org/0000-0001-8155-9356
                http://orcid.org/0000-0002-3088-4965
                http://orcid.org/0000-0003-2503-8219
                http://orcid.org/0000-0002-9169-9245
                http://orcid.org/0000-0002-9888-9324
                Article
                201907517
                10.1073/pnas.1907517116
                6789865
                31548428
                28e7d203-5afc-4573-9515-3b76d905ea20
                Copyright © 2019 the Author(s). Published by PNAS.

                This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

                History
                Page count
                Pages: 10
                Funding
                Funded by: Gordon and Betty Moore Foundation (Gordon E. and Betty I. Moore Foundation) 100000936
                Award ID: 3307
                Award Recipient : Thomas A Richards Award Recipient : Patrick J. Keeling Award Recipient : Alyson E Santoro Award Recipient : Alexandra Z. Worden
                Funded by: Gordon and Betty Moore Foundation (Gordon E. and Betty I. Moore Foundation) 100000936
                Award ID: 3788
                Award Recipient : Thomas A Richards Award Recipient : Patrick J. Keeling Award Recipient : Alyson E Santoro Award Recipient : Alexandra Z. Worden
                Categories
                PNAS Plus
                Biological Sciences
                Environmental Sciences
                PNAS Plus

                giant viruses,viral evolution,marine carbon cycle,single-cell genomics,host–virus interactions

                Comments

                Comment on this article