1 Introduction 1.1 Uncharacterized Protein Segments Are a Source of Functional Novelty Over the past decade, we have observed a massive increase in the amount of information describing protein sequences from a variety of organisms. 1,2 While this may reflect the diversity in sequence space, and possibly also in function space, 3 a large proportion of the sequences lacks any useful function annotation. 4,5 Often these sequences are annotated as putative or hypothetical proteins, and for the majority their functions still remain unknown. 6,7 Suggestions about potential protein function, primarily molecular function, often come from computational analysis of their sequences. For instance, homology detection allows for the transfer of information from well-characterized protein segments to those with similar sequences that lack annotation of molecular function. 8−10 Other aspects of function, such as the biological processes proteins participate in, may come from genetic- and disease-association studies, expression and interaction network data, and comparative genomics approaches that investigate genomic context. 11−17 Characterization of unannotated and uncharacterized protein segments is expected to lead to the discovery of novel functions as well as provide important insights into existing biological processes. In addition, it is likely to shed new light on molecular mechanisms of diseases that are not yet fully understood. Thus, uncharacterized protein segments are likely to be a large source of functional novelty relevant for discovering new biology. 1.2 Structure–Function Paradigm Enhances Function Prediction Traditionally, protein function has been viewed as critically dependent on the well-defined and folded three-dimensional structure of the polypeptide chain. This classical structure–function paradigm (Figure 1; left panel) has mainly been based on concepts explaining the specificity of enzymes, and on structures of folded proteins that have been determined primarily using X-ray diffraction on protein crystals. The classical concept implies that protein sequence defines structure, which in turn determines function; that is, function can be inferred from the sequence and its structure. Even when protein sequences diverge during evolution, for example, after gene duplication, the overall fold of their structures remains roughly the same. Therefore, structural similarity between proteins can reveal distant evolutionary relationships that are not easily detectable using sequence-based methods. 18,19 Structural genomics efforts such as the Protein Structure Initiative (PSI) have been set up to enlarge the space of known protein folds and their functions, thereby complementing sequence-based methods in an attempt to fill the gap of sequences for which there is no function annotation. 20,21 Specifically, phase two of the PSI aimed to structurally characterize proteins and protein domains of unknown function, often providing the first hypothesis about their function and serving as a starting point for their further characterization. 1.3 Classification Further Facilitates Function Prediction Classification schemes provide a guideline for systematic function assignment to proteins. Generally, proteins are made up of a single or multiple domains that can have distinct molecular functions. These domains, which are referred as structured domains, often fold independently, make precise tertiary contacts, and adopt a specific three-dimensional structure to carry out their function. The sequences that compose structured domains can be organized into families of homologous sequences, whose members are likely to share common evolutionary relationship and molecular function. The Pfam database classifies known protein sequences and contains almost 15 000 such families, for most of which there is some understanding about the function. 22 Nevertheless, Pfam also contains more than 3000 families annotated as domains of unknown function, or DUFs. 23 These families are largely made up of hypothetical proteins and await function annotation. Another powerful example of a protein classification scheme is the Structural Classification of Proteins (SCOP), which provides a means of grouping proteins with known structure together, based on their structural and evolutionary relationships. 24,25 SCOP utilizes a hierarchical classification consisting of four levels, (i) family, (ii) superfamily, (iii) fold, and (iv) class, with each level corresponding to different degrees of structural similarity and evolutionary relatedness between members. Using this scheme, function of newly solved structures or sequences can be inferred from their similarity with existing protein classes through structure or sequence comparisons, for instance, as available via the SUPERFAMILY database. 10 In this direction, another major initiative is Genome3D, which is a collaborative project to annotate genomic sequences with predicted 3D structures based on CATH 26 (Class, Architecture, Topology, Homology) and SCOP 24,25 domains to infer protein function. 27 1.4 Intrinsically Disordered Regions and Proteins While many proteins need to adopt a well-defined structure to carry out their function, a large fraction of the proteome of any organism consists of polypeptide segments that are not likely to form a defined three-dimensional structure, but are nevertheless functional. 28−42 These protein segments are referred to as intrinsically disordered regions (IDRs; Figure 1; right panel). 43 Because IDRs generally lack bulky hydrophobic amino acids, they are unable to form the well-organized hydrophobic core that makes up a structured domain 31,44 and hence their functionality arises in a different manner as compared to the classical structure–function view of globular, structured proteins. In this framework, protein sequences in a genome can be viewed as modular because they are made up of combinations of structured and disordered regions (Figure 1; bottom panel). Proteins without IDRs are called structured proteins, and proteins with entirely disordered sequences that do not adopt any tertiary structure are referred to as intrinsically disordered proteins (IDPs). The majority of eukaryotic proteins are made up of both structured and disordered regions, and both are important for the repertoire of functions that a protein can have in a variety of cellular contexts. 43 Traditionally, IDRs were considered to be passive segments in protein sequences that “linked” structured domains. However, it is now well established that IDRs actively participate in diverse functions mediated by proteins. For instance, disordered regions are frequently subjected to post-translational modifications (PTMs) that increase the functional states in which a protein can exist in the cell. 45,46 In addition, they expose short linear peptide motifs of about 3–10 amino acids that permit interaction with structured domains in other proteins. 47,48 These two features in isolation or in combination permit the interaction and recruitment of diverse proteins in space and time, thereby facilitating regulation of virtually all cellular processes. 47 The prevalence of IDRs in any genome (see, for example, the D2P2 database, 49 Box 1) in combination with their unique characteristics means that these regions extend the classical view of the structure–function paradigm and hence that of protein function. Thus, functional regions in proteins can either be structured or disordered, and these need to be considered as two fundamental classes of functional building blocks of proteins. 50 Figure 1 Structured domains and intrinsically disordered regions (IDRs) are two fundamental classes of functional building blocks of proteins. The synergy between disordered regions and structured domains increases the functional versatility of proteins. Adapted with permission from ref (50). Copyright 2012 American Association for the Advancement of Science. 1.5 The Need for Classification of Intrinsically Disordered Regions and Proteins IDRs and IDPs are prevalent in eukaryotic genomes. For instance, 44% of human protein-coding genes contain disordered segments of >30 amino acids in length 49 (similar data shown in Figure 2A). In the human genome, 6.4% of all protein-coding genes do not have any function annotation in their description in Ensembl 1 (Figure 2B). Further investigation using the D2P2 database of disorder in genomes 49 revealed that most of these genes with no function annotation encode at least some disorder (Figure 2B) and that genes with no annotation contain proportionally more IDRs (Figure 2C). Given the absence of structural constraints, IDRs tend to evolve more rapidly than protein domains that adopt defined structures. 51−56 As a result, identifying homologous regions is harder for IDRs and IDPs than it is for structured domains. This complicates the transfer of information about function between homologues and thus the prediction of function of IDRs and IDPs. Furthermore, much of protein annotation is based on information on sequence families and structured domains. However, less than one-half of all residues in the human proteome fall within such domains (Figure 3). Not only do most residues of human proteins fall outside domains, a large fraction of these residues are also disordered (Figure 3A and B, right bars). Moreover, although it is expected that SUPERFAMILY domains based on known protein structures have very little disorder (Figure 3A, left bar), Pfam domains based on sequence clustering do not contain much more (Figure 3B, left bar). These observations suggest that there is a large pool of protein segments that are not considered by conventional protein annotation methods, because the sequences of disordered regions are difficult to align, or because the methods do not explicitly consider disordered and nondomain regions of the protein sequence. Taken together, these considerations raise the need to devise a classification scheme specifically for disordered regions in proteins that may enhance the function prediction and annotation for this important class of protein segments. Figure 2 The number of protein-coding genes in the human genome with various amounts of disorder. Histograms of the numbers of human genes with annotation (A) and without annotation (B), grouped by the percentage of disordered residues. (C) A comparison of the fraction of annotated and unannotated human genes with different amounts of disorder. Residues in each protein are defined as disordered when there is a consensus between >75% of the predictors in the D2P2 database 49 at that position. The set of human genes was taken from Ensembl release 63, 1 and the representative protein coded for by the longest transcript was used in each case. The annotation was taken from the description field with “open reading frame”, “hypothetical”, “uncharacterized”, and “putative protein” treated as no annotation. Figure 3 The fraction of disordered residues located in domains in human protein-coding genes: (A) residues inside (left) and outside (right) of SCOP domains, 24 and (B) residues inside (left) and outside (right) of Pfam domains (only curated Pfam domains were considered, i.e., Pfam-A). 22 The SCOP domains in human proteins are defined by the SUPERFAMILY database. 10 Disordered residues were taken from the D2P2 database 49 (when there is a consensus between >75% of the disorder predictors). The set of human genes was taken from Ensembl release 63. 1 In this Review, we synthesize and provide an overview of the various classifications of intrinsically disordered regions and proteins that have been put forward in the literature since the start of systematic studies into their function some 15 years ago. We discuss approaches based on function, functional elements, structure, sequence, protein interactions, evolution, regulation, and biophysical properties (Table 1). Finally, we discuss resources that are currently available for gaining insight into IDR function (Table 2), we suggest areas where increased efforts are likely to advance our understanding of the functions of protein disorder, and we speculate how combinations of multiple existing classification schemes could achieve high quality function prediction for IDRs, which should ultimately lead to improved function coverage and a deeper understanding of protein function. Table 1 Classifications of Intrinsically Disordered Regions and Proteins basis for classification classes description examples function (33,39,57,58) •entropic chains IDRs carrying out functions that benefit directly from their conformational disorder, e.g., flexible linkers and spacers MAP2 projection domain, titin PEVK domain, RPA70, MDA5 •display sites flexibility of IDRs facilitates exposure of motifs and easy access for proteins that introduce and read PTMs p53, histone tails, p27, CREB kinase-inducible domain •chaperones their binding properties (many different partners, rapid association/disassociation, and folding upon binding) make IDPs suitable for chaperone functions hnRNP A1, GroEL, α-crystallin, Hsp33 •effectors folding upon binding mechanics allow effectors to modify the activity of their partner proteins p21, p27, calpastatin, WASP GTPase-binding domain •assemblers assembling IDRs have large binding interfaces that scaffold multiple binding partners and promote the formation of higher-order protein complexes ribosomal proteins L5, L7, L12, L20, Tcf 3/4, CREB transactivator domain, Axin •scavengers disordered scavengers store and neutralize small ligands chromogranin A, Pro-rich glycoproteins, caseins and other SCPPs functional features linear motifs 47,125 •structural modification sites of conformational alteration of a peptide backbone peptidylprolyl cis–trans isomerase Pin1 sites •proteolytic cleavage sites of post-translational processing events or proteolytic cleavage scission sites Caspase-3/-7, separase, taspase1 scission sites •PTM removal/addition specific binding sequences that recruit enzymes catalyzing PTM moiety addition or removal cyclin-dependent kinase phosphorylation site, SUMOylation site, N-glycosylation site •complex promoting motifs that mediate protein–protein interactions important for complex formation; often associated with signal transduction proline-rich SH3-binding motif, cyclin box, pY SH2-binding motif, PDZ-binding motif, TRAF-binding motifs in MAVS •docking motifs that increase the specificity and efficiency of modification events by providing an additional binding surface KEN box degron, MAPK docking sites •targeting or trafficking signal sites that localize proteins within particular subcellular organelles or act to traffic proteins nuclear localization signal, clathrin box motif, endocytosis adaptor trafficking motifs molecular recognition features (MoRFs) 121 •alpha disordered motifs that form α-helices upon target binding p53 ∼ Mdm2, p53 ∼ RPA70, p53 ∼ S100B(ββ), RNase E ∼ enolase, inhibitor IA3 ∼ proteinase A •beta disordered motifs that form β-strands upon target binding RNase E ∼ polynucleotide phosphorylase, Grim ∼ DIAP1, pVIc ∼ adenovirus 2 proteinase •iota disordered motifs that form irregular secondary structure upon target binding p53 ∼ Cdk2-cyclin A, amphiphysin ∼ α-adaptin C •complex disordered motifs that contain combinations of different types of secondary structure upon target binding amyloid β A4 ∼ X11, WASP ∼ Cdc42 intrinsically disordered domains (IDDs) 158,159 some protein domains identified using sequence-based approaches are fully or largely disordered WH2, RPEL, BH3, KID domains co-occurrence of protein domains with disordered regions 161,162 particular disordered regions frequently co-occur in the same sequence with specific protein domains structure structural continuum 37 proteins function within a continuum of differently disordered conformations, extending from fully structured to completely disordered, with everything in between and no strict boundaries between the states protein quartet 32,34,166 •intrinsic coil flexible regions of extended conformation with hardly any secondary structure; high net charge differentiates these from disordered globules ribosomal proteins L22, L27, 30S, S19, prothymosin α •pre-molten globule disordered protein regions with residual secondary structure, often poised for folding upon binding events; lower net charge makes them more compact than coils Max, ribosomal proteins S12, S18, L23, L32, calsequestrin •molten globule globally collapsed conformation with regions of fluctuating secondary structure nuclear coactivator binding domain of CREB binding protein •folded structured proteins with a defined three-dimensional structure most enzymes, transmembrane domains, hemoglobin, actin sequence sequence–structural ensemble relationships 166,204 •polar tracts sequence stretches enriched in polar amino acids often form globules that are generally devoid of significant secondary structure preferences Asn- and Gly-rich sequences, Gln-rich linkers in transcription factors and RNA-binding proteins •polyelectrolytes amino acid compositions biased toward charged residues of one type; strong polyelectrolytes (high net charge) form expanded coils Arg-rich protamines, Glu/Asp-rich prothymosin α •polyampholytes sequences with roughly equal numbers of positive and negative charges; conformations of polyampholytes are governed by the linear distribution of oppositely charged residues, with segregation of opposite charges leading to globules, while well-mixed charged sequences adopt random-coil or globular conformations, depending on the total charge RNA chaperones, splicing factors, titin PEVK domain, yeast prion Sup35 prediction flavors 205 •V predicted best by the VL-2V predictor, for which the hydrophobic amino acids are the most influential attributes E. coli ribosomal proteins •C VL-2C is the best predictor for flavor C, which has more histidine, methionine, and alanine residues than the other flavors poly- and oligosaccharide binding domains •S flavor with less histidine than the others, best predicted by predictor VL-2S, which has a measure of sequence complexity as the most important attribute proteins that facilitate binding and interaction disorder–sequence complexity 206 IDPs from different functional classes show distinct disorder–sequence complexity distributions proteins with disordered linkers between structured domains populate compact and disordered DC regions overall degree of disorder 35,51,68,161,208,209 •fraction categorization of proteins based on the fraction of residues predicted to be disordered 0–10/10–30/30–100% disorder •overall score overall disorder scores for the whole protein minimum average disorder score depending on the predictor •continuous stretches presence or absence of continuous stretches of disordered residues typically >30 residues length of disordered regions 211 •>500 residues proteins that contain disordered regions of different lengths are enriched for different types of functions transcription •300–500 residues kinase and phosphatase functions • 30 amino acids in length 49 (similar data shown in Figure 2A). Short IDRs may function as linkers and contain individual linear motifs or MoRFs, whereas longer disordered regions might be entropic chains or contain combinations of motifs or domains functioning in recognition. Very long disordered regions (more than 500 residues) are typically over-represented in transcription-related functions, 211 whereas proteins containing IDRs of 300–500 residues in length are enriched for kinase and phosphatase functions. Shorter IDRs (less than 50 residues) tend to be linked to metal ion binding, ion channels, and GTPase regulatory functions. Thus, the length of a disordered region can also provide a useful indication about the functional nature of the protein containing it. 5.6 Position of Disordered Regions Almost all human proteins have some disordered residues within their terminal regions. 59 For example, 97% of proteins have predicted disorder in the first or last five residues. 161 Disordered N-terminal tails are common in DNA-binding proteins, and have been shown to contribute to efficient DNA scanning. 212 Furthermore, proteins that are relatively rich in disordered residues at the C-terminus are often associated with transcription factor repressor and activator activities as compared to proteins rich in internal or N-terminal disorder. 211 Membrane proteins, depending on their topology of insertion, also contain disordered regions in the N- or C-terminus, but their sequence composition is different as compared to disordered regions in cytosolic proteins. 213 Ion channel proteins are enriched for disordered residues at the N-terminus, and the same is true to a lesser extent for C-terminal disorder. 211 These terminal disordered regions are often functionally relevant, as illustrated by their role in the inactivation of voltage-gated potassium channels. 214 Similarly, many G-protein-coupled receptors (GPCRs) have large disordered regions in their C-terminus, and often in the intracellular loops. 215 Several of them harbor peptide motifs that link ligand binding in the transmembrane region of the receptor to intracellular effectors, or contain PTM sites or linear motifs that govern their stability. 216 Finally, proteins that are relatively rich in internal disordered regions are weakly enriched for transcription regulator and DNA binding activity. 211 Thus, the relative position of a disordered region in a sequence provides clues about the function of the protein containing it. 5.7 Tandem Repeats Short tandem repeats are common in IDRs and IDPs. 61,217−220 For instance, as much as 96% of polyglutamate and polyserine stretches lie within disordered regions. 219 Similarly, large fractions were found for proline, glycine, glutamine, lysine, aspartate, arginine, histidine, and threonine repeats. In contrast, polyleucine stretches occur predominantly within structured regions. These observations agree with the compositional bias of disordered regions (see section 5.1); the most common tandem repeats in IDRs are made up of disorder-promoting residues 44,194 and of sequence patterns that are typically associated with disorder. 195 Moreover, a distinction between perfect and imperfect tandem repeats suggests that as the repeat perfection increases, so does the disorder content. 219 Repeats of different composition have been linked to specific functions. 218,221 Consequently, the presence of particular types of repeats is likely to contribute to IDR functioning. Descriptions and examples of different classes of disordered tandem repeats and their structural characteristics have been reviewed previously. 218 For instance, polyproline and polyglutamine stretches are associated with protein and nucleic acid binding and transcription factor activity. 222,223 Protein segments enriched for glutamine and asparagine often occur in disordered regions 224 and are abundant in eukaryotic proteomes, 225 despite their propensity to aggregate or form coiled-coil structures. 226 The aggregation propensity of the Q/N-enriched segments is exploited in the formation of physiologically relevant assemblies such as P-bodies (e.g., Ccr4 and Pop2), stress granules, and processing bodies. 227 However, expanded polyglutamine repeats are also associated with neurodegenerative disorders, the most well-known being Huntington’s disease. 228 Moreover, several prion-like yeast proteins (e.g., Sup35p and Ure2p) contain intrinsically disordered Q/N-rich protein segments that have been implicated in the switch between a soluble and an insoluble, aggregated form. 225,229 Another example of functional disordered repeats occurs in the SR protein family of splicing factors (e.g., ASF/SF2 and SRp75). 230,231 SR proteins mediate the assembly of spliceosome components. They consist of an N-terminal RNA-recognition motif and a disordered C-terminus with tandem repeats of arginine and serine residues (RS domain). Phosphorylation switches the RS domain of the serine/arginine-rich splicing factor 1 (SRSF1) from a fully disordered state to a more rigid structure. 232 Other disordered repeats associated with a specific function include sequences enriched in lysine, alanine, and proline in the histone H1 C-terminal domain, which are involved in the formation of 30 nm chromatin fiber by binding linker DNA between the nucleosomes. 233,234 A final example is dentin sialophosphoprotein (DSPP), which contains extensively phosphorylated repeats of aspartic acid and serine involved in calcium phosphate binding (see section 9.3). 235 Some repeat-containing regions are also prone to undergo phase transitions from a soluble monomeric state to an insoluble large assembly form, as demonstrated for regions rich in proline, threonine, and serine residues in mucins (see section 9.2). 236 6 Protein Interactions Disordered region-mediated molecular interactions have been proposed to work using a combination of conformational selection and induced folding. 92,146,237 These mechanisms of binding are two extreme possibilities and are not mutually exclusive. Both play a role in the interaction between two proteins, the dominant mechanism depending, for example, on the concentrations of the individual proteins 238 and the association rate constants. 84 In conformational selection, addition of binding partners can result in a population shift in the conformational ensemble of a disordered protein (see section 4.2) toward the conformation that is most favorable for binding. 119,145,173,175 This mechanism has been observed in both protein–protein and protein–nucleic acid interactions. 173 Evidence for the role of conformational selection in IDP binding comes, for example, from the interaction between PDEγ and the α-subunit of transducin, 239 which is important in phototransduction. The dynamic ensemble of unbound PDEγ includes a loosely folded state that resembles its structure when bound to transducin. In induced folding, a protein undergoes a disorder-to-order transition upon association with its binding partner. 92,146,240 Evidence for this mechanism in IDP binding comes, for example, from a study investigating the disordered pKID region of CREB and the KIX domain of CREB-binding protein. Upon binding of pKID to the KIX domain, an ensemble of transient encounter complexes forms, which appear to be stabilized primarily by hydrophobic contacts and evolve to form the fully bound state via an intermediate state without disassociation of the two domains. 91,241 6.1 Fuzzy Complexes Although disordered protein regions frequently fold upon interacting with other proteins, complexes with IDPs often retain significant conformational freedom and can only be described as structural ensembles. 242 The conformations that disordered proteins adopt in the bound state cover a continuum, similar to the structural spectrum of free, unbound IDPs, 243 and range from static to dynamic, and from full to segmental disorder. 242 In static disordered complexes, disordered regions can adopt multiple well-defined conformations in the complex, whereas in dynamic disorder they fluctuate between various states of an ensemble in the bound state. Disorder in the bound state can be classified into four molecular modes of action, each of which is associated with specific molecular functions (Figure 11A–D). 176,242 (i) The polymorphic model is a form of static disorder, with alternative bound conformations serving distinct functions by having different effects on the binding partner. Examples are the Tcf4 β-catenin binding domain 244 and the WH2 binding domains of thymosin β4 or ciboulot, 245 which have been shown to adopt several distinct conformations upon β-catenin and actin binding, respectively. Different actin–WH2 domain complexes have alternative interaction interfaces and result in actin polymers with different topologies. 245 The (ii) clamp and (iii) flanking models represent forms of dynamic disorder in which complex formation either involves folding upon binding of two disordered segments that are connected by a linker that remains disordered, or the reverse situation, respectively. The cyclin-dependent kinase (Cdk) inhibitor p21, for example, acts as a clamp. It contains a dynamic helical subdomain that serves as an adaptable linker that connects two binding domains and enables these to specifically bind distinct cyclin and Cdk complex combinations. 246 In both the clamp and the flanking models, disordered regions near the interacting protein segments (often short peptide motifs) contribute to binding by influencing affinity and specificity. 242,247 This phenomenon relates to the importance of the sequence context in modulating disordered binding elements (see section 3). Finally, (iv) the random model is an extreme version of dynamic disorder in protein complexes, which occurs when the IDR remains largely disordered even in the bound state. In this case, interaction is achieved via linear motifs that do not get fixed upon binding. An example is the self-assembly of elastin, where solid-state NMR has provided evidence for dynamic disorder within elastin fibers, which exhibit random-coil like chemical shift values. 248 Another case is the complex between the Cdk inhibitor Sic1 and the SCF ubiquitin ligase subunit Cdc4, which is formed in a phosphorylation-dependent manner. 249 At any given time, only one out of nine Sic1 phosphorylation sites interact with the core Cdc4 binding site, while the others contribute to the binding energy via a secondary binding site or via long-range electrostatic interactions (Figure 12N). Hence, binding interchanges dynamically within the Sic1–Cdc4 complex to provide ultrafine tuning of the affinity. 249,250 Figure 11 Classification of fuzzy complexes by topology (upper panel) and by mechanism (lower panel). Blue arrows indicate interactions between fuzzy disordered regions and structured molecules. Protein Data Bank 147 identifiers for the structures are given in parentheses. Topological categories: (A) Polymorphic. The WH2 domain of ciboulot interacts with actin in alternative locations: via an 18-residue segment (3u9z) or via only three residues (2ff3). The flanking regions remain dynamically disordered. (B) Clamp. The Oct-1 transcription factor has a bipartite DNA recognition motif. The two globular binding domains are connected by a 23 residue long disordered linker (1hf0), shortening of which reduces binding affinity. (C) Flanking. The p27Kip1 cell-cycle kinase inhibitor binds to the cyclin–Cdk2 complex (1jsu). The kinase binding site is flanked by a ∼100 residue long disordered linker, which enables T187 at the C-terminus to be phosphorylated. (D) Random. UmuD2 is a dimer that is produced from UmuD by RecA-facilitated self-cleavage (1i4v). The resulting proteins exhibit a random coil signal in circular dichroism experiments at physiologically relevant concentrations. Mechanistic categories: (E) Conformational selection. The fuzzy N-terminal acidic tail of the Max transcription factor (1nkp) facilitates formation of the DNA binding helix (dark red) of the leucine zipper basic helix–loop–helix (bHLH) motif. (F) Flexibility modulation. The disordered serine/arginine-rich region of the Ets-1 transcription factor (1mdm) changes DNA binding affinity by 100–1000-fold by modulating the flexibility of the binding segment via transient interactions. (G) Competitive binding. The acidic fuzzy C-terminal tail of high-mobility group protein B1 (2gzk) competes with DNA for the positively charged binding surfaces. (H) Tethering. The binding of the virion protein 16 activation domain to the human transcriptional coactivator positive cofactor 4 (2phe) is facilitated by acidic disordered regions, which anchor the binding segments. Bound disordered regions can impact the interaction affinity and specificity of the complex and tune interactions of folded regions 176 with proteins or DNA. 251 Four different mechanisms have been proposed for the formation of fuzzy complexes (Figure 11E–H). (i) The first is conformational selection, when the disordered region shifts the conformational equilibrium of the binding interface toward the bound form. The fuzzy N-terminal tail of the Max transcription factor, for example, reduces electrostatic repulsion in the basic helix–loop–helix (bHLH) domain and thereby facilitates formation of the DNA recognition helices, which increases binding affinity by 10–100-fold. 252 (ii) In the second mechanism, the disordered region(s) modulate flexibility of the binding interface. The serine- and arginine-rich region of the Ets-1 transcription factor exemplifies this mechanism, which reduces DNA binding affinity by 100–1000-fold. 253 (iii) The third mechanism is competitive binding of the disordered region. Here, the IDR acts as a competitive inhibitor of other regions in the same protein for binding to a partner. The acidic fuzzy C-terminal tail of high-mobility group protein B1 (HMGB1) negatively regulates interaction of the HMG DNA binding domains by occluding the basic DNA-binding surfaces. 254 (iv) In the fourth mechanism, the disordered region serves to tether a weak-affinity binding region to increase its local concentration. For example, a fuzzy N-terminal domain anchors the human positive cofactor 4 (PC4) to several transactivation domains including the herpes simplex virion protein 16 (VP16). 255 All mechanisms of disordered complex formation affect binding to different degrees and can be further tuned by post-translational modifications. 176,251 PTMs in the disordered region may act as affinity tuners by modulating the charge available for biomolecular interactions. 256 6.2 Binding Plasticity Structural analysis of a large number of intrinsic disorder-based protein complexes resulted in another categorization of IDRs based on their binding plasticity (Figure 12). 257 Examples of relatively static IDR-based complexes are (i) mono- and polyvalent complexes, which typically consist of interactions between disordered segments and one or multiple spatially distant binding sites on their binding partners, respectively, (ii) chameleons, such as p53, that have different structures when binding to different proteins, (iii) penetrators that bury significant parts of the protein inside their binding partners, and (iv) huggers, which function in protein oligomerization, for example, by coupled folding and binding of disordered monomers. In addition to these relatively static complexes involving IDRs, one can identify coiled-coil-based complexes. Regions that make up coiled coils are typically highly disordered in monomeric state and gain helical structure upon coiled-coil formation, giving rise to several distinguishable types of complexes, such as intertwined strings, connectors, armatures, and tentacles. Figure 12 A portrait gallery of disorder-based complexes. Illustrative examples of various interaction modes of intrinsically disordered proteins are shown. Protein Data Bank 147 identifiers for the structures are given in parentheses. (A) MoRFs. Aa, α-MoRF, a complex between the botulinum neurotoxin (red helix) and its receptor (a blue cloud) (2NM1); Ab, ι-MoRF, a complex between an 18-mer cognate peptide derived from the α1 subunit of the nicotinic acetylcholine receptor from Torpedo californica (red helix) and α-cobratoxin (a blue cloud) (1LXH). (B) Wrappers. Ba, rat PP1 (blue cloud) complexed with mouse inhibitor-2 (red helices) (2O8A); Bb, a complex between the paired domain from the Drosophila paired (prd) protein and DNA (1PDN). (C) Penetrator. Ribosomal protein s12 embedded into the rRNA (1N34). (D) Huggers. Da, E. coli trp repressor dimer (1ZT9); Db, tetramerization domain of p53 (1PES); Dc, tetramerization domain of p73 (2WQI). (E) Intertwined strings. Ea, dimeric coiled coil, a basic coiled-coil protein from Eubacterium eligens ATCC 27750 (3HNW); Eb, trimeric coiled coil, salmonella trimeric autotransporter adhesin, SadA (2WPQ); Ec, tetrameric coiled coil, the virion-associated protein P3 from Caulimovirus (2O1J). (F) Long cylindrical containers. Fa, pentameric coiled coil, side and top views of the assembly domain of cartilage oligomeric matrix protein (1FBM); Fb, side and top views of the seven-helix coiled coil, engineered version of the GCN4 leucine zipper (2HY6). (G) Connectors. Ga, human heat shock factor binding protein 1 (3CI9); Gb, the bacterial cell division protein ZapA from Pseudomonas aeruginosa (1W2E). (H) Armature. Ha, side and top views of the envelope glycoprotein GP2 from Ebola virus (2EBO); Hb, side and top views of a complex between the N- and C-terminal peptides derived from the membrane fusion protein of the Visna (1JEK). (I) Tweezers or forceps. A complex between c-Jun, c-Fos, and DNA. Proteins are shown as red helices, whereas DNA is shown as a blue cloud (1FOS). (J) Grabbers. Structure of the complex between βPIX coiled coil (red helices) and Shank PDZ (blue cloud) (3L4F). (K) Tentacles. Structure of the hexameric molecular chaperone prefoldin from the archaeum Methanobacterium thermoautotrophicum (1FXK). (L) Pullers. Structure of the ClpB chaperone from Thermus thermophilus (1QVR). (M) Chameleons. The C-terminal fragment of p53 gains different types of secondary structure in complexes with four different binding partners, cyclin A (1H26), sirtuin (1MA3), CBP bromo domain (1JSP), and s100ββ (1DT7). Panels A–M reprinted with permission from ref (257). Copyright 2011 The Royal Society of Chemistry. (N) Dynamic complexes. Schematic representation of the polyelectrostatic model of the Sic1–Cdc4 interaction. An IDP (ribbon) interacts with a folded receptor (gray shape) through several distinct binding motifs and an ensemble of conformations (indicated by four representations of the interaction). The intrinsically disordered protein possesses positive and negative charges (depicted as blue and red circles, respectively) giving rise to a net charge ql , while the binding site in the receptor (light blue) has a charge qr . The effective distance ⟨r⟩ is between the binding site and the center of mass of the intrinsically disordered protein. Panel N was reprinted with permission from ref (243). Copyright 2010 John Wiley & Sons, Inc. 7 Evolution Disordered regions typically evolve faster than structured domains. 51−56,107 This behavior largely stems from a lack of constraints on maintaining packing interactions, which drives purifying selection in structured sequences. 258 However, disordered residues do display a wide range of evolutionary rates (Box 2). The following section discusses the evolutionary classifications of disordered protein regions. IDRs with similar functions and properties tend to have similar evolutionary characteristics. 7.1 Sequence Conservation While the amino acid sequence of disordered regions evolves at different rates, the property of disorder is usually conserved for functional sequences. 54,159 Sequence conservation of IDRs varies according to their specific functions and provides another means for their classification. 54,259,260 Three biologically distinct classes of IDRs with specific function were identified using a combination of disorder prediction and multiple sequence alignment of orthologous groups across 23 species in the yeast clade (Figure 13): (i) flexible disorder describes regions where disorder is conserved but that have quickly evolving amino acid sequences (i.e., there is a requirement to be disordered, regardless of the exact sequence), (ii) constrained disorder describes regions of conserved disorder with also highly conserved amino acid sequences, and (iii) nonconserved disorder, where not even the property of being disordered is conserved in closely related species. For flexible disorder, low sequence conservation is expected if the property of disorder itself, as opposed to disorder in combination with specific sequence, is the only requirement for function. Examples of functions that mainly require the biophysical flexibility of disordered regions are entropic springs, spacers, and flexible linkers between well-folded protein domains. 37,39,57,58 The linker in RPA70 is an example where the dynamic behavior is conserved even when the sequence conservation is low. 60 Flexible disorder is the most common of the three evolutionary classes with just over one-half of disordered residues in yeast. It appears to account not just for the “flexibility” functions mentioned above, but also for many of the characteristics traditionally associated with disordered regions, such as strong association with signaling and regulation processes, 35,50,104,190,261,262 rapid sequence evolution, 51−56,107 the presence of short linear motifs (which are themselves conserved, see below), 47,72 and tight regulation (see section 8). 68,263 By contrast, constrained disorder (about a third of disordered residues in yeast) is associated with different properties and functions, such as chaperone activity and RNA-binding ribosomal proteins. 54 Many proteins that contain the evolutionarily constrained type of disorder can adopt a fixed conformation, suggesting that these regions might undergo folding upon binding to their targets. This structural transition might impose a high degree of local structural constraints, which results in constraints on the protein sequence alongside requirements to be flexible. 54 Constrained disordered residues also occur more often in annotated protein sequence families (domains) than flexible disorder, but both types are strongly depleted in domains compared to structured regions. In human, both flexible and constrained disorder are enriched in proteins functioning in differentiation and development, 264 which reflects the importance of IDPs in these processes. Finally, nonconserved disorder accounts for around 17% of disordered residues in yeast and appears to be largely nonfunctional. Figure 13 Classification of disordered regions according to their evolutionary conservation (constrained, flexible, and nonconserved disorder). (A) Schematic of computing disorder conservation and amino acid sequence conservation. The alignments are used to calculate the percentage of sequences in which a residue is disordered and the percentage of sequences in which the amino acid itself is conserved. A residue is considered to be conserved disordered if the property of disorder is conserved in at least one-half of the species. Similarly, the amino acid type of a residue is considered conserved if it is present in at least one-half of the species. Disordered residues in which both sequence and disorder are conserved are referred to as constrained disorder. Disordered residues in which disorder is conserved but not the amino acid sequence are referred to as flexible disorder. Residues that are disordered in S. cerevisiae but not cases of conserved disorder are referred to as nonconserved disorder. (B) Disorder splits into three distinct phenomena. Functional enrichment maps of proteins enriched in flexible disorder versus constrained disorder. The area of each rectangle is proportional to the occurrence of that type of disorder in the alignments. Related gene ontology terms are grouped based on gene overlap. Reprinted with permission from ref (54). Copyright 2011 Springer Science + Business Media. Short linear motifs (see section 3.1) 48,125 constitute a special case. Even though SLiMs almost exclusively lie within disordered regions, their own amino acid sequence tends to be conserved. 48 These properties, together with the difficulty of aligning rapidly evolving disordered sequences, result in the motifs to move around when comparing their position in different sequences. In fact, not only do motifs move around (due to insertions and deletions of amino acids around the motif in the sequence 67,265 ), they can also permute their positions with respect to other structural and functional modules. For example, SUMO modification sites in p53 are seen after and before the oligomerization domain in human and fly, respectively. 266 Such behavior could emerge by convergent evolution and loss of the motif in the original site, as only a few amino acids need to mutate to make a new motif elsewhere in the sequence. As long as the position of the motif with respect to the other modules does not affect function, such permutations will not affect fitness and hence may emerge relatively easily during evolution. These are indeed confounding issues when aligning disordered regions among orthologous proteins to identify functional motifs. In many ways, the disordered regions that contain SLiMs constitute flexible disorder as by the above classification, as their main role is to provide flexibility to enable access to the linear motif for proteins that will bind them as ligands 267 or introduce post-translational modifications. 47,48 Phosphorylation sites are closely related to short linear motifs that function in binding, but are often too short and weakly conserved to recognize via computational means. 268 More than 90% of sites phosphorylated by the yeast Cdk1 are in predicted disordered regions, 67 as consistent with previous studies highlighting the importance of IDRs as display sites for phosphorylation and other PTMs (see sections 2.2 and 3.1). 45,46 Comparison of the phosphorylation sites in orthologues of the Cdk1 substrates revealed that the precise position of most phosphorylation sites is not conserved. Instead, clusters of sites move around in the alignment of rapidly evolving disordered regions. 69,250,269 Another example of the role of flexible disorder in signaling and regulation is the yeast serine-arginine protein kinase Sky1, which regulates proteins involved in mRNA metabolism and cation homeostasis. The Sky1 C-terminal loop is intrinsically disordered and contains phosphosites that are important for regulating its kinase activity. 270 Conservation analysis has shown that the loop is conserved for disorder but not for sequence. 54 The combination of sequence conservation of IDRs and conservation of their amino acid composition between human and seven other eukaryotes (chimp, dog, rat, mouse, fly, worm, and yeast) also identifies functional preferences. 260 IDRs with high residue conservation (HR) are enriched in proteins involved in transcription regulation and DNA binding. Low residue conservation in combination with high conservation of the amino acid type composition (LRHT) of the IDR (i.e., high similarity of overall amino acid composition between the human IDR and its orthologs) is often associated with ATPase and nuclease activities. Finally, IDRs that show neither conservation of sequence nor conservation of amino acid composition (LRLT) are abundant in (metal) ion binding proteins. 7.2 Lineage and Species Specificity Increasingly complex organisms have higher abundances of disorder in their proteomes. 35,271 An average of 2% of archaeal, 4% of bacterial, and 33% of eukaryotic proteins have been predicted to contain regions of disorder over 30 residues in length, 35 although there is much variation within kingdoms. 272,273 In human, 31% of proteins are more than 35% unstructured, 68 and 44% contain stretches of disorder longer than 30 residues 49,161,208 (similar data shown in Figure 2A). Human IDPs are spread relatively uniformly across the chromosomes, with percentages ranging from 38% (for genes encoding IDPs on chromosome 21) to 50% on chromosomes 12 and X. 161 A computational analysis of disorder in prokaryotes has corroborated the higher abundance of disorder in Bacteria as compared to Archaea. 274 Moreover, in agreement with the low abundance of disorder in prokaryotes, none of the 13 mitochondrial-encoded proteins are disordered. 161 Systematic analysis of IDP occurrence in 53 archaeal species showed that disorder content is highly species-dependent. 275 For example, Thermoproteales and Halobacteria proteomes have 14% and 34% disordered residues, respectively. Harsh environmental conditions seem to favor higher disorder contents, suggesting that some of the archaeal IDPs evolved to help accommodate hostile habitats. 276 Structural disorder is more common in viruses than in prokaryotes. 277 The characteristics of IDRs seem well suited for especially small RNA viruses with extremely compact genomes. 278,279 For example, disordered regions could buffer the deleterious effects of mutations introduced by low-fidelity virus polymerases better than would structured domains. 277 The flexibility of IDRs to interact with many different proteins, such as proteins of the host immune system, is another useful feature for compact viruses because it maximizes the amount of functionality they encode while minimizing the required genetic information. 280 At the same time, several human innate immunity proteins have predicted disordered regions that could be important for their pathogen defense function. 281 For example, the RIG-I-like receptors (RLRs) RIG-I and MDA5 recognize different types of viral double-stranded RNA (dsRNA). 282 This functional divergence is partly achieved by differential flexibility of a loop that is rigid in RIG-I, but disordered in MDA5, resulting in different RNA binding preferences. 283 Furthermore, the disordered linker between the RNA-binding domains and the two N-terminal CARD (caspase activation and recruitment) domains of MDA5 helps facilitate oligomerization of the CARD domains, which initiates downstream signaling. 283 Activated RIG-I and MDA5 promote the formation of prion-like aggregates of the CARD domains of MAVS (mitochondrial antiviral-signaling). 284 MAVS has a highly disordered central region that contains multiple phosphorylation sites and interacts with several proteins, such as TRAF2 and TRAF6 through their respective consensus binding motifs (PxQx[TS] and PxExx[FYWHDE], respectively). 285 These interactions are part of a signaling pathway that activates the transcription factors IRF3/7 and NF-κB, leading to the expression of proinflammatory cytokines such as IFN-α/β and various proteins with direct antiviral activity. 282 For example, to counteract viral infection, protein kinase R (PKR) phosphorylates the translation initiation factor eIF2α in the presence dsRNA, which reduces global protein synthesis in the cell. 286 PKR contains a long disordered interdomain region that may become ordered upon RNA binding and could affect PKR dimerization. 287,288 Interestingly, viruses counteract PKR action by mimicking eIF2α and competing for PKR binding, as has been shown in the case of the poxvirus protein K3L. 289 PKR is under intense positive selection to keep recognizing eIF2α while minimizing interaction with viral antagonists. 289 Many of the changing sites in PKR are in a dynamic loop near the interaction interface with both eIF2α and K3L. 290 Similarly, recognition of retrovirus capsids by the restriction factor TRIM5α is mediated by disordered regions in the SPRY domain, which bear many positively selected residues that are essential for the antiviral activity. 291 The SPRY domain exists as an ensemble of disordered conformations that determine the specificity and affinity of the interaction between TRIM5α and the viral capsid. 292−294 In this way, the evolutionary flexibility of disordered regions (see section 7.1) provides opportunities for proteins of the host immune system to compete with rapidly changing pathogens while maintaining their functionality. In addition to the variation in prevalence of disordered regions between species, different kingdoms of life seem to use conserved IDRs for different functions: eukaryotic and viral proteins use disorder mainly for mediating transient protein–protein interactions in signaling and regulation, while prokaryotes use disorder mainly for longer lasting interactions involved in complex formation. 159 Thus, knowledge on the lineage, species, and origin of a disordered region could help in predicting its likely function. 7.3 Evolutionary History and Mechanism of Repeat Expansion Tandem repeats are enriched for intrinsic disorder (see section 5.7), and IDRs are increasingly abundant in increasingly complex organisms (see section 7.2). The genetic instability of repetitive genomic regions in combination with the structurally permissive nature of IDRs might have driven the increase in the amount of disorder during evolution. Disordered repeat regions have been shown to fall into three categories, based on their evolutionary history and acquired functional properties (Figure 14): 61 type I regions have not undergone functional diversification after repeat expansion (e.g., the titin PEVK domain), type II repeats have acquired diverse functions due to mutation or differential location within the sequence (e.g., the C-terminal domain of eukaryotic RNA polymerase II), and type III regions have gained new functions as a consequence of their expansion per se (e.g., the prion protein octarepeat region). Figure 14 Repeat expansion creates IDRs. IDRs are abundant in repeating sequence elements, which suggests that repeat expansion is an important mechanism by which genetic material encoding for structural disorder is generated. The expanding repeats may fall into three classes (types) in terms of their functional diversification following expansion. Individual repeats may remain functionally equivalent (type I), or diversify (type II), or collectively acquire a completely new function (type III). Dark-tone red indicates structural disorder of the repeat, which may undergo full (dark-tone blue) or partial (green) induced folding upon binding to a partner. Adapted with permission from ref (61). Copyright 2003 John Wiley & Sons, Inc. 8 Regulation Altered availability of IDPs is associated with diseases such as cancer and neurodegeneration. 190,263,295−299 Indeed, genes that are harmful when overexpressed (i.e., dosage-sensitive genes) often encode proteins with disordered segments. 300 Multiple mechanisms at different stages during gene expression (from transcript synthesis to protein degradation) control the availability of IDPs. 68 Their tight regulation ensures that IDPs are available in appropriate levels and for the right amount of time, thereby minimizing the likelihood of ectopic interactions. Disease-causing altered availability of IDPs may result in imbalances in signaling pathways by sequestering proteins through nonfunctional interactions involving disordered segments (i.e., molecular titration 263 ). The following section discusses possible functional roles of proteins with IDRs based on their cellular regulatory properties such as transcript abundance, alternative splicing, degradation kinetics, and post-translational processing. 8.1 Expression Patterns Five different expression patterns were identified for transcripts encoding highly disordered proteins by investigating the mRNA levels from over 70 different human tissues and comparing the number of tissues in which IDP transcripts are expressed against the level of expression (Figure 15). 208 The expression classes are associated with specific functions. (i) The first subgroup (Figure 15, light blue markers) shows constitutive high expression in all tissues and consists exclusively of large ribosomal subunit proteins, which are almost entirely disordered. (ii) The second group (blue-green) represents transcripts that show high expression levels in the majority of tissues. These often function as protease inhibitors, splicing factors, and complex assemblers. (iii) Moderately expressed transcripts (green) typically encode disordered proteins involved in DNA binding and transcription regulation. (iv) IDPs that are expressed in a tissue-specific manner (yellow) are enriched for cell organization regulators, transcription cofactors, and factors that promote complex disassembly. Finally, (v) the remaining transcripts form a group (gray) not detected to be abundant in any of the tissues studied. This low and transient expression group contains more than one-half of the IDP transcripts analyzed and has a variety of functions. Figure 15 A summary of expression–function trends for human transcripts encoding highly disordered proteins. The x-axis represents the log10 number of tissues in which the transcript is expressed; the y-axis represents the log10 average magnitude of expression within the tissues. From the data, five distinct functional classes of highly disordered human proteins become apparent. Adapted with permission from ref (208). Copyright 2009 Springer Science + Business Media. 8.2 Alternative Splicing Trends in transcriptional regulation (alternative promotor and polyadenylation site usage) and post-transcriptional regulation (alternative splicing by inclusion or exclusion of exons) can also be informative of the role that specific disordered protein regions play in the cell (Figure 16). Alternatively spliced exons are overall more likely to encode intrinsically disordered rather than structured protein segments. 161,301−303 This tendency is even more pronounced in alternative exons whose inclusion or exclusion is regulated in a tissue-specific manner. 304 IDRs that are encoded by these tissue-specific alternative exons frequently influence the choice of protein interaction partners and can be instrumental in protein regulation 304,305 by embedding binding motifs, and residues that can be post-translationally modified. 304 However, simple alteration of the length of a disordered region 306 can also modulate the overall protein function (Figure 16). Changes in IDR length can be an effective mechanism for modifying the affinity of interactions that a protein makes, particularly in instances where a disordered region is responsible for the positioning of protein binding motifs or domains. 307,308 Among the alternative exons, those that exhibit conserved splicing patterns across different species are particularly likely to have important regulatory roles. For example, tissue-specific exons, which are alternatively spliced in multiple different mammals, remarkably often contain IDRs with embedded phosphosites. 309 Disordered regions encoded by these exons are hence likely to act as modulators of protein function depending on the tissue where they are expressed. 309 While tissue-specific exons that are alternatively spliced in a conserved fashion often code for phosphosites, the emergence of novel exons in a gene, although at first likely detrimental, 310 is a possible template for the evolution of short interaction motifs. 311 Furthermore, changes in exon regulation can also be important for the emergence of novel adaptive functions. Accordingly, protein segments encoded by exons, which are alternatively spliced either in a single species or in a whole evolutionary lineage, are enriched in short binding motifs, and alternative inclusion of disordered regions encoded by these exons is conceivably a source of evolutionary novelty. 312 Figure 16 Transcriptional and post-transcriptional gene regulation can be informative of IDR function. How inclusion of exons that code for IDRs is regulated during gene transcription and alternative splicing can give insights into the functional roles of the encoded disordered regions. For example, tissue- or developmental-specific regulation of alternative splicing or alternative promoter and polyadenylation site usage can be associated with important roles of the encoded IDRs in protein regulation and cellular interactions through, for example, the presence of binding motifs and phosphosites. Additionally, information on the conservation of patterns of exon inclusion (i.e., events shared among different evolutionary lineages versus species-specific events) can aid in better characterization of the encoded IDRs. The figure illustrates a hypothetical example where an exon (largest red box) that is included in a tissue-specific manner both in human and in mouse encodes an IDR that embeds a phosphosite (P) and is involved in protein regulation. The human gene depicted in the figure has an additional exon (smallest red box), which encodes an IDR with a short interaction motif and which is also included in a tissue-specific manner in humans. Gene structures, mature mRNAs, and corresponding protein isoforms are shown for human and mouse brain and heart tissues. On the right, possible functional roles of the IDRs encoded by the brain isoforms are illustrated. The examples illustrate how protein functional space can increase due to alternative splicing of exons that encode IDRs. Adapted with permission from ref (304). Copyright 2012 Elsevier. In addition to the tendency of cassette alternative exons to frequently encode IDRs, exons adjacent to the alternatively spliced ones are also likely to code for disordered regions around the insertion point for the alternatively spliced segment. 264,302 These disordered regions not only provide the structural flexibility that tolerates both presence and absence of the alternatively spliced segment, but they can also contain interaction motifs themselves. 264 Furthermore, on the transcriptional level, diversity in protein isoforms can be created through both alternative splicing and usage of alternative promoters and polyadenylation sites. Protein segments that are encoded by the two latter mechanisms can contain disordered regions with motifs that define protein localization and stability. 313 Taken together, these examples illustrate how better understanding of gene regulation and knowledge of evolutionarily conserved and novel isoforms can provide insights into possible functional roles of whole proteins and specific protein regions. 8.3 Degradation Kinetics Another emerging functionality of disordered regions is their role in protein degradation. 314−321 Protein half-life generally correlates with the fraction of disordered residues, 68,317 and proteins that get ubiquitinated specifically upon heat shock stress are typically disordered. 322 Although ubiquitination by E3 ligases has a dominant role in recruiting proteins to the proteasome for degradation, 323,324 some IDRs of sufficient length allow for efficient initiation of degradation by the proteasome independent of the ubiquitination status. This idea is supported by in vitro experiments showing that degradation of tightly folded proteins is accelerated when a disordered region is attached to model substrates. 315,321 Efficient degradation only occurs when the disordered terminal region is of a certain minimal length, 321 and degradation may be initiated by IDRs either at the protein terminus or internally. 314−321 Proteins that contain IDRs of sufficient length may therefore have increased turnover, although the exact length requirements will depend on the substrate. At the same time, not all IDRs influence protein half-life. For example, disordered polypeptides with specific amino acid compositions such as glycine-alanine and polyglutamine repeats can attenuate rather than accelerate degradation by the proteasome. 325−327 The formation of protein complexes or transient interactions with other proteins may also protect IDPs from degradation. Thus, we can distinguish a novel functional class of IDRs: those that influence protein degradation (degradation accelerators) versus those that do not. These properties might be associated with specific protein function. For example, proteins that contain IDRs of a given length are probably more susceptible to degradation, possibly linking them to functions of IDPs with low expression. Some highly disordered proteins (e.g., p53, p73, IκBα, BimEL) can, at least in vitro, be degraded by the 20S proteasome independent of ubiquitination. 328−333 Specialized proteins termed “nannies” have been shown to bind to and protect IDPs from ubiquitin-independent 20S proteasomal degradation. 334 A free IDP, such as newly synthesized p53, might be degraded by the 20S proteasome, which leads to fast degradation kinetics. After a nanny binds the IDP (Hdmx in the case of p53), slower, ubiquitin-dependent degradation by the 26S proteasome takes place. This biphasic decay has been proposed as a way to distinguish structured proteins from IDPs and the proteins that protect them from degradation. 334 8.4 Post-translational Processing and Secretion The majority of secretory proteins are targeted to the endoplasmic reticulum (ER) via an N-terminal signal peptide, which helps to initiate translocation of nascent chains into the ER. 335,336 Bioinformatic analysis of proteins containing N-terminal ER signal peptides has identified only 10% of these proteins as IDPs (>70% disordered), suggesting that IDPs are under-represented in the secretome. 337 The fact that secreted proteins are rarely IDPs might be partially explained by the requirement for largely disordered proteins to contain an α-helical prodomain for correct import into the ER lumen, 338 as demonstrated for intrinsically disordered prohormones. 337 IDPs lacking this structured, α-helical domain were subjected to ER-associated degradation (ERAD) despite the presence of a signal peptide. 338 Despite the relative depletion of IDPs in the secretome, a number of important IDPs are processed within the ER, including many prohormones, 337,339 components of the extracellular matrix, 340 and proteins involved in biomineralization (see section 9.3). 117,341,342 Pre-pro-opiomelanocortin (pre-POMC) is a disordered 285 amino acid protein whose signal peptide is removed during translation to create the 241-residue pro-opiomelanocortin (POMC). This prohormone has at least eight putative basic-rich cleavage sites and is able to yield as many as 10 biologically active peptides including adrenocorticotropic hormone (ACTH) and β-endorphin. The processing of POMC is tissue-specific and depends on the type of convertase enzyme expressed. 343 Other prominent examples of disordered extracellular proteins are elastin and other components of elastic fibers, 344 small integrin-binding ligand N-linked glycoproteins (SIBLINGs) (see section 9.3), 340−342,345 and mucins (see section 9.2). 236 Thus, although secreted proteins are not particularly enriched for structural disorder overall, some IDPs are essential for biomineralization, tissue organization, and hormonal signaling. In line with the features of intracellular IDPs, extracellular structural disorder is heavily post-translationally modified and involved in extensive interactions that organize large molecular assembles while binding multiple interaction partners. 117,341,342 9 Biophysical Properties A large range of biophysical work has been carried out on structural disorder in proteins using a variety of experimental techniques (Box 2). 346 Previous sections have touched on several aspects. Disordered regions rapidly shift within a continuum of variably extended or globular conformations and are best described as dynamic ensembles (see section 4). The amino acid sequence of a disordered region determines which conformations it can sample, depending for example on the charge properties (see section 5.1). Disordered proteins frequently fold upon binding, and their binding thermodynamics allow for fast, transient, but highly specific interactions (see sections 2, 3, and 6). The following section discusses three other physical properties that are essential for the biology of some IDRs and IDPs: solubility, the ability to undergo phase transitions, and the role in biomineralization. 9.1 Solubility The solubility of a protein depends upon the favorability of its interactions with water. Globular proteins bury hydrophobic amino acids within their solvent-excluded cores, while their surfaces are generally enriched in polar and charged amino acids that interact favorably with water, leading to aqueous solubility. 347,348 The presence of hydrophobic surface residues, for example, binding sites for other proteins, and the denaturation of otherwise folded proteins lead to the exposure of hydrophobic residues to water and reduce solubility, sometimes leading to aggregation and precipitation. Disordered proteins do not spontaneously fold into globular structures because their sequences are depleted in hydrophobic amino acids that, in globular proteins, drive folding (see section 5). 31,44 The accompanying enrichment in polar and charged amino acids, as a general rule, causes disordered proteins to be soluble in aqueous solutions. In addition, IDPs are generally resistant to heat-induced aggregation and precipitation, because disordered proteins, in isolation, lack extensive secondary and tertiary structure that in folded, globular proteins is subject to thermal denaturation. Heat-stability was observed for some of the earliest examples of IDPs. For example, the highly disordered cyclin-dependent kinase (Cdk) inhibitor p21 remains soluble and structurally unaltered from 5 to 90 °C. 28 In fact, the related Cdk inhibitor p27 was purified by boiling, although at that time it was not known to be a disordered protein. 349 In that study, boiling was used as a means to release p27 from its highly stable complexes with Cdks and cyclins, which, because they are folded proteins, underwent thermal denaturation and precipitated while heat-stable p27 remained soluble. This heat-treated preparation of p27 was subsequently demonstrated to potently inhibit Cdk2-cyclin A. 349 Sequence analysis algorithms have predicted a high prevalence of IDRs and IDPs in sequenced genomes (see section 7.2). 35,271 To experimentally address the issue of the disordered protein content of a proteome, Galea and co-workers 209 treated the soluble extract of mouse embryo fibroblast cells with heat to precipitate folded proteins and then used large-scale liquid chromatography and mass spectrometry methods to identify ∼1300 proteins that remained soluble. Disorder predictions showed that more than two-thirds of these thermostable proteins are substantially disordered. This demonstrates that disordered proteins, as a structural class, are more heat stable and soluble than their folded counterparts, consistent with their sequence features and the principles of amino acid solubility. However, disordered proteins exhibit varying degrees of compaction, which is influenced by the presence and patterning of charged residues within the polypeptide chain (see section 5.1). 166−168,196 While the influence of compaction on disordered protein solubility has not been addressed, it is reasonable to expect that the extent of compaction will influence the exposure of solubility-promoting amino acids for interactions with water and therefore aqueous protein solubility. It is possible that solubility has influenced the evolution of disordered protein sequences, with low abundance disordered proteins involved in signaling and regulation being less dependent on high solubility than other disordered proteins that are highly abundant in certain cell types (e.g., titin in muscle cells). Several extracellular IDPs use their solubility to great effect in the sequestration of inorganic molecules in the extracellular environment (see section 9.3). Apart from evolutionary considerations, there are practical applications of the high solubility associated with some disordered protein sequences. For example, proteins with higher degrees of disorder have an increased success rate of expression in a cell-free protein synthesis system. 350 Furthermore, Dunker and co-workers demonstrated that fusion of a variety of disordered polypeptide tags containing repetitive, highly negatively charged sequences (termed “entropic bristles”) enhanced the aqueous solubility of many proteins previously shown to be poorly soluble upon expression in E. coli. 351 Whether the solubilizing effect of these disordered tags is simply due to an increase in the fraction of solubility-promoting amino acids or to other effects, such as a potential molecular chaperone function, has not been determined. Clearly, however, disordered regions within multidomain proteins that also contain folded domains are likely to influence overall protein solubility. 9.2 Phase Transition The involvement of IDRs in phase transitions provides another biophysical angle to the characterization of proteins that harbor disordered regions. 99 Li and co-workers 137 observed that interactions between recombinant proteins that contain multiple copies of an SH3 domain and IDRs with multiple instances of the proline-rich SH3 interaction motif (see section 3.1) produced sharp liquid–liquid-demixing (phase separations) that resulted in micrometer-sized liquid protein-based droplets (Figure 17A). The concentrations needed for the phase transition depend on the valency (i.e., number of repeating units) of the interacting elements. Importantly, experiments with the natural NCK–nephrin–N-WASP (neuronal Wiskott–Aldrich syndrome protein) complex, which contains multiple copies of the same SH3 interaction partners, showed the formation of similar dynamic droplets, which lead to a significant increase in the activity of the actin nucleation factor Arp2/3. 137 The formation of the droplets is controlled by the degree of phosphorylation of one of the interaction partners, which potentially explains how the phase transitions may be regulated in the cell. Figure 17 Involvement of IDRs in phase transitions. (A) Interactions between proteins that contain multiple copies of a specific domain (an SH3 domain in the figure) and IDRs with multiple instances of its interaction motif (proline-rich SH3 motif here) can, at appropriate concentrations, produce sharp liquid–liquid-demixing phase separations. This phase transition is likely to increase local “active” protein concentrations exploitable for signaling switches. (B) High concentrations of low-complexity IDRs found in certain RNA binding domains lead to a reversible phase transition with the formation of highly dynamic hydrogels. These RNA granule-like assemblies consist of heteromeric protein aggregates and allow localization and storage of functionally related but nonidentical RNA molecules. Adapted from ref (100). Copyright 2013 the Biochemical Society. A related phenomenon occurs with RNA-binding proteins that contain IDRs of low sequence complexity. Such regions have been associated with the regulated formation of cellular RNA granules. 352 Various types of RNA granules are used to modulate the fate of specific mRNAs, but their assembly mechanism has remained unclear. Kato and co-workers 353 reconstituted granule-like RNA assemblies in vitro by exploiting low complexity IDRs. They demonstrated that the low-complexity IDRs of certain RNA-binding proteins were necessary for the formation of granule-like assemblies and that high concentrations of these regions lead to a reversible phase transition with a highly dynamic hydrogel state (Figure 17B). Interestingly, hydrogels formed by the low-complexity IDR of one purified member of the granules are capable of binding IDRs of other members and thereby enable the assembly of heterogeneous macromolecular structures. 353 Many IDRs that can form such functional aggregates have been shown to be under tight regulation to modulate their availability in the cell. 224 Regulation of IDR abundance can shift the equilibrium between the monomeric and oligomeric/aggregate form, thereby preventing formation of undesirable aggregates and keeping functional assemblies under control. 224 Together, these findings indicate that the biophysical properties of certain IDRs (such as those that contain specific low-complexity regions or linear motifs) enable phase transitions that are likely to be exploited in various macromolecular assemblies and could function to bridge the length scale of proteins with that of organelles. 354 Disorder-mediated phase transitions also occur extracellularly, as exemplified by the mucin family of proteins. These proteins rely on structural disorder for the formation of gel-like networks of mucus, which function in the protection of epithelial surfaces such as those in the airway and the gut. 355,356 Extensive glycosylation of very large disordered regions that are rich in proline, threonine, and serine residues contributes to the formation of these structures. 357 Mucin-1 can contain up to 120 such repeats, depending on the genetic variant an individual carries. 358 Regulated order-to-disorder transitions of Mucin-2 are important in the formation of colon mucus aggregates. 88,236,359 Mucin-2 trimers are compact structures under the conditions of the secretory pathway, where the pH is low and calcium is present, but these structures partially unfold and greatly expand in more basic environments, such as in the colon, triggering a phase transition into a mucus polymer gel. 88,236,359 9.3 Biomineralization Most animals are able to produce hard tissues for various physiological purposes by mineralization of the extracellular matrix. 360,361 Bone and teeth, for example, consist of collagen and other proteins in conjunction with inorganic calcium phosphate in the form of hydroxyapatite (HA). 360,362 Proteins involved in hard tissue mineralization are predicted to have very high levels of disorder, 340−342 and disordered proteins are important in mineral homeostasis in general, 117 indicating an important role for IDRs in these processes. For example, unfolded phosphoproteins sequester calcium phosphate by forming stable complexes in which the phosphorylated side-chains of the proteins occupy the phosphate positions on the surfaces of calcium phosphate nanoclusters. 117 The disordered nature of these proteins allows them to readily adjust their shapes to surround and solubilize clusters of calcium phosphate. In this manner, proteins such as the milk caseins achieve high concentrations of calcium and phosphate while preventing the precipitation of the corresponding salts (i.e., calcification). 117 Caseins belong to the highly disordered secretory calcium-binding phosphoprotein (SCPP) gene family, 341 which includes bone, tooth, milk, and salivary proteins. 363 Humans encode five small integrin-binding ligand N-linked glycoproteins (SIBLINGs), which are a subset of SCPPs involved specifically in regulating bone and teeth formation by bringing together hydroxyapatite, cell-surface integrins, and collagens. 345,360 These are osteopontin (OPN, or bone sialoprotein 1), bone sialoprotein 2 (IBSP), dentin matrix acidic phosphoprotein 1 (DMP1), matrix extracellular phosphoglycoprotein (MEPE), and dentin sialophosphoprotein (DSPP). 235 SIBLINGs are highly disordered 340−342,345 and undergo extensive phosphorylation in the Golgi before they are secreted, as demonstrated in the case of DSPP, which has approximately 200 phosphoserines. 235 DSPP has a particularly extreme serine and aspartic acid content, and its maturation product dentin phosphoprotein (DPP, or phosphophoryn) is likely to be one of the most acidic natural proteins known. 10 Discussion It is likely that many of the functionally uncharacterized proteins will be similar to already characterized ones. 8−10 This notion forms the basis for computational methods that aim to improve annotation coverage by predicting the function of novel and undefined proteins based on information from better-studied proteins. Databases such as Pfam 22 and SCOP 24 attest to the success of these approaches. However, existing methods are focused primarily on sequences that give rise to well-folded protein structures and domains. As a result, it is much harder to gain insight into the function of intrinsically disordered regions (IDRs) and proteins (IDPs), despite the increasing evidence of their prevalence and importance for protein functionality (Figure 1). 50 Many important disease proteins such as p53, Myc, α-synuclein, and BRCA1 are highly disordered, underscoring the importance of disordered regions for understanding the molecular basis of human diseases. 263,295,299 In this Review, we have assembled an overview of the major approaches used to classify and categorize IDRs and IDPs (Table 1). These classification schemes help us understand how disordered protein functionality is defined and could be used to enhance function prediction for disordered protein regions in general. In these final sections, we discuss the resources that are currently available for gaining insight into IDR function (Table 2), we address potential areas for improvement of the current approaches, and we propose that combinations of multiple existing classification schemes could achieve higher-quality function prediction for IDRs. Finally, we suggest areas where increased efforts are likely to advance our understanding of the functions of structural disorder in proteins. 10.1 Current Methods for Function Prediction of IDRs and IDPs Which methods and resources can a researcher use to gain insight into the functions of the disordered regions in a protein? Current approaches (Table 2) are mainly based on the presence of functional features such as short linear motifs (SLiMs), post-translational modification (PTM) sites, molecular recognition features (MoRFs), and intrinsically disordered domains (IDDs) (see section 3). These aspects have the potential to shed light on which interaction partners an IDR may have and how many, as well as the mode of binding. 10.1.1 Linear Motif-Based Approaches Mapping of well-characterized linear motifs onto other protein sequences holds particular promise for discovering novel functionality. For example, proteomic characterization of the motif (RxxPDG) that recruits Tankyrase ADP-ribose polymerases has led to the identification of novel Tankyrase substrates and explains the basis for mutations causing cherubism disease. 364 Similarly, proteome-wide searches for the SxIP motif have resulted in the identification of previously uncharacterized microtubule plus-end tracking proteins. 365 However, these types of individual studies require considerable resources. MiniMotif 126 and ELM 125 are two major efforts aimed at the annotation of known instances of linear motifs, which are primarily found in IDRs, and their binding partners. The MiniMotif and ELM databases aim to categorize linear motifs of all functions based on in-depth manual annotation of experimentally validated instances from the literature. Similar approaches have also been taken specifically for PTM site motifs (see section 10.1.2). Although these resources are excellent repositories of the functional sites that occur in IDRs, they do have certain shortcomings. For example, the annotations from MiniMotif are not publicly available. Although the ELM database is the most comprehensive database of functional features within IDRs, at present it does not have the resources to annotate all motifs in the literature; ELM contains ∼200 classes of linear motifs with over 2400 instances, but more than 250 classes await annotation with this number constantly increasing. 125 This has meant ELM is limited to annotating (a fraction) of the shorter motif classes and does not explicitly consider the longer binding modules in disordered regions. Complementary to the annotation efforts, the linear motif resources employ prediction methods that map functionality onto regions of proteins with unknown function (i.e., unannotated regions). For example, MiniMotif and ELM use regular expressions derived from experimentally validated and curated motif instances to search protein sequences. These searches bring up functional descriptions of sequence instances that match the regular expressions. A major problem in the computational detection of short motifs in particular is the high false positive rate, which means that it is very difficult for users to identify the instances that are most likely to be functional from the large total of mostly nonfunctional motif instances that result from these searches. To overcome this issue, both databases have developed additional methods to improve prediction accuracy that rely on the use of additional context information, such as accessibility (using structural models 366 and predictions of intrinsic disorder 72 ), evolutionary conservation, 367,368 cell compartment (based on annotation), 126,369 and protein–protein interactions. 128,370,371 These efforts will need to be combined in the future with a clearer user interface so researchers can more easily identify the most relevant instances. De novo predictors make up the final category of motif resources. These predictors computationally identify putative uncharacterized motifs in protein sequences. There are two broad types: predictors that identify clusters of amino acids that are more conserved than surrounding residues (e.g., SLiMPrints 372 and phylo-HMM 373 ) or those that find short peptide patterns that are over-represented in a set of sequences (e.g., DiliMot 374 and SLiMFinder 375 ). Although both approaches have been combined with the gene ontology terms of the identified proteins, further development is required to define potential functionality. 10.1.2 PTM Site-Based Approaches In terms of PTM sites within disordered regions, resources such as Phospho.ELM, 268 PhosphoSite, 376 and PHOSIDA 377 curate experimentally verified phosphorylation sites and sometimes other types of modifications from the literature and genome-scale studies. Integration of such information with data on SNPs that are seen in natural populations or in cancer genomes can provide important insights into the functionality of a PTM site. 378,379 Important progress has been made in identifying and cataloging peptide motifs that direct post-translational modifications. ScanSite primarily identifies linear motifs that are likely to be phosphorylated and play key roles in signaling, such as the SH2 and 14–3–3 motifs. 380 Annotation of these sequence motifs is based on results from binding experiments with peptide libraries and phage display experiments. 380 NetPhorest contains consensus sequence motifs of 179 kinases and 104 phosphorylation-dependent binding domains. 381 In addition, approaches such as NetworKIN 370 systematically integrate experimentally derived PTM sites with evolutionary information, and define motifs around the PTM sites that may be recognized by the kinase. In this manner, site-specific interactions between 123 kinases and specific PTM sites (often in disordered regions) in 5515 phosphoproteins are predicted. 382 Another resource, PhosphoNET, provides predictions of potential kinases for over 650 000 putative phosphosites. 383 Extending these approaches to other post-translational modifications is an area of intense research, and a number of such PTM site prediction programs currently exist, 384 although linking the PTM sites to the modifying enzymes remains to be addressed for the other types of modifications. 10.1.3 Molecular Recognition Feature-Based Approaches Two important methods exist for identifying novel binding modules in IDRs based on the concept of molecular recognition features (MoRFs). MoRFpred predicts sequences that undergo disorder-to-order transitions of all types of MoRFs (α, β, coil, and complex) using a combination of sequence alignment and machine learning predictions based on amino acid properties, predicted disorder, B-factors, and solvent accessibility. 385 ANCHOR also predicts parts of disordered regions that are likely to fold upon binding with their interactors, but does so by identifying segments that cannot form enough favorable intrachain interactions to fold on their own and are likely to gain stabilizing energy by interacting with a globular partner protein. 386,387 An important shortcoming of the MoRF predictions is the difficultly in identifying which of the binding sites are relevant and what their functionality might be. This is primarily because the results are not linked to known MoRF instances with annotated functions, as is the case for linear motifs, and no clues are provided regarding the potential role of a binding site or its interacting partners. The IDEAL database 388 collects verified elements in disordered regions that undergo coupled folding and binding upon interaction (Box 1). The careful annotation of well-described MoRFs in terms of their sequence propensities or interaction interfaces as well as their known binding partners, and integration of these annotations with MoRF predictions, would likely improve the use of these predictions for gaining insight into IDR functionality. 10.1.4 Intrinsically Disordered Domain-Based Approaches Few attempts have been made to systematically annotate protein domains that are largely made up of intrinsic disorder. Pfam 22 models are able to predict several intrinsically disordered domains (e.g., KID, WH2, RPEL, and BH3 domains). However, this seems to be a simple consequence of the fact that these disordered domains can be described and detected by sequence profiles, rather than an effort directed at annotating long IDRs. ELM 125 has also annotated a small number of long disordered domains, such as the WH2 motif; however, the main focus of the database remains on short motifs. Finally, some of the IDRs that are present in annotated domains are in fact MoRFs or linear motifs, and linear motifs also frequently fold upon binding like MoRFs, underscoring the underlying connections between linear motifs, MoRFs, and IDDs as functional elements (see section 3.4). 10.1.5 Other Approaches Only a few IDR classifications that are not based on linear motifs, MoRFs, or IDDs have so far been exploited for function prediction. FFPred is a correlation-based approach that uses the length and position of IDRs along a sequence (see sections 5.5 and 5.6), among other general protein features, to predict the function of the protein in terms of gene ontology categories (molecular activities and biological processes). 211,389−391 The DisProt database of protein disorder 203 (Box 1) lists functions of individual disordered regions, when known from experiments, the major limitation here being the small number of regions for which exact function has been characterized. The Database of Disordered Protein Prediction (D2P2) 49 (Box 1) stores predictions of IDRs in whole genomes, which together with information on MoRFs, PTM sites, and domains can be used to obtain insight into the possible function of the IDR and the protein containing it. 10.2 Requirement for Annotation Future effort in the classification of IDRs and IDPs must be directed at annotation. Substantiating classes with more examples will lead to refinement of their function descriptions and will likely reveal inaccuracies in existing classification schemes. For example, there are only a limited number of well-characterized examples of proteins that contain the evolutionarily flexible (e.g., RPA70 and Sky1) or constrained types of disorder (Rpl5 and Hsp90). The same is true for the different classes of dynamic disorder in protein complexes, although efforts are ongoing there. 176 In terms of the functional features of IDRs, there is a need for annotating MoRFs and longer disordered binding regions as described in the previous section. Efforts directed at short linear motifs have been very successful, but only a small fraction of the potentially thousands of motifs 392 have been annotated. Pfam contains almost 15 000 curated protein families, 22 while ELM contains less than 200 motif classes, 125 suggesting that significant numbers of functional features are still to be identified and further annotation is required. High-quality resources that collect all of the experimentally validated functional regions of intrinsically disordered regions will provide a strong basis to map functional features onto novel proteins of unknown function. 10.3 Integration of Methods for Finding IDR and IDP Function The current methods for finding and classifying IDR and IDP function have been successful in the area of their focus. However, not all functional characteristics of disordered regions have been fully exploited, and neither is there a resource that brings all of these aspects together. The combination of multiple categorizations and features of IDRs is likely to provide a better understanding of the functionalities encoded in these regions. A comprehensive IDR function resource should have several aspects. It starts with a reliable consensus disorder prediction for the protein sequence of interest (Box 3), such as available in the D2P2 database (Box 1). 49 Functional features, such as SLiMs (see section 3.1), MoRFs (see section 3.2), and disordered domains (see section 3.3), can then be mapped on every disordered part of the protein. The disorder profile allows for the identification of individual IDRs in the protein, as well as the calculation of disorder properties of the whole protein, such as which disorder predictors support which IDRs (see section 5.2), the overall degree of disorder (see section 5.4), the length of the individual disordered regions (see section 5.5), or the amount of disorder at the termini (see section 5.6). These can be used to assign general function to the proteins, such as gene ontology terms that correlate with these properties. Patterns in amino acid sequence could reveal additional function. For example, the presence of tandem repeats or enrichment in certain amino acids (see sections 5.7 and 7.3) may point toward involvement in certain processes. The overall sequence composition and the distribution of charges (see section 5.1) could indicate the solubility of a polypeptide chain (see section 9.1) and conformational properties such as the degree of compaction (see section 4). The combination of sequence complexity and disorder propensity could suggest function as well (see section 5.3). Integration of other types of information will determine what classifications can additionally be used. Addition of domain information, such as Pfam, can provide insight into the role of disordered segments that are commonly associated with specific structured domains (see section 3.3). Protein–protein interactions and structures of protein complexes could indicate interacting partners of IDR binding elements and the mode of interaction (see section 6). Information about sequence conservation (see section 7.1) is another important aspect and could provide clues about evolutionarily constrained or flexible types of disorder, which are implicated in different types of functions. Knowledge on the origin of a disordered region in evolution or the species containing the protein sequence of interest suggests possible functions as well (see section 7.2). Furthermore, data describing regulatory properties such as gene expression levels (see section 8.1), alternative splicing (see section 8.2), and degradation kinetics (see section 8.3) could implicate IDRs in regulating protein availability and may suggest or reject roles as interactions hubs, for example. Finally, biophysical properties of the protein, such as the potential of multivalent elements to undergo phase transitions (see section 9.2) and occurrence inside or outside the cell (see sections 8.4 and 9.3), may suggest involvement in the spatiotemporal organization of (extra)cellular assemblies. The hypothetical resource might be able to suggest function for some of the following examples, although it is clear that in other cases the biology will be too complicated and the outlook of function prediction as described here will be unrealistic. Therefore, the following examples should at this point be considered as speculative. A long (more than 30 residues) IDR that shows signs of evolutionarily flexible disorder and contains no short motifs or other predicted binding regions could be a flexible linker between domains or an entropic chain. A region containing a PxxPx[KR] motif flanked by evolutionarily flexible disorder that is likely to retain an open conformation in the unbound form (based on the primary structure) probably binds a class II SH3 domain, and might be involved in transcription processes if the IDR constitutes the C-terminus of a protein with an otherwise small degree of disorder. Long IDRs that are encoded by alternatively spliced exons and have several nonoverlapping functional motifs and MoRFs might be part of signaling hubs or assemble multiprotein complexes, the type of which might be inferred from the combination of binding sites present. A constitutively expressed, largely disordered IDP with an amino acid composition promoting intrinsic coil conformations and conservation of both primary and disorder sequence is likely to be a ribosomal protein or part of another rigid multisubunit complex. It is clear that some classifications will provide more useful and direct information about function than others. Some classifications have been proposed to contrast IDPs with structured proteins, which does not necessarily make them useful for a detailed description of disorder function per se. Others have limited use for prediction because they are conceptual only, or because of overlap in the properties they describe with other schemes. Moreover, not all approaches can realistically be incorporated in a tool. Binding functionality and sequence-based predictions will generally be possible, but predictions based on other types of data may be harder. For example, assignment of evolutionarily constrained or flexible disorder requires automatic alignment of amino acid and disorder sequences, while gene expression subtypes can be derived from the wealth of microarray and RNA sequencing data. Various types of information are already brought together in the D2P2 database, 49 which contains information on disordered regions, MoRFs, PTM sites, and structured domains, and in ELM, 125 which shows information on linear motifs, disorder, phosphorylation, domains, protein–protein interactions, and secondary structure. Further extension of resources like these, with information on both structured and disordered regions, holds great promise toward creating a comprehensive overview of the functional elements and properties of a protein. 10.4 Future Directions A major area of improvement in the description of disordered protein regions pertains to their dynamic behavior. 172,178 IDRs fluctuate rapidly over an ensemble of heterogeneous conformations (see section 4.2), the relative free energies and propensities of which are determined by the amino acid sequence (see section 5.1). The relationship between sequence and structural ensemble is important because it describes what part of the time the chain is in a compact state, and what part of the time it is more accessible. Knowledge about these structural subtypes and about how sequence contexts and chemical modifications of the chain (e.g., by PTMs) modulate the structural ensemble is vital for the correct description of IDR behavior and has direct implications for the functional roles such regions can have in the cell. 157 Classical methods are not optimally designed to take structural dynamics into account. For example, current disorder prediction technology is successful at distinguishing sequence stretches that are likely to be disordered versus those that are likely to be part of autonomously folded domains, resulting in a binary verdict (disordered versus structured) within a certain confidence limit (Box 3). Although predicted disordered regions correlate well with experimentally determined backbone dynamics, 393 detailed prediction of conformational subtypes requires a more sophisticated description of disorder. A recent method for the prediction of protein backbone dynamics, trained based on order parameters estimated from experimental chemical shifts, is not only capable of distinguishing different structural organizations with varying degrees of flexibility, such as folded domains, disordered linkers, molten globules, and MoRFs, but regions that are predicted to be dynamic also correspond well with conventional predictions of IDRs. 394 Furthermore, high-throughput atomistic simulations of sequence ensembles can provide information about the degree of conformational heterogeneity, 395 which can be quantified by various parameters, such as an information theory measure 396 or an order parameter-like measure. 397 One could imagine a multiple-component scheme describing structural and dynamic characteristics that would assign, for example, residues in a random coil small values for the fractional population of secondary structure, a large value for spatial fluctuations, a fast interconversion rate, and large values for structural heterogeneity. Conversely, molten globule residues would be assigned a relatively large value for the fractional population of secondary structure, a smaller value for spatial fluctuations and structural heterogeneity, and a slower interconversion rate. Progress in the objective description of conformational ensembles will likely require development of novel structural classifications. Such efforts will be greatly encouraged by the new pE-DB database of structural ensembles (Box 1). 398 There is considerable room for growth at the interface between atomistic simulations, physical theories, machine learning methods, and experiments, to enable the unmasking of the connection between disorder dynamics and molecular and system level functions of IDRs and IDPs. Full understanding of the cellular functions of IDPs will also require knowledge of their abundance, their interactions, and their physical state in the physiological context. Are IDPs always bound to target proteins, are they chaperoned, or are there pools of unbound IDPs? Answers to these questions will vary among different IDPs and will depend on the exact context in the cell. However, the discovery of features that can help classify and categorize IDRs in terms of their cellular status will lead to more insights into their function. For example, entropic chains may mostly be disordered even in the cell, whereas effectors and assemblers may mostly be associated with other proteins in folded conformations and exchange binding partners by competition rather than by dissociation to the free, disordered state. Scavengers likely populate both disordered and ordered states, depending on whether or not their ligand is bound. Thus, investigations of the in-cell status of IDPs 399 will be crucial toward understanding their biological roles. 11 Conclusion Finally, we would like to stress that it is not all about intrinsic disorder. This Review has focused on classifications for intrinsically disordered regions and proteins, because function annotation for these regions is lagging behind annotation of structured regions. However, proteins are modular, and their functional regions can be structured or disordered, or somewhere in between. The synergy between these fundamental building blocks of proteins leads to combinatorial diversity of function. Therefore, understanding how structure and disorder work together will be crucial for uncovering the full extent of protein function. Box 1 Databases of Intrinsically Disordered Regions and Proteins Several resources exist that collect experimental or computational information on disordered regions in proteins. The Database of Protein Disorder (DisProt, http://www.disprot.org/) was developed to facilitate research on protein disorder by organizing the rapidly increasing knowledge about the experimental characterization and the functionalities of IDRs and IDPs. 203,400 The database includes the location of the experimentally determined disordered region(s) in a protein and the methods used for disorder characterization. Additionally, where known, entries list the biological function of an IDR and how it performs this function. As of the latest release (6.02, May 24, 2013), DisProt contained 694 IDP entries and 1 539 IDRs. The IDEAL database (http://www.ideal.force.cs.is.nagoya-u.ac.jp/IDEAL/) also collects annotations of experimentally verified IDPs. 388 This database focuses on regions that undergo coupled folding and binding upon interaction with other proteins (regions for which there is evidence for both a disordered isolated state and an ordered bound state), such as MoRFs and certain linear motifs (see section 3). It also suggests putative sequences for which there is only evidence of an ordered bound state, but that are thought to undergo induced folding based on, for example, the presence of a verified folding-upon-binding element in a homologue. The latest version (30 August 2013) contained 340 proteins with annotated IDRs of which 148 contain verified or putative elements that undergo folding upon binding. MobiDB (http://mobidb.bio.unipd.it/) collects experimental data on IDRs from DisProt, 203 IDEAL, 388 and the Protein Data Bank 147 (missing residues in crystal structures and structurally mobile regions in NMR ensembles). 401 It also stores disorder prediction data from three methods. The total of disorder information is summarized in a weighted consensus. The latest version (1.2.1, August 28, 2012) contained 26 933 proteins for which there is experimental data on the presence or absence of disorder and disorder predictions for 4 662 776 proteins from 297 proteomes. pE-DB (http://pedb.vib.be/) is the first database for the deposition of structural ensembles (see section 4.2) of intrinsically disordered proteins. 398 Entries contain the primary experimental data (mainly NMR and SAXS, Box 2), the algorithms used in their calculation, and the coordinates of the structural ensembles, which are provided as a set of models in Protein Data Bank 147 format. Development of pE-DB is intended to support the evolution of new methodologies for the structural descriptions of the disordered state. pE-DB stored 45 ensembles in 10 entries as of 17 January 2014. Finally, the Database of Disordered Protein Prediction (D2P2, http://d2p2.pro/) stores disorder predictions (Box 3) made by nine different predictors for proteins from completely sequenced genomes. 49 Alongside the disorder predictions, it contains information on MoRFs (ANCHOR 386 ), PTM sites (PhosphoSitePlus 402 ), and domains (SCOP 24 and Pfam 22 ). As of January 2014, D2P2 contained disorder predictions for 10 429 761 sequences in 1 765 genomes from 1 256 distinct species. Box 2 Experimental Characterization of Intrinsically Disordered Regions and Proteins IDPs and IDRs have been studied using a variety of experimental techniques, including NMR, SAXS, and smFRET. Nuclear magnetic resonance (NMR) spectroscopy is the key method to characterize protein disorder, due to its ability to provide residue-level information on protein structure and dynamics in solution. 403 Many aspects of structural disorder can be detected directly using NMR, including local disorder, folding upon binding, and disorder in complex. In contrast to NMR methods, detection of disorder using X-ray crystallography techniques is mainly indirect as it relies on missing electron density. 32 Another powerful method for detecting and characterizing IDPs is small-angle X-ray scattering (SAXS), which assesses protein dimensions and shape by measuring the scattered X-ray intensity caused by a sample. SAXS can be used to determine hydrodynamic parameters and the degree of globularity of a protein, which are good indicators to determine whether a protein is compact or unfolded. 183,404 Single-molecule methods are also emerging for the study of structural disorder. 179−182 These techniques minimize averaging over the heterogeneous ensembles of conformations in which disordered proteins naturally exist and thus are able to measure dynamics of individual molecules. For example, single-molecule fluorescence resonance energy transfer (smFRET) can measure dynamics and individual conformations of the unbound ensemble, intermediates during induced folding, and internal friction in the folding process. 180−182 Atomic force microscopy (AFM) is also useful for the characterization of the conformational heterogeneity of single proteins. 182 High-throughput proteomic approaches are mainly used to identify IDPs. These techniques enrich cellular extracts for disordered proteins, and then separate structured from disordered proteins, followed by identification (e.g., by mass spectrometry). For example, heat treatment enriches cell extracts for IDPs and depletes for proteins containing folded domains (see section 9.1). 209 IDPs can also be identified on the basis of their susceptibility to degradation by the 20S proteasome under conditions in which structured proteins are resistant (see section 8.3). 332 The degradation assays can be used to identify binding partners of IDPs that provide protection against degradation. Finally, computational techniques such as molecular dynamics (MD) simulations complement experimental approaches and provide important insights into IDP behavior. 196,405 The DisProt, IDEAL, MobiDB, and pE-DB databases collect experimentally verified disordered regions and proteins (Box 1). Box 3 Prediction of Intrinsically Disordered Regions and Proteins Predicting disordered regions from amino acid sequence allows the analysis of disordered proteins at a genome-wide scale and provides initial hypotheses about the presence of structural disorder in individual proteins. 38,406 A large number of prediction methods have been developed and are regularly benchmarked as part of the Critical Assessment of Techniques for Protein Structure Prediction (CASP). 407,408 Excellent overviews of disorder prediction methods are given elsewhere, 406,409,410 and nonexhaustive lists of publicly available prediction software and webservers can be found at http://en.wikipedia.org/wiki/List_of_disorder_prediction_software and http://www.disprot.org/predictors.php. Three general prediction strategies currently exist: • Disorder prediction based directly on sequence properties. For instance, IUPred is a physicochemical sequence-based method that estimates residue interaction energies. 411 Sequences with lower predicted pairwise interaction energies are considered more likely to be disordered due to a lack of stabilizing contacts. Similarly, FoldIndex considers weakly hydrophobic regions of high net charge. Such regions are likely to be disordered due to their low energy benefit when adopting a compact conformation. 31,412 • Machine learning is used in the majority of predictors, for example, by using unresolved residues in X-ray structures as a training set. 410 For example, DISOPRED2 uses linear support vector machines (SVMs) trained on PSI-BLAST sequence profiles surrounding unresolved residues. 35 Similarly, PONDR XL1 employs a feed-forward neural network trained on sequence attributes found associated with unresolved residues. 271 • Meta-predictors that combine several individually successful disorder prediction methods have been developed more recently, resulting in increases in prediction accuracy. 407 For instance, metaPrDOS 413 and MFDp 414 both apply SVM-based machine learning to the results of a number of individual prediction methods to arrive at a final score. Similarly, the MobiDB 401 and D2P2 databases 49 (Box 1) provide a consensus overview of several independent prediction methods. Curated databases containing experimentally determined disordered regions, such as DisProt 203 and IDEAL 388 (Box 1), provide a gold standard for assessing disorder prediction methods. Overall, the quality of the predictions appears to have reached a reasonable plateau of accuracy, with modest recent progress. 407,408 Additional data on biologically relevant long disordered regions may lead to future improvements in predicting IDRs and IDPs. 408 Box 4 Evolution of Intrinsically Disordered Regions and Proteins IDRs generally evolve faster than their structured counterparts. 51−56,107 However, comparison of the rates of evolution of structured and disordered regions in 26 protein families has shown that this is not always the case. 51 To get more insight into the evolution of disordered regions, we predicted disorder in the human proteome using MULTICOM-REFINE. 415 We integrated the disorder status of the protein residues with their evolutionary rates across multiple sequence alignments of homologous proteins from 53 (mostly vertebrate) species in Ensembl Compara, 1 calculated using the Rate4Site program. 416 As observed previously, 417 protein residues that are predicted to be disordered generally evolve more quickly (i.e., have much higher evolutionary rates) than those in structured regions (Figure Box 4, P value 1.5 times the interquartile range from the median. Outliers are not shown for visual clarity.