Introduction Integral membrane proteins are encoded by ∼30% of the genes in most genomes and perform numerous biological processes from signaling to transport (Almén et al., 2009; Stevens and Arkin, 2000). There are many indications that the activity of such proteins can be affected by physical properties of the lipid bilayer such as lipid order and hydrophobic thickness (Andersen and Koeppe, 2007; Bondar et al., 2009; Nyholm et al., 2007; Phillips et al., 2009). There is also considerable interest in the possibility that local differences in the physical properties of membranes could contribute to the lateral segregation of proteins during sorting or signaling (Bretscher and Munro, 1993; Dukhovny et al., 2009; Patterson et al., 2008; Ronchi et al., 2008; Simons and Ikonen, 1997). Determining the biological significance of such processes in eukaryotes is contingent on understanding the properties of the different bilayers of the cell. Organelle membranes vary in both their protein and lipid content, and even within one membrane the lipid composition of the two leaflets of the bilayer can be very different (van Meer et al., 2008). For instance, sterols and sphingolipids are scarce in the ER but abundant and asymmetrically distributed in the plasma membrane. These lipids differ from typical phospholipids in that sphingolipids are characterized by saturated acyl chains, and sterols by an inflexible core formed by four fused rings. In artificial liposomes the degree of acyl chain saturation and the levels of sterols affect such physical properties of the bilayer as thickness, order and viscosity (Brown and London, 1998). However, what effect they have at physiological levels in heterogeneous, protein-containing biological membranes is unclear. Most integral membrane proteins contain α-helical transmembrane domains (TMDs) that span the hydrophobic core of the lipid bilayer (Killian and von Heijne, 2000; White and Wimley, 1999). The primary constraint on all TMDs that enter the secretory pathway is that they must partition out of the Sec61 translocon into the membrane of the ER during synthesis. TMDs are greatly enriched in aliphatic hydrophobic residues, and these residues promote partitioning out of the translocon (Hessa et al., 2005, 2007; Killian and von Heijne, 2000). However, the physical properties of the bilayer in which a protein will eventually reside should also impose constraints upon the sequence of its TMD. Previous studies comparing the TMDs of Golgi and plasma membrane proteins have suggested a difference in TMD length and hence bilayer thickness (Bretscher and Munro, 1993; Levine et al., 2000). However, the full significance of this finding for cellular organization is unclear as the analysis was based on only a small number of proteins and did not include other organelles. Indeed the conclusions have been called into question by attempts to measure bilayer thickness of different compartments (Mitra et al., 2004). To obtain a clear picture of organelle-specific constraints on TMDs, we have made use of the recent increase in available genome sequences to perform a comprehensive comparison of a large number of membrane proteins with a single TMD from the major secretory organelles from both fungi and vertebrates. Our findings validate the previous suggestions of a difference in TMD length between Golgi and plasma membrane and extend this to reveal an apparent step-change in bilayer thickness that occurs in the secretory pathway at the trans side of the Golgi. We also find that the TMDs of proteins from post-ER organelles show striking variations in amino acid composition across the bilayer. This results in an asymmetry in residue composition that is linked to residue volume and correlates with changes in lipid asymmetry. Thus, eukaryotic TMDs are not a single type of entity but vary in a manner that implies that there are clear differences in the physical properties of the bilayers of the secretory pathway. Results Computational Analysis of Fungal and Vertebrate Transmembrane Sequences from Distinct Subcellular Locations To reliably compare TMDs that span different membranes, we curated a dataset of proteins with an experimentally determined topology and location and a single TMD (bitopic proteins, Figure 1A). Bitopic proteins represent ∼40% of all membrane proteins in eukaryotic genomes, and their TMDs are those likely to have the most residues exposed to the lipid bilayer (Almén et al., 2009; Krogh et al., 2001). We assembled datasets of all single TMD proteins from what are probably the best characterized eukaryotic genomes, Saccharomyces cerevisiae and Homo sapiens. We then used literature searches and cross-referencing between databases to identify those proteins with a known organelle of residence and topology (Table S1 and Table S2). For the Golgi apparatus we pooled all the proteins from the various cisternae of the Golgi stack into a single “Golgi” set, with a separate set for those proteins that cycle between the trans-Golgi network (TGN) and endosomes. Only a few mammalian Golgi proteins have been accurately located within the Golgi stack, but for yeast, where this is more easily done, we found that the proteins of the early part of the stack were strikingly similar in TMD properties to those from the later part of the stack (see below), indicating that this pooling probably does not mask significant complexity. Selecting only those proteins with a known location and topology inevitably reduced the size of the datasets, and so to expand the number of sequences available for analysis, we used BLAST searches to collect the orthologous proteins from all other complete fungal and vertebrate genomes. The topology and subcellular location of orthologs were assumed to be the same as for the reference protein. Many of their functions are highly organelle specific, and a global comparison of protein localization in the distantly related yeasts S. cerevisiae and Schizosaccharomyces pombe found the subcellular distributions of orthologs to be very similar (Matsuyama et al., 2006). The inclusion of orthologs significantly expanded our datasets, but this would be of little value if the proteins were very similar to the reference sequence. Thus the proteins from each organelle set were redundancy reduced by using BLASTClust to cluster them based on sequence similarity in their TMD and flanking sequences, and then we removed any with greater than 30% identity over this region (Altschul et al., 1997). Figure 1B summarizes the strategy used, and the numbers of proteins used for the analysis are provided in Figure 1C. Alignment of TMDs Based on Their Cytosolic Ends To compare the TMDs from different organelles, their sequences were aligned using the cytosolic ends of their hydrophobic cores. Initially, TMDs were located in the reference proteins using the TMHMM prediction algorithm (Krogh et al., 2001), and the orthologs were then aligned with the reference protein in order to assign their TMD positions. There is no established computational method for defining the ends of the part of a protein that spans the bilayer. Thus we implemented a scanning algorithm, which uses a sliding window and a threshold based on hydrophobicity. For this and subsequent analyses we used the hydrophobicity scale of Goldman, Engelman, and Steitz (GES) as it is designed for single-pass transmembrane helices and out-performs other scales in TMD prediction (Engelman et al., 1986; Koehler et al., 2009). However, to ensure that our findings were not dependent on this choice we also performed parallel analyses with the Wimley-White scale and the recently reported Biological scale from Hessa and coworkers, which is based on a completely different method (Hessa et al., 2007; Wimley and White, 1996). There is of course some flexibility in how charged residues are positioned at a bilayer interface, but by applying the same objective method to all organelles we should avoid bias in how TMD ends are assigned for the different datasets. The scanning algorithm enabled us to align proteins from an organelle set at the position where a sharp change in hydropathy occurred, and the cytosolic end of the hydrophobic region was defined as position one. For all our analyses the hydrophobic spans were aligned with respect to their bilayer orientation, i.e. from the cytosolic side to the exoplasmic side (Figure 1A), rather than from N terminus to C terminus. We wanted to determine if residue preferences were influenced by position in the bilayer, which would be missed if all proteins (type I/III and type II) were simply analyzed from N to C terminus. In addition, the “positive-inside rule” indicates that the cytosolic flanking regions of TMDs are generally enriched in positively charged residues, thus allowing a clear definition of the cytosolic edges of hydrophobic spans (Nilsson et al., 2005). TMDs from Different Organelles Exhibit Compositional Differences Using the aligned sets of proteins, the frequency of each amino acid at each position through the hydrophobic region was calculated and plotted as matrices for fungi and vertebrates (Figures 2A and 2B, numerical values in Table S3 and Table S4). The residue preferences typically show a cluster of basic residues on the cytosolic side, followed as expected by the run of mostly aliphatic hydrophobic residues that spans the hydrophobic core of the bilayer. However, the matrices also reveal striking compositional differences between, and along, the TMDs. For both fungi and vertebrates, the regions enriched in hydrophobic residues are shorter for the ER and Golgi proteins than for plasma membrane proteins, indicating a difference in TMD length. In addition, the different hydrophobic residues were not uniformly distributed through the hydrophobic TMD core. For example valine shows a clear enrichment in the exoplasmic side of the plasma membrane set in both vertebrates and fungi (Figure 2). To determine the extent and significance of such trends, we analyzed in more detail the changes in residue property and type through the bilayer. Hydrophobic Lengths of TMDs Differ along the Secretory Pathway in Fungi and Vertebrates To quantify trends in hydropathy, the mean hydrophobicity over all the sequences in each dataset was plotted relative to residue position. As noted above, the hydropathy plots for the fungal proteins from the early Golgi and late Golgi were found to be very similar, and so the datasets were combined to form a “Golgi” set (Figure S1 available online). For both fungi and vertebrates, the plasma membrane TMDs were on average hydrophobic for a greater length than those of the ER and Golgi (Figures 3A and 3B). For fungi the hydrophobicity values of the Golgi and plasma membrane TMDs were highly significantly different between positions 16 and 24 (p 70%), which may reflect the multitude of factors involved in the recycling and localization of SNAREs, and the TMDs potentially having a role in SNARE function (Stein et al., 2009). However, when the sequences of the SNAREs were reversed, and hence the orientation of their TMDs with respect to the bilayer, there was no particular trend or accuracy in the prediction (22% correct, Figure 6E). Thus, despite the SNAREs all sharing a common general function, there are constraints imposed on the sequences of SNARE TMDs that are shared with the TMDs of unrelated proteins from the same organelle, and the asymmetry of these constraints is a major feature detected by the neural network. Discussion The analysis described here is, to the best of our knowledge, the first report of a comprehensive comparison of TMDs from all the major compartments of the eukaryotic secretory pathway. We find overwhelming evidence that there is not a “generic” type of TMD shared by eukaryotic membrane proteins. There are, of course, protein-specific constraints on TMD sequences imposed by the interactions and function of a particular protein. However, it appears that TMDs also vary depending on their organelle of residence in both length and composition. The structural consequences of these compositional differences can be illustrated by modeling the “consensus” TMDs for the fungal Golgi and plasma membrane (Figure 7A). These organelle-specific trends have obvious implications for improving the prediction of TMD presence and topology, as TMD features recognized by prediction algorithms will, in part, reflect the localization of the membrane proteins used to train the algorithm. Our observations also have implications for how and why the different bilayers of the cell vary in their physical properties. The TMDs from the plasma membrane proteins of both fungi and vertebrates are longer than those from the proteins of internal membranes, even though the two sets of plasma membrane proteins are otherwise unrelated by sequence or function. This length difference was suggested by previous analyses of much smaller datasets from the plasma membrane and the Golgi (Bretscher and Munro, 1993; Levine et al., 2000) but is unequivocally validated by these much larger datasets. In addition, the analysis has now been extended to all of the secretory pathways of both vertebrates and fungi, revealing that TMD lengths are similarly short in both the ER and Golgi and then increase in compartments beyond the Golgi stack. This difference could reflect a shared tendency for post-Golgi TMDs to tilt in the bilayer of their organelle of residence, but this seems highly implausible, especially as the increased levels of order-inducing lipids in post-Golgi membranes would be expected to discourage tilting (see below). Thus the simplest explanation of the difference in TMD length is that for both fungi and vertebrates the plasma membrane is thicker than the membranes of the ER and Golgi. The length of an α helix increases by 1.5 Å per residue, and so these differences in TMD length would equate to an increase in bilayer thickness of ∼12 Å (42%) from Golgi to plasma membrane in fungi and ∼6 Å for vertebrates. Although the trend for longer TMDs in post-Golgi compartments is broadly similar in fungi and vertebrates, there also appear to be some differences. The TMD lengths imply that the fungal plasma membrane is even thicker than that of vertebrates, and also the membranes of the TGN/endosomal system are similar in thickness to the plasma membrane in vertebrates, but in fungi their thickness is intermediate between those of the Golgi and plasma membrane. The TGN/endosomal route is followed by proteins taken in from the plasma membrane or traveling from the Golgi to the vacuole or lysosome (Bonifacino and Traub, 2003; Bowers and Stevens, 2005). We did not include these lytic compartments in the analysis above because only a few bitopic proteins are known for each. However, when the methods used above are applied to these small datasets, the vertebrate lysosomal proteins appear similar to plasma membrane proteins, with longer TMDs and a preference for smaller residues in the exoplasmic half of the bilayer (Figure S4). In contrast, the fungal vacuolar proteins have shorter TMD lengths and an increased abundance of bulky aromatic residues compared to lysosomal TMDs (Figure S4). These differences cannot be viewed as definitive given the small numbers of reference proteins, but they are at least consistent with all post-Golgi membranes in vertebrates being equally thickened compared to the Golgi and ER, whereas in fungi the plasma membrane is particularly thick and the other post-Golgi membranes are intermediate in thickness compared to the Golgi. The thickness of a fluid lipid bilayer has been shown to depend on acyl chain length and the presence of lipids such as sterols or sphingolipids (Brown and London, 1998; Lewis and Engelman, 1983). Sterols are rigid and sphingolipids have saturated acyl chains, and so both increase acyl chain order and thus thicken the bilayer and reduce permeability to solutes. The plasma membranes of fungi and mammals are enriched in sterols and sphingolipids compared to the ER and Golgi, which would be consistent with an increase in bilayer thickness (Holthuis et al., 2001). Sphingolipids are synthesized in the exoplasmic leaflet of the trans-Golgi from where they move, via mechanisms that are not understood, up a concentration gradient into post-Golgi compartments (Holthuis and Levine, 2005; Klemm et al., 2009; Tafesse et al., 2006; van Meer, 1989). The vacuole and endosomes of S. cerevisiae have relatively low levels of sterols and sphingolipids compared to the fungal plasma membrane or vertebrate lysosomes, which would be consistent with the apparent differences in the bilayer thickness between these organelles (Klemm et al., 2009; Schneiter et al., 1999). In contrast, when we compared the TMDs of proteins that reside in the apical or basolateral domains of epithelial cells, we did not find a clear difference in hydrophobic length or trends in residue volume (Figure 3 and data not shown). There have been suggestions that TMD:lipid interactions could contribute to sorting of proteins to the apical surface (Simons and van Meer, 1988), but we are not aware of any previous report of a comparison of the TMDs from the two sets of proteins. The lack of apparent difference in TMD length may reflect the relatively small number of reference proteins, and indeed Mitra and coworkers have used low-angle X-ray scattering to measure the thickness of membranes of polarized hepatocytes and reported that the apical membrane was 3–5 Å thicker than the Golgi and ER, but the basolateral membrane was, if anything, thinner (Mitra et al., 2004). However, it should be noted that although X-ray scattering is an interesting approach, the method requires that organelles are isolated from cells, treated with carbonate to rupture them, and then treated for several hours with protease. This could perturb aspects of the bilayers and so may not have provided a definitive measure of in vivo properties. Moreover, the protocol used to isolate basolateral membranes removes apical membranes but not all others, with inner mitochondrial membranes alone constituting 22% of the basolateral fraction (Meier et al., 1984). It should also be noted that whereas glycolipids are ∼2-fold more concentrated on the apical surface of many epithelia, the other order-inducing lipids cholesterol and sphingomyelin can be equally distributed, and sphingomyelin even concentrated at the basolateral surface in some cell types (Brasitus and Schachter, 1980; Simons and van Meer, 1988; van IJzendoorn et al., 1997). Further work is clearly needed to understand the different properties of the apical and basolateral surfaces, but at present it seems possible that the major difference in bilayer thickness in epithelial cells could occur between pre- and post-Golgi compartments rather than between apical and basolateral domains. In addition to variations in TMD length, we also found an asymmetry in the distribution of residue volume in the plasma membrane TMDs. Extrapolating from studies of bilayer permeability, small and more compact side chains would be expected to fit better into a bilayer that has well-ordered lipid acyl chains (Mathai et al., 2008; Mitragotri et al., 1999). This implies that there is an asymmetry in the state of lipid order in the plasma membrane. Such an asymmetry is more easily accounted for by lipids such as sterols and sphingolipids, which are enriched in one leaflet, than by proteins that span both leaflets. This suggests that lipids contribute, at least in part, to differences in bilayer order between organelles or subdomains. Indeed TMD asymmetry may explain why plasma membrane proteins show a surprising exclusion from “plasma membrane-like” lipid domains in liposomes (Bacia et al., 2004), as liposomes are symmetric and so the residues of the TMD adapted to the cytosolic leaflet would be exposed to a lipid organization that is only experienced in vivo by the outer leaflet residues. The results of our analysis strongly imply that the different bilayers of eukaryotic cells have different physical properties, and these differences seem likely to be, at least in part, imposed by differences in lipid composition. Changes in membrane properties would provide an indication of location that could be used to control the activity of proteins such as channels and transporters as they move through the secretory pathway. However, a striking aspect of the data is how pervasive the differences between TMDs are in the large datasets that we have examined, implying that the TMDs of many of the proteins in a particular compartment share organelle-specific properties. This is perhaps clearest for TMD length in the case of fungi (Figure 3C), but even for vertebrates 92% of the plasma membrane TMDs are longer than the mean length for the Golgi and ER. Previous theoretical and experimental work has suggested that integral membrane proteins can influence the organization of the lipids that surround them (Andersen and Koeppe, 2007; Mitra et al., 2004; Mouritsen and Bloom, 1993). In addition, a quantitative analysis of the composition of synaptic vesicles revealed that TMDs account for ∼20% of the area of the membrane, indicating that most lipids are close to proteins, and this very high protein density is unlikely to be unique to this particular membrane (Frick et al., 2007; Takamori et al., 2006). If many of the proteins in the same compartment or forming vesicle share TMD shapes then they could contribute to bilayer properties, and in particular to thickness, if they are at a high enough concentration. Protein clustering in forming vesicles could thus cause local changes in bilayer physical properties, which could result in lipid sorting, especially at the late Golgi where sphingolipids are synthesized and a major transition in bilayer thickness seems to occur (Figure 7B). This means that the answer to the long-standing question of how cells sort lipids to different destinations could be that it is an emergent property of the traffic of membrane proteins that are at a high density and share organelle-specific TMD properties. This need not exclude the resulting protein/lipid microdomains attracting further cargo or excluding residents based on physical properties alone. Determining the relative contributions of proteins and lipids to each other's sorting is likely to be a key issue for future studies of the biogenesis of eukaryotic membranes. Further work will be required to investigate these issues in detail, but irrespective of the outcome of such studies, our analysis clearly shows that eukaryotic TMDs are not a generic entity that is varied solely for protein-specific functions. Rather, TMD sequences are optimized for insertion, function, and also the variable and asymmetric physical properties of their bilayers of residence. Experimental Procedures Full methods and associated references are in the Extended Experimental Procedures online. In summary, proteins with a single TMD from S. cerevisiae and H. sapiens were collated from databases. Those with a known location and topology were identified from the literature (Table S1 and Table S2), and their TMDs located with the prediction program TMHMM (Krogh et al., 2001). Orthologs from a further 36 fungi or 12 vertebrates were identified by BLAST searching of RefSeq genomes, and the TMDs in the orthologs identified by aligning them to the references sequences. The cytosolic and exoplasmic edges of the TMDs were defined as the point at which the residue hydropathy in a small window sliding out from the middle of the TMD fell below a fixed threshold. For analysis of residue properties all the TMDs were aligned at their cytosolic edges. For type II proteins, residues were thus analyzed starting from the N-terminal end of their TMDs, and for type I and III from the C-terminal end. The resulting datasets were analyzed using custom software with a graphical user interface, and plots of residue properties or abundance can be generated at http://www.tmdsonline.org. Extended Experimental Procedures Sequence Collation The proteome sequences of the fungi S. cerevisiae, A. capsulatus, A. clavatus, A. fumigatus, A. nidulans, A. niger, A. oryzae, A. terreus, B. fuckeliana, C. albicans, C. cinerea, C. glabrata, C. globosum, C. immitis, C. neoformans, D. hansenii, E. gossypii, G. zeae, K. lactis, K. waltii, L. bicolor, L. elongisporus, M. grisea, M. globosa, N. fischeri, N. crassa, P. anserina, P. guillermondii, P. nodorum, P. stipitis, S. japonicus, S. kluyveri, S. pombe, S. sclerotorium, U. maydis, V. polyspora, Y. lipolytica, and the vertebrates H. sapiens, B. taurus, C. familiaris, D. rerio, E. cabullus, G.gallus, M. domestica, M. mulatta, M. musculus, O. anatinus, R. norvegicus, S. scrofa, and T.guttata were downloaded from RefSeq (Pruitt et al., 2007). Single-pass proteins from S. cerevisiae and H. sapiens with experimentally determined topologies and locations were identified from literature and database searches (Saccharomyces Genome, TopDB and LOCATE databases (Sprenger et al., 2008; Tusnády et al., 2008)), and grouped by subcellular location (Table S1 and Table S2). Ortholog Identification Orthologs of each of the single-pass proteins in Table S1 and Table S2 were identified using a BLAST (Basic local alignment search tool) based algorithm (Altschul et al., 1990). For S. cerevisiae proteins the searches were performed against the 36 fungal genomes above. For H. sapiens searches were performed against the 12 vertebrate genomes listed above. The cut-off stringency for BLAST was E = 10−10. For each protein the best hit from each species was collected if present. Relative TMD positions were obtained by aligning the orthologs to the reference proteins using MUSCLE (Edgar, 2004). Orthologs were filtered for deviation in expected protein length (±100 residues) and TMD hydrophobicity (window size 10, average threshold 0.95 kcal/mol), and duplicated proteins were removed. Redundancy Reduction To ensure that the analysis was not biased by the presence of closely related sequences, the BLASTClust option of the BLAST distribution was used to cluster sequences at 30% identity. The clustering was performed on sequences corresponding to the hydrophobic core of the TMD and 10 residues of flanking sequence from either side. For each organelle, one protein from each cluster was selected at random for the analysis, ensuring that no two proteins had greater than 30% identity in their TMD regions. Transmembrane Domain End Definition A hydrophobicity scanning algorithm was implemented to identify the point where a sharp change in hydropathy occurs in sequences known to have a TMD. The approximate TMD edges from TMHMM were used as guides (Emanuelsson et al., 2007). The edges were indented by 4 amino acids at one end or the other and then a window of five residues centered on the measured residue was scanned back toward the TMD end. The Goldman-Engelman-Steitz (GES) hydrophobicity scale was used unless stated (Engelman et al., 1986). Ends were defined by an average hydropathy across the window of more than −0.94 kcal/mol or by an individual residue with a hydropathy of more than 8.0 kcal/mol (D, E, K, or R). For comparison the scanning was also performed using the Biological and Wimley-White scales (Hessa et al., 2007; Wimley and White, 1996). The window threshold was the median value of each scale (0.20 kcal/mol for Biological or −0.50 kcal/mol for Wimley-White, and the individual residue threshold corresponded to the second most hydrophilic value (2.70 kcal/mol for Biological (K, D), or 3.60 kcal/mol for Wimley-White (D, E)). Altering these parameters such that the threshold was set to zero, or the individual residue criteria removed, did not substantially affect the plot profiles or conclusions for any scale (data not shown). For analysis with ZPRED, a stand-alone version of Zpred2 was obtained from Arne Elofsson (Stockholm University) (Papaloukas et al., 2008). To obtain TMD lengths, the number of residues predicted to be within 15 Å of the membrane center was calculated. The analysis could not be performed on whole protein sequences, as the software cannot distinguish signal peptides from TMDs. Instead, FASTA files were created of the TMDs (as calculated above), with 10 flanking residues on either side to eliminate the change of error from use of the GES scale for TMD end definitions. Analysis of Amino Acid Composition, Hydrophobicity, and Residue Volume For each position relative to the aligned cytosolic edge in a protein set, the frequency of each amino acid was calculated and normalized to one. The mean hydrophobicity (kcal/mol, GES Scale (Engelman et al., 1986), Biological scale (Hessa et al., 2007) or Wimley-White (Wimley and White, 1996)) and amino acid volumes (Å3, (Pontius et al., 1996)) for each organelle set were calculated for positions along the TMD. These data were then plotted as matrices, bar charts or line graphs within a custom-written graphical user interface in Python. The interface was built using Python graphical libraries developed as part of the CCPN project which are released under the GPL license (Vranken et al., 2005). The t test for two independent samples was used to assess the significance of differences between mean values. To obtain a measure of TMD hydrophobic lengths the exoplasmic edge was defined as described above for the cytosolic edge and a frequency distribution of resulting TMD lengths determined. TMD Asymmetry Analysis For each protein in an organelle set the hydrophobic length was defined as above. The “inner leaflet” was defined as the cytosolic edge to the midpoint and the “outer leaflet” the midpoint to the extra-cytosolic edge. The abundance of all residues was normalized for each “leaflet.” For each residue type the abundance in the inner leaflet was then subtracted from the abundance in the outer leaflet, and divided by the total abundance. Size Moment To test for the presence of flattened interaction faces on TMD helices a “size moment” for the residues along the TMDs was calculated. This is analogous to the hydrophobic moment described by Eisenberg (Eisenberg et al., 1984), and is designed to measure the circular asymmetry of side chain volume around the helix. Residues in a typical α helix are offset by 100°. Thus size moments were calculated by defining each residue as a vector with its volume as its length and its angle as n x 100° (where n = 0 for the first residue, 1 for the second etc). These vectors were then summed over a window of seven residues, i.e., 700° or almost two complete turns of the α helix. This window was then scanned along the TMDs, and moments were plotted with respect to the position of the central residue of the window. Artificial Neural Network The inputs for the neural network were derived from the residue composition of the sequences in our data sets. However, the data sets each had different numbers of sequences which could bias the network toward the largest input group. Thus sequences were removed at random from all but the smallest data set so that each set consisted of 99 proteins. The amino acid compositions of short stretches of sequence adjacent to and within the TMDs of the proteins in the equilibrated data sets were encoded into numerical vector inputs. The relative abundance of each of the 20 amino acids in a given sequence region corresponded to an input node. The six sequence regions were (−3–0), (1–4), (9–11), (12–15), (16–17), and (18–24) with the cytosolic TMD edge at position zero. This gave 120 input nodes: 6 regions × 20 amino acids. The neural network was of the feed-forward type with one hidden layer (Me, 2009). Error back-propagation was used to train the neural network. The learning rate was set at 0.01, and there were 100 training cycles. Use of 6 input regions (120 input nodes) and 7 hidden nodes was found to be optimal. For fivefold cross-validation the data sets were randomly partitioned into five subsets, and for each round of testing four subsets were used for training and one was used for testing. Predictive performance was measured using the Matthews correlation coefficient (MCC, (Matthews, 1975)): M C C = T p T n − F p F n ( T p + F n ) ( T p + F p ) ( T n + F p ) ( T n + F n ) . A threshold was set to enable us to classify predictions into true positives (Tp), true negatives (Tn), false negatives (Fn) and false positives (Fp). During training this threshold was 0.67. The MCC was used to identify the best neuronal weights to be used in prediction. The MCC is one for a perfect prediction and zero for a random assignment. The MCC, sensitivity and specificity were calculated for all thresholds between zero and one. The mean accuracy of the cross-validation was calculated using the threshold (0.68) with the highest MCC (0.84). The sensitivity was calculated as: S e n s i t i v i t y = T p T p + F n . The specificity was calculated as: S p e c i f i c i t y = T p T p + F p . Subcellular Location Prediction The best network from cross-validation testing was trained with the entire size-normalized datasets, and the optimized weights then used to predict the subcellular location of SNARE family proteins, based on their TMD regions. Fungal orthologs of the S. cerevisiae SNAREs were obtained by BLAST searching as described above or using the SNARE database (Kloepper et al., 2007). To test the topology dependence of the neural network, the SNARE TMD sequences were reversed and treated as type III proteins. In both cases, the highest output score was taken to be the prediction. To compare the neural network to existing localization predictors, the S. cerevisiae proteins from the organelle-specific data sets were tested for predicted localization using the available large-scale predictions and web servers for SherLoc (Shatkay et al., 2007), WoLF PSORT (Horton et al., 2007), or Euk-mPLoc (Chou and Shen, 2007), with the predictors being set to search for fungal localizations where possible. For WoLF PSORT and SherLoc the top prediction was counted, while for Euk-mPloc if the correct localization was present in the prediction it was counted. “Membrane” was assumed to mean plasma membrane. The Euk-mPLoc server only accepts proteins over 50 amino acids long and so some proteins could not be predicted and were not counted. Prediction accuracy was calculated as the percentage of correct predictions out of the whole dataset used for testing. The accuracy of the neural network was calculated by averaging the performances in the rounds of leave-one-out cross-validation. Consensus TMDs for Structural Representations To represent the differences in the structure of TMDs, “consensus” sequences for fungal organelle sets were generated using the most abundant amino acid at each position in the alignment: Golgi: RRRRRLLLAALLLLLLLLLSSSSS Plasma membrane: KKRRRLFFFLILLLLLLVVVVGVVAAIGGSSGS. Sequences were modeled on an α helix using PyMOL in the surface display mode (DeLano Scientific).