Frequency-rank Distributions in Proteomics

This paper analyzes the protein abundances in 8 organisms to determine if they fit any of a number of commonly-seen distributions in frequency-rank analyses, with the intention of drawing analogies between biochemistry and linguistics. The organisms were chosen so as to be representative and come from a wide range of body complexities. Our analysis suggests that while individual organisms fit certain distributions quite well, there is no overarching thread that unifies the protein distributions found across the living world, at least on the scale of individual proteins.


Introduction
Proteins are the basic building block for any living organism. This family of macronutrients takes diverse forms to perform a variety of tasks essential for the growth and maintenance of biological beings [Stryer, 1981, Chap. 3]. Various kinds of proteins, all with different functions, are found in varying amounts in an organism. Here we are interested in studying the distribution of these amounts, and how much more common are more common proteins compared to rarer ones.
The way the various proteins perform their functions and give the body structure and meaning is, in a poetic sense, akin to how a book is comprised of letters and words that come together to form something beyond the sum of their parts. Each protein means something to the organism, the way each word means something to a piece of literature. It may be illuminating to take this analogy further, to explore possible similarities between artificial and natural structures by analyzing their statistical properties.
A statistical characterization of natural language was first carried out by Zipf, in his famous Zipf's law -an inverse power law between frequency and rank of words used in English literature that was found to fit almost perfectly (Powers [1998]). Since then, the model has been found in many different situations: city populations, internet traffic, and company sizes (Li [2002]). Several explanations have been offered: first-order rank expansions of various natural distributions (Belevitch [1959]), the principle of least effort (i Cancho and Solé [2003]), and preferential attachment (Lin et al. [2015]). However, it may be prudent to explore other, similar distributions for any of these situations rather than settling on a power law -often one may find a better fit (Clauset et al. [2009]).
This literary analogy has been explored (Mantegna et al. [1995], Som et al. [2001]) for the case of codons, the nucleotide sequences that code for amino acids (the building blocks of protein). In that case the analogy with language is somewhat more direct (see Table 1) (Crick [1970]). To our knowledge, however, this analysis has not been carried out with relative abundances of expressed proteins. A modest review of possible statistical models and distributions that can be fit to a proteome, as it is called, is undertaken here.

Method
Datasets for the relative abundance of expressed proteins in various organisms were found on https://www.proteomaps.net/, by Liebermeister et al. [2014]. Distribution fitting was carried out with the Python module lmfit (Newville et al. [2021]). lmfit provides an interface to fit a dataset to a user-defined curve using non-linear least-squares minimization. This was chosen over another popular curve-fitting module, scipy.curve_fit, because it can conveniently output various goodness-of-fit parameters.
The dataset under analysis includes, for several organisms, the abundance of each protein that the organism can express. Abundance is reported in four forms: Abundance (Original), Abundance (ppm), Size-weighted Abundance (Original), and Size-weighted Abundance (ppm). Abundance (Original) counts the number of expressed protein molecules of each type available in a typical cell of the organism. Abundance (ppm) divides this by the sum total of protein molecules in the cell. Size-weighted Abundance (original) measures the 'weight' of a particular protein in a cell, with the 'weight' referring not to mass but to the protein's length (i.e. how many amino acids it is composed of). Size-weighted Abundance (ppm) is calculated by taking this length-weighted abundance value for a particular protein, and dividing by the sum total of (length-weighted) proteins in the cell.
The second and fourth of these are analyzed in this report. Moreover, a variety of organisms are available and were utilized, from simple bacteria to a modern mammal, as summarized in Table 2. This allows for cross-comparison of the proteome distribution across various levels in the evolutionary hierarchy, and helps identify any universal trends.
The datasets were loaded into a Python script, and the relevant columns (Abundance ppm and SizeWeightedAbundance ppm) were extracted and sorted in descending order, with the intention of analyzing them separately. An integer array of appropriate size was generated to serve as the rank. These operations were handled with the Python module numpy (Harris et al. [2020]). A number of functions and distributions were then defined, taking inspiration from similar use-cases in earlier literature; these are summarized in Table 3. Many of these find mention in (Clauset et al. [2009]).
Then, lmfit was called to fit the abundance/frequency data to each distribution, as a function of rank. This module outputs certain goodness-of-fit results for each fit that it performs: chi-squared, reduced chi-squared, AIC, and BIC. It also gives parameter estimates (with optional confidence intervals), and correlations between parameters. Note that many distributions have non-trivial support for their parameters, so lmfit's options of constraining the parameters must sometimes be used. Also, often numpy will return NaN errors when evaluating too-big numbers or encountering unsupported calculations, so the choice of initial parameter guesses must be made judiciously.
Finally, fitted models were plotted against the real data with the module matplotlib.pyplot (Hunter [2007]). The goodness-of-fit results were also analyzed to choose the best-fitting model distribution in each case.

Results
Distributions given by the functional forms in Table 3 were fitted to each organism (Table 2), for both protein Abundance and protein Size-weighted Abundance. Log-log plots of the fitted models superimposed on the real data scatterplot are depicted, and the curve-fitting parameters are summarized.
In the interest of brevity, the results of only three of the organisms are depicted here (M. pneumoniae, S. cerevisiae, and P. troglodytes). Results for the other organisms are available from the authors on request.       3.2 Size-weighted abundance (ppm)

Discussion
We now wish to select the best model for a variety of cases. This selection may be done by information-theoretic criteria (AIC and BIC), or by frequentist methods (χ 2 and reduced χ 2 ).
In the information-theoretic approach, the 'best' model is selected as that with the smallest AIC or BIC value; and the difference between the AIC/BIC value of some particular model and the minimum value determines how seriously we can consider that model as an alternative. The selection rules for this criteria are rather subjective (Shi et al. [2012]), but it is agreed that a difference of more than 10 all but rules out the alternative models. This is indeed the case in 15 out of the 16 datasets that were analyzed, indicating that in most cases, the best fitting model is the only one that is considerable. The sole exception is Size-weighted Abundance in E. coli, where the second-best model has 'substantial support' by AIC, but there is 'some positive evidence against' it by BIC. In any case, this may be considered an outlier in this regard.
As for the frequentist methods, there are some problems. First of all, the analysis of χ 2 and related statistics is only really valid when the models are linear in the parameters, which is not the case here. Thus, any interpretation here must be done with great caution. (Andrae et al. [2010]) Now, for a good fit, we would want to find reduced χ 2 to be close to 1, but in all cases here it is much smaller, nearly close to 0. This may be indicative of errors or uncertainty in the data being grossly overestimated. Thus, to take these values on their own merit and interpret them absolutely may be dubious. Since these problems are more or less the same across the datasets, we can make the (rather weak) claim that models with less reduced χ 2 are better fits, but no more than that. Going by this rule of thumb, we find the frequentist model selection to agree with the information-theoretic approach in all cases. The null hypothesis for this analysis is that there will be no clear trends across the data, and that there is no one distribution that models all protein frequencies well.
The alternate hypotheses of this analysis are: • Is there a single distribution that best models all organisms?
• Are there any trends in terms of the distribution as we move up in the evolutionary hierarchy?
• Are Abundance and Size-weighted Abundance distributed the same way, for the same organism?
We can see that the most common well-fitting distributions are Parabolic Fractal, Stretched Exponential, and Cut-off Power-law, but these are more or less strewn randomly across the organisms. It may also be noted that in Size-weighted Abundance, the Parabolic Fractal is slightly more common than others.
Apart from this, no clear trends can be identified with respect to any of our hypotheses. Thus we cannot quite reject our null hypothesis, and conclude that there is no method to the madness of proteomics that can be identified yet.
This report analyzes frequency-rank distributions at the level of expressed proteins found in cells. Future avenues of research could involve moving up or down a level of organization. At a lower level, we can analyze the distributions in codon frequency (as in Kim et al. [2005]) or aminoacid frequency. At a higher level, we can cluster proteins of similar function at various functional levels (as can be seen in the Proteomaps of Liebermeister et al. [2014]), e.g. proteins that are involved with metabolism; more specifically, energy metabolism; even more specifically, oxidative phosphorylation, etc.
We would also like to thank • Dr. Shantanu Desai, for mentoring us on statistical analysis and supervising this project • Razia Shaikh, for suggesting the topic of the paper, providing an accessible introduction to relevant topics in biochemistry, and pointing out useful datasets in proteomics • Alexandra Elbakyan, for indirect but invaluable assistance during the course of our research The authors declare no conflict of interest.