2017-02-01
Average rating: | Rated 4 of 5. |
Level of importance: | Rated 5 of 5. |
Level of validity: | Rated 3 of 5. |
Level of completeness: | Rated 4 of 5. |
Level of comprehensibility: | Rated 4 of 5. |
Competing interests: | None |
This work has been published open access under Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com.
Keywords: | radiation biology, statistical forensics, cell biology, data fabrication, tissue culture, terminal digit analysis, triplicate colony counts |
The application of statistical tests for the identification of fraudulent data has become more appealing in recent years due to a slew of high-profile scandals in which researchers have inappropriately manipulated or outright invented data. In response to the increasingly apparent need for robust statistical tools to test the validity of datasets, Joel H. Pill and Helene Z. Hill out of Renaissance Associates and NJ Medical School respectively, published a study which applied three statistical tests to a series of datasets, produced by different researchers running radiobiological experiments. The authors posit that the data coming from one particular researcher (known as RTS in the paper), is incongruous with data coming from legitimate samples. They run a series of statistical tests on the RTS data and conclude that the data is indeed suspect due to the high occurrence of the mean value of each triplet among the three values in the triplet; the high occurrence of the same number in the final two digits of 3+ digit data; and the non-uniform distribution of the final digit of the data. We aim to replicate the results of this analysis, report inconsistencies, and provide a review of the methodology and hypotheses tested. In our work, we attempted to replicate the methodology of the authors as closely as possible, but chose to work primarily in Python, rather than R, to determine whether these results were replicable using other software. We note below how our analysis differs from that of the authors.
This replication work is being done as part of a graduate-level course through the University of California at Berkeley’s Statistics Department. The course is STAT 215A, Statistical Models: Theory and Application, Fall 2016. The project was guided by the professor and teaching assistant for the class, Professor Philip Stark and Yuansi Chen. However, the work was conducted entirely by the authors, and the opinions expressed in this review are those of the authors. The authors of this replication paper are graduate students in math (Chen), industrial engineering (Li), and civil and environmental engineering (Maurer and Mohanty) and, as such, have backgrounds in statistical analysis and computer programming. However, none of the authors has expertise in biology or biological research methods.
We thank the authors of the paper for making the raw data available to us for replication and review.
The authors ran their analysis for two types of data, one from a Coulter machine for cell counts and one from manual counts of colonies grown from the cells. They included a total of seven datasets: RTS’s data (Coulter and colony), aggregated data from nine other investigators in the same laboratory (Coulter and colony), and three datasets (two Coulter and one colony) from outside laboratories that used similar research methodologies. In their paper and in this write-up, the source of the data is designated by “RTS”, “Others”, “Outside lab 1”, “Outside lab 2”, and “Outside lab 3” respectively.
The summary statistics were calculated by processing the raw data files, following the authors’ methodology. This included calculation of total number of samples (triplets) in each experiment, and the total number of complete triples (defined as triples where maximum value in the triplet was at least 2 more than the minimum value in the triplet). Next, the number of samples where the mean lies in the average value of the triplet was calculated. This was done was comparing the integer part of the average value in each triplet to each component of the triplet. Though not examined by the authors, we believed it would also be useful to study the impact of each investigator in the “Others” sample separately, so we calculated the summary statistics for each investigator.
Pitt and Hill postulate that the final digits of each of the Coulter and colony counts should be uniformly distributed, since the processes for selecting cells is not precise enough to impact the final digit in any systematic way. In order to defend this hypothesis, they state that they “ran simulations generating data sets of triples of independent identical Poisson random variables with comparable means,” the results of which were “consistent with the hypothesis of uniformity” (Pitt and Hill, 7). Despite the vagueness of this statement (how many simulations were performed; what constitutes a “comparable” mean; and how “consistency with uniformity” was determined were not specified), we attempted to replicate the test. For each dataset, we set the means of each triplicate equal to lambda, a choice consistent with the methodology of the authors in their mid-ratio test. Three IID Poisson random variables were drawn for each lambda, mimicking the triplicate data. A chi-square goodness of fit test, described below, was performed on these generated datasets to confirm the hypothesis of uniformity.
Next, following Pitt and Hill, a chi-square goodness-of-fit test was used to analyze the final digits of the colony and Coulter counts, which, as discussed, were expected to be uniformly distributed. We performed the chi-square test twice, once using Scipy’s built-in chi-square test and once using a “manual” step-by-step calculation of the chi-square test statistic. The additional calculation was done to control for any possible bugs in Scipy’s function; the results from the two calculations were identical to at least 14 significant figures.
In the paper, Pit and Hill calculate the percentage of equal pairs of rightmost digit pairs among the 3+ digit Coulter count data, which was expected to be 10%. They then calculate the probability of the observed percentage being greater than or equal to this level in the binomial distribution. This analysis was performed by the authors for the RTS data and again for the combination of Other investigators data and the two Outside lab Coulter datasets.
We implemented a Python function to filter out the data with less than three digits, count the number of filtered data, and count the number of data with equal rightmost pairs. The percentage of the equal-rightmost-pair data is calculated using built-in division in Python, and the probability of the percentage being greater than or equal to the calculated result is calculated by Scipy’s built-in binomial CDF calculator. An additional probability calculator was urn in Matlab to confirm the result of the Scipy’s function. In addition to replicating the test for the same two datasets as the authors, we applied the test to each individual researcher.
Based on a basic analysis of the RTS data, the authors suspect that a disproportionately high number of RTS’s triplicate samples contain their own rounded mean. They speculate that since means of samples are important values in radiobiological experiments, including the desired mean in the sample and constructing the other two values to match that mean would be an easy way to fabricate data with a desired mean. They proceed to construct a model to test for the presence of the mean in the triplicate samples.
A Poisson distribution was used to model both the colony and Coulter samples. As explained by the authors, the three Poisson random variables in each triple share a common parameter lambda and the lambda will vary from triple to triple. Using the formula in Appendix, we wrote a Python function to compute the probability that a triple of independent Poisson random variables with a common parameter lambda includes its rounded mean as one of the three elements. We generated a probability table for lambda in the range 1 to 2,000 and also extended the table by adding the values for lambda that were multiples of 100 between 2,100 and 10,000. Due to the high computing times required to further extend the table, we did not follow the authors to include values larger than 10,000. This required discarding triples with means greater than 10,000 from this analysis; however, the proportion of the discard data was less than 5%. Given the table of lambda values, we computed how likely it is that the researcher collected a that large or even larger portion of the data which contains its mean (i.e p-value).
First, Scipy’s binomial survival function was used to provide an overestimate of this p-value. Then, we wrote a Python function to provide a more accurate p-value when we treat the samples as independent but not identical Bernoulli random variables (Poisson Binomial r.v.). However, we found that there’s no available package for us to deal with Poisson Binomial distribution in Python. An algorithm for computing such probabilities is formulated by expanding the probability generating function and collecting the appropriate coefficients via a recursive scheme.^{1} To avoid the precision loss caused by the the Python function, we also used the ’poibin’ R package to recalculate this p-value. Finally, we found approximate p-values using the normal approximation.
In addition to replicating the authors’ analysis of each of the seven datasets, we applied the same method to compute the p-value for each of the individual investigators whose pooled data comprises “Other Investigators.” We did not attempt to replicate the authors’ probability model for mid-ratios (see Conclusion for further discussion).
The results of the summary statistics from the original paper (white cells) and our analysis (grey cells) are described in Table I. We found some discrepancies, both in the total and complete triplet counts as well as in the number of samples where mean is included in the triplet:
Colony count for RTS (no. mean-containing triplets = 690 from study and 647 from our analysis)
Coulter count for RTS (no. complete triplets = 1,716 from study and 1,726 from our analysis, no. mean-containing triplets = 173 from study and 174 from our analysis)
Colony count for other investigators (no. complete triplets = 572 from study and 578 from our analysis, no. mean-containing triplets = 109 from study and 99 from our analysis)
Colony count for outside lab 1 (no. mean-containing triplets = 0 from study and 1 from our analysis)
Coulter count for outside lab 2 (no. mean-containing triplets = 1 from study and 4 from our analysis)
Coulter count for outside lab 3 (no. mean-containing triplets = 3 from study and 6 from our analysis)
Since there is no way to figure out which entries are excluded, there is no way to exactly replicate each result in each test. We do however find values which are reasonably close to those mentioned in the paper.
None of these differences alters the conclusions since the proportion of samples where the mean is present in the triplet is still much higher in the case of RTS. However, when we analyze each investigator in the “Others” sample independently, we find that the colony counts for investigator C also have a high proportion of mean-containing triplets (20 out of 85). Even though the sample size is small, this would show a discrepancy if we evaluated each investigator in a “One vs All” fashion.
Our main concern is that it is likely that every investigator can be made to look unusual compared to the rest, if one can pick the test after looking at the data. We would be more comfortable with the conclusion if the hypothesis had been formulated using only some of the data, reserving the remaining (unexamined) data to test the hypothesis.
Table I. Summary Statistics Comparison
Histograms of the IID Poisson triplicates generated to test for uniformity can be found in the Appendix. The chi-square statistics and associated p-values for each distribution are listed in Table II. As can be seen, the results for all simulations confirm the hypothesis of uniformity, as the p-values are well above a level where the null hypothesis would be rejected.
The chi square test statistics for each of the data sets tested by the author (two from RTS, two from other researchers in the lab, and three from outside labs) were similar but not equal to those reported by Pitt and Hill (only the observed frequencies for Outside Lab 1 matched exactly). Their values are reported alongside ours in Table III (the authors’ values are in the white cells; ours are highlighted in grey). Given that our summary statistics were slightly different from the authors’, it is to be expected that the terminal digit counts are also slightly different. All p-values in our analysis were effectively the same as those reported by the authors with the exception of the other investigator Coulter counts. While the authors reported a p-value higher than 0.05, we found that the results were lower, albeit barely (see Table III). As the numbers reported are not hugely different (and consistent with the other discrepancies), this finding is of interest primarily because 0.05 is often used by researchers as the threshold to reject the null hypothesis. Our findings would thus require the null hypothesis to be rejected not only in the case of RTS data, but also from the researchers. Of course, 0.05 is a somewhat arbitrary cut-off, and it should be noted that the other investigators’ p-value is still orders of magnitude higher than the RTS values, which are effectively zero.
Table II. Uniformity Test Results
The authors’ calculated rates of equal terminal digits appearing in the two datasets (RTS Coulter and The combination of Other investigators Coulter and two outside lab Coulter datasets) were similar to, but not identical to, the values we find. Table IV compares Pitt and Hill’s results (white cells) with ours (grey cells); we also applied this test to the Coulter counts from the outside labs and other investigators individually. Since the statistics reported in the paper and counted by us were slightly different, the fact that the “Total” and “Equal” the differences in Table IV are to be expected. Despite the differences in observed rates, the percentage of data with equal digits (Column 5) and the probability of that percentage being greater than or equal to the calculated result (Column 6) were effectively the same as the results reported by Pitt and Hill.
Table III. Terminal digit analysis comparison
Table IV. Equal digits analysis comparison
The paper proposes two tests of the uniformity of terminal digits in the data. The first one is the percentage of data within a given dataset with equal rightmost pairs. The hypothesis is that if the data are “honest”, there’s a 10% chance that the final two digits will be equal. This hypothesis is tested using a binomial test. We verify the reasonableness of the assumption after the simulation.
We applied the same tests to the counts we extracted from the data, and reached essentially the same conclusions: the rates observed in the RTS data were improbably high. However, when we applied the same test to other investigators and outside lab Coulter datasets individually, we found that besides the RTS Coulter dataset, another Coulter dataset also gets abnormal results. There are only thirty pairs of equal rightmost digits – smaller than we would expect – among all the 360 3+ digit data from Outside lab 2. The authors looked for anomalously high rates of equal rightmost digits. One might also look for anomalously low rates. Outside Lab 2 had a rate that was somewhat low, producing an occurring probability of 16.7%. Therefore, the probability of this occurring is not impossibly low, but low enough that were we to examine this investigator’s data in isolation, we might have been lead to the conclusion that it is suspect.
Given that the sample size for Outside Lab 2 is much lower than for RTS (360 in total versus 5,177), there is less power to detect a departure from the null hypothesis for the lab. Nevertheless, these results do leave the door open to the possibility that the RTS data is not uniquely abnormal or that the expected rates of equal digit occurrence is wrong.
Our results and the comparison to authors’ results are summarized in the Table IV (again, the authors’ results are in white cells and ours in grey). Column 3 is the number of mean-containing triples. Column 4 is the number of samples expected to contain their means. Columns 5 and 6 list the standard deviation, and the resulting z-value. Column 7 is an approximate p-value based on the normal approximation to the Poisson binomial tail probability, and Column 8 is the p-value based directly on the Poisson binomial distribution.
We applied the same procedure and methods to compute the p-values for each dataset, and same conclusions were drawn based on the results. Nonetheless, our results do not quite match the authors’ results. As we can see in Table V, most of our p-values are quite different from those reported by the author. The p-values depend on the inferred values of lambda for each triple. Since we found slightly different counts than the authors, our calculations give different numerical values, but they do lead to the same conclusions.
Despite results which strongly single out RTS’s data, we have some reservations about the design and use of this test. Comparing the data collected by an individual to pooled data for the rest might skew the findings, especially if the RTS was singled out after inspecting these data. There may be patterns in the other individual researcher’s data that cause those to stand out when compared to RTS’s data; presenting the results in an aggregated manner can obscure such trends. We would recommend analyzing the data for each individual and calculating the p-value for each of those investigators, which we did as described in the Replication Methodology section. The results presented in Table VI show that these p-values confirm finding that the RTS data is anomalous. Ultimately, we reach the same conclusion the authors did, recognizing that there are relatively few data for researchers other than RTS, so the power to find anomalies is limited.
Additional detail would have made it easier to replicate the calculations regarding Coulter counts. Coulter count values are in a much higher range than Colony count values, and thus, the Poisson random variables that give rise to them tend to have much higher lambdas. Following the authors, we assumed for the purposes of the analysis that the means of the triples are equal to the lambda values for that Poisson distribution.
Table V. Appearance of mean in triplicate data
We created an extended probability table to accommodate higher values of lambda by adding multiples of 100 between 2,100 and 10,000. However, the authors do not explain how they selected a lambda value if the mean for the triple falls is greater than 2,100 and falls somewhere between consecutive multiples of 100 (for example, say lambda = 2,254: is it rounded to 2,200 or 2,300?). We chose to round it to the closest multiple of 100; this might be the cause of some of the difference between our results and the authors’. Some of the difference presumably is attributable to the slight differences between our parsing of the data and theirs. Despite the fact that it does not affect the conclusion drawn from the data, it would be helpful for the author to provide more details in the paper or in the supplementary materials for the purposes of replication and verification of the results.
Table VI. p-values for each investigator
In addition to replicating the methodology of Pitt and Hill’s study, we performed a further permutation test based on the aggregated data sets from all researchers. Permutation tests provide a way to evaluate the evidence that the RTS’s data is different from that collected by other researchers, without positing a generative model. The hypothesis tested is, essentially, whether data from the RTS “are like” a random sample from the pooled data, according to some test statistic. If the value of the test statistic is in the tails of the distribution that would be obtained by randomly re-labeling the data, that is evidence that the RTS’s data is not like the rest.
We used permutation test to check whether the conclusions are sensitive to the particular generative model the authors used: if a permutation test leads to the same conclusions, that gives comfort that even if the generative model is incorrect, the conclusions still hold. We combined all the data from all samples in each of the two count categories (Coulter and colony) and then randomly assigned data to different investigating groups. The size of the data for each investigator was maintained. Each of the three statistical tests (terminal digit, equal digit, and appearance of the mean) were then repeated for this randomized sample.
We present the results from one of the permutation runs (seed for numpy.random = 4210163759) in Table VII. The results of the permutation test show that randomizing the data does lead to every sample containing roughly equal proportion of means in the triplet and roughly equal probability of containing equal pairs of terminating digits. This further corroborates the claim that the corresponding values obtained for RTS dataset were suspiciously high. Nevertheless, we still find that the p-values are high enough that we cannot outright dismiss the possibility that the RTS data is consistent with those from any other investigator. Noticeably, the p-values that do not cause the null hypothesis to be rejected tend to reflect smaller data samples (highlighted in green). This observation leads us to speculate that the large sample size of the RTS dataset relative to the other datasets decreases the power of the tests performed. Though the permutation test offered nothing conclusive, we have included it as a demonstration of an analysis that does not rely on a model for how the data were generated.
Table VII. Permutation test results
Overall, our analysis confirms the conclusions of the authors. In nearly every case, the same inferences would be drawn from our results as the authors’.
We are grateful to the authors for making their data available. It would help to have the scripts they used to process the raw data into the data used for subsequent tests. As a side note, a small editing error appears in Tables 2 and 3 of the original paper: the number of the outside labs is inconsistent between the tables as well as when compared to the raw data filenames.
We were unable to replicate the exact values in the summary statistics in the paper (see Table I). However, the deviations in values is not significant and does not affect any of the test statistics.
We find that the sample size of the data in the experiments labeled “Outside Lab” is low; this potentially makes comparisons to outside lab data difficult due to the disparity in sample sizes between RTS data and outside lab data. It is thus hard to assess if the data characteristics (such as whether the mean is present in the triplet and terminal digit analysis) can be compared across multiple experiments. This issue arose in the equal digit analysis; the appearance of the mean analysis; and the permutation test.
We also observed that the statistical tests performed focused specifically on ways in which the RTS data was salient. The authors did not elaborate on how they came to run statistical tests on this dataset in particular, so absent any more detailed explanation, we may suppose that RTS’s data was suspect before the tests were performed and that the tests were selected specifically to address his/her data. In general, we are somewhat skeptical of this practice of designing and verifying a test based on the same set of data, as it may create the tendency (conscious or not) to test for only the anomalies that are already suspected. Thus, the tests become a self-fulfilling prophecy, identifying one dataset as anomalous even though it is possible that every investigator could be made to look unusual compared to the rest if one can picked the test after looking at the data. We suggest that the methods presented by the authors could be further validated if applied to datasets that were not previously suspect.
Overall, we can corroborate the findings of Pitt and Hill, and find their paper and data to be a valuable contribution to the literature.
J. H. Pitt and H. Z. Hill, “Statistical analysis of numerical preclinical radiobiological data," pp. 1–11, 2016.
Histogram of IID Poisson R.V. generated based on RTS Colony triplicate means
Histogram of IID Poisson R.V. generated based on RTS Coulter triplicate means
Histogram of IID Poisson R.V. generated based on Others Colony triplicate means
Histogram of IID Poisson R.V. generated based on Others Coulter triplicate means
Histogram of IID Poisson R.V. generated based on Outside Lab 1 Coulter triplicate means
Histogram of IID Poisson R.V. generated based on Outside Lab 2 Coulter triplicate means
Histogram of IID Poisson R.V. generated based on Outside Lab 3 Colony triplicate means
Marlin A. Thomas & Audrey E. Taub (1982) Calculating binomial probabilities when the trial probabilities are unequal, Journal of Statistical Computation and Simulation, 14:2, 125-131, DOI: 10.1080/00949658208810534 (The link to this article: http://dx.doi.org/10.1080/00949658208810534)↩