Average rating: | Rated 4 of 5. |
Level of importance: | Rated 4 of 5. |
Level of validity: | Rated 4 of 5. |
Level of completeness: | Rated 4 of 5. |
Level of comprehensibility: | Rated 4 of 5. |
Competing interests: | None |
This peer review of Hill and Pitt (2016) was co-authored with Antonio Iannopollo and Jiancong Chen at the University of California, Berkeley. We review the paper in the spirit of promoting reproducibility of research and attempt to replicate the authors’ work. We also discuss other methods to identify anomalies, and present results based on our analysis using Permutation Tests. Permutation tests are consistent with the aim of the paper–providing simple tools for detecting anomalies–and validate the results in the paper, leading to the same conclusions.
Before diving into technical details, we make a minor observation: the organization of the paper was not properly introduced. The use of distinct sections for (1) the discussion on data and experiments; (2) their model and related calculations; (3) the application of common tests from the literature; and (4) conclusions, would have been helpful. The review is organized as follows. In section 2 we replicate authors’ work and results and discuss weaknesses of their approach. In section 3, we propose and implement additional tests to consolidate the results. We finally draw our conclusions in section 4.
The paper begins by voicing a growing concern towards “Scientific fraud and Plagiarism” in the scientific community and is successful in conveying a strong message. The authors present some statistical figures and point out the existence of easy statistical tools to detect fabricated data and ignorance about such tools.
The authors examine datasets from radiobiological experiments. They find that data reported by one of 10 researchers, the “RTS”, is suspicious. They perform three different tests to validate their suspicion and also validate their tests and assumptions by looking at the data obtained from three other sources. Each researcher made two types of triple measurements - colony counts and Coulter counts. The authors suspect that the RTS fabricated data triples to get the mean s/he desired in each triple by setting one observation equal to the desired mean and the other two roughly equidistant above and below that value. This would result in triples that contain the (rounded) mean as one of their values.
The methodological contribution of the paper is “bounds and estimates for the probability that a given set of n such triples contains k or more triples which contain their own mean” when each of the n triples is independent and identically distributed (i.i.d.) Poisson and triples are independent of each other. (Different triples may have different Poisson means.) For this Poisson model, the chance that the RTS’s data would contain so many triples that include their rounded mean is astronomically low. They also apply more common tests for anomalous data, based on statistics such as the frequency of the terminal digit and the frequency with which the last two digits are equal. However, some of the questions that were slightly untouched upon are discussed below:
The authors write, “Having observed what appeared to us to be an unusual frequency of triples in RTS’s data containing a value close to their mean, we used R to calculate the mid-ratios for all of the colony data triples that were available to us.” This suggests that the same data–and the same feature of the data–that raised their suspicions about the RTS was the data used to test whether the RTS’s data were anomalous on the basis of that feature. If so, then the nominal p-values are likely to be misleadingly small.
Most of the tests assume a model for the observations and compare the RTS’s data to that model. The authors validate the assumptions of the model by comparing it with the data pooled for the other researchers. Pooling the data in this way may hide anomalies in the other researchers’ data. Permutation tests allow the data from each researcher to be compared to the data from the other researchers without positing a generative model for the data. On the other hand, the bulk of the data available is from the RTS. To reject the hypothesis that another researcher’s data looks like a random sample from the pooled data if it includes the RTS’s data does not imply s/he is suspicious. Instead, it shows that his/her data is not like that of the RTS. See section 3 of this review for more discussion.
This section discusses our efforts to replicate the analyses in the paper. After fine tuning, we were able to replicate most of their results, obtaining similar results in the other cases. Our work is available on github.com/ianno/science_review. The original datasets used for the paper and also used in this review can be found at https://osf.io/mdyw2/files/. We first discuss specifics about the replication and then comment about the tests and methods involved.
The authors first consider the mid-ratio, which is defined for a triple (a, b, c), a < b < c as \(\frac{b-a}{c-a}\), and show that the histogram of RTS’s data concentrates abnormally around the 0.4 − 0.6 range, compared to the data taken by all the other lab members. After tweaking the default histogram function on numpy, we were able to obtain plots similar to the ones reported in Figure (1) of the paper. Two noticeable differences were - (1) we obtain 44% chance of seeing mid-ratio in (0.4, 0.5] interval for RTS, compared to 50% chance reported in the paper and (2) we used 1360/1361 and 595/595 triples to compute histogram for RTS and the rest respectively, compared to the use of 1343/1361 and 572/595 triples by the authors. We believe the authors did not provide enough information about the methods used to filter data for this section. However, such minor differences did not demand further investigation.
The authors develop a model to bound the probability of observing k out of n triples contain their mean. Each entry in a triple is assumed to be an independent sample from a Poisson distribution with mean λ. (Different triples may have different means.) The event of observing the rounded mean in such a triple is a Bernoulli random variable (BRV) whose success probability depends on λ. The authors derive analytical expressions for these success probabilities in Appendix A. Numerical values of these probabilities, for λ = {1, …, 25}, are presented in Table 1. We could replicate this table exactly. For large λ (>2000), for which the authors provide only a few representative probability values, our implementation suffered from numerical issues.
Using Table 1, the authors determine the success probability for the BRV in two different ways and use it to compute the chance of observing the data. For hypothesis test I (non-parametric) they used the maximum value from Table 1 as an upper bound for all triples, essentially treating all BRVs as i.i.d. Bernoulli(0.42). Replicating this was straightforward. For hypothesis test II and III, the authors use maximum likelihood estimate of λ for each triple to compute the corresponding success probability values, essentially treating each BRV to have a different distribution. The authors address the sum of these BRVs as a “Poisson Binomial Random Variable". Additionally, for the hypothesis test III, the authors use the normal approximation for the Poisson binomial random variables. We could replicate the probability values, up to minor errors, for the colony data. Limitations of our implementation gave inaccurate results for Coulter data. For sanity checks of the results, we used linearly interpolated estimates from the paper (for intermediate λ) and obtained values similar to those in the paper for these tests. Figure 1 is the approximate replication of Table 2 from the paper.
The authors also perform some common tests for fraud detection - terminal digit analysis and pair of equal terminal digits analysis. These tests are based on the assumption that, in general, insignificant digits of a random sample are uniformly distributed.
The first test assumes that the last digit in samples of large numbers (>100) should empirically show uniform distribution. Also, some previous works, e.g., have shown that fabricated data often fails to show such peculiar property. The authors use the chi-square test for goodness of fit and get low p-values for the RTS’s data, and good fits for the data of other researchers. Our results are very similar to theirs, although not identical.
This test assumes that, for large numbers, empirical frequencies of observations of a pair of equal terminal digits should be close to 1/10. The authors did not mention which tests were considered for this analysis. We assume they performed chi-square tests for goodness of fit, for which we obtain similar results.
Here are a few general comments on the methodology adopted by the authors:
The authors did not justify the assumption of Poisson distribution for the underlying radiobiological data. We think a more thorough explanation would have been helpful for readers with different backgrounds.
The authors suspected RTS’s data but used his/her data to fit a model and quantify their suspicion. While sometimes this may raise concerns, here we agree with the authors that doing so increases the odds in favor of the RTS, hence giving us desirable conservative results.
The authors do not discuss why considering only numbers larger than 100 justifies the assumption of insignificance for the two terminal digits.
The authors include additional data from three external sources (two for Coulter counts and one for colony counts). All of them, however, had a relatively small amount of data. Despite the authors’ attempts to account for this, we believe that in the current setting these additional samples do not provide more compelling evidence. Instead, they might be misleading (Are the procedures used the same? Is the equipment calibrated in the same way? etc.).
We reiterate that pooling the data may hide anomalies in the other researchers’ data.
As a preliminary test for identifying suspicious datasets, we (1) plot histograms of mid-ratios for the colony data provided by individual researchers, and (2) contrast the histogram of each investigator with the histogram of the pooled data of the other investigators. In Figure 2, we present the plots for (1).
Two important observations can be made:
The histograms for researchers with labels B, C, E, F, G, H, I do not appear following the uniform distribution.
RTS heavily influences the histogram when his/her data is collected in the pool and, therefore, patterns from the other researchers look anomalous when compared to it.
These points illustrate the limitations of the uniformity assumption for mid-ratios and the visual comparison between the histograms of RTS and the pool to motivate suspicion.
“The problem of determining whether a treatment has an effect is widespread in various real world problems. To evaluate whether a treatment has an effect, it is crucial to compare the outcome when treatment is applied (the outcome for the treatment group) with the outcome when treatment is withheld (the outcome for the control group), in situations that are as alike as possible but for the treatment. This is called the method of comparison.”. We will describe this method for a specific set up relevant for this review.
Suppose that we are given two sets of observations - one of them labeled as ‘treatment’ with size T, and the other labeled as ‘control’, of size C. We assume that the first of them has received a treatment and we wish to test the hypothesis whether this treatment affects the group. In a two-sample permutation test, the data is pooled together to form a population of size N = T + C. To compare the two groups, we need to decide on a test-statistic that can capture the effect of the treatment (if any) on the population. As an example, we can consider the absolute difference between the sample means of the two datasets. Under the null hypothesis that the treatment has no effect, one can analytically derive the distribution of this test statistic. However, it is often easier to empirically approximate such distribution rather than compute it numerically. To do so, one needs to repeatedly randomly partition the data into groups of size T and C and compute the test statistic contrasting the two groups. We use the empirical histogram obtained from these experiments, as a proxy for the true distribution of the test statistic. Just like typical hypothesis testing, we then determine the chance (p-value) of observing the test statistic that we computed in the beginning.
When the p-value is below a preset significance level, we infer that the treatment has an effect at that level of significance. It is unlikely that the two sets were obtained by a random partition of the pooled data.
We set the test statistic to be the difference in standard deviation of the mid-ratios for the two datasets. We choose the standard deviation, instead of the mean, because our null and alternative hypothesis for mid-ratio (uniform distribution versus concentration around 0.5) have the same mean (0.5). We expect the standard deviation to capture the unintentional reduction in spread caused in data due to intentional adjustments.
We consider each researcher’s data equivalent to a treatment group and the rest of them as the control group. We use 1000 repetitions to obtain the empirical distribution and then compute the p-values:
0.00, for investigators A, B, D, and RTS;
<0.01, for C, H, I;
>0.01, for E,F,G.
The p-values indicate that almost all datasets are surprising with respect to this test-statistic. We would like to emphasize that here a p-value of 0.00, in fact, denotes a p-value <0.001, because of the finite resolution owing to 1000 tests. We would also like to mention that RTS is still the most surprising if one looks at the location of the test-statistic in the tails of the distribution.
We also use ℓ_{1}-distance between the density^{1} and the ℓ_{1}-distance between the cumulative distribution function (CDF) as the test statistic. Again, we reject several researchers of the lab at a significance level of 1%. We present all the p-values in Figure 2. The top row denotes the test statistics used and the first column denotes the researcher for whose data we perform the permutation test. Colum labeled as 'No.' denotes the number of data points associated with the researcher.
Remark We would like to mention that when RTS is included in the control group, it constitutes the bulk of the group. As a result, rejecting the null hypothesis for a researcher is almost equivalent to rejecting the hypothesis that the data of that researcher is same as RTS’s data. If we already believed or discovered that RTS’s data was suspicious, then we cannot flag other researchers’ data as suspicious. Therefore, we do another set of permutation tests after excluding the RTS’s data. We did not find strong evidence to reject the null hypothesis, hence we conclude that none of the researchers is suspicious at a significance level of 1%. However, this set of tests suffers from a bias because of our manual throwing away 2/3 of the data points.
Putting together all the pieces, we conclude that there is statistical evidence to claim that RTS’s data is not genuine.
For the terminal digit and equal digits analyses, we extended the tests done by the authors to individual members of the lab and performed (1) chi-square test for goodness of fit for terminal digit, (2) chi-square test for goodness of fit for equal digits and (3) permutation tests for terminal digit. For permutation tests, we used the test statistics listed in the previous section. Results are tabulated in Figures 5-7.
Figure 7, once again, confirms that RTS’s data is suspicious. As before, the huge fraction of data by RTS contributes towards the low p-values for some of the other researchers. In permutation tests after excluding RTS, none of the researchers look suspicious. For sake of brevity, we avoid mentioning the p-values here.
Data fraud is an extremely critical issue in science, engineering, and many other fields. Methods to detect manipulated data are needed to identify fraudulent research behaviors. Detecting frauds, however, is a delicate matter. Challenging the credibility of a researcher or of a scientific work, in fact, can have heavy consequences for all the parties involved in the process. Methodologies and techniques used in this kind of work need to be clear and widely accepted. They need to produce results which leave minimal (ideally no) space to ambiguity. Independently, reproducibility of results is a fundamental element to rule out any doubts that could arise at any time. In our review, we carefully analyzed the authors’ work by reproducing the results in the paper and using additional tests which we believe to be more general. We found that authors’ conclusions are correct, having been able to reproduce most of their results. Moreover, we encourage the use of more powerful tools, such as permutation tests, which we proved to be effective in the context of the paper. Such tests help to focus the analysis not on the assumptions, but on the actual anomalies present in the data.
At the end of our review, we do believe that there is a significant evidence that RTS has suspicious data. However, we recommend the authors to collect additional information since some of our tests suggest that other investigator’s data have anomalies as well if we do not discount the huge fraction of data given by RTS.
We would like to thank the authors H. Pitt and H. Hill for publishing in an open journal, and making the data available for everyone. Also, we would like to thank Prof Philip Stark for his valuable and critical guidelines and timely feedback. We would also like to thank Yuansi Chen for valuable tips with python. As a final note, we would like to claim complete responsibility for all the opinions expressed in this paper.