# ScienceOpen: research and publishing network

5,050
6
recommends
+1 Recommend
81
shares
• Record: found
• Abstract: found
• Article: found
Is Open Access

## Statistical analysis of numerical preclinical radiobiological data

Joel H. Pitt1, Helene Z. Hill*,2

ScienceOpen Research – Section: SOR-STAT

ScienceOpen

Bookmark
Review statistics
 Level of importance:     Rated 4 of 5. Level of validity:     Rated 3.5 of 5. Level of completeness:     Rated 3.5 of 5. Level of comprehensibility:     Rated 3.5 of 5.
10 items per page
Average Score (Highest to Lowest)

### Reviews

Showing 1 - 7 of 7
Tessa Maurer evaluated the article as: Show full review    Rated 4 of 5.
An important demonstration of the use of statistical methods for detecting fraud
 Publication date: 01 February 2017 DOI: 10.14293/S2199-1006.1.SOR-STAT.AFHTWC.v1.RYVDVY Level of importance:     Rated 5 of 5. Level of validity:     Rated 3 of 5. Level of completeness:     Rated 4 of 5. Level of comprehensibility:     Rated 4 of 5. Competing interests: None Recommend this review: +1One person recommends this

# Introduction

The application of statistical tests for the identification of fraudulent data has become more appealing in recent years due to a slew of high-profile scandals in which researchers have inappropriately manipulated or outright invented data. In response to the increasingly apparent need for robust statistical tools to test the validity of datasets, Joel H. Pill and Helene Z. Hill out of Renaissance Associates and NJ Medical School respectively, published a study which applied three statistical tests to a series of datasets, produced by different researchers running radiobiological experiments. The authors posit that the data coming from one particular researcher (known as RTS in the paper), is incongruous with data coming from legitimate samples. They run a series of statistical tests on the RTS data and conclude that the data is indeed suspect due to the high occurrence of the mean value of each triplet among the three values in the triplet; the high occurrence of the same number in the final two digits of 3+ digit data; and the non-uniform distribution of the final digit of the data. We aim to replicate the results of this analysis, report inconsistencies, and provide a review of the methodology and hypotheses tested. In our work, we attempted to replicate the methodology of the authors as closely as possible, but chose to work primarily in Python, rather than R, to determine whether these results were replicable using other software. We note below how our analysis differs from that of the authors.

## Context of Replication Work

This replication work is being done as part of a graduate-level course through the University of California at Berkeley’s Statistics Department. The course is STAT 215A, Statistical Models: Theory and Application, Fall 2016. The project was guided by the professor and teaching assistant for the class, Professor Philip Stark and Yuansi Chen. However, the work was conducted entirely by the authors, and the opinions expressed in this review are those of the authors. The authors of this replication paper are graduate students in math (Chen), industrial engineering (Li), and civil and environmental engineering (Maurer and Mohanty) and, as such, have backgrounds in statistical analysis and computer programming. However, none of the authors has expertise in biology or biological research methods.

We thank the authors of the paper for making the raw data available to us for replication and review.

# Replication Methodology

## Summary statistics

The authors ran their analysis for two types of data, one from a Coulter machine for cell counts and one from manual counts of colonies grown from the cells. They included a total of seven datasets: RTS’s data (Coulter and colony), aggregated data from nine other investigators in the same laboratory (Coulter and colony), and three datasets (two Coulter and one colony) from outside laboratories that used similar research methodologies. In their paper and in this write-up, the source of the data is designated by “RTS”, “Others”, “Outside lab 1”, “Outside lab 2”, and “Outside lab 3” respectively.

The summary statistics were calculated by processing the raw data files, following the authors’ methodology. This included calculation of total number of samples (triplets) in each experiment, and the total number of complete triples (defined as triples where maximum value in the triplet was at least 2 more than the minimum value in the triplet). Next, the number of samples where the mean lies in the average value of the triplet was calculated. This was done was comparing the integer part of the average value in each triplet to each component of the triplet. Though not examined by the authors, we believed it would also be useful to study the impact of each investigator in the “Others” sample separately, so we calculated the summary statistics for each investigator.

## Terminal Digit Analysis

Pitt and Hill postulate that the final digits of each of the Coulter and colony counts should be uniformly distributed, since the processes for selecting cells is not precise enough to impact the final digit in any systematic way. In order to defend this hypothesis, they state that they “ran simulations generating data sets of triples of independent identical Poisson random variables with comparable means,” the results of which were “consistent with the hypothesis of uniformity” (Pitt and Hill, 7). Despite the vagueness of this statement (how many simulations were performed; what constitutes a “comparable” mean; and how “consistency with uniformity” was determined were not specified), we attempted to replicate the test. For each dataset, we set the means of each triplicate equal to lambda, a choice consistent with the methodology of the authors in their mid-ratio test. Three IID Poisson random variables were drawn for each lambda, mimicking the triplicate data. A chi-square goodness of fit test, described below, was performed on these generated datasets to confirm the hypothesis of uniformity.

Next, following Pitt and Hill, a chi-square goodness-of-fit test was used to analyze the final digits of the colony and Coulter counts, which, as discussed, were expected to be uniformly distributed. We performed the chi-square test twice, once using Scipy’s built-in chi-square test and once using a “manual” step-by-step calculation of the chi-square test statistic. The additional calculation was done to control for any possible bugs in Scipy’s function; the results from the two calculations were identical to at least 14 significant figures.

## Equal Digit Analysis

In the paper, Pit and Hill calculate the percentage of equal pairs of rightmost digit pairs among the 3+ digit Coulter count data, which was expected to be 10%. They then calculate the probability of the observed percentage being greater than or equal to this level in the binomial distribution. This analysis was performed by the authors for the RTS data and again for the combination of Other investigators data and the two Outside lab Coulter datasets.

We implemented a Python function to filter out the data with less than three digits, count the number of filtered data, and count the number of data with equal rightmost pairs. The percentage of the equal-rightmost-pair data is calculated using built-in division in Python, and the probability of the percentage being greater than or equal to the calculated result is calculated by Scipy’s built-in binomial CDF calculator. An additional probability calculator was urn in Matlab to confirm the result of the Scipy’s function. In addition to replicating the test for the same two datasets as the authors, we applied the test to each individual researcher.

## Appearance of the mean in triplicate samples

Based on a basic analysis of the RTS data, the authors suspect that a disproportionately high number of RTS’s triplicate samples contain their own rounded mean. They speculate that since means of samples are important values in radiobiological experiments, including the desired mean in the sample and constructing the other two values to match that mean would be an easy way to fabricate data with a desired mean. They proceed to construct a model to test for the presence of the mean in the triplicate samples.

A Poisson distribution was used to model both the colony and Coulter samples. As explained by the authors, the three Poisson random variables in each triple share a common parameter lambda and the lambda will vary from triple to triple. Using the formula in Appendix, we wrote a Python function to compute the probability that a triple of independent Poisson random variables with a common parameter lambda includes its rounded mean as one of the three elements. We generated a probability table for lambda in the range 1 to 2,000 and also extended the table by adding the values for lambda that were multiples of 100 between 2,100 and 10,000. Due to the high computing times required to further extend the table, we did not follow the authors to include values larger than 10,000. This required discarding triples with means greater than 10,000 from this analysis; however, the proportion of the discard data was less than 5%. Given the table of lambda values, we computed how likely it is that the researcher collected a that large or even larger portion of the data which contains its mean (i.e p-value).

First, Scipy’s binomial survival function was used to provide an overestimate of this p-value. Then, we wrote a Python function to provide a more accurate p-value when we treat the samples as independent but not identical Bernoulli random variables (Poisson Binomial r.v.). However, we found that there’s no available package for us to deal with Poisson Binomial distribution in Python. An algorithm for computing such probabilities is formulated by expanding the probability generating function and collecting the appropriate coefficients via a recursive scheme.1 To avoid the precision loss caused by the the Python function, we also used the ’poibin’ R package to recalculate this p-value. Finally, we found approximate p-values using the normal approximation.

In addition to replicating the authors’ analysis of each of the seven datasets, we applied the same method to compute the p-value for each of the individual investigators whose pooled data comprises “Other Investigators.” We did not attempt to replicate the authors’ probability model for mid-ratios (see Conclusion for further discussion).

# Results and Discussion

## Summary Statistics

The results of the summary statistics from the original paper (white cells) and our analysis (grey cells) are described in Table I. We found some discrepancies, both in the total and complete triplet counts as well as in the number of samples where mean is included in the triplet:

• Colony count for RTS (no. mean-containing triplets = 690 from study and 647 from our analysis)

• Coulter count for RTS (no. complete triplets = 1,716 from study and 1,726 from our analysis, no. mean-containing triplets = 173 from study and 174 from our analysis)

• Colony count for other investigators (no. complete triplets = 572 from study and 578 from our analysis, no. mean-containing triplets = 109 from study and 99 from our analysis)

• Colony count for outside lab 1 (no. mean-containing triplets = 0 from study and 1 from our analysis)

• Coulter count for outside lab 2 (no. mean-containing triplets = 1 from study and 4 from our analysis)

• Coulter count for outside lab 3 (no. mean-containing triplets = 3 from study and 6 from our analysis)

Since there is no way to figure out which entries are excluded, there is no way to exactly replicate each result in each test. We do however find values which are reasonably close to those mentioned in the paper.

None of these differences alters the conclusions since the proportion of samples where the mean is present in the triplet is still much higher in the case of RTS. However, when we analyze each investigator in the “Others” sample independently, we find that the colony counts for investigator C also have a high proportion of mean-containing triplets (20 out of 85). Even though the sample size is small, this would show a discrepancy if we evaluated each investigator in a “One vs All” fashion.

Our main concern is that it is likely that every investigator can be made to look unusual compared to the rest, if one can pick the test after looking at the data. We would be more comfortable with the conclusion if the hypothesis had been formulated using only some of the data, reserving the remaining (unexamined) data to test the hypothesis.

Table I. Summary Statistics Comparison

## Terminal Digit Analysis

Histograms of the IID Poisson triplicates generated to test for uniformity can be found in the Appendix. The chi-square statistics and associated p-values for each distribution are listed in Table II. As can be seen, the results for all simulations confirm the hypothesis of uniformity, as the p-values are well above a level where the null hypothesis would be rejected.

The chi square test statistics for each of the data sets tested by the author (two from RTS, two from other researchers in the lab, and three from outside labs) were similar but not equal to those reported by Pitt and Hill (only the observed frequencies for Outside Lab 1 matched exactly). Their values are reported alongside ours in Table III (the authors’ values are in the white cells; ours are highlighted in grey). Given that our summary statistics were slightly different from the authors’, it is to be expected that the terminal digit counts are also slightly different. All p-values in our analysis were effectively the same as those reported by the authors with the exception of the other investigator Coulter counts. While the authors reported a p-value higher than 0.05, we found that the results were lower, albeit barely (see Table III). As the numbers reported are not hugely different (and consistent with the other discrepancies), this finding is of interest primarily because 0.05 is often used by researchers as the threshold to reject the null hypothesis. Our findings would thus require the null hypothesis to be rejected not only in the case of RTS data, but also from the researchers. Of course, 0.05 is a somewhat arbitrary cut-off, and it should be noted that the other investigators’ p-value is still orders of magnitude higher than the RTS values, which are effectively zero.

Table II. Uniformity Test Results

## Equal Digit Analysis

The authors’ calculated rates of equal terminal digits appearing in the two datasets (RTS Coulter and The combination of Other investigators Coulter and two outside lab Coulter datasets) were similar to, but not identical to, the values we find. Table IV compares Pitt and Hill’s results (white cells) with ours (grey cells); we also applied this test to the Coulter counts from the outside labs and other investigators individually. Since the statistics reported in the paper and counted by us were slightly different, the fact that the “Total” and “Equal” the differences in Table IV are to be expected. Despite the differences in observed rates, the percentage of data with equal digits (Column 5) and the probability of that percentage being greater than or equal to the calculated result (Column 6) were effectively the same as the results reported by Pitt and Hill.

Table III. Terminal digit analysis comparison

Table IV. Equal digits analysis comparison

The paper proposes two tests of the uniformity of terminal digits in the data. The first one is the percentage of data within a given dataset with equal rightmost pairs. The hypothesis is that if the data are “honest”, there’s a 10% chance that the final two digits will be equal. This hypothesis is tested using a binomial test. We verify the reasonableness of the assumption after the simulation.

We applied the same tests to the counts we extracted from the data, and reached essentially the same conclusions: the rates observed in the RTS data were improbably high. However, when we applied the same test to other investigators and outside lab Coulter datasets individually, we found that besides the RTS Coulter dataset, another Coulter dataset also gets abnormal results. There are only thirty pairs of equal rightmost digits – smaller than we would expect – among all the 360 3+ digit data from Outside lab 2. The authors looked for anomalously high rates of equal rightmost digits. One might also look for anomalously low rates. Outside Lab 2 had a rate that was somewhat low, producing an occurring probability of 16.7%. Therefore, the probability of this occurring is not impossibly low, but low enough that were we to examine this investigator’s data in isolation, we might have been lead to the conclusion that it is suspect.

Given that the sample size for Outside Lab 2 is much lower than for RTS (360 in total versus 5,177), there is less power to detect a departure from the null hypothesis for the lab. Nevertheless, these results do leave the door open to the possibility that the RTS data is not uniquely abnormal or that the expected rates of equal digit occurrence is wrong.

## Analysis of appearance of the mean in triplicate samples

Our results and the comparison to authors’ results are summarized in the Table IV (again, the authors’ results are in white cells and ours in grey). Column 3 is the number of mean-containing triples. Column 4 is the number of samples expected to contain their means. Columns 5 and 6 list the standard deviation, and the resulting z-value. Column 7 is an approximate p-value based on the normal approximation to the Poisson binomial tail probability, and Column 8 is the p-value based directly on the Poisson binomial distribution.

We applied the same procedure and methods to compute the p-values for each dataset, and same conclusions were drawn based on the results. Nonetheless, our results do not quite match the authors’ results. As we can see in Table V, most of our p-values are quite different from those reported by the author. The p-values depend on the inferred values of lambda for each triple. Since we found slightly different counts than the authors, our calculations give different numerical values, but they do lead to the same conclusions.

Despite results which strongly single out RTS’s data, we have some reservations about the design and use of this test. Comparing the data collected by an individual to pooled data for the rest might skew the findings, especially if the RTS was singled out after inspecting these data. There may be patterns in the other individual researcher’s data that cause those to stand out when compared to RTS’s data; presenting the results in an aggregated manner can obscure such trends. We would recommend analyzing the data for each individual and calculating the p-value for each of those investigators, which we did as described in the Replication Methodology section. The results presented in Table VI show that these p-values confirm finding that the RTS data is anomalous. Ultimately, we reach the same conclusion the authors did, recognizing that there are relatively few data for researchers other than RTS, so the power to find anomalies is limited.

Additional detail would have made it easier to replicate the calculations regarding Coulter counts. Coulter count values are in a much higher range than Colony count values, and thus, the Poisson random variables that give rise to them tend to have much higher lambdas. Following the authors, we assumed for the purposes of the analysis that the means of the triples are equal to the lambda values for that Poisson distribution.

Table V. Appearance of mean in triplicate data

We created an extended probability table to accommodate higher values of lambda by adding multiples of 100 between 2,100 and 10,000. However, the authors do not explain how they selected a lambda value if the mean for the triple falls is greater than 2,100 and falls somewhere between consecutive multiples of 100 (for example, say lambda = 2,254: is it rounded to 2,200 or 2,300?). We chose to round it to the closest multiple of 100; this might be the cause of some of the difference between our results and the authors’. Some of the difference presumably is attributable to the slight differences between our parsing of the data and theirs. Despite the fact that it does not affect the conclusion drawn from the data, it would be helpful for the author to provide more details in the paper or in the supplementary materials for the purposes of replication and verification of the results.

Table VI. p-values for each investigator

# Further Analysis

## Permutation Test: Relevance and Methodology

In addition to replicating the methodology of Pitt and Hill’s study, we performed a further permutation test based on the aggregated data sets from all researchers. Permutation tests provide a way to evaluate the evidence that the RTS’s data is different from that collected by other researchers, without positing a generative model. The hypothesis tested is, essentially, whether data from the RTS “are like” a random sample from the pooled data, according to some test statistic. If the value of the test statistic is in the tails of the distribution that would be obtained by randomly re-labeling the data, that is evidence that the RTS’s data is not like the rest.

We used permutation test to check whether the conclusions are sensitive to the particular generative model the authors used: if a permutation test leads to the same conclusions, that gives comfort that even if the generative model is incorrect, the conclusions still hold. We combined all the data from all samples in each of the two count categories (Coulter and colony) and then randomly assigned data to different investigating groups. The size of the data for each investigator was maintained. Each of the three statistical tests (terminal digit, equal digit, and appearance of the mean) were then repeated for this randomized sample.

## Results

We present the results from one of the permutation runs (seed for numpy.random = 4210163759) in Table VII. The results of the permutation test show that randomizing the data does lead to every sample containing roughly equal proportion of means in the triplet and roughly equal probability of containing equal pairs of terminating digits. This further corroborates the claim that the corresponding values obtained for RTS dataset were suspiciously high. Nevertheless, we still find that the p-values are high enough that we cannot outright dismiss the possibility that the RTS data is consistent with those from any other investigator. Noticeably, the p-values that do not cause the null hypothesis to be rejected tend to reflect smaller data samples (highlighted in green). This observation leads us to speculate that the large sample size of the RTS dataset relative to the other datasets decreases the power of the tests performed. Though the permutation test offered nothing conclusive, we have included it as a demonstration of an analysis that does not rely on a model for how the data were generated.

Table VII. Permutation test results

# Conclusion and General Thoughts

Overall, our analysis confirms the conclusions of the authors. In nearly every case, the same inferences would be drawn from our results as the authors’.

We are grateful to the authors for making their data available. It would help to have the scripts they used to process the raw data into the data used for subsequent tests. As a side note, a small editing error appears in Tables 2 and 3 of the original paper: the number of the outside labs is inconsistent between the tables as well as when compared to the raw data filenames.

We were unable to replicate the exact values in the summary statistics in the paper (see Table I). However, the deviations in values is not significant and does not affect any of the test statistics.

We find that the sample size of the data in the experiments labeled “Outside Lab” is low; this potentially makes comparisons to outside lab data difficult due to the disparity in sample sizes between RTS data and outside lab data. It is thus hard to assess if the data characteristics (such as whether the mean is present in the triplet and terminal digit analysis) can be compared across multiple experiments. This issue arose in the equal digit analysis; the appearance of the mean analysis; and the permutation test.

We also observed that the statistical tests performed focused specifically on ways in which the RTS data was salient. The authors did not elaborate on how they came to run statistical tests on this dataset in particular, so absent any more detailed explanation, we may suppose that RTS’s data was suspect before the tests were performed and that the tests were selected specifically to address his/her data. In general, we are somewhat skeptical of this practice of designing and verifying a test based on the same set of data, as it may create the tendency (conscious or not) to test for only the anomalies that are already suspected. Thus, the tests become a self-fulfilling prophecy, identifying one dataset as anomalous even though it is possible that every investigator could be made to look unusual compared to the rest if one can picked the test after looking at the data. We suggest that the methods presented by the authors could be further validated if applied to datasets that were not previously suspect.

Overall, we can corroborate the findings of Pitt and Hill, and find their paper and data to be a valuable contribution to the literature.

J. H. Pitt and H. Z. Hill, “Statistical analysis of numerical preclinical radiobiological data," pp. 1–11, 2016.

# Appendix

Histogram of IID Poisson R.V. generated based on RTS Colony triplicate means

Histogram of IID Poisson R.V. generated based on RTS Coulter triplicate means

Histogram of IID Poisson R.V. generated based on Others Colony triplicate means

Histogram of IID Poisson R.V. generated based on Others Coulter triplicate means

Histogram of IID Poisson R.V. generated based on Outside Lab 1 Coulter triplicate means

Histogram of IID Poisson R.V. generated based on Outside Lab 2 Coulter triplicate means

Histogram of IID Poisson R.V. generated based on Outside Lab 3 Colony triplicate means

1. Marlin A. Thomas & Audrey E. Taub (1982) Calculating binomial probabilities when the trial probabilities are unequal, Journal of Statistical Computation and Simulation, 14:2, 125-131, DOI: 10.1080/00949658208810534 (The link to this article: http://dx.doi.org/10.1080/00949658208810534)

Nima Hejazi evaluated the article as: Show full review    Rated 3.5 of 5.
 Publication date: 26 January 2017 DOI: 10.14293/S2199-1006.1.SOR-STAT.AFHTWC.v1.RXIDZS Level of importance:     Rated 3 of 5. Level of validity:     Rated 4 of 5. Level of completeness:     Rated 2 of 5. Level of comprehensibility:     Rated 4 of 5. Competing interests: None Recommend this review: +1

# Introduction

This review reports the results of attempts to reproduce the findings presented by Joel H. Pitt and Helene Z. Hill in the paper “Statistical analysis of numerical preclinical radiobiological data,” published in ScienceOpen Research in January 2016. We thank the authors for making their data public and for publishing in an open journal, both of which made the work we present in this review possible. We comment on the strengths and potential areas of improvement of the aforementioned paper, based on our experience in replicating the reported results. In addition, we offer suggestions for improving the statistical analyses and computational reproducibility of said paper, including questions and potential issues that arose in our re-analysis of the data. We note that this review was vetted by Philip B. Stark – and, accordingly, are thankful for the commentary he provided; still, we emphasize that the work and opinions found in the present review are those of the authors alone. Further, we would like to add that the authors – Nima Hejazi, Courtney Schiffman, and Xiaonan Zhou – all contributed equally to the work that led to the production of this review. Finally, in keeping with the goal of reproducible research, all of our work, largely in the form of Jupyter notebooks, is publicly available on GitHub at https://github.com/nhejazi/stat215a-journal-review; we welcome independent review and commentary.

In the paper, the analyses reported by Pitt and Hill appear to be motivated by the belief that a single investigator (RTS) in one of the labs studied reported false, fabricated colony and Coulter counts in radiobological data from a preclinical study. How the authors came to harbor such a suspicion is, to our understanding, never made entirely clear (e.g., were they exploring the data, did they hear a tip from a colleague). To test their suspicion in a manner more rigorous than simply plotting and comparing distributions of counts and mid-ratios, the authors carried out several hypothesis tests against null hypotheses that would hold if the data were not fabricated. For example, the authors assert that, if the data were produced honestly, the distribution of terminal digits in the Coulter counts would be uniform and that the number of triples containing their own means should be limited. Such null hypotheses were based on the authors’ speculations as to how data would be fabricated (e.g., start by choosing a mean value then subtract and add some value from the mean) and how counts would be generated if they were the result of true biological processes (e.g., independently with equally likely digits).

To carry out tests examining the means of the observed count triples, the authors assumed that each independent triple was composed of a set of three I.I.D. Poisson random variables. In settings where the number of successes found in independent trials was being tested (as in the case of the number of equal terminal digits), the authors used the binomial distribution; to test for a supposed multinomial distribution, they used the well-known chi-square goodness of fit test. In employing these well-known statistical tests, the authors were able to present a probabilistic perspective as to whether the data were fabricated. All tests performed by the authors suggest that the investigator of interest (RTS) fabricated at least some data.

# Replication of Results

In keeping with the ideal of computational reproducibility, accuracy and honesty in academic research – motivations of the original authors themselves – we sought to reproduce many of the data summaries and analyses reported in the paper by Pitt & Hill. The vast majority of our analyses were performed using the Python programming language (v.3.5). We were able to reproduce Table 1 using the formula for the probability that a triple of I.I.D. Poisson random variables contain their own mean, found in the appendix of the original paper. We did not check the proof of the formulation of the probability, but we were able to very nearly computationally replicate the results reported from use of said equation. Next, we attempted to replicate Table 2; however, some of the reported values for complete, total and mean-containing triples we found differ from those presented in the paper (see Table [table1] of this review and Table 2 of the original paper). For example, the authors of the paper report 1, 716 complete triples, 1, 717 total triple and 173 mean-containing triples for the RTS Coulter data set. This is at odds with the 1, 726 complete triples, 1, 727 total triples and 176 mean-containing triples we found in the RTS Coulter data set. Despite this, we believe our summaries of the RTS Coulter counts to be consistent with the data set. The Excel spreadsheet for this data set contains 1, 729 observations, and from finding all NA values in the data set, we know there are only 2 NA values, in two different triples. This would leave 1, 727 total triples, which was the number obtained by two of our team members independently when running the analysis on the data in Python. As in the paper, we find only one triple to have a gap less than two, but that still leaves a number which is 10 less than the number of complete triples reported in the paper. This difference of 10, we believe, is also responsible for a discrepancy in the number of mean-containing triples. We hypothesize that perhaps the missing 10 triples in their reported values is due to an additional, unreported filtering step or a computational error.

For analysis of triplicate counts data, the authors proposed that a researcher wishing to guide the results of their experiments would likely arrange the data values in the triples so that their means are consistent with the desired result. The easiest way to construct such triples is to choose the desired mean as one of the three count values and to use two rough equal constants to calculate the other two counts as the mean value plus or minus the constants. Data constructed in this manner would be expected to have (1) a high concentration of mid-ratio (the middle value minus smallest value, then divided by the gap between largest and smallest values) close to 0.5, and (2) large number of triples that include mean as one of their values.

To test this hypothesis about the mid-ratio, we reproduced the histogram of mid-ratios for RTS and other investigators as is shown in Figure [fig:midratio]. The histogram of mid ratio for RTS shows a high percentage in the range 0.4 to 0.6, approximately 59%, while the histogram of mid-ratios recorded by the other members is fairly uniform across the 10 intervals.

To evaluate the significance of the high percentage of triples having mid-ratios close to 0.5, we attempted to write a python function to calculate the probability that the mid-ratio of a triple with a given parameter λ falls within the interval [0.40, 0.60]. The paper did not explicitly show how the authors calculated the probability that a triple of I.I.D. Poisson variables has a mid-ratio in the interval [0.4, 0.6], so we attempted to reproduce their results as best we could. We tried to calculate the probability of a mid-ratio falling within the interval [r1, r2] (0 ≤ r1 ≤ r2 ≤ 1) in a manner that is similar to the probability calculation for Poisson triples containing their rounded mean found in the Appendix of Pitt & Hill:

The probability that a triple randomly generated by three independent Poisson random variables with a given λ contains a mid-ratio in [r1, r2] is the probability of the union of an infinite number of mutually exclusive events:

$$A_{j} = \text{the event that the gap is equal to j and the mid-ratio is within[r1, r2], for j = 2, 3, 4, \dots}$$

For each j , A j is the union of an infinite number of mutually exclusive events:

$$A_{j,k} = \text{the event that largest value is k, the mid-ratio is j and the mid ratio is within [r1,r2], k = j, j + 1, \dots}$$

$$P(A) = \sum_{j=2}^{\infty} P(A_j) = \sum_{j=2}^{\infty}\sum_{k=j}^{\infty}P(A_{jk})$$

To calculate P(A j, k ), the smallest of the three elements must be kj , and the largest would be k , if the mid-ratio is between [r1, r2], that is equivalent to r1 ≤ (midvalue − (kj))/jr2, which can be converted to (r1 ⋅ j + kj) ≤ midvalue ≤ (r2 ⋅ j + kj). Let a = r1 ⋅ j + kj, b = r2 ⋅ j + kj , then we need to find P(a ≤ midvalue ≤ b) = C D F(b) − C D F(a).

Since the elements of our triples are assumed to be independently generated Poisson random variables with common parameter λ , considering the 3! permutation, we have:

$$P(A_{jk}) = 6 \cdot \frac{e^{-\lambda} \lambda^{k - j}}{(k-j)!} \cdot\frac{e^{-\lambda} \lambda^{k}}{(k)!} \cdot (1 - e^{-\lambda(r2 \cdot j + k\cdot j)}) - (1 - e^{-\lambda(r1 \cdot j + k - j \cdot j)})$$

Thus, the formula for obtaining P(A) is $$P(A) = 6 \cdot ( \sum_{j=2}^{N} \sum_{k=j}^{N} \frac{e^{-\lambda} \lambda^{k -j}}{(k-j)!} \cdot \frac{e^{-\lambda} \lambda^{k}}{(k)!} \cdot (1 -e^{-\lambda(r2 \cdot j + k \cdot j)}) - (1 - e^{-\lambda(r1 \cdot j + k - j\cdot j)})$$

where we choose $$\sum_{j=0}^{N} \frac{e^{-\lambda} \lambda^{j}}{(j)!}\geq 1-10^{-9}$$.

Our calculated probabilities for the mid-ratio falling within [0.4, 0.6] for different values of λ are shown in Figure [fig:poisson-mid-ratio]. We were not able to perfectly replicate the probabilities they report, which they say reach a maximum of roughly 0.26, but our efforts hit close to the mark. Our calculated probabilities, like those in the paper, are significantly less than the percentage of mid-ratios we saw in the RTS mid-ratio distribution and provides some additional evidence of data fabrication.

We then attempted to reproduce Table 3. Of course, we expected our Table 3 results to vary slightly from those in the Pitt & Hill paper if the number of complete and total triples in Table 2 differed from those in the paper to begin with. This is the case for the RTS Coulter and Others colony counts. However, we find their total number of terminal digits to be puzzling in other cases as well. For example, for the RTS Colony counts, we agree with the paper that there are 1, 361 total triples. There would be 1, 362, but one of the triple contains a missing value. This would mean that there are 1, 362 ⋅ 3 − 1 = 4, 085 total terminal digits. Indeed, this is the number of terminal digits we found in the RTS Colony data after running our program to extract terminal digits from non-missing values. However, the paper reports that for the RTS Colony data, there are only 3, 501 total terminal digits. Where this total number of terminal digits comes from, given that they also report 1, 361 total triples, is mysterious to us. For the “Equal Digit Analysis,” we replicated their results for the other investigators’ Coulter counts. Because we had different numbers of terminal digits (and therefore also of last two digits) for the RTS Coulter counts, we had slightly different results for the equal terminal digit test for this data set but the null hypothesis is still rejected (5, 185 pairs of terminal digits, 644 pairs have equal digits, p-value = 1.043 ⋅ 10 − 8 ).

We successfully reproduced their results for Hypothesis Test 1, assuming their upper bound on the probability that a triple of I.I.D. Poisson random variables contain their mean is correct at 0.42. For Hypothesis Test 2, we were able to nearly replicate the p-value for the Poisson Binomial test of the RTS colony data, with a p-value of 3.553 ⋅ 10 − 15 , but could not replicate the p-value exactly. This is most likely due to small differences in the code for calculating the probability that each triplicate contains its mean. It would be helpful to see the authors’ code used to generate these probabilities. For this reason, we were unable to perfectly replicate the authors’ results for Hypothesis Test 3 of the RTS colony data. We found an expected value of 214.924 with standard deviation of 13.28, whereas the authors found an expected value of 220.31 with a standard deviation of 13.42. Our numbers are close, however, and still result in rejecting the hull hypothesis at a highly significant level. However, it would be beneficial to have the author’s code in order to identify the source of the variation.

Number of complete, total and mean-containing triples from our analysis. Results for RTS Coulter and Others Colony counts differ slightly from those reported in the original paper.
Complete Total Contain Mean
RTS Colony 1343 1361 690
RTS Coulter 1726 1727 176
Others Colony 578 597 109
Others Coulter 929 929 36
Lab 1 97 97 0
Lab 2 120 120 1
Lab 2 49 50 3
Number of terminal digits from 0, 1, …, 9. Results for RTS Coulter, RTS Colony and Others Colony and Lab 1 counts differ slightly from those reported in the original paper.
0 1 2 3 4 5 6 7 8 9 Total P-Value
RTS Colony 564 324 463 313 290 478 336 408 383 526 4085 2.33378e-38
RTS Coulter 475 613 736 416 335 732 363 425 372 718 5185 7.06227e-95
Others Colony 191 181 195 179 184 175 178 185 185 181 1834 0.9943625
Others Coulter 261 311 295 259 318 290 298 283 331 296 2942 0.066995
Lab 1 28 34 29 25 27 36 44 33 26 33 315 0.394527
Lab 2 34 38 45 35 32 42 31 35 35 33 360 0.839124
Lab 3 21 9 15 16 19 19 9 19 11 12 150 0.20589

# What Was Done Well

The authors should be commended on their efforts to develop tests which can help to address the issue of dishonesty in research. Working to improve the reliability and truth of published data and results is important and valuable. Developing methods to test for data fabrication is crucial given the large volume of published reports claiming significant findings. The paper does a good job of identifying and discussing this need and providing methods to test data fabrication that go beyond simple exploratory analysis of the data. We find the hypothesis tests concerning mid-ratios to be particularly convincing, as these tests rely on a reasonable guess at how data is fabricated which researchers could make prior to examining the data of the investigator.

While parametric assumptions are not ideal, assuming a Poisson distribution for count data is a common and widely accepted practice, and seems to be an appropriate choice of parametric models for the nature of this data. We found their probability calculations involving the Poisson distribution in the Appendix to be insightful and thorough, and accurate from what we could tell. The authors’ inclusion of a closed form solution for the probability that a triple of I.I.D. Poisson random variables contains their rounded mean is noteworthy and appreciated. Furthermore, we found their inclusion of three variants on the test for a high number of mean containing triples in the RTS data to be very thorough. The inclusion and analysis of the data from fellow investigators and other labs is also a great strength of the paper. What is more, they appropriately applied all statistical tests given a set of assumed distributions and independence relations for the data. If they were motivated to investigate the honesty of the data prior to examining it, then their methods of carrying out this examination (i.e., the tests and summary statistics they chose to expose and prove the dishonesty) are intuitive and understandable. We find their results to be, overall, quite convincing.

# Suggested Improvements

While we acknowledge that assuming parametric models for the data generating processes is a common statistical practice (as it allows for closed form solutions of p-values); however, how are we to ensure that the assumed parametric models are correct? It is unlikely that the assumed models are the true data generating distributions, and it has been well shown that under model misspecification, the incorrect parameters are indeed evaluated. The authors based Hypothesis Tests 1-3 on the assumption that the triplicates were independent sets of 3 I.I.D. Poisson random variables, and the reliability of the p-values resulting from these tests are bound to the truth of this parametric assumption. We suggest that instead of relying on model assumptions to carry out hypothesis tests, the authors use nonparametric permutation tests, which rely solely on the data to determine if certain patterns are surprising. One benefit of these permutation tests over assumed parametric models is that one avoids the complex probability calculations found in the paper, which are difficult to reproduce. The authors provide a formula for the probability that a triple of Poisson random variables contains their mean, but they do not provide the formula for the probability that the mid-ratio is between [0.4, 0.6], as far as we can tell. These complex formulas make the task of reproducing the authors’ results challenging, whereas permutation tests can be easily carried out and do not require complex probability calculations.

Another suggestion for the paper is that the authors discuss why they believe terminal digits in the data should be distributed uniformly if the data is honest. The reasoning behind this assumption was not entirely clear, as it seems possible that in biological processes such as plated cell survival and replication, terminal digits may not necessarily be distributed equally across all 10 digits. Indeed, the p-value for the chi-square goodness of fit test for the other investigators’ Coulter counts is close to the arbitrary 0.05 p-value cutoff, with a p-value of 0.067, showing that the distribution of their Coulter counts is close to being rejected as a uniform (0, 1, …, 9) distribution. Therefore, there is evidence in the data from other investigators that a uniform distribution may not be a correct assumption. However, the authors do not discuss this.

Similarly, the authors should discuss further the logic behind using the heuristic that terminal digits would fail to be the same less than 10% of the time as a test for fabrication – that is, their argument appears to be that people tend to think that terminal digits matching is rarer by chance than it really is, leading to the conclusion that fabricated data would have terminal digits failing to match less than 10% of the time. This reasoning behind testing mid-ratios and the frequency of triplicates which contain their means seems reasonable and intuitive: if a researcher is going to fabricate their data, they might do so by choosing favorable means for the triplicates and then adding and subtracting a given value. Though we provide our interpretation of the rationale the authors used in designing this test, the authors do not make clear their line or reasoning in the paper. Therefore, we suggest that the authors focus more on the tests concerning mid-ratios of the triplicates, as there is an intuitive story behind why these tests are valid catching potential fabrication in the data. Furthermore, the mid-ratio hypothesis tests are more comprehensive than the mean-containing tests, since the case of a triplicate containing its mean (mid-ratio equal to 0.5) is included in testing whether mid-ratios are in the interval [0.4, 0.6]. It seems likely that if data were to be fabricated, investigators would not add and subtract exactly the same number, but slightly different numbers, and therefore get a lot of triplicates with mid-ratios between [0.4, 0.6].

Knowing the authors’ motivations behind testing the RTS data for anomalous patterns is important. Did they use the RTS data to both initiate their investigation and to prove the data’s fabrication? This is an important point that the authors should clarify, because if the same data was used to bring about the investigation and to draw conclusions, this would make the findings less significant. Anomalous patterns in the data can always be found if you are looking for them, and then to attach a level of unexpectedness, such as a p-value, to these anomalies after they are found is misleading. For example, if we were looking at all data sets in the paper for surprising patterns, we would note that Lab 3 has a surprising number of equal counts in the terminal digits Table 3. Outside Lab 3 has exactly 19 counts for the terminal digits 4, 5 and 7. We could hypothesize that this seems really unlikely if their data were honest and that it appears that this lab has data which is too uniform, like they were fabricating their counts to make them appear uniform across terminal digits. We could use this observation to come up with a test for the honesty of their data and find the probability that three or more terminal digits contain the same number of counts if all digits are equally likely. We may find that this probability is very small assuming the digits are all equally likely, and then would reject the null hypothesis that Lab 3 has equally likely terminal digits and is therefore fabricating their data. This example shows that the authors should explain how they came to test the RTS investigator in order for readers to correctly interpret the p-values in the paper.

# Remaining Questions

As mentioned above, one of our main concerns is the uncertainty surrounding how the authors became aware of possible dishonesty in the RTS data. Did they use the same data as both standing for accusation of dishonesty and proof of fabrication? If so, this changes the interpretation of the papers’ p-values. If the same data is used to discover and test falsehood, the statistical inferences from the analyses will be misleading.

We have remaining questions regarding how the authors calculated some of their colony and Coulter counts, and why their terminal digit counts appear to be inconsistent with their total triplicate counts. How were they counting the number of terminal digits? Did they include additional filtering steps to produce occasionally smaller counts than we produced?

Finally, we wonder how the comparison with other labs and investigators would change if they had as many triplicate observations as the RTS investigator. The outside labs used for comparison had fewer Coulter and colony counts than the RTS investigator, and this changes the power of the tests and the distribution of mid-ratios. We would like to re-run a similar analysis but with data from other investigators and labs which contained a similar number of colony and Coulter counts.

# Appendix of Figures

Raaz Dwivedi evaluated the article as: Show full review    Rated 4 of 5.
Strong advocacy of detecting scientific fraud using reproducible statistical methods
 Publication date: 01 February 2017 DOI: 10.14293/S2199-1006.1.SOR-STAT.AFHTWC.v1.RLCGLI Level of importance:     Rated 4 of 5. Level of validity:     Rated 4 of 5. Level of completeness:     Rated 4 of 5. Level of comprehensibility:     Rated 4 of 5. Competing interests: None Recommend this review: +1

# 1 Introduction

This peer review of Hill and Pitt (2016) was co-authored with Antonio Iannopollo and Jiancong Chen at the University of California, Berkeley. We review the paper in the spirit of promoting reproducibility of research and attempt to replicate the authors’ work. We also discuss other methods to identify anomalies, and present results based on our analysis using Permutation Tests. Permutation tests are consistent with the aim of the paper–providing simple tools for detecting anomalies–and validate the results in the paper, leading to the same conclusions.

Before diving into technical details, we make a minor observation: the organization of the paper was not properly introduced. The use of distinct sections for (1) the discussion on data and experiments; (2) their model and related calculations; (3) the application of common tests from the literature; and (4) conclusions, would have been helpful. The review is organized as follows. In section 2 we replicate authors’ work and results and discuss weaknesses of their approach. In section 3, we propose and implement additional tests to consolidate the results. We finally draw our conclusions in section 4.

## 1.1 Problem Set Up

The paper begins by voicing a growing concern towards “Scientific fraud and Plagiarism” in the scientific community and is successful in conveying a strong message. The authors present some statistical figures and point out the existence of easy statistical tools to detect fabricated data and ignorance about such tools.

The authors examine datasets from radiobiological experiments. They find that data reported by one of 10 researchers, the “RTS”, is suspicious. They perform three different tests to validate their suspicion and also validate their tests and assumptions by looking at the data obtained from three other sources. Each researcher made two types of triple measurements - colony counts and Coulter counts. The authors suspect that the RTS fabricated data triples to get the mean s/he desired in each triple by setting one observation equal to the desired mean and the other two roughly equidistant above and below that value. This would result in triples that contain the (rounded) mean as one of their values.

The methodological contribution of the paper is “bounds and estimates for the probability that a given set of n such triples contains k or more triples which contain their own mean” when each of the n triples is independent and identically distributed (i.i.d.) Poisson and triples are independent of each other. (Different triples may have different Poisson means.) For this Poisson model, the chance that the RTS’s data would contain so many triples that include their rounded mean is astronomically low. They also apply more common tests for anomalous data, based on statistics such as the frequency of the terminal digit and the frequency with which the last two digits are equal. However, some of the questions that were slightly untouched upon are discussed below:

• The authors write, “Having observed what appeared to us to be an unusual frequency of triples in RTS’s data containing a value close to their mean, we used R to calculate the mid-ratios for all of the colony data triples that were available to us.” This suggests that the same data–and the same feature of the data–that raised their suspicions about the RTS was the data used to test whether the RTS’s data were anomalous on the basis of that feature. If so, then the nominal p-values are likely to be misleadingly small.

• Most of the tests assume a model for the observations and compare the RTS’s data to that model. The authors validate the assumptions of the model by comparing it with the data pooled for the other researchers. Pooling the data in this way may hide anomalies in the other researchers’ data. Permutation tests allow the data from each researcher to be compared to the data from the other researchers without positing a generative model for the data. On the other hand, the bulk of the data available is from the RTS. To reject the hypothesis that another researcher’s data looks like a random sample from the pooled data if it includes the RTS’s data does not imply s/he is suspicious. Instead, it shows that his/her data is not like that of the RTS. See section 3 of this review for more discussion.

# 2 Reproducibility of Results

This section discusses our efforts to replicate the analyses in the paper. After fine tuning, we were able to replicate most of their results, obtaining similar results in the other cases. Our work is available on github.com/ianno/science_review. The original datasets used for the paper and also used in this review can be found at https://osf.io/mdyw2/files/. We first discuss specifics about the replication and then comment about the tests and methods involved.

## 2.1 Mid-Ratio Analysis

The authors first consider the mid-ratio, which is defined for a triple (a, b, c), a < b < c as $$\frac{b-a}{c-a}$$, and show that the histogram of RTS’s data concentrates abnormally around the 0.4 − 0.6 range, compared to the data taken by all the other lab members. After tweaking the default histogram function on numpy, we were able to obtain plots similar to the ones reported in Figure (1) of the paper. Two noticeable differences were - (1) we obtain 44% chance of seeing mid-ratio in (0.4, 0.5] interval for RTS, compared to 50% chance reported in the paper and (2) we used 1360/1361 and 595/595 triples to compute histogram for RTS and the rest respectively, compared to the use of 1343/1361 and 572/595 triples by the authors. We believe the authors did not provide enough information about the methods used to filter data for this section. However, such minor differences did not demand further investigation.

## 2.2 Probability Model and Hypothesis Tests

The authors develop a model to bound the probability of observing k out of n triples contain their mean. Each entry in a triple is assumed to be an independent sample from a Poisson distribution with mean λ. (Different triples may have different means.) The event of observing the rounded mean in such a triple is a Bernoulli random variable (BRV) whose success probability depends on λ. The authors derive analytical expressions for these success probabilities in Appendix A. Numerical values of these probabilities, for λ = {1, …, 25}, are presented in Table 1. We could replicate this table exactly. For large λ (>2000), for which the authors provide only a few representative probability values, our implementation suffered from numerical issues.

Using Table 1, the authors determine the success probability for the BRV in two different ways and use it to compute the chance of observing the data. For hypothesis test I (non-parametric) they used the maximum value from Table 1 as an upper bound for all triples, essentially treating all BRVs as i.i.d. Bernoulli(0.42). Replicating this was straightforward. For hypothesis test II and III, the authors use maximum likelihood estimate of λ for each triple to compute the corresponding success probability values, essentially treating each BRV to have a different distribution. The authors address the sum of these BRVs as a “Poisson Binomial Random Variable". Additionally, for the hypothesis test III, the authors use the normal approximation for the Poisson binomial random variables. We could replicate the probability values, up to minor errors, for the colony data. Limitations of our implementation gave inaccurate results for Coulter data. For sanity checks of the results, we used linearly interpolated estimates from the paper (for intermediate λ) and obtained values similar to those in the paper for these tests. Figure 1 is the approximate replication of Table 2 from the paper.

## 2.3 Digits Analysis

The authors also perform some common tests for fraud detection - terminal digit analysis and pair of equal terminal digits analysis. These tests are based on the assumption that, in general, insignificant digits of a random sample are uniformly distributed.

### 2.3.1 Terminal Digit Analysis

The first test assumes that the last digit in samples of large numbers (>100) should empirically show uniform distribution. Also, some previous works, e.g., have shown that fabricated data often fails to show such peculiar property. The authors use the chi-square test for goodness of fit and get low p-values for the RTS’s data, and good fits for the data of other researchers. Our results are very similar to theirs, although not identical.

### 2.3.2 Equal Digits Analysis

This test assumes that, for large numbers, empirical frequencies of observations of a pair of equal terminal digits should be close to 1/10. The authors did not mention which tests were considered for this analysis. We assume they performed chi-square tests for goodness of fit, for which we obtain similar results.

## 2.4 Discussion

Here are a few general comments on the methodology adopted by the authors:

• The authors did not justify the assumption of Poisson distribution for the underlying radiobiological data. We think a more thorough explanation would have been helpful for readers with different backgrounds.

• The authors suspected RTS’s data but used his/her data to fit a model and quantify their suspicion. While sometimes this may raise concerns, here we agree with the authors that doing so increases the odds in favor of the RTS, hence giving us desirable conservative results.

• The authors do not discuss why considering only numbers larger than 100 justifies the assumption of insignificance for the two terminal digits.

• The authors include additional data from three external sources (two for Coulter counts and one for colony counts). All of them, however, had a relatively small amount of data. Despite the authors’ attempts to account for this, we believe that in the current setting these additional samples do not provide more compelling evidence. Instead, they might be misleading (Are the procedures used the same? Is the equipment calibrated in the same way? etc.).

• We reiterate that pooling the data may hide anomalies in the other researchers’ data.

# 3 Further Analysis

As a preliminary test for identifying suspicious datasets, we (1) plot histograms of mid-ratios for the colony data provided by individual researchers, and (2) contrast the histogram of each investigator with the histogram of the pooled data of the other investigators. In Figure 2, we present the plots for (1).

Two important observations can be made:

• The histograms for researchers with labels B, C, E, F, G, H, I do not appear following the uniform distribution.

• RTS heavily influences the histogram when his/her data is collected in the pool and, therefore, patterns from the other researchers look anomalous when compared to it.

These points illustrate the limitations of the uniformity assumption for mid-ratios and the visual comparison between the histograms of RTS and the pool to motivate suspicion.

## 3.1 Permutation Tests

“The problem of determining whether a treatment has an effect is widespread in various real world problems. To evaluate whether a treatment has an effect, it is crucial to compare the outcome when treatment is applied (the outcome for the treatment group) with the outcome when treatment is withheld (the outcome for the control group), in situations that are as alike as possible but for the treatment. This is called the method of comparison.”. We will describe this method for a specific set up relevant for this review.

Suppose that we are given two sets of observations - one of them labeled as ‘treatment’ with size T, and the other labeled as ‘control’, of size C. We assume that the first of them has received a treatment and we wish to test the hypothesis whether this treatment affects the group. In a two-sample permutation test, the data is pooled together to form a population of size N = T + C. To compare the two groups, we need to decide on a test-statistic that can capture the effect of the treatment (if any) on the population. As an example, we can consider the absolute difference between the sample means of the two datasets. Under the null hypothesis that the treatment has no effect, one can analytically derive the distribution of this test statistic. However, it is often easier to empirically approximate such distribution rather than compute it numerically. To do so, one needs to repeatedly randomly partition the data into groups of size T and C and compute the test statistic contrasting the two groups. We use the empirical histogram obtained from these experiments, as a proxy for the true distribution of the test statistic. Just like typical hypothesis testing, we then determine the chance (p-value) of observing the test statistic that we computed in the beginning.

When the p-value is below a preset significance level, we infer that the treatment has an effect at that level of significance. It is unlikely that the two sets were obtained by a random partition of the pooled data.

### 3.1.1 Results for Mid-Ratio

We set the test statistic to be the difference in standard deviation of the mid-ratios for the two datasets. We choose the standard deviation, instead of the mean, because our null and alternative hypothesis for mid-ratio (uniform distribution versus concentration around 0.5) have the same mean (0.5). We expect the standard deviation to capture the unintentional reduction in spread caused in data due to intentional adjustments.

We consider each researcher’s data equivalent to a treatment group and the rest of them as the control group. We use 1000 repetitions to obtain the empirical distribution and then compute the p-values:

• 0.00, for investigators A, B, D, and RTS;

• <0.01, for C, H, I;

• >0.01, for E,F,G.

The p-values indicate that almost all datasets are surprising with respect to this test-statistic. We would like to emphasize that here a p-value of 0.00, in fact, denotes a p-value <0.001, because of the finite resolution owing to 1000 tests. We would also like to mention that RTS is still the most surprising if one looks at the location of the test-statistic in the tails of the distribution.

We also use ℓ1-distance between the density1 and the ℓ1-distance between the cumulative distribution function (CDF) as the test statistic. Again, we reject several researchers of the lab at a significance level of 1%. We present all the p-values in Figure 2. The top row denotes the test statistics used and the first column denotes the researcher for whose data we perform the permutation test. Colum labeled as 'No.' denotes the number of data points associated with the researcher.

Remark We would like to mention that when RTS is included in the control group, it constitutes the bulk of the group. As a result, rejecting the null hypothesis for a researcher is almost equivalent to rejecting the hypothesis that the data of that researcher is same as RTS’s data. If we already believed or discovered that RTS’s data was suspicious, then we cannot flag other researchers’ data as suspicious. Therefore, we do another set of permutation tests after excluding the RTS’s data. We did not find strong evidence to reject the null hypothesis, hence we conclude that none of the researchers is suspicious at a significance level of 1%. However, this set of tests suffers from a bias because of our manual throwing away 2/3 of the data points.

Putting together all the pieces, we conclude that there is statistical evidence to claim that RTS’s data is not genuine.

## 3.2 Additional Tests for Digit Analysis

For the terminal digit and equal digits analyses, we extended the tests done by the authors to individual members of the lab and performed (1) chi-square test for goodness of fit for terminal digit, (2) chi-square test for goodness of fit for equal digits and (3) permutation tests for terminal digit. For permutation tests, we used the test statistics listed in the previous section. Results are tabulated in Figures 5-7.

Figure 7, once again, confirms that RTS’s data is suspicious. As before, the huge fraction of data by RTS contributes towards the low p-values for some of the other researchers. In permutation tests after excluding RTS, none of the researchers look suspicious. For sake of brevity, we avoid mentioning the p-values here.

# 4 Conclusion

Data fraud is an extremely critical issue in science, engineering, and many other fields. Methods to detect manipulated data are needed to identify fraudulent research behaviors. Detecting frauds, however, is a delicate matter. Challenging the credibility of a researcher or of a scientific work, in fact, can have heavy consequences for all the parties involved in the process. Methodologies and techniques used in this kind of work need to be clear and widely accepted. They need to produce results which leave minimal (ideally no) space to ambiguity. Independently, reproducibility of results is a fundamental element to rule out any doubts that could arise at any time. In our review, we carefully analyzed the authors’ work by reproducing the results in the paper and using additional tests which we believe to be more general. We found that authors’ conclusions are correct, having been able to reproduce most of their results. Moreover, we encourage the use of more powerful tools, such as permutation tests, which we proved to be effective in the context of the paper. Such tests help to focus the analysis not on the assumptions, but on the actual anomalies present in the data.

At the end of our review, we do believe that there is a significant evidence that RTS has suspicious data. However, we recommend the authors to collect additional information since some of our tests suggest that other investigator’s data have anomalies as well if we do not discount the huge fraction of data given by RTS.

# Acknowledgments

We would like to thank the authors H. Pitt and H. Hill for publishing in an open journal, and making the data available for everyone. Also, we would like to thank Prof Philip Stark for his valuable and critical guidelines and timely feedback. We would also like to thank Yuansi Chen for valuable tips with python. As a final note, we would like to claim complete responsibility for all the opinions expressed in this paper.

1. abuse of terminology, used in place of normalized histograms

Aaron Stern evaluated the article as: Show full review    Rated 4 of 5.
Clever techniques to address a pressing problem in science; selective analysis of their performance
 Publication date: 19 December 2016 DOI: 10.14293/S2199-1006.1.SOR-STAT.AFHTWC.v1.RWPXRI Level of importance:     Rated 5 of 5. Level of validity:     Rated 3 of 5. Level of completeness:     Rated 3 of 5. Level of comprehensibility:     Rated 4 of 5. Competing interests: None Recommend this review: +12 people recommend this

# Summary

Pitt and Hill have presented an exciting and lucid analysis that introduces several new methods for detecting fraud. More precisely, the authors analyze count data recorded by an individual referred to as “research teaching specialist" (RTS). They analyze the RTS data using a set of statistical tests designed to detect anomalous patterns in count data. The p-values for these tests are compared to those for data from groups of other investigators using the same protocols. For all the tests they performed, the authors reject the hypothesis that the RTS reported the data accurately. Conversely, they find no significant anomalies in the comparison groups.

# Reproducibility

We strove to reproduce the main results (i.e., hypothesis tests) presented in this paper. Code and figures for our review is available on Github at

$$\texttt{https://github.com/35ajstern/reproduce\_sor\_2016/}$$

in the $$\texttt{report/}$$ folder, written as a Jupyter notebook; main results of the reproduction are also presented at the end of the review as data tables with red marks that indicate our own results.

We were able to reproduce most of the authors’ results, with a couple of minor discrepancies that may have arisen from filtering that was unspecified in the paper. The Jupyter notebook also contains our own novel analysis of the paper’s data, which we discuss throughout this review.

# Study design and alternative analyses

## Mean-containing and mid-ratio tests

### Poisson assumption

A central contribution of this paper is its presentation of novel mid-ratio/mean-containing tests for count data. These methods assume that count data {Xji}j = 13 within a triple is $$X^i_j \overset{iid}{\sim}$$ Pois(λi) ∀i ∈ {1, …, N}, where N is the number of triples in the population. Does this probability model sufficiently describe the dynamics of a cell population? The fate of a cell is likely dependent on the fates of its neighbors; this relationship is not captured in the Poisson model. We are concerned that the Poisson assumption might be unrealistic, and wish the authors had discussed in more detail what behavior could be expected from their tests when this assumption breaks down.

We independently examined the claim that counts within a triple are distributed Poisson by comparing the real data with simulated Poisson triples. Independently and identically distributed Poisson variables should on average have sample mean equal to the unbiased estimate of variance
$$\mathbb{E}[\bar X] = \mathbb{E}\big[\hat \sigma^2 (X)\big] = \mathbb{E}\bigg[ \frac{1}{n-1} \sum\limits_{i=1}^n (X_i - \bar X)^2 \bigg]$$
where n = 3 for triples.

To test whether the experimental data adheres to this canonical relationship, we performed linear regression on $$\bar X$$ and $$\hat \sigma^2$$ for all the triples from a putative control group (colony triples from other investigators in the RTS lab). If the Poisson distribution assumption were true, then the slope of this regression would be approximately equal to 1 (Fig. 1). However, the regression coefficient for the real data is 0.73, which means the sample variance of the real data is substantially smaller compared to Poisson distribution. This suggests that the colony count data from other investigators do not follow a Poisson distribution. We performed the same test on Coulter machine-counted data from the group of other investigators, and found a regression coefficient that seems implausible under Poisson assumption (Fig. 2). In this case, however, the data is over-dispersed with a regression coefficient of 1.37.

Empirical compared to simulated distributions of $$\hat \sigma^2$$ v.s. $$\bar X$$. The Poisson parameter of each randomly simulated Poisson triple is the sample mean of a corresponding real triple in the RTS data. The red line represents the expected $$\hat \sigma^2$$.

Figure 1 [see review.pdf on github]: Empirical compared to simulated distributions of σˆ2 v.s. X ̄. The Poisson parameter of each randomly simulated Poisson triple is the sample mean of a corresponding real triple in the RTS data. The red line represents the expected sample variance.

Figure 2 [see review.pdf on github]: A simulated distribution of $$\hat \beta = \frac{\hat \sigma^2}{\bar X}$$ for $$N=1000$$ random Poisson triples. Each triple was parameterized uniformly at random by $$\hat \lambda_{ML}$$ of a triple, which is sampled randomly from the real data. The blue line on the right represents the actual value of $$\hat \beta$$. Outliers of real data ($$\hat \sigma^2 > 3\bar X$$) were excluded.

Figure 1 in the paper shows that distribution of the mid-ratio from RTS colony triples is very different from that of the other investigators. While the authors just use this as supporting evidence rather than a concrete argument, it is worth pointing out that there is no reason these distributions should look similar. Assuming each triple Xi comes from a Poisson distribution with rate parameter λi for i = 1, …, N, where N is the number of triples, then the empirical distribution of mid-ratios will depend on the composition of the set {λi}i = 1N, which certainly varies across experiments. Since colony counts from other investigators might have totally different rate parameters to those of RTS’s experiment, there is no reason to expect the corresponding mid-ratios to be similar. To illustrate this issue, consider the difference in the empirical distribution of simulated mid-ratios for triples with λ = 1 vs λ = 100.

Figure 3 [see review.pdf on github]: Left: mid-ratios for N = 1000 simulated Poisson triples with λ = 1. Right: mid-ratios for N = 1000 simulated Poisson triples with λ = 100.

### Stratification of mean-containing/mid-ratio tests

In their study, the authors present Figure 1 to suggest that RTS data has an anomalous proportion of mean-containing triples compared to the agglomerated group of other investigators from his lab. We stratified the histogram for these 9 investigators to see if there were anomalous patterns lingering within the lumped group (Fig. 4). While sample size is too small to declare significance, there are numerous investigators within the lumped group who individually appear to record a high proportion of triples with mid-ratio concentrated about 1/2.

Figure 4 [see review.pdf on github]: Mid ratio histogram for other 9 investigators stratified by individual.

### Hypothesis testing

While the authors propose a plausible mechanism for how their novel mid-ratio/mean-containing tests might detect fraud, we wonder if they came to settle on performing this test only after they “peeked" at the data. Designing a statistical test in full knowledge of the data to be tested can often produce smaller p-values. We recommend that if data is used to guide test design, then some of the data should be apportioned into a disjoint testing set; there it remains unobserved until application of the hypothesis test to the test set (and not the previously observed set) to find significance.

Furthermore, the authors do not divulge the hypothesis tests that were considered or performed before the ones which they present in the paper. Our concern here is that the disclosed hypothesis tests may have happened to reject the null, while many more undisclosed tests may have not. Providing this information is invaluable for managing the false discovery rate.

Hypothesis test I is based on a (numerically) conservative bound, while test II treats estimated values of λ as if they have no uncertainty, which might result in an unconservative test (the true p-value could be rather larger than the nominal p-value). For the data in the paper, the conservative test yields an extremely low p-value; we suppose the authors presented the other test because it might be useful in other situations. However, we prefer the cruder test I to test II; because $$\hat \lambda_{ML}$$ (the maximum likelihood estimate (MLE) of λ) is crude estimate for the rate parameter of a triple, test I seems to be conservative. For example, take the case that a triple occurs far from its expectation–say, (72,102,104)–when its “true" λ = 70. In that way, test I could be robust to the under- or over-dispersion of count data that we pointed out previously. Conversely, we view this as a problem in hypothesis test II, where the sample mean is used to stratify the triples by their “true" λ. The authors mention the sample mean is the MLE; however, they do not discuss its large mean squared error (MSE):

$$\text{MSE}_\lambda(\hat \lambda_{ML}) = \mathbb{E}_\lambda[(\bar X - \lambda)^2] = 1/9(\sum ^3_{i = 1} \text{Var}_\lambda(X_i)) = \lambda/3$$

To this end, we would have liked to see an exploration of the sensitivity of the true level of the tests to the uncertainty in the sample mean as an estimate of λ.

Hypothesis test III applies the Lindeberg-Feller Central Limit Theorem (L-FCLT) to approximate the distribution of occurrences of mean containing triples. We do agree with the authors that the Bernoulli events “triple Xi is mean-containing" satisfy the Lindeberg Condition as the number of triples grows large. However, the authors use the mean of each triple–a highly unstable estimate, as we just discussed–and a comparably small sample size (i.e., the number of triples). Therefore, we are concerned that the authors take for granted that the L-FCLT would be suitable to approximate the number of mean-containing triples when the total number of triples is only on the order of 103. Furthermore, the unbiasedness of the estimate $$\hat p_i = f(\hat \lambda_{ML})$$ is not guaranteed, and therefore it is not guaranteed that the L-FCLT holds for {pi}i = 1N.

## Tests of digit uniformity

We find the subsequent tests deployed by the authors to be more compelling than the aforementioned mean-containing/mid-ratio tests. That the authors cite usage of terminal digit analysis in previous studies of fraud suggests to us that these tests were more likely to have been selected agnostic of the data. As a result, we have fewer concerns about “peeking" at the data and ensuing selective inference on these two tests.

The authors’ chi-squared test on the occurrence of terminal digits banks on the assumption that the distribution of terminal digits is uniform when a single count is iid Pois(λ). We checked this claim and found that when we simulated Poisson random variables with λ < 30, it is not a reasonable assumption (Fig. 5). That being said, the majority of the colony count data in the study takes on values larger than 30, so that assumption works well if the observed rates indicate the underlying theoretical rates (we also suspect that terminal digits under an over-dispersed distribution converge even more rapidly to uniformity). However, the authors do not appear to have filtered out data with small empirical rates; in fact, our reanalysis suggests they did not discard single-digit numbers in the terminal digit analysis. Nonetheless, our reanalysis was largely concordant with the authors’, with slight differences that do not affect significance.

Figure 5 [see review.pdf on github]: The mean terminal digit of a Poisson random variable does not converge to 4.5 (necessary for uniformity) until $$\lambda > 30$$. We simulated $$N=10^4$$ variables for each value of $$\lambda$$.

That said, we have a serious concern about the usage of this test to compare the χ2 of an individual to the χ2 of a lumped group; for example, consider a group consisting of two individuals–one of whom only records even numbers and the other only odds. If their counts “cancel out" sufficiently, their group may have an insignificant χ2 value (perhaps even equal to 0). Separately, these two individuals would no doubt have significant χ2 statistics. This pathology arises from testing individuals against groups. While this example directly concerns terminal digit uniformity, it can also apply to the authors’ equal digit analysis and mean-containing/mid-ratio tests. In all of these tests, opposite biases can cancel each other out when lumped into a single group. In the following section, we examine how the authors’ results change when data is stratified individual-by-individual.

### Stratification of digit uniformity tests

To examine how lumping of individuals affects significance, we stratified data from other investigators in RTS’s lab individual-by-individual based on codes in the authors’ spreadsheets. We performed a terminal digit and equal digit analysis on these groups and found several individuals who produced unlikely data: Investigators D and F had statistically significant terminal (p < 0.01) and equal digit data (p < 0.05), respectively (Tab. 5,6).

### Arbitrary digit pairs

We were confused that the authors looked for an enrichment of equal digits in the data. People committing fraud may avoid fabricating equal digits by the token that they are 9 times less likely than non-matching digits under uniformity (this reasoning is congruent to the authors’ motivation for mid-ratio tests, which look for an enrichment of likely triples). We performed a test equivalent to the equal digit analysis on 10 non-equal digit pairs – {01, 12, ⋯, 90} – and looked at how anomalous individuals appeared under this test versus equal digits. We found that this choice of digit pairs produced a test that suggested 4 of the 9 other investigators (as well as RTS) had unlikely data (Table 7).

## Permutation testing: terminal and equal digits

Touching back on our criticism of how grouping affects calculation of χ2, we reiterate our concern that the individual RTS was tested against groups of other investigators. It is not clear why RTS was singled out; other researchers might also have fabricated data. To control for the effects of this way of testing the data individual-to-group, we implemented two non-parametric permutation tests.

To test the abnormality of RTS’s data, we took data from RTS and other investigators, combined it into one group, and repeated permuted their labels (“RTS" or “Other Investigators"). These new permuted populations were used to calculate the chi-squared and total variation distance between terminal digit frequencies of each pair of permuted groups (Fig. 6). Indeed, the p-value of the actual RTS data’s distance (both TVD and χ2) are extremely small. This results reinforces the claim that RTS’s was unlikely to have occurred by chance, even if the data are not observations of Poisson variables. We would have liked to do pairwise permutation tests stratified individual-by-individual, but no individuals besides RTS contributed sufficient data to perform such tests. Please refer to our Jupyter notebook for additional permutation tests of equal digit pairs and triple mid-ratios.

Figure 6 [see github]: Left: Total variation distance (TVD) of terminal digit frequency in $$N=1000$$ permutations of RTS vs others (cyan); TVD of the actual RTS data vs others (dashed bar). Right: $$\chi^2$$ distance applied to the same permutation scheme.

# Concluding remarks

The authors offer an overall persuading analysis of the data. Ultimately, we believe the authors’ tests indicate that some fraction of the RTS data is fabricated. However, we are concerned that their novel hypothesis tests may have been designed deliberately to detect anomalies they observed a priori in the RTS data. We showed evidence contrary to some of the authors’ main assumptions, including the Poisson distribution of triples. We also showed that the design of the test groups glazes over potentially suspicious individuals within the comparison groups. Lastly, we designed two new permutation tests for count data abnormality that do not rely on parametric assumptions. Test for fraud should be careful to avoid selective inference, and we find evidence of fraud that depends on parametric assumptions is less compelling than evidence based on nonparametric tests.

We would like to thank the authors of the paper we reviewed for contributing this very interesting study, for making their data public, and for choosing to publish in an open-access journal.

Also, we would like to acknowledge Philip B. Stark, who vetted this review. However, the work was conducted entirely by the authors, and the opinions expressed in this review are those of the authors.

Table 1 in the paper
λ P λ P λ P λ P λ P
1 0.267 6 0.372 11 0.317 16 0.281 21 0.254
2 0.387 7 0.359 12 0.309 17 0.275 22 0.250
3 0.403 8 0.348 13 0.0301 18 0.269 23 0.246
4 0.397 9 0.337 14 0.294 19 0.264 24 0.242
5 0.385 10 0.327 15 0.287 20 0.259 25 0.238
Table 2 in the paper
Type Inv. # complete/tot. # mean # exp. SD Z p ≥ k
Colony RTS 1,343/1,361 690 220.3 13.42 34.97 0
Colony Others 572/591 (578/597) 109 107.8 9.23 0.08 0.466
Colony Lab 1 49/50 3 7.9 2.58 −2.11 0.991
Coulter RTS 1,716/1,717 (1726/1727) 173 (176) 97.7 9.58 7.80 6.26 ⋅10−13
Coulter Others 929/929 36 39.9 6.11 −0.71 0.758
Coulter Lab 2 97/97 0 4.4 2.03 −2.42 1.00
Coulter Lab 3 120/120 1 3.75 1.90 −1.71 0.990
Table 3 from paper
Type Investigator χ2 p
Colony RTS 200.7 0
Colony Same lab 1.65 (1.79) 0.994363
Colony Other lab 12.1 0.205897
Coulter RTS 456.4 (466.88) 0
Coulter Same lab 16.0 0.0669952
Coulter Other lab 1 9.9 (9.48) 0.394527
Coulter Other lab 2 4.9 0.839124
Equal digit analysis (Coulter)
Investigator x n p
RTS 636 (644) 5155 (5187) 8.57787e-09
Same lab 291 (286) 2942 (3021) 0.827748
Other lab 1 32 327 0.504864
Other lab 2 30 360 0.83282
Stratified terminal digit analysis (Coulter)
Investigator χ2 p n
A 8.10232 0.523869 1401
C 14.5789 0.10317 105
B 5.88889 0.750985 180
E 9.12121 0.426161 165
D* 21.8438 0.00938759* 645
G 5.33333 0.804337 60
F 6.96774 0.640478 312
I 9.4183 0.399591 153
Stratified equal digit analysis (Coulter)
Investigator x n p
A 132 1401 0.748688
C 8 105 0.733914
B 16 180 0.634373
E 13 165 0.777841
D 62 645 0.597186
G 4 60 0.729042
F* 40 312 0.0436366*
I 11 153 0.848016
Alternative digit pairs
Investigator x n p
RTS* 560 5187 0.027532*
A* 156 1401 0.0738102*
C 11 105 0.35797
B* 23 180 0.0896297*
E 10 165 0.947312
D 72 645 0.147142
G 6 60 0.393549
F* 39 312 0.0624213*
I 16 153 0.361155
Stephanie DeGraaf evaluated the article as: Show full review    Rated 4 of 5.
Convincing and valuable analysis with methodology that could be refined
 Publication date: 14 February 2017 DOI: 10.14293/S2199-1006.1.SOR-STAT.AFHTWC.v1.RKMHJD Level of importance:     Rated 4 of 5. Level of validity:     Rated 4 of 5. Level of completeness:     Rated 4 of 5. Level of comprehensibility:     Rated 3 of 5. Competing interests: None Recommend this review: +1

Review: Statistical Analysis of Numerical Preclinical Radiobiological Data

Erik Bertelli, Stephanie DeGraaf, James Hicks

Introduction

This paper tackles the serious problem of detecting fraud by applying and developing multiple methods: terminal digit frequency analysis and tests using the mid-ratio, based on a probability model for the triplicate count data. In this review, we first attempt to replicate Pitt and Hill's results using the data they published, and second, offer several points of clarification and discussion about their methodological contribution.

We would like to thank the authors Pitt and Hill for publishing their results in an open journal and making their data available online, enabling us to attempt this replication. This review was vetted by Philip B. Stark. However, the work was conducted entirely by the authors, and the opinions expressed in this review are those of the authors.

Summary of Main Findings

The goal of Pitt and Hill’s work is to develop a test based on the mid-ratio and demonstrate its usefulness by applying it to a specific data set. For their data set, Pitt and Hill test the hypothesis that the data from a particular researcher (the RTS) were generated in a way so unusual as to suggest fabrication. Their first test, digit frequency analysis, is a common method of detecting fraud in many kinds of datasets. The analysis of midratios is much more specific to the radiobiological preclinical data, since the midratio involves triplicate measurements of counts. Pitt and Hill model "honest" triplicate colony count data and Coulter count data as independent triples of IID Poisson variables. That is, each triple consists of three IID Poisson random variables, and the set of all triples is independent (but not identically distributed: in general, different triples have different rates lambda). The bulk of the paper involves deriving or approximating the corresponding null distribution of the test statistics they consider, and applying those tests. They find that the RTS data are inconsistent with this null hypothesis.

After replicating Pitt and Hill’s major findings, we agree with the conclusion that the RTS data were not generated in the same fashion as the other investigators. While we offer small critiques of each method, the general strength of the evidence is such that we agree with the conclusions in the paper.

Reproduction of Results

First, we replicate the results of Pitt and Hill’s major analyses. Our code and results are available at https://github.com/sldegraaf/preclinical-data-review.

Terminal Digit Analysis

We independently replicated the terminal digit analysis for all of the data sets and found almost exactly the same results. The only major differences were for the Coulter-RTS and Colonies-Others data sets, in which our counts were a bit higher than those reported in the paper. We applied the exact same code to all datasets, and most of our results matched the results in the paper perfectly. The fact that our Coulter-RTS and Colonies-Others counts were slightly higher suggests some of the data may have been filtered out of the original author’s analysis, but we were unable to infer the cause of the differences. Our counts are provided in the table below, where the bolded, underlined entries indicate the counts that differ from those in the paper. These differences were all very small and thus did not meaningfully change the Chi-squared test statistics or implication of the results.

 Digits Type Investigator 0 1 2 3 4 5 6 7 8 9 Total Χ2 P Coulter RTS 475 613 736 416 335 732 363 425 372 718 5185 466.87 7.06E-95 Coulter Others 261 311 295 259 318 290 298 283 331 296 2942 15.99 0.07 Coulter Outside 1 28 34 29 25 27 36 44 33 26 33 315 9.48 0.39 Coulter Outside 2 34 38 45 35 32 42 31 35 35 33 360 4.94 0.84 Colonies RTS 564 324 463 313 290 478 336 408 383 526 4085 200.73 2.33E-38 Colonies Others 191 181 195 179 184 175 178 185 185 181 1834 1.79 0.99 Colonies Outside 3 21 9 15 16 19 19 9 19 11 12 150 12.13 0.21

We also replicated the analysis of the final two digits in each of the counts and also found that the RTS data was suspiciously non-uniform. In the paper, the authors test the occurrence of equal digits in the last two digits for the Coulter RTS and Other data sets, restricted to values with at least 3 digits. We replicated this analysis and found similar results, with RTS having 12.4% of values with equal last digits and Others having 9.9% equal. Under a binomial distribution with probability .1 (meaning an assumption of uniform distribution), these outcomes have the one-sided probabilities of being greater than the observed quantities of 7.8 E -09 and 0.55 respectively, which is strong evidence to be suspicious of the RTS data set.

Midratio Analysis

To replicate the results of mid-ratio analysis, we calculated the mid-ratios for Colony counts in the RTS data and Other Investigators’ data. As in Pitt and Hill's initial findings, our results indicate an overwhelming predominance of midratio values in the 0.4-0.6 range in the RTS data, compared to a uniform distribution in the Others' data.

To ensure that the uniform distribution is what we should expect from genuine data, we also looked at the distribution of the midratio on the Outside Lab's data, which Pitt and Hill did not include in their results. While there were fewer triples to consider, the distribution of the midratio still appeared uniform, supporting Pitt and Hill's conclusion that the RTS data is unusual. In addition, we checked the uniformity of the midratio distributions in the Coulter counts. Pitt and Hill also did not include this analysis, and when we analyzed the Coulter counts, the mid-ratios did not reveal anything unusual.

Pitt and Hill then constructed two tests using midratios based on the probability that a midratio falls in the interval [0.4,0.6]. Their first test relies on the assumption that for any value of lambda, the probability that the midratio is between 0.4 and 0.6 is less than 0.26. This assumption was validated numerically. If triples are independent and identically Poisson distributed and the sets of triplicate counts are independent, the number of triples for which the midratio is in [0.4, 0.6] is stochastically smaller than a Binomially distributed random variable with p = 0.26 and n equal to the number of triples, as Pitt and Hill claim implicitly. We used a binomial-based significance test, and found that out of 1361 triples, 824 of the midratio values fell into the range [0.4, 0.6], yielding a highly significant p-value near zero.

Triplicate Probability Models

To check the RTS findings against a theoretical distribution, the authors assert that each triple t can be modeled as a set of IID draws from a Poisson distribution with common parameter lambda_t. The comparator distribution permits us to answer the question: what is the probability that from n triples, k will include their means as one of their values? Pitt and Hill derive a probability model from the properties of the Poisson distribution, which we did not check in detail. Instead, we incorporated their assumptions into a simulation, in which we drew 10,000 triples for each value of lambda from 1 to 2,000, and then estimated the rate of mean-inclusion for each lambda. This yielded a set of simulated probabilities that corresponded closely to Pitt and Hill’s “MidProb” table.

Using these probabilities, Pitt and Hill conduct three tests of significance. The first “crude” test simply finds the maximum probability (which happens to be where lambda = 4), uses that to calculate the binomial probability of n triples in which k contain their means. This essentially approximates an “upper bound” for the p-value. This is a straightforward calculation, and we were able to replicate the authors’ results.

The second, more refined, test relaxes the requirement of constant lambda. Instead, it treats the RTS data as a series of Bernoulli trials (where “success” indicates that a triple includes its mean). Each trial has a distinct probability of success derived from our “MidProb” table, with the lambda parameter set as each observed triple’s mean (lambda is the Poisson maximum likelihood estimator). This yields a Poisson binomial distribution (the distribution of the sum of independent Bernoulli trials with non-equal probability), whose density can be readily estimated. We did not precisely replicate the authors’ results here, because our simulated probabilities were slightly different to those in the paper, but we confirm their overall conclusion with respect to the RTS data. We are slightly less confident with respect to the data for other investigators, but again our conclusions are not materially different. As we report in the terminal digits analysis, our number of cases was slightly higher (k = 128; n= 597), leading to a p-value of 0.17 — some distance from their 0.58, but not a statistic which renders the data improbable.

Finally, the authors use the normal approximation of the Poisson binomial to calculate traditional z-scores. Here, our results closely mirrored the authors’; a z-score of around 34 is strong evidence against the null hypothesis.

Criticisms of Analysis

In a general sense, it is important that hypotheses used to detect fraud be formulated and specified prior to conducting any exploration of the data. From a statistical point of view, it is always possible, genuine randomness notwithstanding, to find some unusual pattern in a sample of data if enough features are analyzed. However, these unusual patterns do not reflect any true departure from randomness and should not be used as evidence that data are not genuine. That is, in statistical terms, it is still possible to obtain significant p-values for unusual characteristics of the data, but these p-values no longer have the same interpretation if hypotheses were not pre-specified.

To demonstrate how easy it is to find “unusual” features present in data, consider the data for the Other Investigators' Colony counts, which Pitt and Hill deem to be genuine. In the triplicates, it appears that the lowest value of the triple appears in the first column of the data quite often: in 230 of the 597 cases the lowest value is in the first column, a proportion of 0.385. In a “truly random” scenario, we would expect 1/3 of the lowest values to be in column 1. Thus, if we treat each triple as an independent random trial with probability 1/3, we can conduct a binomial test of significance on the 597 trials. Our sample estimate returns a p-value of 0.00437, indicating that the probability of seeing the smallest value in the first column is significantly higher than it would be due to chance alone. This is clearly an incorrect interpretation of the p-value, however, since we did not hypothesize that low values would be in first column until after observing this phenomenon.

This simple example serves to emphasize the importance of selecting metrics of fraud before exploring the data. The terminal digit analysis is a standard metric and it is logical to assume that this hypothesis would be specified before looking at any data. However, the midratio analysis seems to be a characteristic of the data observed after looking at the data. Pitt and Hill even acknowledge that their decision to test mid-ratios came after "[h]aving observed what appeared to us to be an unusual frequency of triples in RTS data containing a value close to their mean". This casts some doubt on the validity of the p-value; however, the significance of the midratio test was so strong that we feel confident that the conclusion is valid.

Terminal Digit Analysis

The terminal digit analysis can reveal some kinds of data fabrication, using fairly minimal assumptions about how the data were generated: the final digit amounts to “noise”. It also has the benefit of being a simple to understand and common approach one would try early on when addressing this question. On the other hand, the equal digit analysis is somewhat more aggressive than the terminal digit analysis. Since the majority of these numbers are only three-digit, it is much more likely that the final two digits will not be uniformly distributed in these data sets compared to just the terminal digit. In fact, we believe that is why the equal digit analysis was only applied to the Coulter data sets, as the Colony data sets were not of large enough average values to lead to guarantees of uniformity. For example, if you examine the distribution of all possible final two digit pairing in the Colony-Other Investigators data set, you can see that the empirical distribution does not look very close to uniform:

As you can see, the distribution is quite a bit heavier for the smaller values of the last two terminal digits. For this reason assuming that the last 2 digits of each number should be equal 10% of the time would not be valid, and it is good that the authors did not apply the equal digit analysis to the Colony data sets.

It is also worth wondering why a more general test of the uniformity of the final two digits such as a chi-squared test on all 100 possible two digit combinations was not used. We performed such a test and found that the probability for the Coulter-Other data set was only 0.06. This results leads us to question the Coulter Equal Digit probabilities given in the paper, as the 10% assumption for the equal digits relies on the overall distribution of last two digits being uniform. If the authors noticed that the RTS data set had a large number of data points with the last two digits being equal and only then applied the statistical test, that would reduce the persuasiveness of the result. In our opinion the Terminal Digit analysis already shows that the data is not distributed as expected, and the Equal Digit Analysis only complicates this analysis with less persuasive evidence.

Triplicate Probability Models

In general, we agree with the logical formulation of the model in this section, but we are curious about the choice of distribution to model the count data. In general, the Poisson distribution is a natural choice to model count data because it is a generalization of the Binomial, as the Binomial parameter n goes to infinity and p goes to zero. However, in this scenario, it is unclear whether the number of colonies formed by surviving cells is an appropriate application of the limits of a Binomial.

In addition, we have concerns about the parameter estimation in each triple. The sample mean for each triple is indeed the maximum likelihood estimator of the Poisson lambda parameter, but with only three data points available to estimate this parameter, these sample means are unlikely to be reliable estimates of the true parameters. It might be better to incorporate shrinkage methods or use a Bayesian approach to share information across all colony triples to get more reliable estimates of each triple’s mean.

Since the strength of evidence in this scenario is so strong, we do not think these criticisms would change the conclusions, but they would be quite salient in scenarios where the evidence is not as strong.

Location of the Mean in the Triples

As an extension of this analysis, we noticed in the image of the RTS lab notebook provided that the mean value always appeared in the first column in the mean containing triples. However, this is only a sample of 6 triples, so we applied a chi-squared test to the count of the means occurring in each column, on the assumption that those count should be uniform. We excluded the outside lab results due to low counts and obtained the following results:

 Type Investigator Column 1 Column 2 Column 3 Χ2 P Colonies RTS 220 373 97 166.25 7.92E-37 Colonies Others 38 45 26 5.08 0.0788 Coulter RTS 38 79 59 14.33 0.0008 Coulter Others 13 13 10 0.5 0.7788

Once again we find that the RTS datasets do not conform to the uniformity assumptions while the other investigators have data which is consistent with the assumption of means uniformly distributed across the columns. We believe that this information in conjunction with the other evidence presented further enhances the case that the RTS data is not genuine.

Conclusion

While we have raise several technical points of consideration about Pitt and Hill's methodology, the evidence in general is strong enough for us to arrive at the same conclusion — the RTS data is statistically different from the rest.

In general, we found Pitt and Hill’s work to be both convincing and valuable for future work in detecting fraud. Crucially, their paper illustrates the importance of developing statistical methods to detect fraud in data that would otherwise appear genuine, and contributes valuable tools for detecting fraud in similar preclinical radiobiological data sets. While the triplicate probability model was designed specifically for this type of data, Pitt and Hill’s ideas can be applied more generally; this model provides a general framework for future methods to be developed for other types of datasets.

1. Our simulated probabilities also incorporated Pitt and Hill’s restriction that the range of any triple had to be at least 2. This appears to follow from their assumption that fabrication involved first selecting a desired mean, and then choosing two other numbers such that they were approximately equal distances from that mean. In general, it is not clear this must be true: the triple (0,1,1) contains its rounded mean (1), even though it has range < 2. This has a material effect on the estimated (or calculated) probabilities for low values of lambda. Indeed, since the maximum probability without the range limitation is close to 0.8, the RTS data would actually “pass” Pitt and Hill’s first (crude) test.

Kenneth Hung evaluated the article as: Show full review    Rated 3.5 of 5.
A review of the statistical methodology adopted in the article
 Publication date: 26 January 2017 DOI: 10.14293/S2199-1006.1.SOR-STAT.AFHTWC.v1.ROMQSD Level of importance:     Rated 4 of 5. Level of validity:     Rated 3 of 5. Level of completeness:     Rated 4 of 5. Level of comprehensibility:     Rated 3 of 5. Competing interests: None Recommend this review: +1

[Co-authored with Madeleine Sheehan (m.sheehan@berkeley.edu), Yiyi Chen (yiyi.chen@berkeley.edu) and Yulin Liu (liuyulin101@berkeley.edu)]

# Introduction

In Pitt and Hill (2016), the authors examine the radiobiological data sets from 10 individuals in the same laboratory, along with data from three outside laboratories applying similar methods. As the data from one of the 10 individuals appears anomalous, the authors employ statistical techniques to determine if the anomaly could have happened at random.

In their analysis, the authors conduct hypothesis testing on four main metrics: triplicate mean count, mid-ratio count, terminal digit distribution, and equal last two digits distribution. For each metric, they conclude that the pattern seen in the one individuals data is too abnormal to have occurred by chance.

As part of the referee process, we carry out three main activities in reviewing the paper. We first examine and verify the assumptions the authors make on the distributions of the mid- ratio and terminal digits of triplicate counts (modeled as Poisson Binomial variables). We then attempt to replicate the tests conducted in the study, and compare our results with those in the paper. Finally, we apply an alternative statistical testing method to validate their approach.

Overall we agree with the authors’ analyses and conclusion that the anomaly in the data from that individual would be unlikely if the individual were reporting actual measurements. We see similar results from the replicated tests. The alternative method, a permutation test using the pooled data, gives a similar conclusion.

It should be noted, however, that there are a few minor discrepancies between our results and those in the paper. We have some questions about the assumptions made in the paper; addressing them would strengthen the conclusions.

# Poisson Assumption

As the authors explain, the data correspond to triplicate counts of cells. There are initially some cells in each dish — the authors report that the initial counts can be modeled as a Poisson distribution with an unknown parameter $$\lambda_0$$. Each of the three dishes is subject to the same treatment (radiation level) and the probability that a given cell survives to generate a colony is $$p$$. The resulting final counts make up the radiobiological data sets analyzed in the paper. The authors claim that the final counts of the cells in a triplicate follow a Poisson distribution with $$\lambda = \lambda_0 p$$, where $$\lambda_0$$ is constant within triplicates but may vary across triplicates.

As we do not have a radiobiology background, we do not have the domain expertise to validate the assumption that the triples actually come from a Poisson distribution with a common parameter $$\lambda$$. Three data points drawn from the distribution do not give us enough information to verify this assumption either.

This assumption also comes with implications. The common cell survival rate $$p$$ implies the survival of an individual cell to start a colony is independent of initial number of cells in the dish and the all other cells present.

Since we lack domain knowledge in radiobiology, we will refrain from commenting on the model. In the sections that follow, we will generally treat each set of triplicate counts as i.i.d. Poisson random variables, with the triples independent of each other. However, we will also provide alternative analysis through simulations and permutation tests, bypassing the assumption of a Poisson distribution and comparing the results from the RTS investigator against his / her peers. For example, we will perform a permutation test / hypergeometric test to see if the number of mean containing triplicates for the RTS investigator is significantly large in comparison to that of the other researchers.

# Triplicate Analysis

The authors first model the triplicate colony data. They assume that the three values are realizations of i.i.d. Poisson variables sharing a common parameter $$\lambda$$, which never exceeds 1000. The authors propose a method of calculating the probability that a triplicate generated by such process includes their own rounded mean. The derivation of the formula (Appendix A) seems correct.

We implement the approach proposed by the authors, and regenerate Table 1 (we generate the probabilities for $$\lambda$$ ranging from 0 to 1999) in the paper. The results are consistent with the authors’ (except for $$\lambda = 13$$, which might be a typo). To further verify the results, we use a simulation-based approach to approximate Table 1, which gives very similar results.

## Mean Containing Triplicate Analysis

Replication The authors claim that the RTS’s data contain a surprisingly large number of triples that include their own rounded mean. To determine if the high number of rounded mean containing triples may have occurred by chance, the authors first construct an upper bound on the p-value based on the Poisson model.

First, a note of clarification — the authors never explicitly define the term “complete triple.” Table 2 of the paper identifies that the RTS colony dataset has 1361 total triples, and 1343 complete triples. From our analysis, we recovered their definition of a complete triple to be one that has a gap ≥ 2 — there are 18 triples with gap < 2 in the data set. The authors base their first hypothesis test on the observation that, of the 1343 triples reported by the RTS, 690 contain their (rounded) mean. They are therefore omitting all gap < 2 triples from the test. The rationale behind this omission is not entirely clear. For the purpose of replicating the test, we exclude these triplicates in our calculation as well.

We successfully replicated the authors’ results for Hypothesis Test I. As pointed out in the paper, the method is intentionally conservative (it overestimates the p-value). The authors thus propose a heuristic method to get a less conservative, approximate p-value. They assume that the event of each triplicate containing its own rounded mean is a Bernoulli trial with a known probability of success, and such probability that $$k$$ of $$n$$ triplicates in a data set contain their mean is assumed to follow a Poisson-binomial distribution. While they do not know the parameter $$\lambda$$ for the distribution, they use the mean of each triplicate as the $$\lambda$$. While this is the MLE of $$\lambda$$, treating $$\lambda$$ as known ignores uncertainty in the estimate.

Nonetheless, we are able to use the poibin package in R to replicate the tests that produced Table 2 of the paper. In the body text of the paper, the authors suggest that they apply the test to the 1343 complete RTS samples. We believe that might be a typo in the text — our replication matches the authors’ results only if we include all 1361 samples. If we replicate Tables 2 of Pitt and Hill (2016), but use only the 1343 complete (gap ≥ 2) samples, we get the results shown in Table 1, below. Discrepancies between this result and Table 2 of the paper are in bold. We will not replicate the Coulter count mean-containing triples analysis in this section.

 Type Investigator #exps #comp. / total #mean #expected StDev $$Z$$ p ≥ k Colonies RTS 128 1343 / 1361 690 214.9 13.28 35.73 3.66e-15 Colonies RTS 59 578 / 597 109 103.4 9.06 0.56 0.284 Colonies RTS 1 49 / 50 3 7.8 2.55 -2.07 0.989

To sum up, both hypothesis testing methods are reasonable. However, Hypothesis Testing I gives us a conservative estimate of p-values, and it rejects the null hypothesis that the high number of rounded mean containing triples results from observing independent triples that are each i.i.d. Poisson. We think it might not be necessary to do another experiment using Hypothesis Testing II, especially when the method has a shaky assumption of parameters.

Permutation test (hypergeometric) As an alternative test, we circumvent the assumption that the samples are generated from a Poisson process, by conducting permutation tests on the same Coulter and colony data sets to verify the conclusion from the earlier section. For this, we first pool together all the sample data for Coulter and colony counts respectively. We then draw random samples from the pool, where the size of the sample is equal to that of the RTS investigator. We use two methods to assign p-values.

Method 1. We first run a simulation of 10,000 draws. In those 10,000 draws, we count the number (and proportion) of draws containing equal or more mean-containing triplicates than those observed in RTS investigator data set. After running the simulation a few times, we notice that the number (and proportion) of draws containing more mean-containing triplicate is consistently 0. This preliminary test suggests that the RTS data are quite different from those of the other researchers.

Method 2. To get a more precise bound, we proceed to calculate the probability analytically using a hypergeometric distribution. In modeling the distribution, we define the drawing of a mean-containing triplicate a success event. The population $$N$$ is the total number of triplicates in the pooled samples (for Coulter and colony counts respectively), $$K$$ is equal to the total number of mean-containing triplicates, and $$n$$ is set to the sample size of the RTS investigator data set. We then calculate the probability of $$k$$ successes in $$n$$ draws, where $$k$$ is equal to the count of mean-containing triplicates in the RTS investigator’s data set.

 Type Sample size (n) Test statistic (k) Probability Coulter (all triplicates) 1727 177 1.33e-13 Coulter (excluding consecutive triplicates) 1726 176 1.84e-13 Colonies (all triplicates) 1361 708 2.40e-43 Colonies (excluding consecutive triplicates) 1343 690 2.96e-48

The exact probabilities of obtaining the same number of mean-containing triplicates as in the RTS investigator’s dataset is exceedingly small ($$< 10^{-10}$$) for both Coulter and colony data sets, thereby supporting our observations in Method 1 and the earlier conclusion.

In their analysis, the authors exclude triplicates with adjacent counts, where the maximum and minimum of the triplicate count differ by at most one. Since the rationale behind this treatment is not entirely clear, we carry out permutation test on both the full pooled samples and the samples excluding triplicates with adjacent counts. The results are only marginally different. We therefore maintain the original conclusion.

## Mid-Ratio Analysis

Similarly, the authors suggest that the RTS investigator observes a surprisingly high percentage of triples that contain a value close to their mean. A triple is said to contain a value close to its mean if the triples mid-ratio falls in the interval [0.4, 0.6], where the mid-ratio is defined as the ratio of the difference between the mid and the smallest value in the triple to the difference between the largest and smallest value in the triple.

By simulation, we corroborate the authors’ finding that, for triples generated from a Poisson distribution with parameter $$\lambda$$ from 1 to 2000, the expected percentage of triples with mid-ratio in the interval [0.4, 0.6] never exceeds 0.26. The simulated results are shown in Figure 1.

As stated in the paper, for a collection of $$n$$ triples, the probability of observing $$k$$ or more triples with mid-ratios in the interval [0.4,0.6] cannot be greater than the probability of $$k$$ successes in $$n$$ Bernoulli trials with the probability of success, $$p$$.

Carrying out this test we find that 824 of 1362 colony counts and 523 of 1729 Coulter counts produced by the RTS investigator have a mid-ratio value in the interval [0.4, 0.6]. If we model each triple as a Bernoulli trial with probability of success, $$p = 0.26$$, then the probability of observing 824 or more successes in 1362 colony count trials is 1.11e−16. The probability of observing 523 or more successes in 1729 Coulter count triplicates is 3.26e−5. These p-values are not reported in the paper. Both corroborate the finding that it is very unlikely that a Poisson process produced this many triplicates with mid-ratios in the interval [0.4, 0.6].

While these results seem far too unlikely to have happened by chance, we are concerned by the fact that Pitt and Hill decided to perform this test after observing “what appeared to be an unusual frequency of triples in RTS data containing a value close to their mean.” Other investigators besides RTS may well have trends that make their data look anomalous under these post hoc analyses. The argument would be stronger if the authors had decided what tests to perform a priori, before looking at the data and observing an “unusual” trend; or if they adjusted for their post hoc analysis explicitly. We believe such selection bias is not as problematic for terminal digit analysis, as it seems to be a more routine tool for investigation.

# Terminal Digit Analysis

Assumption Check We start by validating the assumption that the distribution of terminal digits of a Poisson variable is approximately uniformly distributed. We approximate the probability distribution of the unit digit of a Poisson variable for $$\lambda$$ from 50 to 500. Each probability distribution is compared to a uniform distribution and the total variation distance (a metric comparing the distance of distributions) is computed, and shown in Figure 2. The total variation distance appears relatively small, compared to 0.1, and hence affirming the assumption.

Replication Given that the terminal digit of a Poisson distribution is approximately uniform, we replicate the chi-square goodness of fit test to assess the significance of non-uniformity in each of the data sets. Our conclusions match the findings of the paper — we reject the null hypothesis of uniformity for the RTS data sets, and fail to reject the null for all other data sets. While our conclusions are the same, we did find some reporting issues in Table 3 of the paper. Our reproduction of Table 3 is shown in Table 3. The discrepancies between our terminal digit counts / goodness of fit calculations and the authors are shown in bold.

 Data set 0 1 2 3 4 5 6 7 8 9 Total $$\chi^2$$ p RTS Coulter 475 613 736 416 335 732 363 425 372 718 5185 466.9 7.06e-95 Other Coulter 261 311 295 259 318 290 298 283 331 296 2942 16.0 6.70e-02 Outside Coulter 1 28 34 29 25 27 36 44 33 26 33 315 9.476 3.95e-01 Outside Coulter 2 34 38 45 35 32 42 31 35 35 33 360 4.9 9.39e-01 RTS Colony 564 324 463 313 290 478 336 408 383 526 4085 200.7 2.33e-38 Other Colony 191 181 195 179 184 175 178 185 185 181 1834 1.79 9.94e-01 Outside Colony 21 9 15 16 19 19 9 19 11 12 150 12.1 2.06e-01

Permutation Test To bypass the assumption that the triples can be modeled as Poisson random variables, we pool all the data from all investigators and ask “how unlikely is it that the RTS colony count terminal digits were as non-uniform as they were if they were drawn from the same distribution as all the investigators?” The procedure for performing a permutation test is outlined as follows:

1. Compute chi-squared goodness of fit test statistic for sample of interest

2. Pool all the data

3. Repeat the following for sufficiently many times

• Randomly select a sample from the pooled data that is the same size as the test sample

• Compute the chi-squared goodness of fit test statistic for this new sample

4. Find the percentage of random samples where the random sample test statistic is larger than the original test statistic

The returned percentage approximates how likely it is that the RTS colony counts deviate more from uniformity if the counts are drawn from the same distribution as the pooled data from all the investigators. Due to limits in computing power, the number of random samples generated is limited and so is our accuracy in approximation. There were 4086 individual RTS colony counts and 5187 individual Coulter counts in the dataset. For 10,000 iterations of 4086 sample draws from the pooled investigator colony count data, the test consistently returned 0, indicating that the RTS colony counts are consistently farther from uniform than random samples from the pooled data. The same was true for 10,000 iterations of 5187 sample Coulter count data.

# Equal Digit Analysis

Assumption Check We question whether the assumption that the probability that the last two digits of three-plus digit Poisson variable are equal is 10%. To check, we performed a simulation; the results are plotted in Figure 2. When $$\lambda > 500$$, the probability is indeed about 10%, but for smaller $$\lambda$$, it is not.

We approximated the probability of the last two digits of a Poisson variable being equal, conditioned on the outcome bearing three digits. We plotted this in Figure 3. While in the larger regime, where $$\lambda > 500$$, the probability does hover around 10%; the same conclusion fails to be drawn about the smaller regime.

For the equal digit analysis proposed by Pitt and Hill (2016) to hold, one would at least hope for this probability to be significantly different from 12.3%. However, when λ gets comparable to 100 or even smaller, this probability can rise as high as 12.5%. (See Figure 3) Intuitively, if λ is significantly smaller than 100, then getting 100 as an outcome is much likelier than any outcome greater than 100, driving the probability of having the last two digits being equal much higher. Since the true lambdas are not known and quite a few counts are less than 100, it is plausible that honest reporting would result in equal terminal digits more than 10% of the time.

# Conclusion

For the review process, we examine the underlying assumptions put forth by the authors. We attempt to replicate all four analyses (mid-ratio and mean counts of triplicates, percentage of terminal digits and equal last two digits) outlined in the paper. In addition, we apply an alternative testing method that does not make assumptions on the underlying distribution of the data set, in order to check the sensitivity of the conclusions to the assumption that count triples are i.i.d. Poisson.

Overall, our results agreed with those in the paper. Out of all the metrics tested, we feel that the test on percentage of terminal digits provides the strongest argument against the RTS investigator result. For the rest of the test metrics, the mid-ratio and mean counts rely on the strong assumption that each triplicate shares a common parameter $$\lambda$$, which we feel was inadequately justified. (The permutation test relaxes this assumption.) The expected percentage of equal last two digits of a number produced by a Poisson process, on the other hand, has a high level of inherent variability, especially if the Poisson parameter $$\lambda$$ that produced the data is small. Since the counts in the observed colony data are generally small (< 200), the percentage of equal last two digits fails to suggest any significance in the result with high level of confidence.

We feel the paper would be stronger if the authors explained in greater detail how they came to suspect RTS and how they decided to use the particular metrics they did.

Another area we did not investigate that will potentially provide additional insight is the patterns in the data of other individuals. It is possible that when examined individually, other investigators will also look anomalous for the same or other metrics. When the data are pooled together for the tests in the study, however, the individual anomalies might cancel each other out and get masked. It will therefore be useful for future studies / reviews to conduct further investigation in this area.

# Reproducibility

The source code for our analysis can be found below:

The original full datasets for the study can be found below:
https://osf.io/mdyw2/files/

# Acknowledgement and Declaration

We would like to thank the authors for making their data available and for publishing in an open journal.

This review was vetted by Philip B Stark. However, the work was conducted entirely by the authors, and the opinions expressed in this review are those of the authors.

Chris Hartgerink evaluated the article as: Show full review    Rated 3.5 of 5.
Interesting research, but in need of substantial rewriting.
 Publication date: 09 May 2016 DOI: 10.14293/S2199-1006.1.SOR-STAT.AFHTWC.v1.RBGXSM Level of importance:     Rated 4 of 5. Level of validity:     Rated 4 of 5. Level of completeness:     Rated 3 of 5. Level of comprehensibility:     Rated 2 of 5. Competing interests: None Recommend this review: +1

I read your paper with pleasure; only rarely are articles published in this field so I am happy to review your paper. Despite this however, the way the body of the paper describes the topic, it remained tangential to the topic of interest. This made it hard to follow sometimes and I had to actively remind myself of the idea that was being tested ("is the data anomalous?"). The abstract clearly states you were testing for anomalous data/potential data fabrication, but in the paper itself this is only mentioned in the introduction and discussion. Additionally, the structure of the paper is confusing due to the large amount of subsections. The paper could benefit from substantial rewriting in order to ensure readability.

Also, the paper is mechanical and could benefit from a discussion of the ethics, or what happened in this specific case with these results.

Nonetheless, after ploughing through it I think the contents are interesting, and provide an addition to the field. Thank you for this interesting work.

I structure my review per section.

## Introduction

• You start by mentioning fabrication and falsification as fraud, but the scientific integrity literature mostly refrains from calling it fraud because it is a legal term that requires intent. I suggest retaining only the fabrication and falsification mentioned between parentheses.

• Please check the sentence "But statistical methods which can readily be used to identify potential data fabrication [4–10] are all but ignored by the ORI and the larger world.". It seems to want to state that they are ignored, but it reads as if they are not, which is rather confusing. If the meaning is the latter, then the statement is incorrect. Terminal Digit Analysis originated at the ORI, so it is an incorrect statement. Their recent funding also called explicit attention to statistical tools. If you find that not sufficient attention is paid to them, please adjust text to state so.

• At the end of paragraph 4, you mention the terminal digit analysis for count values. The terminal digit analysis only works on count variables when the mean value is sufficiently high, which you mention by stating that these have to be "inconsequential digits". However, it is unclear to readers unfamiliar that this is a severe limitation and when it applies (e.g., low mean count). Please make this more explicit though, in order to prevent misunderstanding of this method and potential misapplication.

• At the end of the final paragraph, you mention "Where that probability is less than some reasonable level" --- what would you think is a reasonable level for these kinds of analyses, considering the weighing of performance indicators such as positive predictive value and negative predictive value? This might be valuable to discuss at the end, even despite the overwhelmingly convincing values you found later in the paper.

### The data studied

• The description of the data was somewhat confusing; it is not immediately clear that the RTS does not belong to the mentioned nine. Please add a sentence that makes this more explicit, maybe along the line "We inspected numerical data from the RTS, which was suspected of fabrication [see next point]. We compared this to other, assumably, genuine data. We had access to [...]"

• The RTS generated the data; do you mean fabricated the data?

• Is it correct that the PDFs are not included on the OSF and there are private projects? (this serves as a check; the description states that the PDFs are also in the project)

• I would recommend uploading the Excel spreadsheets in a more sustainable format, such as a CSV file or an ODS file.

• You mention the three methods used ("These patterns in RTS data included"). Please let the ordering of these mirror the structure of the paper, for consistency's sake.

### Data sets and probability model

• It was only here it became clear to me that you had the RTS plus the nine others from the lab, which were compared (regarding comment 1 for section "The data studied")

• I suggest that the current text be incorporated in the "The data studied", considering that it is mostly about the data itself and less about the probability model (which is explained more thoroughly in the following section anyway)

• I suggest making this the section where the probability models are actually explained (instead of in the analysis section below). That way, it is clear how the methods are set up prior to applying them.

### Analysis of triplicate data

This section and its subsections are structured in a confusing manner. There are too many subsections that try to explain too much. The colony and the coulter counts are analyzed separately; however the colony does not get a special subsection whereas the coulter does. Additionally, the methods are described in this section as well, but also seem to be different for the Coulter and colony counts, for the mid-ratio analyses (i.e., colony just superficially, coulter full modeling). Improving the sectioning will help improve the structure (e.g., a methods and results sections, with their respective subsections).

#### Initial mid-ratio review

• For the mean in triplicate values (both in colony and coulter counts), you go to great lengths to model this. For the mid-ratio, you only model the data for the coulter counts. Please adjust such that both are subjected to the same models.

#### The models for triplicate data

• This is a strong section when combined with the Appendix, but I miss an essential discussion of one assumption in these models: dispersion. Poisson distributions assume the mean is equal to the variance (i.e., both are equal to lambda). I frequently encounter problems with this in for example Poisson regressions and it might also be a problem here. I inspected the assumably genuine data and noted that the dispersion (i.e., variance/mean) is 1.012 in same lab colony counts and 7.456 in same lab coulter Coulter counts. As such, it seems to me that overdispersion might be a problem for the Coulter counts (note that there might still be a problem for the colony counts as well, given that the range of dispersion is 0-10.179). Overdispersion results in underestimation of the p-value you are interested in (i.e., it looks more significant).

#### Hypothesis testing I - a nonparametric test

• This subsection provides an intuitive test, but given the next subsection, it seemed somewhat redundant. This method provides an overestimate of the probability (as you mentioned), so the second step needs to be taken anyway if the p-value is large. Moreover, I think, given the confusing structure in the section "Analysis of triplicate data", it is worth considering to take this out and reducing it to a sentence in "Hypothesis II". This will help make the section denser and reduces the complexity of the section.

#### Hypothesis testing II - using lambda to obtain p-values

• As suggested earlier, there are a lot of methods descriptions in here that can be separately described and provides easier structure.

#### Hypothesis testing III - normal estimation of p-values

• This method was not mentioned previously in the paper, hence, surprised me somewhat. Please mention it earlier (e.g., in a methods section). I also do not see the added value over the Poisson model itself except for computational parsimony at the cost of an additional assumption. Please discuss this further if it is an important aspect of the paper (and if it is not, why does it get an entire subsection?).

#### Application to Coulter counts

• As mentioned before, why do Coulter counts get a separate subsection and colony do not? (I understand this is not the idea, but it seems like this is the case)

#### Probability model for mid-ratios

• Why is this section on mid-ratio probabilities so far removed from the other mid-ratio section?

• Methods are described, but no results are given? It almost seems as if this section is misplaced.

### Summary

This seems like a redundant recap to me, suggest removing this.

### Discussions

#### Limitations

• This might be an appropriate place to discuss the limitations of the Poisson model, with regards to the dispersion mentioned earlier.

#### Power of statistics

• Despite in principle agreement that statistics provide many fruitful avenues in this area, there are certain precautions that should be taken into account. For example, what does a result of anomalous data mean? Or, should the data be simply dredged for evidence of suspicions, as you seem to suggest in the final sentence? If this section is retained, discussion of the dangers of statistics should also be included, in my opinion.

#### Are RTS data real

• Here you clearly show that the conclusions of these statistical methods are limited, by only concluding that the data collection methods are not equal. This would fit nicely in the previous subsection.

### Remedies

This section includes redundant subsections, these can easily be reduced to just one paragraph.

#### Data available

• Please add the R script and the spreadsheet on the OSF, considering that the data are also available.

• Please add the direct OSF link to the project to prevent potential problems in finding the files.

## Minor remarks

• Please check for "psychological research" or other instances of "psychological"; it should be "psychology research"

• Table 1 has an uppercase lambda in the first column, please change to lowercase lambda for consistency

• Table 1, p-value for lambda = 13 shows .0301; I think this should be .301? If not, then this becomes a major remark and needs to be checked, considering it would be an aberrant result.

• Please ensure that all Roman variables are italicized for consistency.

• Please double check the appendix for typographic errors

• Aj,k = he

• to toccur

• P(A). tt a triplet

• 10^−9. o equivalently,