- Review: found

2020-11-06

Philip Stark3

Small samples, unreasonable generalizations, and outliers: Gender bias in student evaluation of teaching or three unhappy students?CrossrefScienceOpen

A paper which embodies several common statistical misconceptions

Average rating: | Rated 3 of 5. |

Level of importance: | Rated 1 of 5. |

Level of validity: | Rated 1 of 5. |

Level of completeness: | Rated 4 of 5. |

Level of comprehensibility: | Rated 5 of 5. |

Competing interests: | Two of us are co-authors of Boring et al.: Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research, 2016, one of the papers criticized by Uttl and Violo |

- Record: found
- Abstract: found
- Article: found

Bob Uttl (corresponding) , Victoria Violo (2020)

10.14293/S2199-1006.1.SOR-EDU.APUTIGR.v1.RHKDLN

This work has been published open access under Creative Commons Attribution License **CC BY 4.0**, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at **www.scienceopen.com**.

Keywords: | student evaluation of teaching, SET, small samples, outliers, generalization, reproducibility |

Reviewed by Philip B. Stark, Amanda K. Glazer, and Anne Boring

Uttl and Violo [3] claim that the work of Boring et al. [1] and MacNell et al. [2] on student evaluations of teaching (SET) is “unwarranted and uninterpretable” because:

- The sample size is too small, likely leading to “low statistical power, inflated discovery rate, inflated effect size estimation, low replicability, low generalizability, and high sensitivity to outliers.”
- The MacNell et al. [2] study only includes one male and one female instructor, so it is not generalizable to all male and female instructors.
- The presence of three “outliers” produced incorrect results.

Two of us are co-authors of Boring et al. [1], one of the papers criticized by Uttl and Violo [3], so we have a *prima facie* conflict of interest in reviewing their paper, which we were invited to do by the editors of *ScienceOpen*. We agreed to review it—reluctantly— because it embodies several common statistical misconceptions. We hope to shed some light on these issues, which are broader than the current manuscript. In particular, Uttl and Violo [3] misuse statistical terms such as “power” and “outlier,” use an inappropriate parametric statistical test (Student’s *t**-*test) where its assumptions are not satisfied, and consider a research question different from the one considered in the papers they criticize, blurring the distinction between inductive and deductive reasoning.

The MacNell et al. [2] experiment and data seem reliable to us, based on information in the paper itself and correspondence with one of the authors; Uttl and Violo [3] do not seem to disagree. At issue are the statistical analysis of the data and the adequacy of the data to support the conclusion: SET are subject to large biases that can be larger than any “signal” from teaching effectiveness.

As we explain below, while we agree with Uttl and Violo [3] that the analysis in MacNell et al. [2] is problematic, the alternative analysis by Uttl and Violo [3] commits the same errors—and new ones. In contrast, an appropriate nonparametric analysis such as that conducted by Boring et al. [1] supports the conclusion that bias in SET can swamp any information SET might contain about teaching effectiveness.

The subjects in the MacNell et al. [2] study are not a “sample” *per se*. They are a population— all students who took the course that semester—randomized into six groups, four of which are considered in the analysis. (These four are the students assigned to the four sections taught by the two graduate student instructors; the other two groups are students who were assigned to sections taught by the faculty member in charge of the entire course.)

The *power* of a statistical test against a particular alternative is the probability that the test correctly rejects the null hypothesis when that alternative hypothesis is true. A test has low power if it is unlikely to provide strong evidence that the null hypothesis is false when it is in fact false.

The *false discovery* rate is the fraction of rejected null hypotheses that were rejected erroneously, that is, the number of erroneously rejected null hypotheses divided by the number of rejected null hypotheses. *Selective inference* occurs when analysts use the data to decide what analyses to perform, what models to fit, what results to report, and so on. A canonical example of selective inference is to report estimates of parameters only if the estimates are significantly different from zero. The “file-drawer effect,” i.e., submitting only positive results for publication, is another common example of selective inference. In general, selective inference inflates apparent effect sizes and increases the false discovery rate. Methods to control for selective inference are relatively new.

Small experimental groups are an issue for power: if the sample size is too small, no test can have much power: real effects are unlikely to be noticed. The power (of an appropriate test) evidently was not too low to find an effect: appropriate statistical analysis of the MacNell et al. [2] data yields strong statistical evidence that the null hypothesis is false, despite the small sample size.

Small experimental groups are not an issue for the false discovery rate, which is controlled by the significance level of tests, the number of tests performed, and selective inference, regardless of the sample size. Relying on small experimental groups may *indirectly* lead to false discoveries and inflated effect sizes through selective inference: it is cheaper to conduct small-group studies than larger studies, making it easier to conduct many studies and cherry-pick positive results. The inflation of the discovery rate and of estimated effect sizes is not* **caused* by the size of the sample; rather, using fewer subjects just facilitates statistical malfeasance by making it less expensive to conduct many studies and discard those that do not produce positive results. There was no issue of cherry-picking in the MacNell et al. [2] study: it is evidently the only study of its kind. To the best of our knowledge, they ran the experiment only once. Moreover, it was a controlled, randomized experiment, quite rare in studying SET.

What matters for replicability and generalizability is not the number of subjects, but how the subjects were selected from the larger population to which one seeks to generalize. (Selective inference also tends to undermine replicability.) In the present matter, neither the students nor the instructors were selected at random from any larger population of students or instructors, so there is no statistical basis for extrapolating quantitative estimates based on this experiment to other populations. Again, this has nothing to do with “sample size” (the number of students in the experiment), only with how those students came to be in the experiment.

MacNell et al. [2] and Boring et al. [1] do not “generalize” inductively, for instance by claiming that the biases observed in this experiment occur to the same degree in every classroom. Indeed, Boring et al. [1] point out that the effect of gender seems to vary across disciplines, based on data from the other dataset examined by Boring et al. [1], a natural experiment at the French university Sciences Po comprising 23,000 observations of SET scores.

However, the results generalize *deductively* in the same way that observing a single black swan refutes the hypothesis that all swans are white. The MacNell et al. [2] experiment shows that biases in SET can be large enough to obscure large differences in teaching effectiveness. Because that is true in the MacNell et al. [2] experiment, the assumption that such biases are always negligible is false, shifting the burden onto institutions to show that bias in SET is negligible for each class, each semester, and each instructor—if SET are to be relied upon for employment decisions.

Uttl and Violo [3] confuse “influential observation” with “outlier.” An “outlier” is an observation that is drawn from a different distribution than the bulk of the data, for instance, because it is contaminated by a gross error resulting from a typo, from misreading an instrument, or because the measuring apparatus was hit by a cosmic ray.

There is no comparable notion of “outlier” here. (Outlier tests are used to detect gross errors and measurements that are contaminated in some way. Implicitly, outlier tests ask, “would this datum be unlikely if it came from the same underlying distribution as the rest?” The outlier identification rule Uttl and Violo [3] used presumes that the “true” distribution of the data is Gaussian: observations that would be unlikely under a Gaussian model because they are too many standard deviations from the mean are flagged as suspect. Only about 5% of a Gaussian distribution is two or more standard deviations from the mean, but for other distributions, up to 25% of the probability can be two or more standard deviations from the mean. Ironically, applying the rule Uttl and Violo [3] adopted to data that genuinely come from a Gaussian distribution would be expected to discard 5% of the data. They discard 7% of the data. Given that a each datum represents 1=43 = 2:3% of the data, this is gosh darned close to what would be expected if the data were indeed Gaussian and there were no genuine outliers.) It is not as if the three students in question were not actually enrolled in that class and their responses were accidentally comingled with the responses of enrolled students, nor that there was unusually large error in measuring those students’ responses.

The SET ratings of those students are not observations from some different distribution. They are the ratings actually given by 7% (3 of 43 students) of the students enrolled in the class. We are not aware of any university that eliminates extreme ratings before calculating mean evaluation scores. The scores of those three students were included in the mean SET scores “in real life.” The data are *influential* but they are not *outliers*.

A relatively small number of students with extreme views can drive large differences in mean scores. That does not mean that those students’ scores are outliers, somehow erroneous, or that there is a statistical or substantive justification for discarding them.

Figure 1 plots the distribution of overall satisfaction scores by student and instructor gender, for the Sciences Po data examined by Boring et al. [1]. Here too, the gender bias is driven by a small percentage of students, male students who give male instructors high ratings. The percentage of female students giving excellent scores to male instructors is 32.2%, compared to 41.9% of male students rating male instructors; the difference amounts to 4.2% of the student responses. That 4.2% creates statistically significant differences in the average scores received by male and female instructors. These are the scores instructors received and that administrators use to assess teaching effectiveness. (The influence of extreme views may be even larger when SET are not mandatory if students with extreme views are more likely to return SET than students with more moderate views.)

Because of the randomized design, one can calculate the chance that the three students who gave the lowest scores all would have been in the apparent female group if the treatment (instructor name) had no effect. The answer is about 7%: modest evidence of bias even before considering the numerical scores.

As mentioned above, institutions generally report and use average SET scores: they do not remove extreme values before computing the average. The three scores in question would be included in the average, just as they were in the analyses of MacNell et al. [2] and Boring et al. [1]. They should be taken into account, not discarded.

That said, the parametric *t**-*test Uttl and Violo [3] and MacNell et al. [2] used is statistically inappropriate, as Boring et al. [1] point out:

- SET scores are on a Likert scale. They are ordinal categorical values, not a continuous linear scale. In particular, they do not follow a normal distribution. But the parametric
*t*-test assumes that the scores in the treatment and control groups follow normal distributions with the same variance, but possibly different means. - The parametric
*t*-test assumes that the various “treatment” groups are independent random samples from two normal populations with the same variance. But in the MacNell et al. [2] experiment, they are not: they are an entire population randomized into 6 groups. As a result, the groups are dependent: if a student is in one group, that student is not in some other group. The scores are not normally distributed, and the groups are not independent.

The permutation tests employed by Boring et al. [1] have none of those flaws: the permutation *t*-test honors the actual randomization that was used to allocate students to class sections. The probabilities come from the randomization actually performed, not from counterfactual assumptions involving sampling independently from hypothetical populations, which Student’s *t*-test assumes. It makes no assumption whatsoever about the distribution of responses; the inference is conditional on the observed values. In particular, while the parametric *t*-test is sensitive to extreme observations, the permutation *t**-*test correctly takes the observed values into account, no matter how extreme they are.

In short, the *P*-values from Student’s *t*-test are uninterpretable for this experiment: they are not the probability of observing differences as large or larger than those observed, on the assumption that the instructor’s name has no effect on ratings. Rather, they are the probability of observing differences as large or larger than those observed on the assumption that the ratings are independent random samples from Gaussian distributions with equal variances and equal means. That has nothing whatsoever to do with the experiment conducted by MacNell et al. [2].

Boring et al. [1] show that the parametric *t*-test used by MacNell et al. [2] and Uttl and Violo [3] finds statistical significance where there is none and misses actual statistical significance. The conclusions from the parametric and permutation tests are summarized in Table 8 of Boring et al. [1]. Table 1 gives the results in simpler, color-coded form. In every case where the permutation *t*-test rejects the null hypothesis at significance level 5%, the parametric *t*-test does not, and vice versa. This results from the failure of the assumptions of the parametric *t*-test.

Uttl and Violo [3] rely on the parametric *t*-test despite the fact that its assumptions contradict how the experiment was conducted and despite the fact that it has been shown to yield spurious results.

Uttl and Violo [3] are correct that there is a problem with the statistical analysis in [2], but not the one they identify. Indeed, their analysis repeats the problem: the parametric *t*-test does not produce a meaningful *P*-value for the experiment, because the experiment does not satisfy the assumptions of the test. Contrary to the claims of Uttl and Violo [3], the three influential observations are not “outliers” in the sense in which the term is used in statistics: they are the actual scores given by three students in the class, and there is no statistical basis for excluding those students’ scores. The fact that all three students were in the “female persona treatment group” is itself evidence of gender bias: the chance of that occurring is roughly 7% if the instructor’s name had no effect on SET scores.

While small experimental groups can result in low power, *a posteriori* the MacNell et al. [2] experiment included enough subjects to detect an effect, in part because the effect was evidently so large. Small experimental groups do not *ipso facto* elevate the false discovery rate nor inflate effect sizes: *selective inference* does both, but there was no selective inference in this case.

Uttl and Violo [3] are correct that the MacNell et al. [2] study does not “generalize” statistically to other students and classes, but the study does generalize logically: it shows that bias is sometimes substantial. That is a serious problem.

[1] A. Boring, K. Ottoboni, and P. Stark. Student evaluations of teaching (mostly) do not measure teaching effectiveness. *ScienceOpen Research*, 2016.

[2] L. MacNell, A. Driscoll, and A. N. Hunt. What’s in a name: Exposing gender bias in student ratings of teaching. *Innovative Higher Education*, 40:291–303, 2015.

[3] B. Uttl and V. Violo. Small samples, unreasonable generalizations, and outliers: Gender bias in student evaluation of teaching or three unhappy students? *ScienceOpen Research*, 2020.

Version and Review History