576

views

Comment

recommends

Recommend

similar

All similar

Review: found

Is Open Access

Review of 'Small samples, unreasonable generalizations, and outliers: Gender bias in student evaluation of teaching or three unhappy students?'

Reviewer: Philip Stark

Inviter role(s): EDITOR

Publication date of review: 2020-11-06

Bookmark

Philip Stark3

Small samples, unreasonable generalizations, and outliers: Gender bias in student evaluation of teaching or three unhappy students?Crossref ScienceOpen

A paper which embodies several common statistical misconceptions

Average rating:	    Rated 3 of 5.
Level of importance:	    Rated 1 of 5.
Level of validity:	    Rated 1 of 5.
Level of completeness:	    Rated 4 of 5.
Level of comprehensibility:	    Rated 5 of 5.
Competing interests:	Two of us are co-authors of Boring et al.: Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research, 2016, one of the papers criticized by Uttl and Violo

Reviewed article

Record: found
Abstract: found
Article: found

Is Open Access

Small samples, unreasonable generalizations, and outliers: Gender bias in student evaluation of teaching or three unhappy students?

Bob Uttl, Victoria Violo (2020)

In a widely cited and widely talked about study, MacNell et al. (2015) examined SET ratings of one female and one male instructor, each teaching two sections of the same online course, one section under their true gender and the other section under false/opposite gender. MacNell et al. concluded that students rated perceived female instructors more harshly than perceived male instructors, demonstrating gender bias against perceived female instructors. Boring, Ottoboni, and Stark (2016) re-analyzed MacNell et al.s data and confirmed their conclusions. However, the design of MacNell et al. study is fundamentally flawed. First, MacNell et al. section sample sizes were extremely small, ranging from 8 to 12 students. Second, MacNell et al. included only one female and one male instructor. Third, MacNell et al.s findings depend on three outliers -- three unhappy students (all in perceived female conditions) who gave their instructors the lowest possible ratings on all or nearly all SET items. We re-analyzed MacNell et al.s data with and without the three outliers. Our analyses showed that the gender bias against perceived female instructors disappeared. Instead, students rated the actual female vs. male instructor higher, regardless of perceived gender. MacNell et al.s study is a real-life demonstration that conclusions based on extremely small sample-sized studies are unwarranted and uninterpretable.

0 comments Cited 0 times     Rated -3 of 5. – based on 2 reviews

Preprint version 1

Bookmark

Review information

DOI:: 10.14293/S2199-1006.1.SOR-EDU.APUTIGR.v1.RHKDLN

License:

This work has been published open access under Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com.

Keywords: student evaluation of teaching, SET, small samples, outliers, generalization, reproducibility

Review text

Review of Uttl and Violo (2020) Small samples, unreasonable generalizations, and outliers: Gender bias in student evaluation of teaching or three unhappy students?

Reviewed by Philip B. Stark, Amanda K. Glazer, and Anne Boring

Uttl and Violo [3] claim that the work of Boring et al. [1] and MacNell et al. [2] on student evaluations of teaching (SET) is “unwarranted and uninterpretable” because:

The sample size is too small, likely leading to “low statistical power, inflated discovery rate, inflated effect size estimation, low replicability, low generalizability, and high sensitivity to outliers.”
The MacNell et al. [2] study only includes one male and one female instructor, so it is not generalizable to all male and female instructors.
The presence of three “outliers” produced incorrect results.

Two of us are co-authors of Boring et al. [1], one of the papers criticized by Uttl and Violo [3], so we have a prima facie conflict of interest in reviewing their paper, which we were invited to do by the editors of ScienceOpen. We agreed to review it—reluctantly— because it embodies several common statistical misconceptions. We hope to shed some light on these issues, which are broader than the current manuscript. In particular, Uttl and Violo [3] misuse statistical terms such as “power” and “outlier,” use an inappropriate parametric statistical test (Student’s t-test) where its assumptions are not satisfied, and consider a research question different from the one considered in the papers they criticize, blurring the distinction between inductive and deductive reasoning.

The MacNell et al. [2] experiment and data seem reliable to us, based on information in the paper itself and correspondence with one of the authors; Uttl and Violo [3] do not seem to disagree. At issue are the statistical analysis of the data and the adequacy of the data to support the conclusion: SET are subject to large biases that can be larger than any “signal” from teaching effectiveness.

As we explain below, while we agree with Uttl and Violo [3] that the analysis in MacNell et al. [2] is problematic, the alternative analysis by Uttl and Violo [3] commits the same errors—and new ones. In contrast, an appropriate nonparametric analysis such as that conducted by Boring et al. [1] supports the conclusion that bias in SET can swamp any information SET might contain about teaching effectiveness.

1 Sample Size

The subjects in the MacNell et al. [2] study are not a “sample” per se. They are a population— all students who took the course that semester—randomized into six groups, four of which are considered in the analysis. (These four are the students assigned to the four sections taught by the two graduate student instructors; the other two groups are students who were assigned to sections taught by the faculty member in charge of the entire course.)

The power of a statistical test against a particular alternative is the probability that the test correctly rejects the null hypothesis when that alternative hypothesis is true. A test has low power if it is unlikely to provide strong evidence that the null hypothesis is false when it is in fact false.

The false discovery rate is the fraction of rejected null hypotheses that were rejected erroneously, that is, the number of erroneously rejected null hypotheses divided by the number of rejected null hypotheses. Selective inference occurs when analysts use the data to decide what analyses to perform, what models to fit, what results to report, and so on. A canonical example of selective inference is to report estimates of parameters only if the estimates are significantly different from zero. The “file-drawer effect,” i.e., submitting only positive results for publication, is another common example of selective inference. In general, selective inference inflates apparent effect sizes and increases the false discovery rate. Methods to control for selective inference are relatively new.

Small experimental groups are an issue for power: if the sample size is too small, no test can have much power: real effects are unlikely to be noticed. The power (of an appropriate test) evidently was not too low to find an effect: appropriate statistical analysis of the MacNell et al. [2] data yields strong statistical evidence that the null hypothesis is false, despite the small sample size.

Small experimental groups are not an issue for the false discovery rate, which is controlled by the significance level of tests, the number of tests performed, and selective inference, regardless of the sample size. Relying on small experimental groups may indirectly lead to false discoveries and inflated effect sizes through selective inference: it is cheaper to conduct small-group studies than larger studies, making it easier to conduct many studies and cherry-pick positive results. The inflation of the discovery rate and of estimated effect sizes is not caused by the size of the sample; rather, using fewer subjects just facilitates statistical malfeasance by making it less expensive to conduct many studies and discard those that do not produce positive results. There was no issue of cherry-picking in the MacNell et al. [2] study: it is evidently the only study of its kind. To the best of our knowledge, they ran the experiment only once. Moreover, it was a controlled, randomized experiment, quite rare in studying SET.

What matters for replicability and generalizability is not the number of subjects, but how the subjects were selected from the larger population to which one seeks to generalize. (Selective inference also tends to undermine replicability.) In the present matter, neither the students nor the instructors were selected at random from any larger population of students or instructors, so there is no statistical basis for extrapolating quantitative estimates based on this experiment to other populations. Again, this has nothing to do with “sample size” (the number of students in the experiment), only with how those students came to be in the experiment.

MacNell et al. [2] and Boring et al. [1] do not “generalize” inductively, for instance by claiming that the biases observed in this experiment occur to the same degree in every classroom. Indeed, Boring et al. [1] point out that the effect of gender seems to vary across disciplines, based on data from the other dataset examined by Boring et al. [1], a natural experiment at the French university Sciences Po comprising 23,000 observations of SET scores.

However, the results generalize deductively in the same way that observing a single black swan refutes the hypothesis that all swans are white. The MacNell et al. [2] experiment shows that biases in SET can be large enough to obscure large differences in teaching effectiveness. Because that is true in the MacNell et al. [2] experiment, the assumption that such biases are always negligible is false, shifting the burden onto institutions to show that bias in SET is negligible for each class, each semester, and each instructor—if SET are to be relied upon for employment decisions.

2 Outliers

Uttl and Violo [3] confuse “influential observation” with “outlier.” An “outlier” is an observation that is drawn from a different distribution than the bulk of the data, for instance, because it is contaminated by a gross error resulting from a typo, from misreading an instrument, or because the measuring apparatus was hit by a cosmic ray.

There is no comparable notion of “outlier” here. (Outlier tests are used to detect gross errors and measurements that are contaminated in some way. Implicitly, outlier tests ask, “would this datum be unlikely if it came from the same underlying distribution as the rest?” The outlier identification rule Uttl and Violo [3] used presumes that the “true” distribution of the data is Gaussian: observations that would be unlikely under a Gaussian model because they are too many standard deviations from the mean are flagged as suspect. Only about 5% of a Gaussian distribution is two or more standard deviations from the mean, but for other distributions, up to 25% of the probability can be two or more standard deviations from the mean. Ironically, applying the rule Uttl and Violo [3] adopted to data that genuinely come from a Gaussian distribution would be expected to discard 5% of the data. They discard 7% of the data. Given that a each datum represents 1=43 = 2:3% of the data, this is gosh darned close to what would be expected if the data were indeed Gaussian and there were no genuine outliers.) It is not as if the three students in question were not actually enrolled in that class and their responses were accidentally comingled with the responses of enrolled students, nor that there was unusually large error in measuring those students’ responses.

The SET ratings of those students are not observations from some different distribution. They are the ratings actually given by 7% (3 of 43 students) of the students enrolled in the class. We are not aware of any university that eliminates extreme ratings before calculating mean evaluation scores. The scores of those three students were included in the mean SET scores “in real life.” The data are influential but they are not outliers.

A relatively small number of students with extreme views can drive large differences in mean scores. That does not mean that those students’ scores are outliers, somehow erroneous, or that there is a statistical or substantive justification for discarding them.

Figure 1 plots the distribution of overall satisfaction scores by student and instructor gender, for the Sciences Po data examined by Boring et al. [1]. Here too, the gender bias is driven by a small percentage of students, male students who give male instructors high ratings. The percentage of female students giving excellent scores to male instructors is 32.2%, compared to 41.9% of male students rating male instructors; the difference amounts to 4.2% of the student responses. That 4.2% creates statistically significant differences in the average scores received by male and female instructors. These are the scores instructors received and that administrators use to assess teaching effectiveness. (The influence of extreme views may be even larger when SET are not mandatory if students with extreme views are more likely to return SET than students with more moderate views.)

Figure 1: Distribution of overall satisfaction scores by student and teacher gender. Data from a natural experiment at Sciences Po. See [1].

Because of the randomized design, one can calculate the chance that the three students who gave the lowest scores all would have been in the apparent female group if the treatment (instructor name) had no effect. The answer is about 7%: modest evidence of bias even before considering the numerical scores.

As mentioned above, institutions generally report and use average SET scores: they do not remove extreme values before computing the average. The three scores in question would be included in the average, just as they were in the analyses of MacNell et al. [2] and Boring et al. [1]. They should be taken into account, not discarded.

That said, the parametric t-test Uttl and Violo [3] and MacNell et al. [2] used is statistically inappropriate, as Boring et al. [1] point out:

SET scores are on a Likert scale. They are ordinal categorical values, not a continuous linear scale. In particular, they do not follow a normal distribution. But the parametric t-test assumes that the scores in the treatment and control groups follow normal distributions with the same variance, but possibly different means.
The parametric t-test assumes that the various “treatment” groups are independent random samples from two normal populations with the same variance. But in the MacNell et al. [2] experiment, they are not: they are an entire population randomized into 6 groups. As a result, the groups are dependent: if a student is in one group, that student is not in some other group. The scores are not normally distributed, and the groups are not independent.

The permutation tests employed by Boring et al. [1] have none of those flaws: the permutation t-test honors the actual randomization that was used to allocate students to class sections. The probabilities come from the randomization actually performed, not from counterfactual assumptions involving sampling independently from hypothetical populations, which Student’s t-test assumes. It makes no assumption whatsoever about the distribution of responses; the inference is conditional on the observed values. In particular, while the parametric t-test is sensitive to extreme observations, the permutation t-test correctly takes the observed values into account, no matter how extreme they are.

In short, the P-values from Student’s t-test are uninterpretable for this experiment: they are not the probability of observing differences as large or larger than those observed, on the assumption that the instructor’s name has no effect on ratings. Rather, they are the probability of observing differences as large or larger than those observed on the assumption that the ratings are independent random samples from Gaussian distributions with equal variances and equal means. That has nothing whatsoever to do with the experiment conducted by MacNell et al. [2].

Boring et al. [1] show that the parametric t-test used by MacNell et al. [2] and Uttl and Violo [3] finds statistical significance where there is none and misses actual statistical significance. The conclusions from the parametric and permutation tests are summarized in Table 8 of Boring et al. [1]. Table 1 gives the results in simpler, color-coded form. In every case where the permutation t-test rejects the null hypothesis at significance level 5%, the parametric t-test does not, and vice versa. This results from the failure of the assumptions of the parametric t-test.

Uttl and Violo [3] rely on the parametric t-test despite the fact that its assumptions contradict how the experiment was conducted and despite the fact that it has been shown to yield spurious results.

Table 1: Table 8 of Boring et al. [1] simplified to show which hypotheses the permutation t-test and parametric t-test reject at 5%.

3 Conclusion

Uttl and Violo [3] are correct that there is a problem with the statistical analysis in [2], but not the one they identify. Indeed, their analysis repeats the problem: the parametric t-test does not produce a meaningful P-value for the experiment, because the experiment does not satisfy the assumptions of the test. Contrary to the claims of Uttl and Violo [3], the three influential observations are not “outliers” in the sense in which the term is used in statistics: they are the actual scores given by three students in the class, and there is no statistical basis for excluding those students’ scores. The fact that all three students were in the “female persona treatment group” is itself evidence of gender bias: the chance of that occurring is roughly 7% if the instructor’s name had no effect on SET scores.

While small experimental groups can result in low power, a posteriori the MacNell et al. [2] experiment included enough subjects to detect an effect, in part because the effect was evidently so large. Small experimental groups do not ipso facto elevate the false discovery rate nor inflate effect sizes: selective inference does both, but there was no selective inference in this case.

Uttl and Violo [3] are correct that the MacNell et al. [2] study does not “generalize” statistically to other students and classes, but the study does generalize logically: it shows that bias is sometimes substantial. That is a serious problem.

References

[1] A. Boring, K. Ottoboni, and P. Stark. Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research, 2016.

[2] L. MacNell, A. Driscoll, and A. N. Hunt. What’s in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education, 40:291–303, 2015.

[3] B. Uttl and V. Violo. Small samples, unreasonable generalizations, and outliers: Gender bias in student evaluation of teaching or three unhappy students? ScienceOpen Research, 2020.

Comments

Bob Uttl wrote:

REPLY TO PHILIP STARK

Stark asserts that “MacNell et al. data are indeed for an entire population.” This assertion is factually incorrect. MacNell et al. analyzed only those students who actually responded to the SET (43 of them) and only provided data for those students who responded (43 of them). Boring, Ottoboni and Stark simply could not and did not analyze SET responses that the non-respondents did not provide. One cannot squeeze blood out of a stone and one cannot squeeze SET responses from the non-respondents.

Stark asserts that “the use of means of Likert scores is not meaningful...” However, he explains that Boring, Ottoboni, and Stark used means because that is what others (e.g., institutions) do. So why exactly did Stark chose to criticize us for following what others – MacNell et al. and Boring, Ottoboni, and Stark – do when we were following what they were doing while replicating their analyses?

Stark says that MacNell et al. study ‘was recently replicated in an experiment with 136 student (115 respondents).” He says: “I have not seen the underlying data nor attempted to check or replicate the analysis, but here is an announcement...” and he refers to a media report. In behavioral sciences, we teach our students not to take media reports seriously and we also teach them not to rely on abstracts but to carefully examine method and results sections.

We are happy to report that Stark is correct: Khazan, Borden, Johnson, & Greenhaw (2020) recently replicated MacNell et al’s finding showing NO gender bias in student evaluation of teaching. MacNell et al. found no gender difference in overall/average SET ratings (p = .128); Boring, Ottoboni, and Stark (2016) confirmed MacNell et al.’s finding of no gender difference in overall SET ratings (p = .12, see Table 8, using permutation test); and Khazan et al. now replicated MacNell et al.’s findings of no gender difference in more than twice as large (but still underpowered) study (p = .73). Khazan et al. wrote:

“The mean score for TAM [male TA] (17.9, SD = 7.89) was higher than that for TAF [female TA] (17.3, SD = 11.1), but the means were not statistically different (chi2 = 0.12, p = 0.73).”

In contrast to Stark, we read Khazan et al.’s article (including the method and results), we reanalyzed the data, and we confirm that Khazan et al. found no evidence of gender bias whatsoever. Unfortunately, Stark’s source -- the insidehighered.com (and numerous other news outlets) -- got it wrong when they announced Khazan et al.’s study as a new study that found “Gender bias in TA evals.” Why? We believe they may not have read Khazan et al.’s method and results section.

Our detailed review and re-analysis of Khazan et al.'s study is available on ScienceOpen.com, titled: "Gender bias in student evaluation of teaching or a mirage" at https://www.scienceopen.com/document?id=1761de99-8a62-42ce-b6de-4a0b8514a587

2020-11-27 06:41 UTC

Philip Stark wrote:

The comment by Uttl misrepresents both the review and the papers Uttl and Violo criticize.

For instance, the MacNell et al. data are indeed for an entire population, not a sample: every student in the course was included in the analysis by Boring et al., whether the student submitted an evaluation or not. See the section entitled "The US Randomized Experiment" for how the four nonresponders were treated in the permutation tests. The data are here: https://github.com/kellieotto/SET-and-Gender-Bias/blob/master/Code/Macnell-RatingsData.csv Under the strong null hypothesis, responses--including the failure to respond at all--do not depend on the name the instructor used, only on the actual identity of the instructor.

The use of means of Likert scores is not meaningful, as Stark and Freishtat have written. We are not advocating using the mean--nor indeed any other function of the SET scores. But educational institutions nonetheless rely on such means for employment decisions. We are studying what institutions do, not what we wish they did. If gender bias reduces the mean Likert scores of female instructors, those instructors will be disadvantaged by employment policies that rely on the mean scores, whether taking the mean makes sense or not. The analysis of the MacNell et al. data by Boring et al. provides statistical evidence that mean Likert scores were affected by the name the instructor used, with the female name receiving lower mean scores by large, statistically significant amounts for promptness, fairness, and praise. As discussed elsewhere, the difference in "promptness" is especially troubling because assignments were returned at exactly the same time in all sections of the course: objectively there was no difference.

The MacNell et al. study was recently replicated in an experiment involving 136 students (115 responders). I have not seen the underlying data nor attempted to check or replicate the analysis, but here is an announcement: https://www.insidehighered.com/news/2020/11/02/study-finds-gender-bias-ta-evals-too

2020-11-09 23:54 UTC

One person recommends this

Bob Uttl wrote:

REPLY TO PHILIP STARK, AMANDA GLAZER, AND ANNE BORING'S REVIEW

Stark et al. asserts that we made numerous statistical errors, that we do not know what outliers are (they say we are “confused”), that MacNell et al. study is not a study of a sample but of a population, that we used improper parametric statistics, etc..

We address some of the Stark et al.’s criticisms below.

RE: SAMPLE SIZE

Stark et al. assert that “The subjects in the MacNell et al. [2] study are not a ‘sample’ per se. They are a population – all students who took the course that semester – randomized into six groups...” While the entire “population” of students in that particular class was randomized to sections/groups, when these randomized students were invited to complete SETs, some of them chose to not participate. MacNell et al. stated (p. 297): “Over 90% of the class completed the evaluation.” If 100% of the class completed SETs, Stark et al. would be correct to say that the population/the entire class were the subjects but this is not the case. MacNell et al. did not say 100%; they said “over 90%” only. Thus, Stark et al.’s assertion is incorrect for at least two reasons: (1) The subjects in the MacNell et al. study were the self-selected sample from the population of all students registered in that particular class; and (2) Students who registered in the class were a self-selected sample of other students who were eligible to register in it but chose not to do so.

Stark et al. contradict themselves when they claims that “Small experimental groups are not an issue for the false discovery rate” and a few sentences later state that “The inflation of the discovery rate and of estimated effect sizes is not caused by the size of the sample; rather using fewer subjects just facilitates statistical malfeasance by making it less expensive to conduct many studies and discard those that do not produce positive results.” The discarding of the studies that did not produce positive results is precisely the problem; in behavioral sciences, this problem is discussed under various labels including “file drawer effect” and “publication bias” (no “malfeasance” is necessary). Stark et al. go on and claim that “There was no issue of cherry-picking in the MacNell et al. [2] study; it is evidently the only study of its kind.” Unfortunately, Stark et al.’s claim that MacNell et al. study “is evidently the only study of its kind” is supported by zero evidence and Stark et al. have no way of knowing how many studies of its kind were performed around the world. MacNell et al.’s study may be the only published study of its kind. For all we know, there may be hundreds of such studies that resulted in no statistically significant findings and were not published because (a) authors did not submit them for publication and/or (b) journals rejected them because of small samples, insufficient statistical power, and no statistically significant findings.

With respect to Stark et al.’s “a single black swan” argument, we agree that the single black swan demonstrates that not all swans are in fact white. However, if one sees a black swan that does not actually prove that black swans actually exist as a distinct species. A single black swan can be a white swan with a bucket of black paint spilled over it – it merely masquerades as a black swan. Similarly, the effect in MacNell et al. study – what MacNell et al. and Stark et al. interpret as gender bias – may be nothing else but the three students’ unhappiness/anger masquerading as gender bias (three students giving their instructor low ratings simply because of some perceived slight).

RE: OUTLIERS

Stark et al. asserts that we “confuse ‘influential observation’ with ‘outlier’”. We disagree. Stark et al. state that “an ‘outlier’ is an observation that is drawn from a different distribution than the bulk of the data...” It is unclear what Stark et al. mean by “drawn from a different distribution.” Regardless, a widely accepted definition of an outlier is that it is an observation far removed from the bulk of the data. Even Stark’s own webpages (https://www.stat.berkeley.edu/~stark/SticiGui/Text/gloss.htm, retrieved November 6, 2020) define an outlier as such: “An outlier is an observation that is many SD’s from the MEAN.”

Stark et al. assert that “there is no comparable notion of “outlier” here [in MacNell et all data].” We disagree. As we stated (see our Method): “We formally examined MacNell et al’s data for outliers using Tukey’s rule for identifying outliers as values more than 1.5 interquartile range from the quartiles...”.

Once an observation is identified as an outlier, a question arises as to what is the cause for the outlier to be so far removed from the bulk of the data. Stark et al. state that an outlier can be caused by “a gross error resulting from a typo, from misreading an instrument, or because the measuring apparatus was hit by a cosmic ray.” However, hits by cosmic rays aside, an outlier can also be caused by a student not answering questions on SET form but going down the form and giving the instructor the lowest ratings in retaliation for some perceived slight, out of anger. Stark and Freishtat (2014) themselves argued precisely that point; they say: “For instance, anger motivates people to action more than satisfaction does.” Thus, given that the majority of the ratings given by these three outliers are extremely low (the lowest possible), the three outliers could be women haters/students who are extremely biased against women or they could simply be angry for something the instructor did or did not do. Notably, two of the three “gender biased” students/possible women haters/angry students were women and one was a man. It is impossible to tell which one of these possibilities is correct, if any. Hence, MacNell et al.’s results are not interpretable and claims of gender bias unwarranted.

Stark et al. assert that “The SET ratings of those students are not observations from some different distribution... The scores of those three students were included in the mean SET scores ‘in real life’. The data are influential but they are not outliers.” First, whether the three outliers are from “some different distribution” depends on what is meant by a different distribution. If for example, the three outliers decided to ignore SET questions and to give the instructor the low ratings in revenge for some perceived slight, they were from a different distribution – a population of students who were exacting revenge rather than rating the instructor. It happens in real life all the time. Second, the three scores are outliers; they are far removed from the bulk of the distribution. Third, they are influential; when the three outliers were removed and data-re-analyzed, “gender bias” against women disappeared and students rated the actual female instructor higher regardless of perceived gender.

IN CONCLUSION

We agree with Stark that institutions ought to validate whatever instruments they use in high stakes personnel decisions prior to using them and to ensure that those instruments are free of various biases.

We demonstrated that MacNell et al. summaries depend critically on 3 outliers, 3 students who rated their instructors very differently than the rest of the students, who were far removed from the bulk of the distribution.

With respect to statistics, we used exact same statistic as MacNell et al. (2015). Our goal was to see how MacNell et al.’s results change when the three outliers are removed – they changed hugely. We did not chose the way MacNell et al. chose to analyze the data; MacNell et al. did. We merely replicated it.

Moreover, if one insists on statistical purity, one ought to practice the purity oneself. Stark et al. say that “SET scores are on a Likert scale. They are ordinal categorical values, not a continuous linear scale...” Elsewhere, Stark argued that it does not make sense to average these ordinal data: “SET scores are ordinal, categorical variables... It does not make sense to average labels.” (Stark & Freishtat, 2014). Yet, both MacNell et al. and Boring, Ottoboni, and Stark (2016) used means to summarize these ordinal data, even though it “does not make sense”, and argued that those differences in means are evidence of bias. We quote from Boring et al.: “To test where there is a systematic difference in how students rate apparently male and apparently female TAs, we use the difference in pooled means...” (p. 8, Boring et al., 2016). We note, however, that ordinal rating data are summarized by means and analyzed by parametric tests all the time as numerous empirical studies demonstrated that it makes very little difference as long as one does not use tiny sized samples (MacNell et al. cite some of these studies).

Stark et al. claim that “The fact that all three students were in the ‘female persona treatment group’ is itself evidence of gender bias: the chance of that occurring is roughly 7% if the instructor’s name had no effect on SET scores.” This argument assumes that the three students/outliers (two women and one man) were gender biased against women/women haters, and when they were assigned to perceived female instructor, they let their bias/woman hating loose. However, there is no evidence that the three students/outliers were in fact gender biased/women haters. The three students may have responded to something the instructor did or did not do instead.

Finally, we note that even Boring, Ottoboni, and Stark’s permutation test (with outliers included) on the overall SET resulted in p-value of .12. Boring, Ottoboni and Stark say that this is “weak evidence that the overall SET score depends on the perceived gender (p-value 0.12).” They go on and say that “The evidence is stronger for several other items student rated...” However, if they adjusted their tests for the number of tests performed, none of them would be statistically significant at p < .05.

We see no statistical errors in our re-analyzes. The outliers are outliers and they are 7% of the data precisely because the samples were so tiny and one single student comprised of 2.3% of the sample.

In summary, MacNell et al. study is inherently uninterpretable and its conclusion that the mean differences across conditions are due to gender bias against women unwarranted. Neither of us (Uttl and Violo) would wage a single dollar that MacNell et al. findings would replicate in the very next attempt at such replication. A priori statistical power calculation is clear: it is very unlikely to happen and we do not like to thrown our money away.

---

Stark, P.B., and R. Freishtat, 2014. An Evaluation of Course Evaluations, ScienceOpen,

DOI 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1

2020-11-07 23:18 UTC

Comment on this review

Version and Review History

Published version 1

Preprint version 1

Reviewed by Philip Stark Reviewed by Wolfgang Stroebe