Article Small samples, unreasonable generalizations, and outliers: Gender bias in student evaluation of teaching or three unhappy students?

In a widely cited and widely talked about study, MacNell et al. (2015) examined SET ratings of one female and one male instructor, each teaching two sections of the same online course, one section under their true gender and the other section under false/opposite gender. MacNell et al. concluded that students rated perceived female instructors more harshly than perceived male instructors, demonstrating gender bias against perceived female instructors. Boring, Ottoboni, and Stark (2016) re-analyzed MacNell et al.’s data and confirmed their conclusions. However, the design of MacNell et al. study is fundamentally flawed. First, MacNell et al.’ section sample sizes were extremely small, ranging from 8 to 12 students. Second, MacNell et al. included only one female and one male instructor. Third, MacNell et al.’s findings depend on three outliers -- three unhappy students (all in perceived female conditions) who gave their instructors the lowest possible ratings on all or nearly all SET items. We re-analyzed MacNell et al.’s data with and without the three outliers. Our analyses showed that the gender bias against perceived female instructors disappeared. Instead, students rated the actual female vs. male instructor higher, regardless of perceived gender. MacNell et al.’s study is a real-life demonstration that conclusions based on extremely small sample-sized studies are unwarranted and uninterpretable.

In an article entitled "What's in a name: Exposing gender bias in student ratings of teaching", MacNell, Driscoll, and Hunt (2015) examined whether students are biased against female faculty when completing student evaluation of teaching (SET) questionnaires. MacNell et al. examined SET ratings of one female and one male instructor teaching an online course under two conditions: when students were either truthfully told the gender of each instructor (True Gender condition) or when students were misled about their instructors' genders and told that each instructor's gender was in fact the opposite of what it was (False Gender condition). Accordingly, students evaluated a single identical female instructor under either perceived female/actual female (pF/aF) or under perceived male/actual female (pM/aF) conditions, and evaluated a single identical male instructor under either perceived female/actual male (pF/aM) or under perceived male/actual male (pM/aM) conditions. In each condition, the male and female instructors were evaluated by 8 to 12 students only. MacNell et al. stated that both instructors interacted with their students exclusively online (allowing them to mislead students about their genders) through discussion boards and emails only; graded students' work at the same time; used the same grading rubrics; and co-ordinated their grading to ensure that grading was equitable in all four sections. MacNell et al. (2015) concluded that study demonstrated gender bias in student ratings of teaching. They stated: "Our findings show that the bias we saw here is not [emphasis in original] a result of gendered behavior on the part of the instructor, but of actual bias on the part of the students. Regardless of actual gender or performance, students rated the perceived female instructor significantly more harshly than the perceived male instructor, which suggests that a female instructor would have to work harder than a male to receive comparable ratings...." (p .301) A year later, MacNell et al.'s (2015) data were re-analyzed by Boring, Ottoboni, and Stark (2016) using non-parametric permutation tests rather than parametric tests used by MacNell et al.. Boring et al. similarly concluded that "The results suggests that students rate instructors more on the basis of the instructor's perceived gender than on the basis of the instructor's effectiveness. Students of the TA who is actually female did substantially better in the course, but students rated apparently male TAs higher." students in each of the four conditions was extremely small, ranging from only 8 to 12 students. Results based on such small samples typically have low statistical power, inflated discovery rate, inflated effect size estimation, low replicability, low generalizability, and high sensitivity to outliers (Ioannidis, 2005).
Second, MacNell et al.'s (2015) study included only one female and one male instructor. It is difficult to see how one could make valid generalizations about how students rate female vs. male instructors based on how students rate one particular male and one particular female instructor. Third, MacNell et al.'s (2015) Table 2 as well as Figure 2 suggest that the variability of SET ratings is much larger in some conditions than in other conditions, indicating the likely presence of outliers inflating variability in some but not other conditions. In fact, MacNell et al.'s data shown in Table 1, include three obvious outliers --three unhappy students who gave their instructors the lowest possible ratings on all or nearly all SET items (a familiar scenario to anyone who has ever taught such small courses). The three outliers are printed in bold in Table 1. All three occurred in perceived female conditions. Accordingly, we examine the effect of the three outliers -three unhappy students -on MacNell et al. (1) with the three outliers kept in the analyses and (2) with the three outliers removed from the data set.

Method
We downloaded MacNell et al.'s (2015) data from http://n2t.net/ark:/b6078/d1mw2k, via the link provided in Boring, Ottoboni, and Stark (2015). We formally examined MacNell et al's data for outliers using Tukey's rule for identifying outliers as values more than 1.5 interquartile range from the quartiles and then re-analyzed MacNell's data with and without the three outliers plainly visible in Table 1.
Based on preliminary principal component factor analysis of their data, MacNell et al.  Table 2 and Figure 1 and to see how these summaries would change when the three outliers were removed. Notably, neither MacNell et al. nor Boring et al. (2016) mentioned outliers in their analyses of MaNell et al.'s data. Figure 1, Panel A, shows the boxplot of SET ratings -the mean average of 12 items used by MacNell et al. (2015). The boxplot shows the three outliers -three students giving their instructors the lowest possible ratings on all or nearly all items. Similarly, Panel B shows the same data but for the mean average of all 15 items. The same three outliers are identified in this boxplot. Panel C shows the near identity relationship between the average of 12 items and the average of 15 items, with the correlation r = .998. This suggests that MacNell et al. would have obtained nearly identical results if they used all SET items rather than select only 12. Panel D shows the stripchart of the 12-item means for each of the four experimental conditions: pF/aF, pM/aM, pM/aF, and pF/aM. The stripchart shows that the three outliers occurred in the two perceived female conditions (i.e, pF/aF and pF/aM) and highlights the extremely small number of students in each of the four conditions, with ns ranging from 8 to 12 students. Table 2 shows the mean student ratings for each of the 12 SET items used by MacNell et al. (2015). The top third shows the means, standard deviations, and other statistics for 12 SET items comparing the male instructor with the female instructor and comparing the perceived male and perceived female instructors as reported by MacNell et al. in their Table  2. MacNell et al. did not report actual p-values but only whether any given p-value was < .10 and < .05. Table 2, the middle third, shows our re-analysis of MacNell et al.'s (2015) data with outliers not removed. Accordingly, the values in the middle third ought to be identical to those reported by MacNell et al. and shown in the top third of the table. The values are indeed identical -we consider differences in the last significant digit as identical --to those in MacNell et al. with two notable exceptions: the values in the r 2 column comparing the male instructor with the female instructor match except for the last value in the column, and the values in the r 2 column for the perceived male and perceived female instructors do not match except the last value in the column which matches. However, the statistically significant difference between male and female, using p < .05 standard, occurred only for the perceived instructor conditions and only for fair, praise, and prompt SET items, replicating MacNell et al.'s inferential statistics conclusions. Table 2, the bottom third, shows the identical analyses with the three outliers removed. As expected, the values change considerably except in the perceived male conditions as these did not include any outliers. First, in the actual gender conditions, the female instructor was rated higher than the male instructor on all 12 items, with the female instructor rated 0.08 to 0.54 points higher than the male instructor. For two items only, these differences were statistically significant at p < .05. Second, in the perceived gender conditions, the female and male instructors were rated comparably, with no difference statistically significant at p < .05 level. Accordingly, these item level analyses showed that when the three outliers were removed, the SET effects favouring males vs. females reported by MacNell et al. (2015) were wiped out and some SET effects favouring females vs. males emerged instead. Figure 2 shows the mean SET ratings for the 12 items. Panel A shows the SET ratings for the actual male vs. female instructor and for the perceived male vs. female instructor for all data. The Actual Gender bars show the data for the actual male and actual female instructor with data collapsed across True and False Gender conditions. The Perceived Gender bars show the data for the perceived male and the perceived female instructor with the data collapsed across actual gender. This figure highlights that students rated the actual female instructor numerically higher than the actual male instructor. In contrast, when the data were collapsed across Actual Gender conditions, the students rated the perceived male instructor higher than the perceived female instructor. The Panel A directly replicates MacNell et al's (2015) analyses reported in their Figure 2. Figure 2, Panel B, shows SET ratings by the four experimental conditions (i.e., with no collapsing across conditions). This figure highlights that in the True Gender conditions, the male instructor was rated higher than the female instructor. In the False Gender conditions, the students rated the same female instructor who was presented as male higher than the same male instructor who was presented as female. Thus, this data pattern supports MacNell et al.'s (2015) claim that it is the perception of the instructor as male vs. female that matters rather than what male vs. female instructors actually did.

Results
However, when the three outliers are removed, the findings change. Panel C shows the identical analyses to those in Panel A but with the three outliers removed. The Actual Gender condition shows that the female instructor is rated higher than the male instructor whereas the Perceived Gender condition shows that the differences between perceived female and male instructors all but disappeared. Panel D shows the identical analyses to those in Panel B but with the three outliers removed. The data show that female instructor was rated higher than the male instructor in both the True Gender and False Gender conditions.

Conclusions
MacNell et al. (2015) claimed that their findings demonstrated that students were actually biased against female vs. male instructors rather than merely being in favor of female gendered behavior. Boring, Ottoboni, and Stark (2016) re-analyzed MacNell et al.'s data, confirmed MacNell et al.'s findings, and concluded that students (1) rated instructors on the basis of gender rather than teaching effectiveness, and (2) rated male teachers better than female teachers even though they learned more from female teachers.
Our Importantly, MacNell et al.'s (2015) published data highlight nothing short of the absurd practice of interpreting the mean SET ratings from a small number of students as having anything to do with the instructor. The same identical instructor who received ?? SET rating in one section received widely discrepant ratings of ?? or ?? in the other section depending on whether or not two outliers -two unhappy students -were retained or excluded from the means, respectively. They highlight that professors ought to focus principally on students' satisfaction and ought not to do anything to lower it, for example, ought not to call students on academic dishonesty, adhere to academic standards, etc.. Moreover, given the Kruger-Dunning effect (Kruger & Dunning, 1999) and SET destroying effect of one or two outliers in small classes, professors must focus on satisfying principally the least able students who would perceive the greatest discrepancy between the grades reflecting their achievement and the grades they believe their work deserves, if their grades were not inflated (Uttl et al., 2017).  MacNell et al.'s (2015) study is a real-life demonstration that conclusions based on small sample-size studies are unwarranted and uninterpretable. MacNell et al.'s study design, including extremely small samples, and use of only a single woman and a single man to represent female and male professors, is simply insufficient to answer their research question. Combined with small samples, failure to examine the data, and to recognize that the summaries of the data depend critically on three outliers, three unhappy students, was only the last fatal flaw rendering the study 100% uninterpretable, and its conclusions unwarranted. In the meantime, however, the world, or at least hundreds of researchers citing MacNell et al. and Boring, Ottoboni and Stark (2016), falsely believes that MacNell et al's study demonstrated that students are biased against female professors. It is not true; MacNell et al. did not demonstrate students' bias against female professors. If anything, their results suggest that students rate female professors higher than male professors, but it would be foolish to make that claim based on the fundamentally flawed small sample design.  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  1  1  0  Note. Group: pM/aF = perceived male/actual female, pM/aM = perceived male/actual male, pF/aF = perceived female/actual female, pF/aM = perceived female/actual male; sex: 1 = male student, 2 = female student; ag = actual gender: 0 = female, 1 = male; pg = perceived gender: 0 = female, 1 = male; SET Item: 1 = professional, 2 = respect, 3 = caring, 4 = Note. † p < .10; * p < .05; pM/aF = perceived male/actual female, pM/aM = perceived male/actual male, pF/aF = perceived female/actual female, pF/aM = perceived female/actual male; SET Item: 1 = professional, 2 = respect, 3 = caring, 4 = enthusiastic, 6 = helpful, 7 = feedback, 8 = prompt, 9 = consistent, 10 = fair, 11 = responsive, 12 = praised, 13 = knowledgeable

Figure 1
MacNell et al. 's (2015) data. Panel A shows the boxplot of SET ratings -the mean average of 12 items used by MacNell et al.. The boxplot highlights the presence of three outliers -three students giving their instructors the lowest possible rating on all or nearly all SET items. Panel B shows the same data but for the mean average for all 15 items. The same three outliers are visible. Panel C shows the near identity relationship between the average of 12 items and the average of 15 items (r = .998). Panel D shows the strip chart of the 12-item means for each of the four experimental conditions and highlights extremely small number of students in each condition. It also shows that the three outliers occurred in the two perceived female conditions.

Figure 2
SET ratings for 12-item averages. Panel A shows the SET ratings for the actual male vs. female instructor and for the perceived male vs. female instructor for all data. Panel B shows the SET ratings by the four experimental conditions for all data. The instructor perceived as male received higher ratings that the instructor perceived as female. Panel C shows the SET ratings for the actual male vs. female instructor and for the perceived vs. female instructor when the three outliers are removed. Panel D shows the SET ratings by the four experimental conditions when the three outliers are removed. The actual female instructor received higher ratings than the actual male instructor, regardless of their perceived gender.