+1 Recommend
  • Review: found
Is Open Access

Review of 'Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness'

Average rating:
    Rated 4.5 of 5.
Level of importance:
    Rated 4 of 5.
Level of validity:
    Rated 4 of 5.
Level of completeness:
    Rated 5 of 5.
Level of comprehensibility:
    Rated 5 of 5.
Competing interests:

Reviewed article

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness

Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show: SET are biased against female instructors by an amount that is large and statistically significant the bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded the bias varies by discipline and by student gender, among other things it is not possible to adjust for the bias, because it depends on so many factors SET are more sensitive to students' gender bias and grade expectations than they are to teaching effectiveness gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors. These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sections of an online course in a randomized, controlled, blind experiment at a US university.

    Review information


    This work has been published open access under Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at

    ScienceOpen disciplines:

    Review text

    This article presents findings from two studies on the effectiveness of student evaluations of teaching as a measure of teaching effectiveness. First study is from a five-year natural experiment at a French university (23,011 evaluations of 379 instructors by 4,423 students in six mandatory first-year courses) and the second study is from a randomized, controlled, blind experiment at a US university (43 evaluations for four sections of an online course). The methodology uses nonparametric statistical tests, which is presented as an improvement over previous analyses of the French dataset.

    Overall the designs of the two studies presented are interesting and the findings are eye-opening. Below are my comments on mostly the methodology used in the analysis and some of the assumptions made.


    • Comparison to previous work

    The methods section of the article references a previous analysis of the data where ``the tests assumed that SET of male and female instructors are independent random samples from normally distributed populations with equal variances and possibly different means". I assume the authors are referring to MacNell et al. [2014]. This article states the following:

    We used Welch’s t-tests (an adaptation of the Student’s t-test that does not assume equal variance) to establish the statistical significance of each difference. We also ran two general linear multivariate analyses of variance (MANOVAs) on the set of 12 variables to test the effects of instructor gender (perceived and actual) on all of the questions considered as a group.

    While some of the analyses used in the previous study required normally distribution of data and equal variances, others didn't. The text in this article makes it seem like all previous analyses used methods requiring these conditions. It would be best to clarify exactly where improvements to methods were introduced and present evidence that the data did not meet these conditions.

    Another issue is that the authors do not compare their findings to those from MacNell et al. [2014]. It should be explicitly stated in the paper whether the findings agreed or disagreed.


    • Assumptions on relationship between teaching effectiveness and student performance

    Section 3.1 on page 9 states the following:

    To the extent that final exams are designed well, scores on these exams reflect relevant learning outcomes for the course. Hence, in each course each semester, students of more effective instructors should do better on the final, on average, than students of less effective instructors.

    This would be definitely true if the material covered in each section is exactly the same. It is not clear from the article whether this was the case  or how this was evaluated. It is possible that students of a very effective teacher score poorly on an exam if the exam assesses material that was not emphasized in the course.


    • Controls for non-independence

    In the French dataset there are 23,001 student evaluations of 379 instructors. We are told that instructors see the students in groups of 10-24 students. It seems like the same instructor must appear in the dataset multiple times. The analyses do not appear to control for this non-independence structure in the data. 

    It also seems like students might be repeated in the dataset as well since each student has to take these required courses. There does not appear to be any control for this either in the analysis.


    • Multivariate analysis

    All analyses presented in the article are bivariate, exploring the relationship between a given factor and the response variable of SET scores. 

    First, this can potentially result in problems associated with multiple testing. The authors address this issue in Section 6, however I would expect this discussion to appear earlier in the paper.

    Second, it would potentially be more informative to analyze multivariate relationships instead of exclusively bivariate relationships.  A regression based approach, instead of a series of tests, would allow for this exploration, as well as conclusions that evaluate the relationship between SET and certain factors while controlling for others.


    • Discussion of effect size and practical significance

    The article does not discuss effect sizes and considerations around practical significance. This is especially important for the French dataset which has a large number of observations. Indeed in Section 6 the authors discuss the importance of finding small p-values with the small US dataset, however a similar discussion is not included for the French dataset.


    We thank Professor Cetinkaya-Rundel for her thoughtful review.

    Most of her concerns are already addressed in the paper; others we agree with in part but not entirely. For instance, Prof. Cetinkaya-Rundel asserts that

    “while some of the analyses in the previous study required normally distribution (sic) of data and equal variances, others didn’t.”

    She points to Welch’s two-sample t-test and MANOVA as examples. Both of these tests rely on the assumption that data are sampled independently at random from populations with normal distributions. That stochastic assumption contradicts the actual randomization in the MacNell et al. experiment and the as-if randomization in the SciencesPo natural experiment: the data arise from randomizing a fixed group of individuals into different treatments, not from sampling from larger populations at random. And the variables involved do not have normal distributions; indeed, they are dichotomous (e.g., gender), ordinal categorical (e.g., SET), or have compact support (e.g., grades). The null hypotheses for Welch’s t-test and MANOVA are simply not relevant to the problem. Asymptotically, those tests might have the correct level under the randomization actually performed, but for finite populations, the actual levels may differ considerably from the nominal levels, as Table 8 of the paper shows.

    Prof. Cetinkaya-Rundel writes:

    “Another issue is that the authors do not compare their findings to those from MacNell et al. [2014]. It should be explicitly stated in the paper whether the findings agreed or disagreed.”

    Table 8 compares our results to those from MacNell et al. [2014] (the rightmost columns show the p-values from our permutation test side-by-side with the p-values from t-tests in MacNell et al.) and we discuss the differences near the end of page 8, in the section entitled “SET and perceived instructor gender.” Our omnibus test of the hypothesis that perceived instructor gender plays no role in SET matches the spirit of the MANOVA test (but for a more relevant null hypothesis); they do not report details of those tests, only that the p-values were less than 0.05. The permutation omnibus test we performed gives a p-value of essentially zero for the hypothesis that students rate instructors the same, regardless of the gender of the instructor’s name.

    Prof. Cetinkaya-Rundel raises the concern that instructors and students appear in the data more than once, across years and across courses, producing correlation. In the Neyman model, the responses are fixed, not random: only the assignment of instructors to class sections is random. There is no assumption that responses random at all, much less independent. Moreover, instructors appear at most once per course per year—i.e., once per stratum. Because the natural experiment amounts to randomizing within each stratum, independently across strata, there is no dependence across strata, regardless of the identity of the instructors. Prof. Cetinkaya-Rundel continues:

    “ A regression based approach, instead of a series of tests, would allow for this exploration, as well as conclusions that evaluate the relationship between SET and certain factors while controlling for others.”

    A regression-based approach requires far stronger modeling assumptions (e.g., linearity of effects and IID additive errors), assumptions for which there is no basis here. Indeed, those assumptions seem rather far-fetched, for instance, because SET is an ordered categorical scale, not a linear scale. A model that posits that SET (or other variables) are observed with independent, identically distributed additive errors is also unjustified for these data: the randomness is in the assignment of students to instructors, not measurement noise. The Neyman model allows each individual (student or class section) to have its own response to each treatment condition, with no assumption about the relationship among responses of different individuals. Randomization in the experimental design mitigates confounding with no need to model the effect of covariates: that is the power of a randomized experiment. Moreover, the variables one would presumably want to control for are pretreatment variables, such as prior knowledge of the subject matter, while, the covariates available (e.g., final exam scores) are measured post-treatment.

    Prof. Cetinkaya-Rundel also advocates reporting effect sizes. We reported empirical mean effect sizes in the tables (e.g., column 2 of Tables 3, 4, 5, 7, and 8). We intentionally did not report confidence intervals for mean effect sizes. Confidence intervals could be constructed by inverting the permutation tests, but that would require strong assumptions about the structure of the effect that have no basis here. For instance, it would suffice to assume that the effect of instructor gender on SET is the same for every student—but the data give evidence that the effect varies by student gender, at least. Indeed, one of our main conclusions is that there are such differences! Finally, from a policy perspective, effect size per se isn’t particularly interesting: the issue is disparate impact. The data show that the effect of gender on SET can overwhelm differences in actual instructor effectiveness, as measured by exam performance.

    2016-03-17 11:27 UTC

    Comment on this review