+1 Recommend
    • Review: found
    Is Open Access

    Review of 'Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness'

    Student Evaluations of Teaching (Mostly) Do Not Measure Teaching EffectivenessCrossref
    The Boring et al. study falls short of other studies investigating gender and student ratings.
    Average rating:
        Rated 2.5 of 5.
    Level of importance:
        Rated 2 of 5.
    Level of validity:
        Rated 2 of 5.
    Level of completeness:
        Rated 3 of 5.
    Level of comprehensibility:
        Rated 3 of 5.
    Competing interests:
    Jason Barr, Ph.D. is employed as a researcher for The IDEA Center, a nonprofit whose mission is to improve learning in higher education through research, assessment and professional development. The IDEA Center provides Student Ratings of Instruction (SRI) instruments to colleges and universities.

    Reviewed article

    • Record: found
    • Abstract: found
    • Article: found
    Is Open Access

    Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness

    Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show: SET are biased against female instructors by an amount that is large and statistically significant the bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded the bias varies by discipline and by student gender, among other things it is not possible to adjust for the bias, because it depends on so many factors SET are more sensitive to students' gender bias and grade expectations than they are to teaching effectiveness gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors. These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sections of an online course in a randomized, controlled, blind experiment at a US university.

      Review information

      This work has been published open access under Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com.

      Clinical Psychology & Psychiatry
      nonparametric statistics,disparate impact,gender bias,permutation tests
      ScienceOpen disciplines:

      Review text

      Boring et al. report the results of two studies conducted on separate samples, one from six courses offered in France, the other from one course in the U.S.. The authors claim to have found gender bias in both SET instruments. So, it is logical to ask, “What exactly did those SET measure?”  Regarding the French sample, the only possible answer is we don’t know what the SET measured. Readers are simply told it included closed-ended and open-ended questions. No information is provided about any of the items on the SET nor whether they correlate with any relevant measure of teaching effectiveness. So, we really do not know what construct is being correlated with instructor gender.

      The SET used in the U.S. sample was described previously in MacNell, Driscoll, and Hunt (2014). The 15-item instrument was comprised of Likert-type items inviting students to respond from 1 = Strongly disagree to 5 = Strongly agree. Six items were intended to measure effectiveness (e.g., professionalism, knowledge, objectivity); six were for interpersonal traits (e.g., respect, enthusiasm, warmth), two were included for communication skills, and one was “to evaluate the instructor’s overall quality as a teacher.” No information about the exact wording of the items was provided. Moreover, the authors provided no theoretical explanation for item development or whether the “student ratings index” correlates with any other relevant measures.

      So, in the French study we do not know exactly what aspect of teaching effectiveness is being correlated with instructor gender. In the U.S. study, we know that overall teaching quality is NOT associated with instructor gender.

      Other concerns are made apparent in review of the study:

      1. What validity and reliability evidence is there for the learning measure?
      2. What effect did researcher expectancy effects have in the U.S. study?
      3. What effect did having only male lecturers have on French students?
      4. Many of the correlations reported are very weak and non-significant.
      5. Why should we assume assignment of instructors to sections in the French sample was “as if at random”?
      6. Correlation is not causation.
      7. How generalizable are these findings?

      My colleagues and I took each concern to task, with a thorough look at the shortcomings of each. The editorial note, referencing a column based on the study titled “Bias Against Female Instructors” posted January 8, 2016 in Inside Higher Education can be found in full at http://ideaedu.org/research-and-papers/editorial-notes/response-to-bias-against-female-instructors/.

      Our conclusion was the Boring et al. study falls short of other studies investigating gender and student ratings. In studies of ratings of actual teachers there is only a very weak relationship that favors female instructors (Centra, 2009; Feldman, 1993). This is not to say that gender bias does not exist. We grant that it can be found in all walks of life and professions. But a single study fraught with confounding variables and weak correlations should not be cause for alarm. The gender differences in student ratings reported previously (e.g., Centra & Gaubatz, 2000; Feldman, 1992, 1993) and in Boring et al. (2016) are not large and should not greatly affect teaching evaluations especially if SET are not the only measure of teaching effectiveness. But, even if they are the only measure, this study shows gender contributes only about 1% of the variance in student ratings.  Hardly a “large and statistically significant” amount as stated by the authors.



      Thank you for being the first to review our paper. Your concerns were already addressed: we provided full information regarding the French data, including the full survey that students completed, here. Also, as mentioned in the paper, the US data, including the survey items, are here. The statistical method is explained in detail in the paper, and code implementing all the tests is here, if you would like to replicate our results.


      We look forward to additional reviews by people who do not have a financial interest in SET.

      2016-02-17 19:27 UTC
      One person recommends this

      Comment on this review