1,287

views

Comment

recommends

Review: found

Is Open Access

Review of 'Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness'

Reviewer: Jason Barr

Publication date of review: 2016-02-14

Bookmark

Jason Barr3

Student Evaluations of Teaching (Mostly) Do Not Measure Teaching EffectivenessCrossref ScienceOpen

The Boring et al. study falls short of other studies investigating gender and student ratings.

Average rating:	    Rated 2.5 of 5.
Level of importance:	    Rated 2 of 5.
Level of validity:	    Rated 2 of 5.
Level of completeness:	    Rated 3 of 5.
Level of comprehensibility:	    Rated 3 of 5.
Competing interests:	Jason Barr, Ph.D. is employed as a researcher for The IDEA Center, a nonprofit whose mission is to improve learning in higher education through research, assessment and professional development. The IDEA Center provides Student Ratings of Instruction (SRI) instruments to colleges and universities.

Reviewed article

Record: found
Abstract: found
Article: found

Is Open Access

Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness

Anne Boring, Kellie Ottoboni, Philip Stark (2016)

Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show: SET are biased against female instructors by an amount that is large and statistically significant the bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded the bias varies by discipline and by student gender, among other things it is not possible to adjust for the bias, because it depends on so many factors SET are more sensitive to students' gender bias and grade expectations than they are to teaching effectiveness gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors. These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sections of an online course in a randomized, controlled, blind experiment at a US university.

1 comments Cited 226 times     Rated -3 of 5. – based on 2 reviews

(Latest)

Bookmark

Review information

ScienceOpen disciplines: Clinical Psychology & Psychiatry

Review text

Boring et al. report the results of two studies conducted on separate samples, one from six courses offered in France, the other from one course in the U.S.. The authors claim to have found gender bias in both SET instruments. So, it is logical to ask, “What exactly did those SET measure?” Regarding the French sample, the only possible answer is we don’t know what the SET measured. Readers are simply told it included closed-ended and open-ended questions. No information is provided about any of the items on the SET nor whether they correlate with any relevant measure of teaching effectiveness. So, we really do not know what construct is being correlated with instructor gender.

The SET used in the U.S. sample was described previously in MacNell, Driscoll, and Hunt (2014). The 15-item instrument was comprised of Likert-type items inviting students to respond from 1 = Strongly disagree to 5 = Strongly agree. Six items were intended to measure effectiveness (e.g., professionalism, knowledge, objectivity); six were for interpersonal traits (e.g., respect, enthusiasm, warmth), two were included for communication skills, and one was “to evaluate the instructor’s overall quality as a teacher.” No information about the exact wording of the items was provided. Moreover, the authors provided no theoretical explanation for item development or whether the “student ratings index” correlates with any other relevant measures.

So, in the French study we do not know exactly what aspect of teaching effectiveness is being correlated with instructor gender. In the U.S. study, we know that overall teaching quality is NOT associated with instructor gender.

Other concerns are made apparent in review of the study:

What validity and reliability evidence is there for the learning measure?
What effect did researcher expectancy effects have in the U.S. study?
What effect did having only male lecturers have on French students?
Many of the correlations reported are very weak and non-significant.
Why should we assume assignment of instructors to sections in the French sample was “as if at random”?
Correlation is not causation.
How generalizable are these findings?

My colleagues and I took each concern to task, with a thorough look at the shortcomings of each. The editorial note, referencing a column based on the study titled “Bias Against Female Instructors” posted January 8, 2016 in Inside Higher Education can be found in full at http://ideaedu.org/research-and-papers/editorial-notes/response-to-bias-against-female-instructors/.

Our conclusion was the Boring et al. study falls short of other studies investigating gender and student ratings. In studies of ratings of actual teachers there is only a very weak relationship that favors female instructors (Centra, 2009; Feldman, 1993). This is not to say that gender bias does not exist. We grant that it can be found in all walks of life and professions. But a single study fraught with confounding variables and weak correlations should not be cause for alarm. The gender differences in student ratings reported previously (e.g., Centra & Gaubatz, 2000; Feldman, 1992, 1993) and in Boring et al. (2016) are not large and should not greatly affect teaching evaluations especially if SET are not the only measure of teaching effectiveness. But, even if they are the only measure, this study shows gender contributes only about 1% of the variance in student ratings. Hardly a “large and statistically significant” amount as stated by the authors.

Comments

Philip Stark wrote:

Thank you for being the first to review our paper. Your concerns were already addressed: we provided full information regarding the French data, including the full survey that students completed, here. Also, as mentioned in the paper, the US data, including the survey items, are here. The statistical method is explained in detail in the paper, and code implementing all the tests is here, if you would like to replicate our results.

We look forward to additional reviews by people who do not have a financial interest in SET.

2016-02-17 19:27 UTC

One person recommends this

Comment on this review

Version and Review History

Preprint

Reviewed by Mine Cetinkaya-Rundel Reviewed by Jason Barr