41,836

readers

5

recommends

- Record: found
- Abstract: found
- Article: found

Anne Boring^{1}^{,}^{2},
Kellie Ottoboni^{3},
Philip B. Stark^{*}^{,}^{3}

07 January 2016

ScienceOpen Research – Section: SOR-EDU

10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1

Assessment, Evaluation & Research methods, Labor law, Nonparametric Statistics, Disparate Impact, Gender Bias, Permutation Tests

Review statistics

Level of importance: | Rated 3 of 5. |

Level of validity: | Rated 3 of 5. |

Level of completeness: | Rated 4 of 5. |

Level of comprehensibility: | Rated 4 of 5. |

10 items per page

Average Score (Highest to Lowest)

Showing 1 - 2 of 2

Mine Cetinkaya-Rundel evaluated the article as: Show full reviewRated 4.5 of 5.

#### Comments

Publication date: | 06 March 2016 |

DOI: | 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1.RUIBVH |

Level of importance: | Rated 4 of 5. |

Level of validity: | Rated 4 of 5. |

Level of completeness: | Rated 5 of 5. |

Level of comprehensibility: | Rated 5 of 5. |

Competing interests: | None |

Recommend this review: | +1 2 people recommend this |

This article presents findings from two studies on the effectiveness of student evaluations of teaching as a measure of teaching effectiveness. First study is from a five-year natural experiment at a French university (23,011 evaluations of 379 instructors by 4,423 students in six mandatory first-year courses) and the second study is from a randomized, controlled, blind experiment at a US university (43 evaluations for four sections of an online course). The methodology uses nonparametric statistical tests, which is presented as an improvement over previous analyses of the French dataset.

Overall the designs of the two studies presented are interesting and the findings are eye-opening. Below are my comments on mostly the methodology used in the analysis and some of the assumptions made.

**Comparison to previous work**

The methods section of the article references a previous analysis of the data where ``the tests assumed that SET of male and female instructors are independent random samples from normally distributed populations with equal variances and possibly different means". I assume the authors are referring to MacNell et al. [2014]. This article states the following:

We used Welch’s t-tests (an adaptation of the Student’s t-test that does not assume equal variance) to establish the statistical significance of each difference. We also ran two general linear multivariate analyses of variance (MANOVAs) on the set of 12 variables to test the effects of instructor gender (perceived and actual) on all of the questions considered as a group.

While some of the analyses used in the previous study required normally distribution of data and equal variances, others didn't. The text in this article makes it seem like all previous analyses used methods requiring these conditions. It would be best to clarify exactly where improvements to methods were introduced and present evidence that the data did not meet these conditions.

Another issue is that the authors do not compare their findings to those from MacNell et al. [2014]. It should be explicitly stated in the paper whether the findings agreed or disagreed.

**Assumptions on relationship between teaching effectiveness and student performance**

Section 3.1 on page 9 states the following:

To the extent that final exams are designed well, scores on these exams reflect relevant learning outcomes for the course. Hence, in each course each semester, students of more effective instructors should do better on the final, on average, than students of less effective instructors.

This would be definitely true if the material covered in each section is exactly the same. It is not clear from the article whether this was the case or how this was evaluated. It is possible that students of a very effective teacher score poorly on an exam if the exam assesses material that was not emphasized in the course.

**Controls for non-independence**

In the French dataset there are 23,001 student evaluations of 379 instructors. We are told that instructors see the students in groups of 10-24 students. It seems like the same instructor must appear in the dataset multiple times. The analyses do not appear to control for this non-independence structure in the data.

It also seems like students might be repeated in the dataset as well since each student has to take these required courses. There does not appear to be any control for this either in the analysis.

**Multivariate analysis**

All analyses presented in the article are bivariate, exploring the relationship between a given factor and the response variable of SET scores.

First, this can potentially result in problems associated with multiple testing. The authors address this issue in Section 6, however I would expect this discussion to appear earlier in the paper.

Second, it would potentially be more informative to analyze multivariate relationships instead of exclusively bivariate relationships. A regression based approach, instead of a series of tests, would allow for this exploration, as well as conclusions that evaluate the relationship between SET and certain factors while controlling for others.

**Discussion of effect size and practical significance**

The article does not discuss effect sizes and considerations around practical significance. This is especially important for the French dataset which has a large number of observations. Indeed in Section 6 the authors discuss the importance of finding small p-values with the small US dataset, however a similar discussion is not included for the French dataset.

Jason Barr evaluated the article as: Show full reviewRated 2.5 of 5.

#### Comments

The Boring et al. study falls short of other studies investigating gender and student ratings.

Publication date: | 14 February 2016 |

DOI: | 10.14293/S2199-1006.1.SOR-SOCSCI.AETBZC.v1.RPNWYZ |

Level of importance: | Rated 2 of 5. |

Level of validity: | Rated 2 of 5. |

Level of completeness: | Rated 3 of 5. |

Level of comprehensibility: | Rated 3 of 5. |

Competing interests: | Jason Barr, Ph.D. is employed as a researcher for The IDEA Center, a nonprofit whose mission is to improve learning in higher education through research, assessment and professional development. The IDEA Center provides Student Ratings of Instruction (SRI) instruments to colleges and universities. |

Recommend this review: | +1 |

Boring et al. report the results of two studies conducted on separate samples, one from six courses offered in France, the other from one course in the U.S.. The authors claim to have found gender bias in both SET instruments. So, it is logical to ask, “What exactly did those SET measure?” Regarding the French sample, the only possible answer is we don’t know what the SET measured. Readers are simply told it included closed-ended and open-ended questions. No information is provided about any of the items on the SET nor whether they correlate with any relevant measure of teaching effectiveness. **So, we really do not know what construct is being correlated with instructor gender.**

The SET used in the U.S. sample was described previously in MacNell, Driscoll, and Hunt (2014). The 15-item instrument was comprised of Likert-type items inviting students to respond from 1 = *Strongly disagree* to 5 = *Strongly agree*. Six items were intended to measure effectiveness (e.g., professionalism, knowledge, objectivity); six were for interpersonal traits (e.g., respect, enthusiasm, warmth), two were included for communication skills, and one was “to evaluate the instructor’s overall quality as a teacher.” No information about the exact wording of the items was provided. Moreover, the authors provided no theoretical explanation for item development or whether the “student ratings index” correlates with any other relevant measures.

**So, in the French study we do not know exactly what aspect of teaching effectiveness is being correlated with instructor gender. In the U.S. study, we know that overall teaching quality is NOT associated with instructor gender.**

Other concerns are made apparent in review of the study:

- What validity and reliability evidence is there for the learning measure?
- What effect did researcher expectancy effects have in the U.S. study?
- What effect did having only male lecturers have on French students?
- Many of the correlations reported are very weak and non-significant.
- Why should we assume assignment of instructors to sections in the French sample was “as if at random”?
- Correlation is not causation.
- How generalizable are these findings?

My colleagues and I took each concern to task, with a thorough look at the shortcomings of each. The editorial note, referencing a column based on the study titled “Bias Against Female Instructors” posted January 8, 2016 in *Inside Higher Education *can be found in full at http://ideaedu.org/research-and-papers/editorial-notes/response-to-bias-against-female-instructors/.

Our conclusion was the Boring et al. study falls short of other studies investigating gender and student ratings. In studies of ratings of actual teachers there is only a very weak relationship that favors female instructors (Centra, 2009; Feldman, 1993). This is not to say that gender bias does not exist. We grant that it can be found in all walks of life and professions. But a single study fraught with confounding variables and weak correlations should not be cause for alarm. The gender differences in student ratings reported previously (e.g., Centra & Gaubatz, 2000; Feldman, 1992, 1993) and in Boring et al. (2016) are not large and should not greatly affect teaching evaluations *especially if SET are not the only measure of teaching effectiveness*. But, even if they are the only measure, this study shows gender contributes only about 1% of the variance in student ratings. Hardly a “large and statistically significant” amount as stated by the authors.