Average rating: | Rated 4.5 of 5. |
Level of importance: | Rated 4 of 5. |
Level of validity: | Rated 4 of 5. |
Level of completeness: | Rated 5 of 5. |
Level of comprehensibility: | Rated 5 of 5. |
Competing interests: | None |
This article presents findings from two studies on the effectiveness of student evaluations of teaching as a measure of teaching effectiveness. First study is from a five-year natural experiment at a French university (23,011 evaluations of 379 instructors by 4,423 students in six mandatory first-year courses) and the second study is from a randomized, controlled, blind experiment at a US university (43 evaluations for four sections of an online course). The methodology uses nonparametric statistical tests, which is presented as an improvement over previous analyses of the French dataset.
Overall the designs of the two studies presented are interesting and the findings are eye-opening. Below are my comments on mostly the methodology used in the analysis and some of the assumptions made.
The methods section of the article references a previous analysis of the data where ``the tests assumed that SET of male and female instructors are independent random samples from normally distributed populations with equal variances and possibly different means". I assume the authors are referring to MacNell et al. [2014]. This article states the following:
We used Welch’s t-tests (an adaptation of the Student’s t-test that does not assume equal variance) to establish the statistical significance of each difference. We also ran two general linear multivariate analyses of variance (MANOVAs) on the set of 12 variables to test the effects of instructor gender (perceived and actual) on all of the questions considered as a group.
While some of the analyses used in the previous study required normally distribution of data and equal variances, others didn't. The text in this article makes it seem like all previous analyses used methods requiring these conditions. It would be best to clarify exactly where improvements to methods were introduced and present evidence that the data did not meet these conditions.
Another issue is that the authors do not compare their findings to those from MacNell et al. [2014]. It should be explicitly stated in the paper whether the findings agreed or disagreed.
Section 3.1 on page 9 states the following:
To the extent that final exams are designed well, scores on these exams reflect relevant learning outcomes for the course. Hence, in each course each semester, students of more effective instructors should do better on the final, on average, than students of less effective instructors.
This would be definitely true if the material covered in each section is exactly the same. It is not clear from the article whether this was the case or how this was evaluated. It is possible that students of a very effective teacher score poorly on an exam if the exam assesses material that was not emphasized in the course.
In the French dataset there are 23,001 student evaluations of 379 instructors. We are told that instructors see the students in groups of 10-24 students. It seems like the same instructor must appear in the dataset multiple times. The analyses do not appear to control for this non-independence structure in the data.
It also seems like students might be repeated in the dataset as well since each student has to take these required courses. There does not appear to be any control for this either in the analysis.
All analyses presented in the article are bivariate, exploring the relationship between a given factor and the response variable of SET scores.
First, this can potentially result in problems associated with multiple testing. The authors address this issue in Section 6, however I would expect this discussion to appear earlier in the paper.
Second, it would potentially be more informative to analyze multivariate relationships instead of exclusively bivariate relationships. A regression based approach, instead of a series of tests, would allow for this exploration, as well as conclusions that evaluate the relationship between SET and certain factors while controlling for others.
The article does not discuss effect sizes and considerations around practical significance. This is especially important for the French dataset which has a large number of observations. Indeed in Section 6 the authors discuss the importance of finding small p-values with the small US dataset, however a similar discussion is not included for the French dataset.