Blog
About


  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

Student evaluations of teaching (mostly) do not measure teaching effectiveness

Read Overview Bookmark
Review statistics
Level of importance:
    Rated 3 of 5.
Level of validity:
    Rated 3 of 5.
Level of completeness:
    Rated 4 of 5.
Level of comprehensibility:
    Rated 4 of 5.
Mine Cetinkaya-Rundel evaluated the article as: Show full review    Rated 4.5 of 5.
Publication date:
06 March 2016
DOI:
10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1.RUIBVH
Level of importance:
    Rated 4 of 5.
Level of validity:
    Rated 4 of 5.
Level of completeness:
    Rated 5 of 5.
Level of comprehensibility:
    Rated 5 of 5.
Competing interests:
None
Recommend this review:
+1
2 people recommend this

Comments

This article presents findings from two studies on the effectiveness of student evaluations of teaching as a measure of teaching effectiveness. First study is from a five-year natural experiment at a French university (23,011 evaluations of 379 instructors by 4,423 students in six mandatory first-year courses) and the second study is from a randomized, controlled, blind experiment at a US university (43 evaluations for four sections of an online course). The methodology uses nonparametric statistical tests, which is presented as an improvement over previous analyses of the French dataset.

Overall the designs of the two studies presented are interesting and the findings are eye-opening. Below are my comments on mostly the methodology used in the analysis and some of the assumptions made.

 

  • Comparison to previous work

The methods section of the article references a previous analysis of the data where ``the tests assumed that SET of male and female instructors are independent random samples from normally distributed populations with equal variances and possibly different means". I assume the authors are referring to MacNell et al. [2014]. This article states the following:


We used Welch’s t-tests (an adaptation of the Student’s t-test that does not assume equal variance) to establish the statistical significance of each difference. We also ran two general linear multivariate analyses of variance (MANOVAs) on the set of 12 variables to test the effects of instructor gender (perceived and actual) on all of the questions considered as a group.

While some of the analyses used in the previous study required normally distribution of data and equal variances, others didn't. The text in this article makes it seem like all previous analyses used methods requiring these conditions. It would be best to clarify exactly where improvements to methods were introduced and present evidence that the data did not meet these conditions.

Another issue is that the authors do not compare their findings to those from MacNell et al. [2014]. It should be explicitly stated in the paper whether the findings agreed or disagreed.

 

  • Assumptions on relationship between teaching effectiveness and student performance

Section 3.1 on page 9 states the following:

To the extent that final exams are designed well, scores on these exams reflect relevant learning outcomes for the course. Hence, in each course each semester, students of more effective instructors should do better on the final, on average, than students of less effective instructors.

This would be definitely true if the material covered in each section is exactly the same. It is not clear from the article whether this was the case  or how this was evaluated. It is possible that students of a very effective teacher score poorly on an exam if the exam assesses material that was not emphasized in the course.

 

  • Controls for non-independence

In the French dataset there are 23,001 student evaluations of 379 instructors. We are told that instructors see the students in groups of 10-24 students. It seems like the same instructor must appear in the dataset multiple times. The analyses do not appear to control for this non-independence structure in the data. 

It also seems like students might be repeated in the dataset as well since each student has to take these required courses. There does not appear to be any control for this either in the analysis.

 

  • Multivariate analysis

All analyses presented in the article are bivariate, exploring the relationship between a given factor and the response variable of SET scores. 

First, this can potentially result in problems associated with multiple testing. The authors address this issue in Section 6, however I would expect this discussion to appear earlier in the paper.

Second, it would potentially be more informative to analyze multivariate relationships instead of exclusively bivariate relationships.  A regression based approach, instead of a series of tests, would allow for this exploration, as well as conclusions that evaluate the relationship between SET and certain factors while controlling for others.

 

  • Discussion of effect size and practical significance

The article does not discuss effect sizes and considerations around practical significance. This is especially important for the French dataset which has a large number of observations. Indeed in Section 6 the authors discuss the importance of finding small p-values with the small US dataset, however a similar discussion is not included for the French dataset.

We thank Professor Cetinkaya-Rundel for her thoughtful review.

Most of her concerns are already addressed in the paper; others we agree with in part but not entirely. For instance, Prof. Cetinkaya-Rundel asserts that

“while some of the analyses in the previous study required normally distribution (sic) of data and equal variances, others didn’t.”

She points to Welch’s two-sample t-test and MANOVA as examples. Both of these tests rely on the assumption that data are sampled independently at random from populations with normal distributions. That stochastic assumption contradicts the actual randomization in the MacNell et al. experiment and the as-if randomization in the SciencesPo natural experiment: the data arise from randomizing a fixed group of individuals into different treatments, not from sampling from larger populations at random. And the variables involved do not have normal distributions; indeed, they are dichotomous (e.g., gender), ordinal categorical (e.g., SET), or have compact support (e.g., grades). The null hypotheses for Welch’s t-test and MANOVA are simply not relevant to the problem. Asymptotically, those tests might have the correct level under the randomization actually performed, but for finite populations, the actual levels may differ considerably from the nominal levels, as Table 8 of the paper shows.

Prof. Cetinkaya-Rundel writes:

“Another issue is that the authors do not compare their findings to those from MacNell et al. [2014]. It should be explicitly stated in the paper whether the findings agreed or disagreed.”

Table 8 compares our results to those from MacNell et al. [2014] (the rightmost columns show the p-values from our permutation test side-by-side with the p-values from t-tests in MacNell et al.) and we discuss the differences near the end of page 8, in the section entitled “SET and perceived instructor gender.” Our omnibus test of the hypothesis that perceived instructor gender plays no role in SET matches the spirit of the MANOVA test (but for a more relevant null hypothesis); they do not report details of those tests, only that the p-values were less than 0.05. The permutation omnibus test we performed gives a p-value of essentially zero for the hypothesis that students rate instructors the same, regardless of the gender of the instructor’s name.

Prof. Cetinkaya-Rundel raises the concern that instructors and students appear in the data more than once, across years and across courses, producing correlation. In the Neyman model, the responses are fixed, not random: only the assignment of instructors to class sections is random. There is no assumption that responses random at all, much less independent. Moreover, instructors appear at most once per course per year—i.e., once per stratum. Because the natural experiment amounts to randomizing within each stratum, independently across strata, there is no dependence across strata, regardless of the identity of the instructors. Prof. Cetinkaya-Rundel continues:

“ A regression based approach, instead of a series of tests, would allow for this exploration, as well as conclusions that evaluate the relationship between SET and certain factors while controlling for others.”

A regression-based approach requires far stronger modeling assumptions (e.g., linearity of effects and IID additive errors), assumptions for which there is no basis here. Indeed, those assumptions seem rather far-fetched, for instance, because SET is an ordered categorical scale, not a linear scale. A model that posits that SET (or other variables) are observed with independent, identically distributed additive errors is also unjustified for these data: the randomness is in the assignment of students to instructors, not measurement noise. The Neyman model allows each individual (student or class section) to have its own response to each treatment condition, with no assumption about the relationship among responses of different individuals. Randomization in the experimental design mitigates confounding with no need to model the effect of covariates: that is the power of a randomized experiment. Moreover, the variables one would presumably want to control for are pretreatment variables, such as prior knowledge of the subject matter, while, the covariates available (e.g., final exam scores) are measured post-treatment.

Prof. Cetinkaya-Rundel also advocates reporting effect sizes. We reported empirical mean effect sizes in the tables (e.g., column 2 of Tables 3, 4, 5, 7, and 8). We intentionally did not report confidence intervals for mean effect sizes. Confidence intervals could be constructed by inverting the permutation tests, but that would require strong assumptions about the structure of the effect that have no basis here. For instance, it would suffice to assume that the effect of instructor gender on SET is the same for every student—but the data give evidence that the effect varies by student gender, at least. Indeed, one of our main conclusions is that there are such differences! Finally, from a policy perspective, effect size per se isn’t particularly interesting: the issue is disparate impact. The data show that the effect of gender on SET can overwhelm differences in actual instructor effectiveness, as measured by exam performance.

2016-03-17 11:27 UTC
+1
Jason Barr evaluated the article as: Show full review    Rated 2.5 of 5.
The Boring et al. study falls short of other studies investigating gender and student ratings.
Publication date:
14 February 2016
DOI:
10.14293/S2199-1006.1.SOR-SOCSCI.AETBZC.v1.RPNWYZ
Level of importance:
    Rated 2 of 5.
Level of validity:
    Rated 2 of 5.
Level of completeness:
    Rated 3 of 5.
Level of comprehensibility:
    Rated 3 of 5.
Competing interests:
Jason Barr, Ph.D. is employed as a researcher for The IDEA Center, a nonprofit whose mission is to improve learning in higher education through research, assessment and professional development. The IDEA Center provides Student Ratings of Instruction (SRI) instruments to colleges and universities.
Recommend this review:
+1

Comments

Boring et al. report the results of two studies conducted on separate samples, one from six courses offered in France, the other from one course in the U.S.. The authors claim to have found gender bias in both SET instruments. So, it is logical to ask, “What exactly did those SET measure?”  Regarding the French sample, the only possible answer is we don’t know what the SET measured. Readers are simply told it included closed-ended and open-ended questions. No information is provided about any of the items on the SET nor whether they correlate with any relevant measure of teaching effectiveness. So, we really do not know what construct is being correlated with instructor gender.

The SET used in the U.S. sample was described previously in MacNell, Driscoll, and Hunt (2014). The 15-item instrument was comprised of Likert-type items inviting students to respond from 1 = Strongly disagree to 5 = Strongly agree. Six items were intended to measure effectiveness (e.g., professionalism, knowledge, objectivity); six were for interpersonal traits (e.g., respect, enthusiasm, warmth), two were included for communication skills, and one was “to evaluate the instructor’s overall quality as a teacher.” No information about the exact wording of the items was provided. Moreover, the authors provided no theoretical explanation for item development or whether the “student ratings index” correlates with any other relevant measures.

So, in the French study we do not know exactly what aspect of teaching effectiveness is being correlated with instructor gender. In the U.S. study, we know that overall teaching quality is NOT associated with instructor gender.

Other concerns are made apparent in review of the study:

  1. What validity and reliability evidence is there for the learning measure?
  2. What effect did researcher expectancy effects have in the U.S. study?
  3. What effect did having only male lecturers have on French students?
  4. Many of the correlations reported are very weak and non-significant.
  5. Why should we assume assignment of instructors to sections in the French sample was “as if at random”?
  6. Correlation is not causation.
  7. How generalizable are these findings?

My colleagues and I took each concern to task, with a thorough look at the shortcomings of each. The editorial note, referencing a column based on the study titled “Bias Against Female Instructors” posted January 8, 2016 in Inside Higher Education can be found in full at http://ideaedu.org/research-and-papers/editorial-notes/response-to-bias-against-female-instructors/.

Our conclusion was the Boring et al. study falls short of other studies investigating gender and student ratings. In studies of ratings of actual teachers there is only a very weak relationship that favors female instructors (Centra, 2009; Feldman, 1993). This is not to say that gender bias does not exist. We grant that it can be found in all walks of life and professions. But a single study fraught with confounding variables and weak correlations should not be cause for alarm. The gender differences in student ratings reported previously (e.g., Centra & Gaubatz, 2000; Feldman, 1992, 1993) and in Boring et al. (2016) are not large and should not greatly affect teaching evaluations especially if SET are not the only measure of teaching effectiveness. But, even if they are the only measure, this study shows gender contributes only about 1% of the variance in student ratings.  Hardly a “large and statistically significant” amount as stated by the authors.

 

Thank you for being the first to review our paper. Your concerns were already addressed: we provided full information regarding the French data, including the full survey that students completed, here. Also, as mentioned in the paper, the US data, including the survey items, are here. The statistical method is explained in detail in the paper, and code implementing all the tests is here, if you would like to replicate our results.

 

We look forward to additional reviews by people who do not have a financial interest in SET.

2016-02-17 19:27 UTC
+1
One person recommends this