Stark and Freishtat do a good job of reviewing the literature on the problems with student course evaluations. Since literally 1000’s of articles have been written on this topic, good is high praise. What they add to this copious literature that is most valuable is a critique of the validity of the statistics provided by SET’s (Student Evaluations of Teaching). This critique is particularly helpful because it is written in clear language that is understandable to the non-statisticians among us. Since so many of the academics using these evaluations, and relying on them for tenure and promotion decisions, are in this class of non-statisticians, this is a valuable service. As Beyers (1) remarks, “These evaluations are telling us things we know to be wrong “(p.103). Beyers (1) discusses, for example, several classroom experiments that demonstrate how inaccurate SET’s can be on straight forward items like instructors returning tests on time or showing up to all classroom sessions. Given the problems with these straightforward questions, the usefulness of evaluations with more nebulous concepts like teaching effectiveness are questionable. Stark and Freishtat provide the deconstruction of statistics necessary to explain how the results on teaching effectiveness may seem to be clear cut but actually are as flawed as Beyers’ test returning data suggest. While many teachers know intuitively and sometimes concretely (a la Beyers) that evaluation data can be inaccurate, making the anti evaluation form case can be difficult and Stark and Freishtat provide the necessary ammunition.
The ammunition would have been amplified if Stark and Freishtat had considered something like Benton and Cashin’s 2012 (2) literature review of 40 years of research on teaching evaluations, which defends teaching evaluations as both reliable and valid. Stark and Freishtat do a great job of discrediting the notion of SET’s as reliable but Benton and Cashin argue that they are also valid because they correlate with student achievement and instructor self ratings, among a long list of items. It would be helpful for Stark and Freishtat to deconstruct these and similar results for the non- statistician and explain if, and why, we can ignore them.
For example, Stark and Freishtat state “SET seems to be influenced by the gender, ethnicity and attractiveness of the instructor” (p. 6). Benton and Cashin consider ethnicity and gender and find them to have no impact. The problem is given the huge database of small studies one can pick and choose and come to diametrically opposed conclusions. It would be very helpful for Stark and Freishtat to give us the statistical tools to know what to believe. No one expects them to counter every study that disagrees with their conclusions but, given their statistical sophistication, they could have indicated the criteria they used to include a study on gender, ethnicity or attractiveness. They do mention “controlled randomized experiments “ as opposed to observational studies, but as a non statistician I would love to know how these can work on the biases generated by ethnicity, gender or attractiveness. Overall, these are minor concerns and simply ways of making this article even more useful than it already is.
Stark and Freishstat argue for teaching portfolios and peer observation of teaching as better routes to measuring and improving teaching effectiveness. This is not a new idea (3) but that does not mean it is not an excellent one. The problem with this strategy is that it is, as Stark and Freishtat mention, time consuming. Asking students to pick numbers that can be averaged and compared across instructors is quick and painless from an institution’s point of view. It is, however, much like looking for the lost earring near the street light as opposed to the dark sidewalk where it fell. Making jobs easy can sometimes render them useless. Stark and Freishstat make an excellent case for the effectiveness of their teaching evaluation methods but may want to argue that their method is more efficient than it may seem. My own experience is that peer reviewers also learn ways to improve their own teaching every time they step into someone else’s classroom or read over another instructor’s materials.
(1) Beyers C. The Hermeneutics of Student Evaluations. College Teaching, 2008:56(2):102-106.
(2) Benton SL, Cashin WE. Student Ratings of Teaching: A Summary of Research and Literature. IDEA Paper No. 50 Manhattan, KS: The IDEA Center; 2012.
(3) Lauer C. A comparison of faculty and student perspectives on course evaluation terminology. In: Groccia J, Cruz L, editors. To improve the academy: resources for faculty, instructional, and organizational development. San Francisco (CA): Wiley & Sons, Inc; 2012. p. 195–212.