Nice summary of why student evaluations are not good measures of teaching quality
This article nicely summarises the reasons why student evaluations of teaching (SET) are not good measures of teaching quality. It also describes alternative practices adopted at the Statistics department of UC Berkeley. It is a very useful article that presents the material in a simple and accessible way, thus contributing to disseminating the results to an audience of non-specialists. I would simply like to add a couple of additional elements to the debate.
First, I believe it is important to understand more precisely the reasons why SET cannot be used as indicators of teaching quality. Above and beyond the statistical complexities discussed in the article, there are compelling theoretical arguments to argue that students are really not in the right position to evaluate their teachers. They go to university to learn material that they do not know in advance and for this very reason they cannot be fair judges of how well the material has been taught. Imagine for example a teacher who explains very clearly but wrongly. Students who are clearly taught wrong notions do not know that they are wrong. The problem is even worse when the teacher can manipulate the content of the course, which is a very common situation. I teach econometrics and I am pretty sure I can improve my evaluations but taking out of the syllabus those 2-3 papers that are particularly difficult (but important for the field). They are difficult for the students to understand and, since they are difficult, I also find them the most difficult to teach. If I take them out, I would presumably take out of the course my worst teaching performances and I would have more time to teach the easy stuff. Does that make me a better teacher? There is a fundamental problem of asymmetric information between students and teachers that makes it very difficult for the first to evaluate the latter.
For this reason I tend to disagree with the idea that a good evaluation of teaching should combine SET with other elements, which is often the reaction to the research findings about their lack of correlation with harder measures of teaching quality. I think we should accept the simple fact that SET measure some kind of satisfaction of the students with the learning experience, which is a very important piece of information for universities, but it is not teaching quality.
The second point I would like to make goes beyond the role of SET and it is related to the notion of teaching quality. There is a lively debate about how to measure teaching quality or effectiveness and very little about what we mean by good teaching. The definition of the objective of teaching should precede its measurement. Teaching good students is not the same as teaching the average or the bad students. Improving students' performance in their follow-on courses may not also improve students' labour market outcomes, as my co-authors and I show in a paper that will be published in the Journal of Labor Economics. For example, it is well known that activities like group discussions and presentations develop a set of soft skills that are highly valued by employers. However, if I do a lot of them I have less time to go over the core material of the course, at the expenses of students' learning in follow-on courses that build on that material. For the good students the trade-off between more academic and more interactive teaching activities is not particularly sharp, as they only need a little lecturing time to understand the core material and one can then spend the rest of the time for discussions and presentations. For the other students the choice of the teaching style is less obvious. I do not see any easy way out here. Each course in each institution should set out clearly what is the objective of teaching. Perhaps basic introductory courses should target the average student or even the students at the bottom of the distribution, whereas very advanced courses should probably aim at teaching the best students and professors should be evaluated and rewarded accordingly. Similarly, some courses may be more academic oriented and others more practical and the way teaching is evaluated should reflect this approach. Similar choices can be made at the level of the institution. In countries like Germany or Switzerland there are both traditional universities and tertiary institutions with a less academic and a more practical approach (the Fachhochschule). Teaching should probably be evaluated differently at different institutions.
Finally, I would like to comment briefly on the alternative practice discussed at the end of the article. I like the idea of a more thorough evaluation that takes place only at some specific steps of one's career. However, I find the proposed practice a bit too time consuming and infrequent to allow timely intervention in the case of problems. I would rather like to experiment a system that resembles more closely the blind peer-review model that is widely adopted to evaluate research output. Imagine that each professor is evaluated once every 2 or 3 years. During the 2-3 years between the evaluations one knows that a random sample of a few (2-3) teaching sessions is recorded. The recording can be done without the teacher knowing by means of webcams or other technological tools that are now cheap to install and operate. Then, the recordings together with all the teaching material (syllabuses, slides, lectures, problem sets, et.) are sent to external anonymous referees for evaluation. It is admittedly and complex system that has a number of shortcomings but it seems to me that all such shortcomings are common to the system we use for the evaluation of research and that are accepted almost universally across disciplines and countries. Of course there is a serious problem of coordination in the implementation of the system (why should I referee for another institution if my institution does not implement the same system?) but it might be imaginable to start a little experiment with a handful of institutions and, provided results are encouraging, start creating a wider consensus around this practice.