Since 1975, course evaluations at University of California, Berkeley, have asked:
Considering both the limitations and possibilities of the subject matter and course, how would you rate the overall teaching effectiveness of this instructor?
1 (not at all effective), 2, 3, 4 (moderately effective), 5, 6, 7 (extremely effective)
We review statistical issues in analyzing and comparing SET scores, problems defining and measuring teaching effectiveness, and pernicious distortions that result from using SET scores as a proxy for teaching quality and effectiveness. We argue here—and the literature shows—that students are in a good position to evaluate some aspects of teaching, but SET are at best tenuously connected to teaching effectiveness (defining and measuring teaching effectiveness are knotty problems in themselves; we discuss this below). Other ways of evaluating teaching can be combined with student comments to produce a more reliable and meaningful composite. We make recommendations regarding the use of SET and discuss new policies implemented at University of California, Berkeley, in 2013.
SET scores are the most common method to evaluate teaching [1–4]. They define “effective teaching” for many purposes. They are popular partly because the measurement is easy and takes little class or faculty time. Averages of SET ratings have an air of objectivity simply by virtue of being numerical. And comparing an instructor's average rating to departmental averages is simple. However, questions about using SET as the sole source of evidence about teaching for merit and promotion, and the efficacy of evaluation questions and methods of interpretation persist .
STATISTICS AND SET
Some students do not fill out SET surveys. The response rate will be less than 100%. The lower the response rate, the less representative the responses might be: there is no reason nonresponders should be like responders—and good reasons they might not be. For instance, anger motivates people to action more than satisfaction does. Have you ever seen a public demonstration where people screamed “we're content!”? (see, e.g., http://xkcd.com/470/).
Nonresponse produces uncertainty: Suppose half the class responds and they rate the instructor's handwriting legibility as 2. The average for the entire class might be as low as 1.5, if all the “nonresponders” would also have rated it 1. Or it might be as high as 4.5, if the nonresponders would have rated it 7.
Some schools require faculty to explain low response rates. This seems to presume that it is the instructor's fault if the response rate is low, and that a low response rate is in itself a sign of bad teaching. Consider these scenarios:
The instructor has invested an enormous amount of effort in providing the material in several forms, including online materials, online self-test exercises, and webcast lectures; the course is at 8 am. We might expect attendance and response rates to in-class evaluations to be low.
The instructor is not following any text and has not provided notes or supplementary materials. Attending lecture is the only way to know what is covered. We might expect attendance and response rates to in-class evaluations to be high.
The instructor is exceptionally entertaining, gives “hints” in lecture about exams; the course is at 11 am. We might expect high attendance and high response rates for in-class evaluations.
Averages of small samples are more susceptible to “the luck of the draw” than averages of larger samples. This can make SET in small classes more extreme than evaluations in larger classes, even if the response rate is 100%. And students in small classes might imagine their anonymity to be more tenuous, perhaps reducing their willingness to respond truthfully or to respond at all.
Personnel reviews routinely compare instructors' average scores to departmental averages. Such comparisons make no sense, as a matter of Statistics. They presume that the difference between 3 and 4 means the same thing as the difference between 6 and 7. They presume that the difference between 3 and 4 means the same thing to different students. They presume that 5 means the same thing to different students and to students in different courses. They presume that a 3 “balances” a 7 to make two 5s. For teaching evaluations, there is no reason any of those things should be true .
SET scores are ordinal categorical variables: The ratings fall in categories that have a natural order, from worst (1) to best (7). But the numbers are labels, not values. We could replace the numbers with descriptions and no information would be lost: The ratings might as well be “not at all effective,” …, “extremely effective.” It does not make sense to average labels. Relying on averages equates two ratings of 5 with ratings of 3 and 7, since both sets average to 5.
They are not equivalent, as this joke shows: Three statisticians go hunting. They spot a deer. The first statistician shoots; the shot passes a yard to the left of the deer. The second shoots; the shot passes a yard to the right of the deer. The third one yells, “we got it!”
Comparing an individual instructor's average with the average for a course or a department is meaningless: Suppose that the departmental average for a particular course is 4.5 and the average for a particular instructor in a particular semester is 4.2. The instructor's rating is below average. How bad is that?
If other instructors get an average of exactly 4.5 when they teach the course, 4.2 might be atypically low. On the other hand, if other instructors get 6s half the time and 3s half the time, 4.2 is well within the spread of scores. Even if averaging made sense, the mere fact that one instructor's average rating is above or below the departmental average says little. We should report the distribution of scores for instructors and for courses: the percentage of ratings in each category (1–7). The distribution is easy to convey using a bar chart.
All the children are above average
At least half the faculty in any department will have average scores at or below median for that department. Deans and Chairs sometimes argue that a faculty member with below-average teaching evaluations is an excellent teacher—just not as good as the other, superlative teachers in that department. With apologies to Garrison Keillor, all faculty members in all departments cannot be above average.
Students' interest in courses varies by course type (e.g., prerequisite versus major elective). The nature of the interaction between students and faculty varies with the type and size of courses. Freshmen have less experience than seniors. These variations are large and may be confounded with SET [7–9]. It is not clear how to make fair comparisons of SET across seminars, studios, labs, prerequisites, large lower-division courses, required major courses, etc. .
Students are ideally situated to comment about their experience of the course, including factors that influence teaching effectiveness, such as the instructor's audibility, legibility, and perhaps the instructor's availability outside class. They can comment on whether they feel more excited about the subject after taking the class, and—for electives—whether the course inspired them to take a follow-up course. They might be able to judge clarity, but clarity may be confounded with the difficulty of the material. While some student comments are informative, one must be quite careful interpreting the comments: faculty and students use the same vocabulary quite differently, ascribing quite different meanings to words such as “fair,” “professional,” “organized,” “challenging,” and “respectful” . Moreover, it is not easy to compare comments across disciplines [7, 8, 12, 13] because the depth and quality of students' comments vary widely by discipline. In context, these comments are all glowing:
Physical Sciences class:
“Before this course I had only read two plays because they were required in High School. My only expectation was to become more familiar with the works. I did not expect to enjoy the selected texts as much as I did, once they were explained and analyzed in class. It was fascinating to see texts that the authors were influenced by; I had no idea that such a web of influence in Literature existed. I wish I could be more “helpful” in this evaluation, but I cannot. I would not change a single thing about this course. I looked forward to coming to class everyday. I looked forward to doing the reading for this class. I only wish that it was a year-long course so that I could be around the material, graduate instructor's and professor for another semester.”
WHAT SET MEASURE
If you can't prove what you want to prove, demonstrate something else and pretend that they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anybody will notice the difference. 
What is effective teaching? One definition is that an effective teacher is skillful at creating conditions conducive to learning. Some learning happens no matter what the instructor does. Some students do not learn much no matter what the instructor does. How can we tell how much the instructor helped or hindered?
Measuring learning is hard: Grades are poor proxies because courses and exams can be easy or hard . If exams were set by someone other than the instructor—as they are in some universities—we might be able to use exam scores to measure learning (see, e.g., http://xkcd.com/135/). But that is not how most universities work, and teaching to the test could be confounded with learning.
Performance in follow-on courses and career success may be better measures, but those measurements are hard to make. And how much of someone's career success can be attributed to a given course, years later?
There is a large research literature on SET, most of which addresses reliability: Do different students give the same instructor similar marks?[16–21]; Would a student rate the same instructor consistently later? [17, 22–25]. That has nothing to do with whether SET measure effectiveness. A hundred bathroom scales might all report your weight to be the same. That does not mean the readings are accurate measures of your height—or even your weight, for that matter.
Moreover, inter-rater reliability is an odd thing to worry about, in part because it is easy to report the full distribution of student ratings, as advocated earlier. Scatter matters, and it can be measured in situ in every course.
Observation versus randomization
Most of the research on SET is based on observational studies, not experiments. In the entire history of science, there are few observational studies that justify inferences about causes. (A notable exception is John Snow's research on the cause of cholera; his study amounts to a “natural experiment.” See http://www.stat.berkeley.edu/∼stark/SticiGui/Text/experiments.htm#cholera for a discussion). In general, to infer causes, such as whether good teaching results in good evaluation scores, requires a controlled, randomized experiment: individuals are assigned to groups at random; the groups get different treatments; the outcomes are compared statistically across groups to test whether the treatments have different effects and to estimate the sizes of those differences.
Randomized experiments use a blind, non-discretionary chance mechanism to assign treatments to individuals. Randomization tends to mix individuals across groups in a balanced way. Absent randomization, other things can confound the effect of the treatment (see, e.g., http://xkcd.com/552/).
For instance, suppose some students choose classes by finding the professor reputed to be the most lenient grader. Such students might then rate that professor highly for an “easy A.” If those students choose sequel courses the same way, they may get good grades in those easy classes too, “proving” that the first ratings were justified.
The best way to reduce confounding is to assign students randomly to classes. That tends to mix students with different abilities and from easy and hard sections of the prequel across sections of sequels. This experiment has been done at the U.S. Air Force Academy  and Bocconi University in Milan, Italy .
These experiments found that teaching effectiveness, as measured by subsequent performance and career success, is negatively associated with SET scores. While these two student populations might not be representative of all students, the studies are the best we have seen, and their findings are concordant.
What do SET measure?
SET may be reliable, in the sense that students often agree [17, 22–25]. But that is an odd focus. We do not expect instructors to be equally effective with students with different background, preparation, skill, disposition, maturity, and “learning style.” Hence, if ratings are extremely consistent, they probably do not measure teaching effectiveness: If a laboratory instrument always gives the same reading when its inputs vary substantially, it is probably broken.
There is no consensus on what SET do measure:
SET scores and enjoyment scores are related. (In the UC, Berkeley, Department of Statistics in fall 2012, for the 1486 students who rated the instructor's overall effectiveness and their enjoyment of the course, the correlation between instructor effectiveness and course enjoyment was 0.75, and the correlation between course effectiveness and course enjoyment was 0.8.)
SET can be predicted from the students' reaction to 30 seconds of silent video of the instructor; physical attractiveness matters .
Omnibus questions about curriculum design, effectiveness, etc. appear most influenced by factors unrelated to learning .
What good are SET?
Students are in a good position to observe some aspects of teaching, such as clarity, pace, legibility, audibility, and their own excitement (or boredom). SET can measure these things; the statistical issues raised above still matter, as do differences between how students and faculty use the same words .
But students cannot rate effectiveness—regardless of their intentions. Calling SET a measure of effectiveness does not make it one, any more than you can make a bathroom scale measure height by relabeling its dial “height.” Averaging “height” measurements made with 100 different scales would not help.
WHAT IS BETTER?
Let us drop the pretense. We will never be able to measure teaching effectiveness reliably and routinely. In some disciplines, measurement is possible but would require structural changes, randomization, and years of follow-up.
If we want to assess and improve teaching, we have to pay attention to the teaching, not the average of a list of student-reported numbers with a troubled and tenuous relationship to teaching. Instead, we can watch each other teach and talk to each other about teaching. We can look at student comments. We can look at materials created to design, redesign, and teach courses, such as syllabi, lecture notes, websites, textbooks, software, videos, assignments, and exams. We can look at faculty teaching statements. We can look at samples of student work. We can survey former students, advisees, and graduate instructors. We can look at the job placement success of former graduate students, etc.
We can ask: Is the teacher putting in appropriate effort? Is she following practices found to work in the discipline? Is she available to students? Is she creating new materials, new courses, or new pedagogical approaches? Is she revising, refreshing, and reworking existing courses? Is she helping keep the curriculum in the department up to date? Is she trying to improve? Is she supervising undergraduates for research, internships, and honors theses? Is she advising graduate students? Is she serving on qualifying exams and thesis committees? Do her students do well when they graduate?
Or, is she “checked out”? Does she use lecture notes she inherited two decades ago the first time she taught the course? Does she mumble, facing the board, scribbling illegibly? Do her actions and demeanor discourage students from asking questions? Is she unavailable to students outside of class? Does she cancel class frequently? Does she routinely fail to return student work? Does she refuse to serve on qualifying exams or dissertation committees?
In 2013, the University of California, Berkeley, Department of Statistics adopted as standard practice a more holistic assessment of teaching. Every candidate is asked to produce a teaching portfolio for personnel reviews, consisting of a teaching statement, syllabi, notes, websites, assignments, exams, videos, statements on mentoring, or any other materials the candidate feels are relevant. The chair and promotion committee read and comment on the portfolio in the review. At least before every “milestone” review (mid-career, tenure, full, step VI), a faculty member attends at least one of the candidate's lectures and comments on it in writing. These observations complement the portfolio and student comments. Distributions of SET scores are reported, along with response rates. Averages of scores are not reported.
Classroom observation took the reviewer about four hours, including the observation time itself. The process included conversations between the candidate and the observer, the opportunity for the candidate to respond to the written comments, and a provision for a “no-fault do-over” at the candidate's sole discretion. The candidates and the reviewer reported that the process was valuable and interesting. Based on this experience, the Dean of the Division now recommends peer observation prior to milestone reviews.
Observing more than one class session and more than one course would be better. Adding informal classroom observation and discussion between reviews would be better. Periodic surveys of former students, advisees, and teaching assistants would bring another, complementary source of information about teaching. But we feel that using teaching portfolios and even a little classroom observation improves on SET alone.
The following sample letter is a redacted amalgam of chair's letters submitted with merit and promotion cases since the Department of Statistics adopted a policy of more comprehensive assessment of teaching, including peer observation:
Smith is, by all accounts, an excellent teacher, as confirmed by the classroom observations of Professor Jones, who calls out Smith's ability to explain key concepts in a broad variety of ways, to hold the attention of the class throughout a 90-minute session, to use both the board and slides effectively, and to engage a large class in discussion. Prof. Jones's peer observation report is included in the case materials; conversations with Jones confirm that the report is Jones's candid opinion: Jones was impressed, and commented in particular on Smith's rapport with the class, Smith's sensitivity to the mood in the room and whether students were following the presentation, Smith's facility in blending derivations on the board with projected computer simulations to illustrate the mathematics, and Smith's ability to construct alternative explanations and illustrations of difficult concepts when students did not follow the first exposition.
While interpreting “effectiveness” scores is problematic, Smith's teaching evaluation scores are consistently high: in courses with a response rate of 80% or above, less than 1% of students rate Smith below a 6.
Smith's classroom skills are evidenced by student comments in teaching evaluations and by the teaching materials in her portfolio.
Examples of comments on Smith's teaching include:
I was dreading taking a statistics course, but after this class, I decided to major in statistics.
the best I've ever met … hands down best teacher I've had in 10 years of university education
overall amazing … she is the best teacher I have ever had
absolutely love it
loves to teach, humble, always helpful
extremely clear … amazing professor
just an amazing lecturer
great teacher … best instructor to date
inspiring and an excellent role model
the professor is GREAT
Critical student comments primarily concerned the difficulty of the material or the homework. None of the critical comments reflected on the pedagogy or teaching effectiveness, only the workload.
I reviewed Smith's syllabus, assignments, exams, lecture notes, and other materials for Statistics X (a prerequisite for many majors), Y (a seminar course she developed), Z (a graduate course she developed for the revised MA program, which she has spearheaded), and Q (a topics course in her research area). They are very high quality and clearly the result of considerable thought and effort.
In particular, Smith devoted an enormous amount of time to developing online materials for X over the last five years. The materials required designing and creating a substantial amount of supporting technology, representing at least 500 hours per year of effort to build and maintain. The undertaking is highly creative and advanced the state of the art. Not only are those online materials superb, they are having an impact on pedagogy elsewhere: a Google search shows over 1,200 links to those materials, of which more than half are from other countries. I am quite impressed with the pedagogy, novelty, and functionality. I have a few minor suggestions about the content, which I will discuss with Smith, but those are a matter of taste, not of correctness.
The materials for X and Y are extremely polished. Notably, Smith assigned a term project in an introductory course, harnessing the power of inquiry-based learning. I reviewed a handful of the term projects, which were ambitious and impressive. The materials for Z and Q are also well organized and interesting, and demand an impressively high level of performance from the students. The materials for Q include a great selection of data sets and computational examples that are documented well. Overall, the materials are exemplary; I would estimate that they represent well over 1,500 hours of development during the review period.
Smith's lectures in X were webcast in fall, 2013. I watched portions of a dozen of Smith's recorded lectures for X—a course I have taught many times. Smith's lectures are excellent: clear, correct, engaging, interactive, well paced, and with well organized and legible boardwork. Smith does an admirable job keeping the students involved in discussion, even in large (300+ student) lectures. Smith is particularly good at keeping the students thinking during the lecture and of inviting questions and comments. Smith responds generously and sensitively to questions, and is tuned in well to the mood of the class.
Notably, some of Smith's lecture videos have been viewed nearly 300,000 times! This is a testament to the quality of Smith's pedagogy and reach. Moreover, these recorded lectures increase the visibility of the Department and the University, and have garnered unsolicited effusive thanks and praise from across the world.
Conversations with teaching assistants indicate that Smith spent a considerable amount of time mentoring them, including weekly meetings and observing their classes several times each semester. She also played a leading role in revising the PhD curriculum in the department.
Smith has been quite active as an advisor to graduate students. In addition to serving as a member of sixteen exam committees and more than a dozen MA and PhD committees, she advised three PhD recipients (all of whom got jobs in top-ten departments), co-advised two others, and is currently advising three more. Smith advised two MA recipients who went to jobs in industry, co-advised another who went to a job in government, advised one who changed advisors. Smith is currently advising a fifth. Smith supervised three undergraduate honors theses and two undergraduate internships during the review period.
This is an exceptionally strong record of teaching and mentoring for an assistant professor. Prof. Smith's teaching greatly exceeds expectations.
SET does not measure teaching effectiveness.
Controlled, randomized experiments find that SET ratings are negatively associated with direct measures of effectiveness. SET seems to be influenced by the gender, ethnicity, and attractiveness of the instructor.
Summary items such as “overall effectiveness” seem most influenced by irrelevant factors.
Student comments contain valuable information about students' experiences.
Survey response rates matter. Low response rates make it impossible to generalize reliably from the respondents to the whole class.
It is practical and valuable to have faculty observe each other's classes.
It is practical and valuable to create and review teaching portfolios.
Teaching is unlikely to improve without serious, regular attention.
Drop omnibus items about “overall teaching effectiveness” and “value of the course” from teaching evaluations: They are misleading.
Do not average or compare averages of SET scores: Such averages do not make sense statistically. Instead, report the distribution of scores, the number of responders, and the response rate.
When response rates are low, extrapolating from responders to the whole class is unreliable.
Pay attention to student comments but understand their limitations. Students typically are not well situated to evaluate pedagogy.
Avoid comparing teaching in courses of different types, levels, sizes, functions, or disciplines.
Use teaching portfolios as part of the review process.
Use classroom observation as part of milestone reviews.
To improve teaching and evaluate teaching fairly and honestly, spend more time observing the teaching and looking at teaching materials.