24
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      CBE—Life Sciences Education: the story of a “great journal scientists might be caught reading”

      other
      a , *
      Molecular Biology of the Cell
      The American Society for Cell Biology

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          How did a moderately sized scientific society create what many consider to be the leading journal in biology education? As Editor-in-Chief of the education journal of the American Society for Cell Biology (ASCB), CBE—Life Sciences Education ( LSE) and recipient of the 2018 Bruce Alberts Award for Excellence in Science Education, I tell the story of the establishment, growth, and impact of ASCB’s “other journal.”

          Related collections

          Most cited references31

          • Record: found
          • Abstract: found
          • Article: not found

          Assessment of Course-Based Undergraduate Research Experiences: A Meeting Report

          Students can work with the same data at the same time and with the same tools as research scientists. iPlant Education, Outreach & Training Group (2008, personal communication) INTRODUCTION Numerous calls for reform in undergraduate biology education have emphasized the value of undergraduate research (e.g., American Association for the Advancement of Science [AAAS], 2011). These calls are based on a growing body of research that documents how students benefit from research experiences (Kremer and Bringle, 1990; Kardash, 2000; Rauckhorst et al., 2001; Hathaway et al., 2002; Bauer and Bennett, 2003; Lopatto, 2004, 2007; Lopatto and Tobias, 2010; Seymour et al., 2004; Hunter et al., 2007; Russell et al., 2007; Laursen et al., 2010; Thiry and Laursen, 2011). Undergraduates who participate in research internships (also called research apprenticeships, undergraduate research experiences, or research experiences for undergraduates [REUs]) report positive outcomes, such as learning to think like a scientist, finding research exciting, and intending to pursue graduate education or careers in science (Kardash, 2000; Laursen et al., 2010; Lopatto and Tobias, 2010). Research experiences are thought to be especially beneficial for women and underrepresented minority students, presumably because they support the development of relationships with more senior scientists and with peers who can offer critical support to students who might otherwise leave the sciences (Gregerman et al., 1998; Barlow and Villarejo, 2004; Eagan et al., 2011). Yet most institutions lack the resources to involve all or even most undergraduates in a research internship (Wood, 2003; Desai et al., 2008; Harrison et al., 2011). Faculty members have developed alternative approaches to engage students in research with the aim of offering these educational benefits to many more students (Wei and Woodin, 2011). One approach that is garnering increased attention is what we call a course-based undergraduate research experience, or CURE. CUREs involve whole classes of students in addressing a research question or problem that is of interest to the scientific community. As such, CUREs have the potential to expand undergraduates’ access to and involvement in research. We illustrate this in Table 1 by comparing CUREs with research internships, in which undergraduates work one-on-one with a mentor, either a graduate student, technician, postdoctoral researcher, or faculty member. Table 1. Features of CUREs compared with research internships CUREs Research internships Scale Many students Few students Mentorship structure One instructor to many students One instructor to one student Enrollment Open to all students in a course Open to a selected or self-selecting few Time commitment Students invest time primarily in class Students invest time primarily outside class Setting Teaching lab Faculty research lab CUREs offer the capacity to involve many students in research (e.g., Rowland et al., 2012) and can serve all students who enroll in a course—not only self-selecting students who seek out research internships or who participate in specialized programs, such as honors programs or programs that support research participation by disadvantaged students. Moreover, CUREs can be integrated into introductory-level courses (Dabney-Smith, 2009; Harrison et al., 2011) and thus have the potential to exert a greater influence on students’ academic and career paths than research internships that occur late in an undergraduate's academic program and thus serve primarily to confirm prior academic or career choices (Hunter et al., 2007). Entry into CUREs is logistically straightforward; students simply enroll in the course. Research internships often require an application (e.g., to REU sites funded by the National Science Foundation [NSF]) or searching and networking to find faculty interested in involving undergraduates in research. For students, CUREs may reduce the stress associated with balancing a research internship with course work during a regular academic term (Rowland et al., 2012). CUREs may also offer different types of opportunities for students to develop ownership of projects, as they ask their own questions or analyze their own samples. Although this can be the case for research internships, it may be less common, given the pressure on research groups to complete and publish the work outlined in grant proposals. In both environments, beginning undergraduate researchers more often contribute to ongoing projects rather than developing their own independent projects. Opportunities for the latter are important, as work from Hanauer and colleagues (2012) suggests that students’ development of a sense of ownership can contribute to their persistence in science. The Course-Based Undergraduate Research Experiences Network (CUREnet; http://curenet.franklin.uga.edu) was initiated in 2012 with funding from NSF to support CURE instruction by addressing topics, problems, and opportunities inherent to integrating research experiences into undergraduate courses. During early discussions, the CUREnet community identified a need for a clearer definition of what constitutes a CURE and a need for systematic exploration of how students are affected by participating in CUREs. Thus, a small working group with expertise in CURE design and assessment was assembled in September 2013 to: Draft an operational definition of a CURE; Summarize research on CUREs, as well as findings from studies of undergraduate research internships that would be useful for thinking about how students are influenced by participating in CUREs; and Identify areas of greatest need with respect to evaluation of CUREs and assessment of CURE outcomes. In this paper, we summarize the meeting discussion and offer recommendations for next steps in the assessment of CUREs. CUREs DEFINED The first aim of the meeting was to define a CURE. We sought to answer the question: How can a CURE be distinguished from other laboratory learning experiences? This allows us to make explicit to students how a CURE may differ from their other science course work and to distinguish a CURE from other types of learning experiences for the purposes of education research and evaluation. We began by discussing what we mean by “research.” We propose that CUREs involve students in the following: Use of scientific practices. Numerous policy documents, as well as an abundance of research on the nature and practice of science, indicate that science research involves the following activities: asking questions, building and evaluating models, proposing hypotheses, designing studies, selecting methods, using the tools of science, gathering and analyzing data, identifying meaningful variation, navigating the messiness of real-world data, developing and critiquing interpretations and arguments, and communicating findings (National Research Council [NRC], 1996; Singer et al., 2006; Duschl et al., 2007; Bruck et al., 2008; AAAS, 2011; Quinn et al., 2011). Individuals engaged in science make use of a variety of techniques, such as visualization, computation, modeling, and statistical analysis, with the aim of generating new scientific knowledge and understanding (Duschl et al., 2007; AAAS, 2011). Although it is unrealistic to expect students to meaningfully participate in all of these practices during a single CURE, we propose that the opportunity to engage in multiple scientific practices (e.g., not only data collection) is a CURE hallmark. Discovery. Discovery is the process by which new knowledge or insights are obtained. Science research aims to generate new understanding of the natural world. As such, discovery in the context of a CURE implies that the outcome of an investigation is unknown to both the students and the instructor. When the outcomes of their work are not predetermined, students must make decisions such as how to interpret their data, when to track down an anomaly and when to ignore it as “noise,” or when results are sufficiently convincing to draw conclusions (Duschl et al., 2007; Quinn et al., 2011). Discovery carries with it the risk of unanticipated outcomes and ambiguous results because the work has not been done before. Discovery also necessitates exploration and evidence-based reasoning. Students and instructors must have some familiarity with the current body of knowledge in order to contribute to it and must determine whether the new evidence gathered is sufficient to support the assertion that new knowledge has been generated (Quinn et al., 2011). We propose that discovery in the context of a CURE means that students are addressing novel scientific questions aimed at generating and testing new hypotheses. In addition, when their work is considered collectively, students’ findings offer some new insight into how the natural world works. Broadly relevant or important work. Because CUREs provide opportunities for students to build on and contribute to current science knowledge, they also present opportunities for impact and action beyond the classroom. In some CUREs, this may manifest as authorship or acknowledgment in a science research publication (e.g., Leung et al., 2010; Pope et al., 2011). In other CUREs, students may develop reports of interest to the local community, such as a report on local water quality or evidence-based recommendations for community action (e.g., Savan and Sider, 2003). We propose that CUREs involve students in work that fits into a broader scientific endeavor that has meaning beyond the particular course context. (We choose the language of “broader relevance or importance” rather than the term “authenticity” because views on the authenticity of a learning experience may shift over time [Rahm et al., 2003] and may differ among students, instructors, and the broader scientific community.) Collaboration. Science research increasingly involves teams of scientists who contribute diverse skills to tackling large and complex problems (Quinn et al., 2011). We propose that group work is not only a common practical necessity but also an important pedagogical element of CUREs because it exposes students to the benefits of bringing together many minds and hands to tackle a problem (Singer et al., 2006). Through collaboration, students can improve their work in response to peer feedback. Collaboration also develops important intellectual and communication skills as students verbalize their thinking and practice communicating biological ideas and interpretations either to fellow students in the same discipline or to students in other disciplines. This may also encourage students’ metacognition—solidifying their thinking and helping them to recognize shortcomings in their knowledge and reasoning (Chi et al., 1994; Lyman, 1996; Smith et al., 2009; Tanner, 2009). Iteration. Science research is inherently iterative because new knowledge builds on existing knowledge. Hypotheses are tested and theories are developed through the accumulation of evidence over time by repeating studies and by addressing research questions using multiple approaches with diverse methods. CUREs generally involve students in iterative work, which can occur at multiple levels. Students may design, conduct, and interpret an investigation and, based on their results, repeat or revise aspects of their work to address problems or inconsistencies, rule out alternative explanations, or gather additional data to support assertions (NRC, 1996; Quinn et al., 2011). Students may also build on and revise aspects of other students’ investigations, whether within a single course to accumulate a sufficiently large data set for analysis or across successive offerings of the course to measure and manage variation, further test preliminary hypotheses, or increase confidence in previous findings. Students learn by trying, failing, and trying again, and by critiquing one another's work, especially the extent to which claims can be supported by evidence (NRC, 1996; Duschl et al., 2007; Quinn et al., 2011). These activities, when considered in isolation, are not unique to CUREs. Rather, we propose that it is the integration of all five dimensions that makes a learning experience a CURE. Of course, CUREs will vary in the frequency and intensity of each type of activity. We present the dimensions in Table 2 and delineate how they are useful for distinguishing between the following four laboratory learning environments: Table 2. Dimensions of different laboratory learning contexts Dimension Traditional Inquiry CURE Internship Use of science practices Students engage in … Few scientific practices Multiple scientific practices Multiple scientific practices Multiple scientific practices Study design and methods are … Instructor driven Student driven Student or instructor driven Student or instructor driven Discovery Purpose of the investigation is … Instructor defined Student defined Student or instructor defined Student or instructor defined Outcome is … Known to students and instructors Varied Unknown Unknown Findings are … Previously established May be novel Novel Novel Broader relevance or importance Relevance of students’ work … Is limited to the course Is limited to the course Extends beyond the course Extends beyond the course Students’ work presents opportunities for action … Rarely Rarely Often Often Collaboration Collaboration occurs … Among students in a course Among students in a course Among students, teaching assistants, instructor in a course Between student and mentor in a research group Instructor's role is … Instruction Facilitation Guidance and mentorship Guidance and mentorship Iteration Risk of generating “messy” data are … Minimized Significant Inherent Inherent Iteration is built into the process … Not typically Occasionally Often Often A traditional laboratory course, in which the topic and methods are instructor defined; there are clear “cookbook” directions and a predetermined outcome that is known to students and to the instructor (Domin, 1999; Weaver et al., 2008); An inquiry laboratory course, in which students participate in many of the cognitive and behavioral practices that are commonly performed by scientists; typically, the outcome is unknown to students, and they may be challenged to generate their own methods. The motivation for the inquiry is to challenge the students, rather than contribute to a larger body of knowledge (Domin 1999; Olson and Loucks-Horsley, 2000; Weaver et al., 2008); A CURE, in which students address a research question or problem that is of interest to the broader community with an outcome that is unknown both to the students and to the instructor (Domin 1999; Bruck et al., 2008; Weaver et al., 2008); and A research internship, in which a student is apprenticed to a senior researcher (faculty, postdoc, grad student, etc.) to help advance a science research project (Seymour et al., 2004). The five dimensions comprise a framework that can be tested empirically by characterizing how a particular dimension is manifested in a program, developing scales to measure the degree or intensity of each dimension, and determining whether the dimensions in part or as a whole are useful for distinguishing CUREs from other laboratory learning experiences. Once tested, we believe that this framework will be useful to instructors, institutional stakeholders, education researchers, and evaluators. Instructors may use the framework to delineate their instructional approach, clarify what students will be expected to do, and articulate their learning objectives. For example, in traditional laboratory instruction, students may collect and analyze data but generally do not build or evaluate models or communicate their findings to anyone except the instructor. During inquiry laboratory instruction, students may be able to complete a full inquiry cycle and thus engage at some level in the full range of scientific practices. Students in CUREs and research internships may engage in some scientific practices in depth, but neglect others, depending on the particular demands of the research and the structure of the project. As instructors define how their course activities connect to desired student outcomes, they can also identify directions for formative and summative assessment. Education researchers and evaluators may use the framework to characterize particular instructional interventions with the aim of determining which dimensions, to what degree and intensity, correlate with desired student outcomes. For instance, students who engage in the full range of scientific practices could reasonably be expected to improve their skills across the range of practices, while students who participate in only a subset of practices can only be expected to improve in those specific practices. Similarly, the extent to which students have control over the methods they employ may influence their sense of ownership over the investigation, thus increasing their motivation and perhaps contributing to their self-identification as scientists. Using this framework to identify critical elements of CUREs and how they relate (or not) to important student outcomes can inform both the design of CUREs and their placement in a curriculum. CURRENT KNOWLEDGE FROM ASSESSMENT OF CUREs With this definition in mind, the meeting then turned to summarizing what is known from the study of CUREs, primarily in biology and chemistry. Assessment and evaluation of CUREs has been limited to a handful of multisite programs (e.g., Goodner et al., 2003; Hatfull et al., 2006; Lopatto et al., 2008, Caruso et al., 2009; Shaffer et al., 2010; Harrison et al., 2011) and projects led by individual instructors (e.g., Drew and Triplett 2008; Siritunga et al., 2011). For the most part, these studies have emphasized student perceptions of the outcomes they realize from participating in course-based research, such as the gains they have made in research skills or clarification of their intentions to pursue further education or careers in science. To date, very few studies of student learning during CUREs have been framed according to learning theories. With a few exceptions, studies of CUREs have not described pathways that students take to arrive at specific outcomes—in other words, what aspects of the CURE are important for students to achieve both short- and long-term gains. Some studies have compared CURE instruction with research internships and have found, in general, that students report many of the same gains (e.g., Shaffer et al., 2010). A handful of studies have compared student outcomes from CUREs with those from other laboratory learning experiences. For example, Russell and Weaver (2011) compared students’ views of the nature of science after completing a traditional laboratory, an inquiry laboratory, or a CURE. The researchers used an established approach developed by Lederman and colleagues (2002) to assess students’ views of the nature of science, but it is not clear whether students in this study chose to enroll in a traditional or CURE course or whether the groups differed in other ways that might influence the extent to which their views changed following their lab experiences. Students in all three environments—traditional, inquiry, and CURE—made gains in their views of the nature of scientific knowledge as experimental and theory based, but only students in the CURE showed progress in their views of science as creative and process based. When students who participated in a CURE or a traditional lab were queried 2 or 3 yr afterward, they continued to differ in their perceptions of the gains they made in understanding how to do research and in their confidence in doing research (Szteinberg and Weaver, 2013). In another study, Rowland and colleagues (2012) compared student reports of outcomes from what they called an active-learning laboratory undergraduate research experience (ALLURE, which is similar to a CURE) with those from a traditional lab course. Students could choose the ALLURE or traditional instruction, which may have resulted in a self-selection bias. Students in both environments reported increased confidence in their lab skills, including technical skills (e.g., pipetting) and analytical skills (e.g., deciding whether one experimental approach is better than another). Generally, students reported similar skill gains in both environments, indicating that students can develop confidence in their lab skills during both traditional and CURE/ALLURE experiences. Most studies reporting assessment of CUREs in the life sciences have made use of the Classroom Undergraduate Research Experiences (CURE) Survey (Lopatto and Tobias, 2010). The CURE Survey comprises three elements: 1) instructor report of the extent to which the learning experience resembles the practice of science research (e.g., the outcomes of the research are unknown, students have some input into the focus or design of the research); 2) student report of learning gains; and 3) student report of attitudes toward science. A series of Likert-type items probe students’ attitudes toward science and their educational and career interests, as well as students’ perceptions of the learning experience, the nature of science, their own learning styles, and the science-related skills they developed from participating in a CURE. Use of the CURE Survey has been an important first step in assessing student outcomes of these kinds of experiences. Yet this instrument is limited as a measure of the nature and outcomes of CUREs because some important information is missing about its overall validity. No information is available about its dimensionality—that is do student responses to survey items meant to represent similar underlying concepts correlate with each other, while correlating less with items meant to represent dissimilar concepts? For example, do responses to items about career interests correlate with themselves highly, but correlate less with items focused on attitudes toward science, a dissimilar concept? Other validity questions are also not addressed. For instance, does the survey measure all important aspects of CUREs and CURE outcomes, or are important variables missing? Is the survey useful for measuring a variety of CUREs in different settings, such as CUREs for majors or nonmajors, or CUREs at an introductory or advanced levels? Finally, is the survey a reliable measure—does the survey measure outcomes consistently over time and across different individuals and settings? To be consistent with the definition of CUREs given above, an assessment instrument must both touch on all five dimensions and elicit responses that capture other important aspects of CURE instruction that may be missing from this description. This will help ensure that the instrument has “content validity” (Trochim, 2006), meaning that the instrument can be used to measure all of the features important in a CURE learning experience. The CURE Survey relies on student perceptions of their own knowledge and skill gains, and like other such instruments, it is subject to concerns about the validity of self-report of learning gains. There is a very broad range of correlations between self-report measures of learning and measurements such as tests or expert judgments. Depending on which measures are compared, there may be a strong correlation, or almost no correlation, between self-reported data and relevant criteria (Falchikov and Boud, 1989). Validity problems with self-assessment can result from poor survey design, with survey items interpreted differently by different students, or from items designed in such a way that students are unable to recall key information or experiences (Bowman 2011; Porter et al., 2011). The tendency of respondents to give socially desirable answers is a familiar problem with self-reporting. Bowman and Hill (2011) found that student self-reporting of educational outcomes is subject to social bias; students respond more positively because they are either implicitly or explicitly aware of the desired response. A guarantee of anonymity mitigates this validity threat (Albanese et al., 2006). Respondents also give more valid responses when they have a clear idea of what they are assessing and have received frequent and clear feedback about their progress and abilities from others, and when respondents can remember what they did during the assessment period (Kuh, 2001). For example, in her study of the outcomes of undergraduate science research internships, Kardash (2000) compared perceptions of both student interns and faculty mentors of the gains interns made from participating in research. She found good agreement between interns and mentors on some skills, such as understanding concepts in the field and collecting data, but statistically significantly differences between mentor and intern ratings of other skills, with interns rating themselves more positively on their understanding of the importance of controls in research, their abilities to interpret results in light of original hypotheses, and their abilities to relate results to the “bigger picture.” More research is needed to understand the extent to which different students (majors, nonmajors, introductory, advanced, etc.) are able to accurately self-assess the diverse knowledge and skills they may develop from participating in CUREs. A few studies have focused on the psychosocial outcomes of participating in CUREs. One such study, conducted by Hanauer and colleagues (2012), documented the extent to which students developed a sense of ownership of the science projects they completed in a traditional laboratory course, a CURE involving fieldwork, or a research internship. Using linguistic analysis, the authors found that students in the CURE reported a stronger sense of ownership of their research projects compared with students who participated in traditional lab courses and research internships (Hanauer et al., 2012; Hanauer and Dolan, in press, 2014); these students also reported higher levels of persistence in science or medicine (Hanauer et al., 2012). Although the inferred relationship needs to be explored with a larger group of students and a more diverse set of CUREs, these results suggest that it is important to consider ownership and other psychosocial outcomes in future research and evaluation of CUREs. A few studies have explored whether and how different students experience CUREs differently and, in turn, realize different outcomes from CUREs. This is an especially noteworthy gap in the knowledge base, given the calls to engage all students in research experiences and that research has suggested that different students may realize different outcomes from participating in research (e.g., AAAS, 2011; Thiry et al., 2012). In one such study, Alkaher and Dolan (in press, 2014) interviewed students enrolled in a CURE, the Partnership for Research and Education in Plants for Undergraduates, at three different types of institutions (i.e., community college, liberal arts college, research university) in order to examine whether and how their sense of scientific self-authorship shifted during the CURE. Baxter-Magolda (1992) defined self-authorship as the “internal capacity to define one's beliefs, relations, and social identity” or, in this context, how one sees oneself with respect to science knowledge—as a consumer, user, or producer. Developing a sense of scientific self-authorship may be an important predictor of persistence in science, as students move from simply consuming science knowledge as it is presented to becoming critical users of science, and to seeing themselves as capable of contributing to the scientific body of knowledge. Alkaher and Dolan (in press, 2014) found that some CURE students made progress in their self-authorship because they perceived the CURE goals as important to the scientific community, yet the tasks were within their capacity to make a meaningful contribution. In contrast, other students struggled with the discovery nature of the CURE in comparison with their prior traditional lab learning experiences. They perceived their inability to find the “right answer” as reflecting their inability to do science. More research is needed to determine whether and how students’ backgrounds, motives, and interests influence how they experience CUREs, and whether they realize different outcomes as a result. NEXT STEPS FOR CURE ASSESSMENT Our discussion and collective knowledge of research on CUREs and undergraduate research internships revealed several gaps in our understanding of CUREs, which can be addressed by: Defining frameworks and learning theories that may help explain how students are influenced by participating in CUREs, and utilizing these frameworks or theories to design and study CUREs; Identifying and measuring the full range of important outcomes likely to occur in CURE contexts; Using valid and reliable measures, some of which have been used to study research internships or other undergraduate learning experiences and could be adapted for CURE use, as well as developing and testing new tools to assess CUREs specifically (see Weiss and Sosulski [2003] or Trochim [2006] for general explanations of validity and reliability in social science measurement); Establishing which outcomes are best documented using self-reporting, and developing new tools or adapting existing tools to measure other outcomes; and Gathering empirical evidence to identify the distinctive dimensions of CUREs and ways to characterize the degree to which they are present in a given CURE, as well as conducting investigations to characterize relationships between particular CURE dimensions or activities and student outcomes. Following these recommendations will require a collective, scholarly effort involving many education researchers and evaluators and many CUREs that are diverse in terms of students, instructors, activities, and institutional contexts. We suggest that priorities of this collective effort should be to: Use current knowledge from the study of CUREs, research internships, and other relevant forms of laboratory instruction (e.g., inquiry) to define short-, medium-, and long-term outcomes that may result from student participation in CUREs; Observe and characterize many diverse CUREs to identify the activities within CUREs likely to directly result in these short-term outcomes, delineating both rewards and difficulties students encounter as they participate; Use frameworks or theories and current knowledge to hypothesize pathways students may take toward achieving long-term outcomes—the connections between activities and short-, medium-, and long-term outcomes; Determine whether one can identify key short- and medium-term outcomes that serve as important “linchpins” or connecting points through which students progress to achieve desired long-term outcomes; and Assess the extent to which students achieve these key outcomes as a result of CURE instruction, using existing or novel instruments (e.g., surveys, interview protocols, tests) that have been demonstrated to be valid and reliable measures of the desired outcomes. At the front end, this process will require increased application of learning theories and consideration of the supporting research literature, but it is likely to result in many highly testable hypotheses and a more focused and informative approach to CURE assessment overall. For example, if we can define pathways from activities to outcomes, instructors will be better able to select activities to include or emphasize during CURE instruction and decide which short-term outcomes to assess. Education researchers and evaluators will be better able to hypothesize which aspects of CURE instruction are most critical for desired student outcomes and the most salient to study. Drawing from many of the references cited in this report, we have drafted a logic model for CURE instruction (Figure 1) as the first step in this process. (For more on logic models, see guidance from the W. K. Kellogg Foundation [2006].) The model includes the range of contexts, activities, outputs, and outcomes of CUREs that arose during our discussion. The model also illustrates hypothetical relationships between time, participation in CUREs, and short- and long-term outcomes resulting from CURE activities. Figure 1. CURE logic model. This model depicts the set of variables at play in CUREs identified by the authors. During CUREs, students can working individually, in groups, or with faculty (context, green box on left) to perform corresponding activities (middle, red boxes) that yield measurable outputs (middle, pink boxes). Activities and outputs are grouped according to the five related elements of CUREs (orange boxes and arrow). Possible CURE outcomes (blue) are ordered left to right according to when students might be able to demonstrate the outcome (blue arrow) and whether the outcome is likely to be achievable from participation in a single vs. multiple CUREs (blue triangle). It is important to recognize that, given the limited time frame and scope of any single CURE, students will not participate in all possible activities or achieve all possible outcomes depicted in the model. Rather, CURE instructors or evaluators could define a particular path and use it as a guide for designing program evaluations and assessing student outcomes. Figure 2 presents an example of how to do this with a focus on a subset of CURE activities and outcomes. It is a simplified pathway model based on findings from the research on undergraduate research internships and CUREs summarized above. Boxes in this model are potentially measurable waypoints, or steps, on a path that connects student participation in three CURE activities with the short-term outcomes students may realize during the CURE, medium-term outcomes they may realize at the end of or after the CURE, and potential long-term outcomes. Although each pathway is supported by evidence or hypotheses from the study of CUREs and research internships, these are not the only means to achieve long-term outcomes, and they do not often act alone. Rather, the model is intended to illustrate that certain short- and medium-term outcomes are likely to have a positive effect on linked long-term outcomes. See Urban and Trochim (2009) for a more detailed discussion of this approach. Figure 2. Example of a pathway model to guide CURE assessment. This model identifies a subset of activities (beige) students are likely to do during a CURE and the short- (pink), medium- (blue), and long- (green) term outcomes they may experience as a result. The arrows depict demonstrated or hypothesized relationships between activities and outcomes. (This figure is generated using software from the Cornell Office of Research and Evaluation [2010].) We explain below the example depicted in Figure 2, referencing explicit waypoints on the path with italics. This model is grounded in situated-learning theory (Lave and Wenger, 1991), which proposes that learning involves engagement in a “community of practice,” a group of people working on a common problem or endeavor (e.g., addressing a particular research question) and using a common set of practices (e.g., science practices). Situated-learning theory envisions learning as doing (e.g., presenting and evaluating work) and as belonging (e.g., interacting with faculty and peers, building networks), factors integral to becoming a practitioner (Wenger, 2008)—in the case of CUREs, becoming a scientist. Retention in a science major is a desired and measurable long-term outcome (bottom of Figure 2) that indicates students are making progress in becoming scientists and has been shown to result from participation in research (Perna et al., 2009; Eagan et al., 2013). Based on situated-learning theory, we hypothesize that three activities students might engage in are likely to lead to retention in a science major: design methods, present their work, and evaluate their own and others’ work during their research experience (Caruso et al., 2009; Harrison et al., 2011; Hanauer et al., 2012). These activities reflect the dimensions of “use of scientific practices” and “collaboration” described above. Following the right-hand path in the model, when students present their work and evaluate their own and others’ work, they will likely interact with each other and with faculty (Eagan et al., 2011). Interactions with faculty and interactions with peers may lead to improvements in students’ communication and collaboration skills, including their abilities to defend their work, negotiate, and make decisions about their research based on interactions (Ryder et al., 1999; Alexander et al., 2000; Seymour et al., 2004). Through these interactions, students may expand their professional networks, which may in turn offer increased access to mentoring (Packard, 2004; Eagan et al., 2011). Mentoring relationships, especially with faculty, connect undergraduates to networks that promote their education and career development by building their sense of scientific identity and defining their role within the broader scientific community (Crisp and Cruz, 2009; Hanauer, 2010; Thiry et al., 2010; Thiry and Laursen, 2011; Stanton-Salazar, 2011). Peer and faculty relationships also offer socio-emotional support that can foster students’ resilience and their ability to navigate the uncertainty inherent to science research (Chemers et al., 2011; Thiry and Laursen, 2011). Finally, research on factors that lead to retention in science majors indicates that increased science identity (Laursen et al., 2010; Estrada et al., 2011), ability to navigate uncertainty, and resilience are important precursors to a sense of belonging and ultimate retention (Gregerman et al., 1998; Zeldin and Pajares, 2000; Maton and Hrabowski, 2004; Seymour et al., 2004). The model also suggests that access to mentoring is a linchpin, a short- to medium-term outcome that serves as a connecting point through which activities are linked to long-term outcomes. Thus, access to mentoring might be assessed to diagnose students’ progress along the top pathway and predict the likelihood that they will achieve long-term outcomes. (For more insight into why assessing linchpins is particularly informative, see Urban and Trochim [2009].) Examples of measures that may be useful for testing aspects of this model and for which validity and reliability information is available include: the scientific identity scale developed by Chemers and colleagues (2011) and revised by Estrada and colleagues (2011); the student cohesiveness, teacher support, and cooperation scales of the What Is Happening in This Class? questionnaire (Dorman, 2003); and the faculty mentorship items published by Eagan and colleagues (2011). Data will need to be collected and analyzed using standard validation procedures to determine the usefulness of these scales for studying CUREs. Qualitative data from interviews or focus groups can be used to determine that students perceive these items as measuring relevant aspects of their CURE experiences and to confirm that they are interpreting the questions as intended. For example, developers of the Undergraduate Research Student Self-Assessment instrument used extensive interview data to identify key dimensions of student outcomes from research apprenticeship experiences, and then think-aloud interviews to test and refine the wording of survey items (Hunter et al., 2009). Interviews can also establish whether items apply to different groups of students. For example, items in the scientific identity scale (e.g., “I feel like I belong in the field of science”) may seem relevant, and thus “valid,” to science majors but not to non–science majors. Similarly, the faculty-mentoring items noted above (Eagan et al., 2011) include questions about whether faculty provided, for example, “encouragement to pursue graduate or professional study” or “an opportunity to work on a research project.” The first item will be most relevant to students who are enrolled in an advanced rather than an introductory CURE, while the second may be relevant only to students early enough in their undergraduate careers to have time to pursue a research internship. In addition, students may interpret the phrase “opportunity to work on a research project” in ways that are unrelated to mentorship by faculty, especially in the context of a CURE class with its research focus. Statistical analyses (e.g., factor analysis, calculation of Cronbach's alpha; Netemeyer et al., 2003) should confirm that the scales are consistent and stable—are they measuring what they are intended to measure and do they do so consistently? Such analyses would help determine whether students are responding as anticipated to particular items or scales and whether instruments developed to measure student outcomes of research internships can detect student growth from participation in CUREs, which are different experiences. We can also follow the left-hand path in this model with a focus on the CURE activities of designing methods and presenting work. This path is grounded in Baxter Magolda's (2003) work on students’ epistemological development and her theory of self-authorship. Specifically, as students take ownership of their learning, they transition from seeing themselves as consumers of knowledge to seeing themselves as producers of knowledge. Some students who design their own methods and present their work report an increased sense of ownership of the research (Hanauer et al., 2012; Hanauer and Dolan, 2014). Increased ownership has been shown to improve motivation and self-efficacy. Self-efficacy and motivation work in a positive-feedback loop to enhance one another and contribute to development of long-term outcomes, such as increased resilience (Graham et al., 2013). Social cognitive theory is useful for explaining this relationship: if people believe they are capable of accomplishing a task—described in the literature as self-efficacy—they are more likely to put forth effort, persist in the task, and be resilient in the face of failure (Bandura, 1986; Zeldin and Pajares, 2000). Self-efficacy has also been positively related to science identity (Zeldin and Pajares, 2000; Seymour et al., 2004; Hanauer, 2010; Estrada et al., 2011; Adedokun et al., 2013). Thus, self-efficacy becomes a linchpin that interacts closely with motivation and can be connected to retention in a science major. Existing measures that may be useful for testing this model and for which validity and reliability information is available include: the Project Ownership Survey (Hanauer and Dolan, 2014), scientific self-efficacy and scientific identity scales (Chemers et al., 2011; Estrada et al., 2011); and the self-authorship items from the Career Decision Making Survey (Creamer et al., 2010). Again, data would need to be collected and analyzed using standard validation procedures to determine the usefulness of these scales for studying CUREs. When considering what to include in a model or which pathways to emphasize, we encourage CURE stakeholders to remember that each CURE is in its own stage of development and has its own life cycle. Some are just starting and others are well established. CUREs at the beginning stages of implementation are likely to be better served by evaluating how well the program is being implemented before evaluating downstream student outcomes. Thus, early in the development of a CURE, those who are assessing CUREs may want to model a limited set of activities, outputs, and short-term outcomes. CUREs at later stages of development may focus more of their evaluation efforts on long-term student outcomes because earlier evaluations have demonstrated stability of the program's implementation. At this point, findings regarding student outcomes can more readily be attributed to participation in the CURE. Last, we would like to draw some comparisons between CUREs and research internships because these different experiences are likely to offer unique and complementary ways of engaging undergraduates in research that could be informative for CURE assessment. As noted above, a handful of studies indicate that CURE students may realize some of the same outcomes observed for students in research internships (Goodner et al., 2003; Drew and Triplett 2008; Lopatto et al., 2008; Caruso et al., 2009; Shaffer et al., 2010; Harrison et al., 2011). Yet, differences between CUREs and research internships (Table 1) are likely to influence the extent to which students achieve any particular outcome. For example, CUREs may offer different opportunities for student input and autonomy (Patel et al., 2009; Hanauer et al., 2012; Hanauer and Dolan, 2014; Table 2). The structure of CUREs may allow undergraduates to assume more responsibility in project decision making and take on leadership roles that are less often available in research internships. CUREs may involve more structured group work, providing avenues for students to develop analytical and collaboration skills as they explain or defend their thinking and provide feedback to one another. In addition, CURE students may have increased opportunities to develop and express skepticism because they are less likely to see their peers as authority figures. Alternatively, some CURE characteristics may limit the nature or extent of outcomes that students realize. CUREs take place in classroom environments with a much higher student–faculty ratio than is typical of UREs. With fewer experienced researchers to model scientific practices and provide feedback, students may be less likely to develop a strong understanding of the nature of science or a scientific identity. The amount of time students may spend doing the work in a CURE course is likely to be significantly less than what they would spend in a research internship. Students who enroll in CURE courses may be less interested in research, which may affect their own and classmates’ motivation and longer-term outcomes related to motivation. Research interns are more likely to develop close collegial relationships with faculty and other researchers, such as graduate students, postdoctoral researchers, and other research staff, who can in turn expand their professional network. In addition, CURE instructors may have limited specialized knowledge of the science that underpins the CURE. Thus, CURE students may not have access to sufficient mentorship or expertise to maximize the scientific and learning outcomes. SUMMARY This report is a first attempt to capture the distinct characteristics of CUREs and discuss ways in which they can be systematically evaluated. Utilizing current research on CUREs and on research internships, we identify and describe five dimensions of CURE instruction: use of science practices, discovery, broader relevance or importance, iteration, and collaboration. We describe how these elements might vary among different laboratory learning experiences and recommend an approach to CURE assessment that can characterize CURE activities and outcomes. We hope that our discussion draws attention to the importance of developing, observing, and characterizing many diverse CUREs. We also hope that this report successfully highlights the enormous potential of CUREs, not only to support students in becoming scientists, but also to provide research experiences to increasing numbers of students who will enter the workforce as teachers, employers, entrepreneurs, and young professionals. We intend for this report to serve as a starting point for a series of informed discussions and education research projects that will lead to far greater understanding of the usages, value, and impacts of CUREs, ultimately resulting in cost-effective, widely accessible, quality research experiences for a large number of undergraduate students.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Getting Under the Hood: How and for Whom Does Increasing Course Structure Work?

            INTRODUCTION Studies across the many disciplines in science, technology, engineering, and mathematics (STEM) at the college level have shown that active learning is a more effective classroom strategy than lecture alone (reviewed in Freeman et al., 2014). Given this extensive evidence, a recent synthesis of discipline-based education research (DBER; Singer et al., 2012) suggests that it is time to move beyond simply asking whether or not active learning works to more focused questions, including how and for whom these classroom interventions work. This type of research is being referred to as second-generation education research (Eddy et al., 2013; Freeman et al., 2014) and will help refine and optimize active-learning interventions by identifying the critical elements that make an intervention effective. Identifying these elements is crucial for successful transfer of classroom strategies between instructors and institutions (Borrego et al., 2013). Using these DBER recommendations as a guide, we have replicated a course intervention (increased course structure; Freeman et al., 2011) that has been demonstrated to increase student achievement at an R1 university and explored its effectiveness when transferred to a different university with a different instructor and student population. Specifically, we expanded on the original intervention studies by exploring 1) how different student subpopulations respond to the treatment in terms of achievement and 2) course-related behaviors and perceptions. These two forms of assessment will help us both elucidate how this intervention achieves the observed increases in student achievement and identify the elements critical for the intervention's success. Are Active-Learning Interventions Transferable? The transferability of active-learning interventions into novel educational contexts is critical to the successful spread of active learning across universities (National Science Foundation, 2013). Unfortunately, transferability of an intervention across contexts cannot be assumed, as there is some evidence that the success of classroom interventions depends on the student populations in the classroom (Brownell et al., 2013), instructor classroom management style (Borrego et al., 2013), and the topics being taught (Andrews et al., 2011). Thus, interventions that work with one instructor at one institution in one class may not necessarily transfer into novel contexts. Yet the majority of published active-learning interventions at the college level have been tested with at best one or two instructors who are usually at the same institution. We test the transferability of the increased course structure intervention (Freeman et al., 2011), which was effective at a Pacific Northwest R1 university with a predominately white and Asian student body, in a Southern R1 university with a different instructor (who had no contact with the original authors) and a more diverse student body. Additionally, the original study was an introductory biology course for aspiring majors, while the current implementation included mostly nonmajors in a mixed-majors general education course. Thus, in this study, we test the transferability of the increased course structure intervention across three contexts: 1) different instructors, 2) different student body, and 3) different courses (majors vs. nonmajors). Do Course Interventions Differentially Impact Achievement in Some Student Subpopulations? There is emerging evidence that classroom interventions could have different impacts on students from different cultural contexts. For example, Asian-American students learn less when they are told to talk through problems out loud compared with when they think through them silently. White students, on the other hand, performed just as well, and in some cases better, when allowed to talk through problems (Kim, 2002, 2008). This finding has implications for a differential impact of peer instruction on Asian students relative to their white classmates. In addition to different cultural norms for learning, students from different subpopulations bring different value sets into the classroom that can influence how they learn in different classroom environments. For example, one study found that when a setting is perceived as interdependent (rather than independent) first-generation students perform better, but continuing-generation students do not differ (Stephens et al., 2012). Positive interpersonal feelings also increased the performance of Mexicans but not European Americans on a learning task (Savani et al., 2013). Thus, the classroom environment itself could have differential impacts on different students. Findings like these begin to call into question whether “one-size-fits-all” classrooms interventions are possible and encourage researchers to disaggregate student response data by subpopulations (Singer et al., 2012). Up until now, the majority of college-level program evaluations that have disaggregated student groups have done so broadly based on their historical presence in science (underrepresented minority [URM] vs. majority students). Also, most of these studies have explored the impact of supplemental instruction outside an actual science course on student achievement (reviewed in Tsui, 2007; Fox et al., 2009). Only a few STEM course–based curricular interventions have disaggregated student performance (physics: Etkina et al., 1999; Hitt et al., 2013; math: Hooker, 2010; physical science: Poelzer and Zeng, 2008). In biology, two course-based active-learning interventions have been shown to reduce achievement gaps between historically underrepresented students and majority students. Preszler (2009) replaced a traditional course (3 h of lecture each week) with a reformed course that combined 2 h of lecture with 1 h of peer-led workshop. This change in class format increased the grades of all participating students, and the performance of URM students and females increased disproportionately. The second intervention was the increased course structure intervention (Haak et al., 2011). This intervention decreased the achievement gap between students in the Educational Opportunities Program (students from educational or economically disadvantaged backgrounds) and those not in the program by 45% (Haak et al., 2011). Studies that cluster students into two categories (URM vs. majority) assume that students within these clusters respond in the same way to classroom interventions. Yet the URM label includes black, Latin@, 1 Native American, Hawaiian and Pacific Islander students, and the majority designation is often both white and Asian students. The consequence of clustering leads to conclusions that are too generalized; for example, that black students will respond in a similar way to a treatment as do Latin@ students (Carpenter et al., 2006). Yet the different racial and ethnic groups that are included in the URM designation have very different cultures, histories, and exposure to college culture that could impact whether a particular classroom strategy is effective for them (Delpit, 2006). National trends in K–12 education, revealing different achievement patterns and trajectories for black and Latin@ students, also challenge the assumption that URMs are a homogeneous group (Reardon and Galindo, 2009). To our knowledge, only two college-level curricular interventions in STEM, and none in biology, have subdivided the URM category into more fine-grained groups to explore the effectiveness of classroom interventions for these different student populations. In these studies, students of different racial/ethnic groups responded differently to the classroom interventions (Etkina et al., 1999; Beichner et al., 2007). This was demonstrated most dramatically by Beichner et al. (2007), in whose study white and black students were the only groups to benefit significantly from an active-learning intervention. These findings highlight the need for more studies to analyze college course performance by racial/ethnic groups. These smaller categories can still be problematic, as they still combine students with very different cultural backgrounds and experiences into broad categories such as white, Asian, Native American, and Latin@ (Lee, 2011; Carpenter et al., 2006), but disaggregating students to this level will provide a finer-grained picture of the classroom than has been previously reported. A second population of students of concern is first-generation students. These students have limited exposure to the culture of college and are often from working-class backgrounds that may be at odds with the middle-class cultural norms of universities (e.g., the emphasis on abstract over practical knowledge and independence over interdependence; Stephens et al., 2012; Wilson and Kittleson, 2013). The differences between first- and continuing-generation students have been shown to change how they respond to “best-practices” in teaching at the college level, sometimes to the extent that they respond oppositionally (Padgett et al., 2012). In biology, we are not aware of any studies that have explored the response of this population to an active-learning intervention, although there has been promising work with a psychology intervention (Harackiewicz et al., 2014). In our study, we explored whether racial (black, white, Native American, Asian) and/or ethnic (Latin@) identity and first-generation versus continuing-generation status influenced a student's response to the increased course structure. We hypothesized that different student groups would vary in the extent to which an active-learning intervention would influence their exam performance. How Do Active-Learning Interventions Change Course-Related Behaviors and Attitudes of Students? Understanding how interventions change course-related behaviors and attitudes is an important next step in education research, as these behaviors and attitudes mediate how the course structure influences performance (Singer et al., 2012). Some work has already described how active learning increases achievement at the college level, although this work is lacking in the STEM disciplines and usually only looks at the student body as a whole. Courses with more active learning are positively correlated with increased student self-reported motivation and self-efficacy (van Wyk, 2012) and a deeper approach to learning (Eley, 1992). Unfortunately, this work is only done in active-learning classrooms, and either there is no control group (cf. Keeler and Steinhorst, 1995; Cavanagh, 2011) or the study asks students to compare their experience with a different course with a different instructor and content in which they are currently enrolled (cf. Sharma et al., 2005). In our study, we examine how student attitudes and course-related behaviors change between a traditionally taught and an increased-structure course with the same content and instructor. Reviewing the elements of successful classroom interventions suggests possible factors that could contribute to the increase in student achievement. For example, the increased course structure intervention involves the addition of three elements: graded preparatory assignments, extensive student in-class engagement, and graded review assignments (Table 1). Proponents of the increased course structure intervention have hypothesized that the additional practice led to the rise in student performance (Freeman et al., 2011). Yet providing opportunities for practice might not be enough. When and what students practice, as well as the context of and their perceptions of the practice may influence to the impact of the extra practice on learning. Table 1. The elements of a low-, moderate-, and high-structure course Graded preparatory Student in-class engagement Graded review assignments assignments (example: clicker questions, (example: practice exam (example: reading quiz) worksheets, case studies) problems) Low (traditional lecture) None or  40% of course time ≥1 per week aNeed either a preparatory or review assignment once per week, but not both. There are many possible factors that change with the implementation of increased course structure. We focus on three candidate factors, but it is important to recognize that these factors are not mutually exclusive or exhaustive. Factor 1. Time allocation: Increasing course structure will encourage students to spend more time each week on the course, particularly on preparation. How students allocate their out-of-class study time can greatly influence their learning and course achievement. Many students adopt the strategy of massing their study time and cramming just before exams (Michaels and Miethe, 1989; McIntyre and Munson, 2008). Yet distributed practice is a more effective method for learning, particularly for long-term retention of knowledge (Dunlosky et al., 2013). The increased course structure helps students distribute their study time for the class by assigning daily or weekly preparatory and review assignments. These assignments 1) spread out the time students spend on the course throughout the quarter (distributed practice, rather than cramming just before exams) and 2) encourage students to engage with a topic before class (preparatory assignment) and then again in class (in-class activities) and again after class (review assignments). In addition, the preparatory assignments not only encourage students to read the book before class, but also have students answer questions related to the reading, which is a more effective method for learning new material then simply highlighting a text (Dunlosky et al., 2013). We believe that the outside assignments scaffold how students spend time on the course and are one of the primary factors by which increased course structure impacts student performance. However, this idea has never been explicitly tested. In this study, we asked students to report how much time they spent outside of class on the course weekly and what they spent that time doing. We predicted that students would spend more time each week on the course and would spend more time on the parts associated with course points. These results would imply an increase in distributed practice and demonstrate that the instructor can successfully guide what students spend time on outside of class. Factor 2. Classroom culture: Increasing course structure will encourage students to perceive the class as a community. To learn, students must feel comfortable enough to be willing to take risks and engage in challenging thinking and problem solving (Ellis, 2004). High-stakes competitive classrooms dominated by a few student voices are not environments in which many students feel safe taking risks to learn (Johnson, 2007). The increased-structure format has students work in small groups, which may help students develop a more collaborative sense of the classroom. Collaborative learning in college has been shown to increase a sense of social support in the classroom as well as the sense that students like each other (Johnson et al., 1998). This more interdependent environment also decreases anxiety and leads to increased participation in class (Fassinger, 2000) and critical thinking (Tsui, 2002). Increased participation on in-class practice alone could lead to increased performance on exams. In addition, a more interdependent environment has been shown to be particularly important for the performance of first-generation students and Mexican students (Stephens et al., 2012; Savani et al., 2013). Finally, feeling like they are part of a community increases both performance and motivation, especially for historically underrepresented groups (Walton and Cohen, 2007; Walton et al., 2012). We predicted that students in an increased-structure course would change how they viewed the classroom, specifically, that they would feel an increased sense of community relative to students in low-structure courses. Factor 3. Course value: Increasing course structure will increase the perceived value of the course to students. In the increased-structure course, students come to class having read the book, or at least worked through the preparatory assignment, and thus have begun the knowledge acquisition stage of learning. This shift of content acquisition from in class to before class opens up time in the classroom for the instructor to help students develop higher-order cognitive skills (Freeman et al., 2011), providing opportunities to encourage students to make connections between course content and real-world impacts and to work through challenging problems. These opportunities for practice and real-world connections are thought to be more engaging to students then traditional lecture (Handelsman et al., 2006). Thus, through increased engagement with the material (because of increased interest in it) student performance will increase (Carini et al., 2006). We predicted students in the increased-structure course would feel more engaged by the material and thus would value the course more. We considered these three factors—time allocation, classroom culture, and course value—when surveying students about their perceptions and behaviors. We analyzed student survey responses in both the traditional and increased-structure course to identify patterns in responses that support the impact of these three factors on student performance. In summary, we test the transferability of one active-learning intervention (increased course structure; Freeman et al., 2011) into a novel educational context. We expand upon the initial studies by 1) disaggregating student performance to test the hypothesis that student subpopulations respond differently to educational interventions and 2) using student self-reported data to identify possible factors (time allocation, classroom culture, course value) through which the intervention could be influencing student achievement. METHODS AND RESULTS The Course and the Students The course, offered at a large research institution in the Southeast that qualifies as a more selective, full-time, 4-yr institution with a low transfer-in rate on the Carnegie scale, is a one-semester general introduction to biology serving a mixed-majors student population. The course is offered in both Fall and Spring semesters. Course topics include general introductions to the nature of science, cell biology, genetics, evolution and ecology, and animal physiology. The class met three times a week for 50 min each period. An optional laboratory course is associated with the lecture course, but lab grades are not linked to lecture grade. Although multiple instructors teach this course in a year, the data used in this study all come from six terms taught by the same instructor (K.A.H.). The instructor holds a PhD in pathology and laboratory medicine and had 6 yr of experience teaching this course before any of the terms used in this study. The majority of students enrolled in the course were in their first year of college (69%), but the course is open to all students. The class size for each of the six terms of the study averaged 393 students. The most common majors in the course include biology, exercise and sports science, and psychology. The combined student demographics in this course during the years of this study were: 59% white, 13.9% black, 10.3% Latin@, 7.4% Asian, 1.1% Native American, and 8% of either undeclared race, mixed descent, or international origin. In addition, 66.3% of the students identified as female, 32.1% male, and 1.6% unspecified gender, and 24% of these students were first-generation college students. The Intervention: Increasing Course Structure Throughout our analyses, we compared the same course during three terms of low structure and three terms of moderate structure (Table 1). How these designations—low and moderate—were determined is explained later in the section Determining the Structure Level of the Intervention. During the low-structure terms of this study (Spring 2009, Fall 2009, Spring 2010), the course was taught in a traditional lecture format in which students participated very little in class. In addition, only three homework assignments were completed outside the classroom to help students prepare for four high-stakes exams (three semester exams and one cumulative final). In the reformed terms (Fall 2010, Spring 2011, Fall 2011), a moderate-structure format was used with both in-class and out of class activities added. The elements added—guided-reading questions, preparatory homework, and in-class activities—are detailed below, and Table 2 gives some specific examples across one topic. Table 2. Sample question types associated with the three assignment types added during the moderate-structure terms Example learning objective: Determine the possible combinations of characteristics produced through independent assortment and correlate this to illustrations of metaphase I of meiosis Preclass (ungraded) Preclass (graded) In-class (extra credit) Example guided-reading questions 1. Examine Figure 8.14, why are the chromosomes colored red and blue in this figure? What does red or blue represent? 2. Describe in words and draw how independent orientation of homologues at metaphase I produces variation. Example preparatory homework question Independent orientation of chromosomes at metaphase I results in an increase in the number of: a) Sex chromosomes b) Homologous chromosomes c) Points of crossing over d) Possible combinations of characteristics e) Gametes Example in-class questions Students were shown an illustration of a diploid cell in metaphase I with the genotype AaBbDd. For all questions, students were told to “ignore crossing over.” 1. For this cell, what is n = ? 2. How many unique gametes can form? That is, how many unique combinations of chromosomes can form? 3. How many different ways in total can we draw metaphase I for this cell? 4. How many different combinations of chromosomes can you make in one of your gametes? Guided-Reading Questions. Twice a week, students were given ungraded, instructor-designed guided-reading questions to complete while reading their textbook before class. These questions helped to teach active reading (beyond highlighting) and to reinforce practice study skills, such as drawing, using the content in each chapter (Table 2; Supplemental Material, section 1). While these were not graded, the expectation set by the instructor was that the daily activities built from this content and referred to them, without covering them in the same format. Keys were not posted. Preparatory Homework. Students were required to complete online graded homework associated with assigned readings before coming to class (Mastering Biology for Pearson's Campbell Biology: Concepts and Connections). The instructor used settings for the program to coach the students and help them assess their own knowledge before class. Students were given multiple opportunities to answer each question (between two and six attempts, depending on question structure) and were allowed to access hints and immediate correct/incorrect answer feedback. The questions were typically at the knowledge and comprehension levels in Bloom's taxonomy (Table 2). In-Class Activities. As course content previously covered by lecture was moved into the guided-reading questions and preparatory homework, on average 34.5% of each class session was now devoted to activities that reinforced major concepts, study skills, and higher-order thinking skills. Students often worked in informal groups, answering questions similar to exam questions by using classroom-response software (www.polleverywhere.com) on their laptops and cell phones. Thirty-six percent of these questions required a student to apply higher-order cognitive skills such as application of concepts to novel scenarios or analysis (see Supplemental Material, section 2, for methods). Although responses to in-class questions were not graded, students received 1–2 percentage points of extra credit on each of four exams if they participated in a defined number of in-class questions. The remaining 65.5% of class time involved the instructor setting up the activities, delivering content, and course logistics. These percentages are based on the observation of videos from four randomly chosen class session videos. The course was videotaped routinely, so the instructor did not know in advance which class sessions would be scored. Determining the Structure Level of the Intervention Using the data from two articles by Freeman and colleagues (Freeman et al., 2007, 2011) and consulting with Scott Freeman (personal communication) and the Biology Education Research Group at the University of Washington, we identified the critical elements of low, moderate, and high structure (Table 1). Based on these elements, our intervention was a “moderate” structure course: we had weekly graded preparatory homework, students were talking on average 35% of class time, and there were no graded review assignments. Study 1: Does the Increased Course Structure Intervention Transfer to a Novel Environment? Total Exam Points by Course Structure. Our measure of achievement was total exam points. We chose this measure over final grade, because the six terms of this course differed in the total points coming from homework (3 vs. 10%) and the opportunity for bonus points could inflate the final grade in the reformed class. Instead, we compared the total exam points earned out of the possible exam points. As total exam points varied across the six terms by 5 points (145–150), all terms were scaled to be out of 145 points in the final data set. As this study took place over 4 years, we were concerned that term-to-term variation in student academic ability and exam difficulty could confound our survey and achievement results. To be confident that any gains we observed were due to the intervention and not these other sources of variation, we controlled for both exam cognitive level (cf. Crowe et al., 2008) and student prior academic achievement (for more details see Supplemental Material, section 2). We found that exams were similar across all six terms and that the best control for prior academic achievement was a student's combined SAT math and SAT verbal score (Table 3; Supplemental Material, section 2).We therefore used SAT scores as a control for student-level variation in our analyses and did not further control for exams. Table 3. Regression models used to determine whether 1) increased structure can be transferred to a novel environment (study1) and 2) student subpopulations vary in their response to increased course structure (study 2)a Base model: Student performance influenced by course structure Outcome ∼ Term + Combined SAT scores + Gender + Course Structure Model 2: Impact of course structure on student performance varies by race/ethnicity/nationality. Outcome ∼ Term + SAT scores + Gender + Course Structure + Race + Race × Course Structure Model 3: Impact of course structure on student performance varies by first-generation status. Outcome ∼ Term + SAT scores + Gender + Course Structure + First-generation + First-generation × Course Structure aBolded terms in models 2 and 3 are the new additions that test the specific hypotheses that the impact of course structure will vary by student populations. The outcome variable is either student achievement on exams or student failure rates. Course and Exam Failure Rates by Course Structure. To become a biology major, students must earn a minimum of a “C−” in this course. Thus, for the purpose of this study, we considered a grade below 72.9% to be failing, because the student earning this would not be able to move on to the next biology course. We measured failure rates in two ways: 1) final grade and 2) total exam points. Although the components contributing to final course grade changed across the study, this “C−” cutoff for entering the biology major remained consistent. This measure may be more pertinent to students than overall exam performance, because it determines whether or not they can continue in the major. To look more closely at whether increased student learning was occurring due to the intervention, we looked at failure rates on the exams themselves. This measure avoids the conflation of any boost in performance due to extra credit or homework points or deviations from a traditional grading scale but is not as pertinent to retention in the major as course grade. The statistical analysis for this study is paired with that of study 2 and is described later. Study 2. Does the Effectiveness of Increased Course Structure Vary across Different Student Populations? In addition to identifying whether an overall increase in achievement occurred during the moderate-structure terms, we included categorical variables in our analyses to determine whether student subpopulations respond differently to the treatment. We focused on two designations: 1) student ethnic, racial, or national origin, which included the designations of Asian American, black, Latin@, mixed race/ethnicity, Native American, white, and international students; and 2) student generational status (first-generation vs. continuing-generation college student). Both of these factors were determined from student self-reported data from an in-class survey collected at the end of the term. Statistical Analyses: Studies 1 and 2 Total Exam Points Earned by Course Structure and Student Populations. We modeled total exam points as continuous response and used a linear regression model to determine whether moderate course structure was correlated with increased exam performance (Table 3). In our baseline model, we included student combined SAT scores, gender identity (in this case, a binary factor: 0 = male, 1 = female), and the term a student was in the course (Fall vs. Spring) as control variables. Term was included, because the instructor has historically observed that students in the Spring term perform better than students in the Fall term. To test our first hypothesis, that increasing the course structure would increase performance (study 1), we included treatment (0 = low structure, 1 = moderate structure) as a binary explanatory variable. To test our second hypothesis, that students from distinct populations may differ in their response to the classroom intervention, we ran two models (Table 3) that included the four variables described above and either 1) student racial and ethnic group (a seven-level factor) or 2) student first-generation status (a binary factor: 1 = first generation, 0 = continuing generation). If any of these demographic descriptors were not available for a student, that student was not included in the study. We ran separate regression models for race/ethnicity and generation status, because we found these terms were correlated in an initial test of correlations between our possible explanatory variables (Kruskal-Wallis χ2 = 68.1, df = 5, p  10 h) 1–3 h 4 –7 h 2.60 (2.02–3.35) 0.982 (0.974–0.990) Complete readings before class (Never, Rarely, Sometimes, Often) Rarely Sometimes 1.97 (1.54–2.52) 0.994 (0.985–1.00) Preparatory homework importance (Not at all, Somewhat, Important, Very) Somewhat Important 4.6 (3.56–5.85) 0.98 (0.97–0.98) Review notes after class (Never, Rarely, Sometimes, Often) Sometimes Sometimes 0.738 (0.583–0.933) 0.972 (0.965–0.980) Complete textbook review questions (Never, Rarely, Sometimes, Often) Rarely Rarely 0.50 (0.400–0.645) 0.98 (0.972–0.99) Factor 2. Classroom culture: Increasing course structure will encourage students to perceive the class as more of a community. Contribute to classroom discussions (Never, Rarely, Sometimes, Often) Never Rarely 1.13 (0.890–1.44) 0.99 (0.988–1.00) Work with a classmate outside of class (Never, Rarely, Sometimes, Often) Sometimes Sometimes 0.83 (0.664–1.06) 0.984 (0.0977–0.991) Believe students in class know each other (Strongly disagree, Disagree, Neutral, Agree, Strongly agree) Neutral Neutral 2.4 (1.92–3.09) 0.996 (0.989–1.00) Believe students in class help each other (Strongly disagree, Disagree, Neutral, Agree, Strongly agree) Agree Agree 1.22 (0.948–1.57) 1.01 (0.999–1.02) Perceive class as a community (Strongly disagree, Disagree, Neutral, Agree, Strongly agree) Neutral Neutral 1.99 (1.57–2.52) 0.986 (0.979–0.993) Factor 3. Course value: Increasing course structure will increase the value of the course to students. Amount of memorization (Most, Quite a bit, Some, Very Little, None) Some Some 1.07 (0.84–1.35) 0.98 (0.982–0.997) Attend lecture (Never, Rarely, Sometimes, Often) Often Often 0.72 (0.471–1.09) 0.984 (0.971–0.997) Use of skills learned (Strongly disagree, Disagree, Neutral, Agree, Strongly agree) Agree Agree 0.909 (0.720–1.15) 0.991 (0.983–0.998) Lecture importance (Not at all, Somewhat, Important, Very) Very Important Important 0.57 (0.448–0.730) 0.998 (0.991–1.01) aThe second and third columns are the raw median responses under each structure. The fourth and fifth columns are the odds ratios from the log-odds regression including course structure and SAT scores as explanatory variables (> 1 = students more likely to report a higher value;   0.0001). Interestingly, even with the additional investment of hours each week, a focus on preparation seemed to represent a trade-off with time spent reviewing: after we controlled for SAT math and reading scores, students were 1.4 times less likely to review their notes after class as frequently (β = −0.30 ± 0.12 SE, p = 0.011) and 1.9 times less likely to complete the practice questions at the end of each book chapter (β = −0.68 ± 0.12 SE, p  0.0001). After we controlled for SAT math and reading scores, students also did not vary in their frequency of lecture attendance (although this could be because it was high to begin with; β = −0.32 ± 0.21 SE, p = 0.13). Student perception of the importance of the skills they learned in the class did not vary between course structures (β = −0.09 ± 0.12 SE, p = 0.42) nor did they perceive that the moderate-structure course involved more cognitive skills other than memorization (β = 0.07 ± 0.12 SE, p = 0.58). Population-Specific Patterns Black Students Demonstrate Differences in Behaviors and Perceptions among Student Populations. On the basis of the results in study 1, which demonstrated that increased course structure was most effective for black and first-generation students, we explored student survey responses to determine whether we could document what was different for these populations of students. We identified one behavior and three perception questions for which adding a binomial variable identifying whether a student was part of the black population or not increased the fit of the log-odds regression to the data. These differential responses may help us elucidate why this population responded so strongly to the increased-structure treatment. The one behavior that changed disproportionately for black students relative to other students in the class was speaking in class. Under low structure, black students were 2.3 times more likely to report a lower level of in-class participation than students of other ethnicities (β = −0.84 ± 0.35 SE, p = 0.012). The significant interaction between being black and being enrolled in the moderate-structure course (β = 0.89 ± 0.38 SE, p = 0.019) means this difference in participation completely disappears in the modified course. Perception of the course also differed for black students compared with the rest of the students in three ways. First, black students were more likely to report that the homework was important for their understanding relative to other students in the class under both low and moderate structure. (β = 1.06 ± 0.31 SE, p = 0.0006). The significant interaction term between course structure and black racial identity indicates the difference between black students and other students in the class decreases under moderate structure (Table 5; β = 1.06 ± 0.31 SE, p = 0.0006), but this seems to be due to all students reporting higher value for the homework under moderate structure. In addition, black students perceived that there were less memorization and more higher-order skills in the class relative to other students in the class (β = −0.39 ± 0.59 SE, p = 0.024) under both low and moderate structures. Finally, there was a trend for black students to be 1.3 times more likely to report that the skills they learned in this course would be useful for them (β = 0.29 ± 0.16 SE, p = 0.07). Unlike the clear patterns with black students, we found no significant differences in survey responses based on first-generation status. Behaviors and Perceptions That Correlate with Success Are More Numerous under Moderate Structure. During the low-structure term, only lecture attendance impacted exam performance (i.e., significantly improved the fit of the models to the exam performance data after we controlled for student SAT scores; F = 9.59, p < 0.0001). Specifically, students who reported attending fewer lectures performed worse on exams. Students who reported accessing the textbook website more tended to perform better on exams (F = 2.48, p = 0.060), but this difference did not significantly improve the fit of the model. In the moderate-structure terms, attending class (F = 9.59, p < 0.0001), speaking in class (F = 9.03, p < 0.0001), and hours spent studying (F = 10.6, p < 0.0001), reviewing notes (F = 3.19, p = 0.023), and seeking extra help (F = 5.94, p < 0.0001) all impacted student performance on exams. Additionally, one perception changed significantly: students with a higher sense of community performed better (F = 4.14, p = 0.0025). DISCUSSION With large foundation grants working toward improving STEM education, there has been a push for determining the transferability of specific educational innovations to “increase substantially the scale of these improvements within and across the higher education sector” (NSF, 2013). In this study, we provide evidence that one course intervention, increased course structure (Freeman et al., 2011), can be transferred from one university context to another. In addition to replicating the increase in student achievement across all students, we were able to elaborate on the results of prior research on increased course structure by 1) identifying which student populations benefited the most from the increased course structure and 2) beginning to tease out the factors that may lead to these increases. The Increased-Structure Intervention Can Transfer across Different Instructors, Different Student Bodies, and Different Courses (Majors vs. Nonmajors) One of the concerns of any classroom intervention is that the results depend on the instructor teaching the course (i.e., the intervention will work for only one person) and the students in it. We can test the independence of the intervention by replicating it with a different instructor and student body and measuring whether similar impacts on student achievement occur. The university at which this study took place is quite different from the university where the increased course structure intervention was developed (Freeman et al., 2011). Both universities are R1 institutions, but one is in the Southeast (and has a large black and Latin@ population), whereas the original university was in the Pacific Northwest (and has a high Asian population). Yet we find very similar results: in the original implementation of moderate structure in the Pacific Northwest course, the failure rate (defined as a course grade that would not allow a student to continue into the next course in the biology series) dropped from 18.2% to an average of 12.8% (a 29.7% reduction; Freeman et al., 2011). In our implementation of moderate structure, the failure rate dropped by a similar magnitude: from 26.6% to 15.6% (a 41.3% reduction). This result indicates that the impact of this increased-structure intervention may be independent of instructor and that the intervention could work with many different types of students. Some Students Benefit More Than Others from Increased Course Structure We found that transforming a classroom from low to moderate structure increased the exam performance of all students by 3.2%, and black students experienced an additional 3.1% increase (Figure 1A), and first-generation students experienced an additional 2.5% increase relative to continuing-generation students (Figure 1B). These results align with the small body of literature at the college level that indicates classroom interventions differ in the impact they have on student subpopulations (Kim, 2002; Preszler, 2009; Haak et al., 2011). Our study is novel in that we control for both student past academic achievement and disaggregate student racial/ethnic groups beyond the URM/non-URM binary. Our approach provides a more nuanced picture of how course structure impacts students of diverse demographic characteristics (independent of academic ability). One of the most exciting aspects of our results is that we confirm that active-learning interventions influence the achievement of student subpopulations differentially. This finding is supported by both work in physics (Beichner et al., 2007), which found an intervention only worked for black and white students, and work in psychology, which revealed Asian-American students do not learn as well when they are told to talk through problems out loud (Kim, 2002). These studies highlight how important it is for us to disaggregate our results by student characteristics whenever possible, as overall positive results can mask actual differential outcomes present in the science classroom. Students come from a range of educational, cultural, and historical backgrounds and face different challenges in the classroom. It is not surprising that in the face of this diversity one intervention type does not fit all students equally. Comparing our results with published studies in STEM focused on historically underrepresented groups, we see that our achievement results are of a similar magnitude to other interventions. Unlike our intervention, previous interventions generally are not implemented within an existing course but are either run as separate initiatives or separate courses or are associated with a series of courses (i.e., involved supplemental instruction [SI]; cf. Maton et al., 2000; Matsui et al., 2003). These SI programs are effective, but can be costly (Barlow and Villarejo, 2004), and because of the cost, they are often not sustainable. Of seven SI programs that report data on achievement and retention in the first term or first two terms of the program, and thus are directly comparable to our study results, failure rate reductions ranged from 36.3 to 77%, and achievement increased by 2.4–5.3% (Table 7). In our study, the failure rate reduction was 41.3%, and overall exam performance increased by 3.2% (6.2% for black students and 6.1% for first-generation students), which is within the range of variation for the short-term results of the SI studies. These short-term results may be an underestimate of the effectiveness of the SI programs, as some studies have shown that their effectiveness increases with time (Born et al., 2002). Yet the comparison still reveals promising results: one instructor in one course, without a large influx of money, can make a difference for students as large in magnitude as some supplemental instruction programs. Table 7. Changes in achievement and failure rate for SI programs in the first term of their implementationa Failure rate Achievement         % Change:     % Change: Study Classroom Non-SI SI failure rate Non-SI SI achievement Fullilove and Treisman, 1990 Calculus I 41% 7% 77 NA NA NA Wischusen and Wischusen, 2007 Biology I 18.6% 6.9% 62.9 ∼85% ∼87% 2.4 Rath et al., 2007 Biology I 27% 15% 44.4 ∼75% ∼79% 5.3 Peterfreund et al., 2007 Biology I 27% 15% 44 ∼75% ∼79% 5.3 Minchella et al., 2002 Biology I and II 30.2% 16.9% 44 ∼75% ∼78% 4 Barlow and Villarejo, 2004 General Chemistry 44% 28% 36.3 ∼80% ∼83% 3.8 Dirks and Cunningham, 2006 Biology I NA NA NA ∼81% ∼84% 3.7 aMost achievement data were reported on the 4.0 scale, and the percentage of points earned was approximated using a conversion scale. In comparison, in the current student population, we saw a 41.3% reduction in the failure rate and a 3.2–6.3% increase in achievement, depending on which student subpopulation was the focus. Exploring How Increased Course Structure Increases Student Performance Survey data allowed us to explore how student course-related behaviors and attitudes changed with increased course structure. We focused on three specific factors and found evidence that changes in time allocation contributed to increased performance and some support for changes in classroom culture also impacting learning. We did not find evidence to support the idea that the value students found in the course influenced their performance. Factor 1. Time Allocation. Under low structure, students on average spent only 1–3 h on the course outside of class, rarely came to class having read the assigned readings, and were highly dependent on the lecture for their learning. Students also placed little value on the occasional preparatory homework assignments. With the implementation of moderate structure, students increased the amount of time they spent on the course each week to 4–7 h, were twice as likely to come to class having read the assigned readings, and saw the preparatory assignments as being equally as important for their learning as the actual lecture component. These shifts in behaviors and perceptions support our hypothesis that increased course structure encourages students both to distribute their studying throughout the term and to spend more time on behaviors related to graded assignments. We believe that these changes in student behaviors and perceptions occurred because of the structure of accountability built into the moderate-structure course. Students reading before class is an outcome almost all instructors desire (based on the ubiquitous syllabus reading lists), but it is evident from our study and others that, under low structure, students were on average “rarely” meeting this expectation (see also Burchfield and Sappington, 2000). We found the dual method of assigning preparatory homework and making the reading more approachable with ungraded guided-reading questions increased the frequency of students reading before class. It seemed that course points (accountability) were necessary to invoke this change in student behavior, because we did not see a similar increase in the frequency with which students reviewed notes after class. It is possible that moving to high structure (Freeman et al., 2011), with its weekly graded review assignments, could increase the achievement of our students even more, because they would be held accountable for reviewing their notes more frequently. Factor 2. Classroom Culture. We found some evidence to support the hypothesis that increased course structure creates a community environment rather than a competitive environment. Under low structure, students did not seem to get to know the other students in the class and did not positively view the class as a community (although they did believe that students in the class tried to help one another). With increased structure, students were two times more likely to view the class as a community and 2.4 times more likely to say students in the class knew each other. This result is a critical outcome of our study, arguably as important as increased performance, because a sense of being part of a community (belonging) is crucial for retention (Hurtado and Carter, 1997; Hoffman et al., 2002) and has been correlated with increased performance for first-generation students (Stephens et al., 2012). When discussing reasons for leaving STEM, many students, particularly students of color and women, describe feelings of isolation and lack of belonging (Hewlett et al., 2008; Cheryan et al., 2009; Strayhorn, 2011). Because introductory courses are some of the first experiences students have in their major, these could potentially play a role in increasing retention simply by facilitating connections between students through small-group work in class. Factor 3. Course Value. We did not find support for the hypothesis that students in the moderate-structure class found the course to be more valuable than students in the low-structure course. First, there was no difference in how much students valued the skills they learned in the course, but this could be because they did not recognize that the low- and moderate-structure terms were asking them to do different things. Across both terms, students on average believed that they were doing the same amount of memorizing versus higher-order skills such as application and analysis, even though the instructor emphasized higher-order skills more in the moderate-structure terms. In addition, behaviorally, we did not see any evidence of a higher value associated with the course in terms of increased attendance. In fact there was no difference in attendance across treatments. The attendance result was surprising to us, because increased attendance has been shown to be a common result of making a classroom more active (Caldwell, 2007; Freeman et al., 2007); however, these previous interventions all assigned course points to in-class participation, whereas our interventions only gave students bonus points for participation. In a comparison of in-class attendance with and without points assigned to class participation, Freeman et al. (2007) found that attendance dropped in the class in which no points were assigned. Thus, it is possible that attendance in these classes could be increased in the future if points rather than extra credit were assigned for participation. This idea is supported by our data that it is actually the students with the highest predicted achievement (i.e., highest SAT scores) who are more likely to miss lecture. Because these students already were doing well in the course, it may be that the motivation of receiving a few bonus points for attending class was not enough encouragement. Additional evidence that changes in time allocation and classroom culture contribute to achievement comes from the correlation between survey responses and exam performance. Under moderate structure, the number of hours a student spent studying per week and a higher sense of community were both positively correlated with exam performance. The support for these two factors, time allocation and classroom culture, helps us identify potential critical elements for the implementation of the increased-structure intervention. First, students need to be made accountable for preparing before attending class. This can take multiple forms, including guided-reading questions, homework, and/or reading quizzes before class or at the start of class, but the key is that they need to be graded. Without this accountability in the low-structure terms, students were not doing the reading and were likely cramming the week before the exam instead of distributing their study time. The second critical element seems to be encouraging the students in the class to view themselves as a community through small-group work in class. Further research could explore how best to approach in-class work to develop this sense of community rather than competition. Changes in Achievement, Behaviors, and Perceptions Vary among Student Populations In addition to looking at overall patterns in student behaviors and perceptions, we can also disaggregate these data to begin to understand why some groups might benefit more from the intervention. From the achievement data, we identified black and first-generation students as populations who responded most strongly to the treatment. Patterns in behaviors and attitudes were apparent for one of these populations (black students) and not the other (first-generation students). The response of black students on our survey questions differed from other students in the class in three ways. First, under both classroom structures, black students were more likely to report that the homework contributed to their learning in the course, and there was a trend for black students more than any other student groups to report that they valued the skills they developed from this class more than other students. Second, black students perceived the class to require more higher-order skills. These results imply that these students had a greater need for the kind of guidance provided by instructor-designed assignments. Thus, the addition of more homework and more explicit practice may have had a disproportionate impact on these students' achievement. Third, black students were significantly less likely than other students to speak up in class, but this disparity disappeared under moderate structure. We suspect that the increased sense of the classroom as a community may have contributed to this increased participation. Although first-generation students did not differ in how they responded to survey questions versus continuing-generation students, they could still differ in how valuable the changes in the course were to them. In particular, the increased sense of community that seemed to correlate with the implementation of moderate structure could have helped them disproportionately, as has been demonstrated in a previous study (Stephens et al., 2012). In addition, although students grouped in the category first generation share some characteristics, they are also very different from one another in terms of culture, background, and the barriers they face in the classroom (Orbe, 2004; Prospero et al., 2012). For example, in our university setting, 55% of first-generation students have parents with low socioeconomic status and 50% transfer in from community colleges. The variation in students could thus obscure any patterns in their responses. Future analyses will attempt to distinguish subpopulations to identify patterns potentially hidden in our analysis. Limitations of This Work One of the major purposes of this article is to recognize that classroom interventions that work in one classroom may not work in others because 1) student populations differ in how they respond to classroom treatments, and 2) instructors do not always implement the critical elements of an active-learning intervention. Thus, it is important for us to note that, although we have shown that increased structure can work with both majors and nonmajors and with students from a range of racial and ethnic groups, we are still working in an R1 setting. More work needs to be done to establish the effectiveness of the increased course structure intervention in community college or comprehensive university settings (although the evidence that it works well for first-generation students is a good sign that it could transfer). In addition, this study was with one instructor, thus we can now say increased course structure has worked for two independent instructors (the instructor of the current course and the instructor of the original course; Freeman et al., 2011), but further work is necessary to establish its general transferability. In addition, this study has suggested two factors by which increased course structure seems to be working by 1) encouraging distributed practice with a focus on class preparation and 2) helping students view the class as more of a community. Yet these are only two of many possible hypotheses for how this intervention works. It is possible that assigned preparatory assignments and small-group work to encourage community are not the only elements critical for this intervention's success. Further studies could explore how to best implement activities in class or the impact of adding graded review assignments on achievement. Implications for Instructor and Researcher Best Practices As a result of implementing an increased course structure and examining student achievement and survey results, we identified the following elements critical for student success and the success of future implementations: Students are not a monolithic group. This result is not surprising. Students vary in many ways, but currently we do not know much about the impact of these differences on their experience with and approach to a college-level course. Future studies on student learning should disaggregate the students involved in the study (if possible), so instructors looking to implement an intervention can determine whether, and potentially how well, a particular intervention will work for their population of students. Accountability is essential for changing student behaviors and possibly grades. We found that without accountability, students were not reading or spending many hours each week on the course. With weekly graded preparatory homework, students increased the frequency of both behaviors. We did not provide them credit for reviewing each week, and we found the overall frequency of this behavior decreased (even though our results demonstrate that students who did review notes performed better). Survey questions are a useful method of identifying what behaviors an instructor might target to increase student performance. From our survey results, it seems that creating weekly review assignments might increase the frequency that students review their notes and thus increase their grades. Without the survey, we would not have known which behaviors to target. Overall, this work has contributed to our understanding of who is most impacted by a classroom intervention and how those impacts are achieved. By looking at the achievement of particular populations, we can begin to change our teaching methods to accommodate diverse students and possibly increase the effectiveness of active-learning interventions. Supplementary Material Supplemental Material
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The Classroom Observation Protocol for Undergraduate STEM (COPUS): A New Instrument to Characterize University STEM Classroom Practices

              INTRODUCTION A large and growing body of research indicates that undergraduate students learn more in courses that use active-engagement instructional approaches (Prince, 2004; Knight and Wood, 2005; Michael, 2006; Blanchard et al., 2010). As a result, the importance of teaching science, technology, engineering, and mathematics (STEM) courses more effectively has been stressed in numerous reports, including the President's Council of Advisors on Science and Technology Engage to Excel report (2012), the National Science Foundation/American Association for the Advancement of Science Vision and Change report (AAAS, 2010), and the National Research Council Discipline-Based Education Research report (Singer et al., 2012). Given these compelling, evidence-based recommendations and the recognized need for measures of teaching effectiveness beyond student evaluations (Association of American Universities, 2011), higher education institutions are struggling to determine the extent to which faculty members are teaching in an interactive manner. This lack of information is a major barrier to transforming instruction and evaluating the success of programs that support such change. To collect information about the nature of STEM teaching practices as a means to support institutional change, faculty at both the University of British Columbia (UBC) and the University of Maine (UMaine) created classroom observation programs. The results of such observations were needed to: 1) characterize the general state of STEM classroom teaching at both institutions, 2) provide feedback to instructors who desired information about how they and their students were spending time in class, 3) identify faculty professional development needs, and 4) check the accuracy of the faculty reporting on the Teaching Practices Survey that is now in use at UBC (CWSEI Teaching Practices Survey, 2013). To achieve these goals, the programs needed an observation protocol that could be used by faculty member observers to reliably characterize how students and instructors were spending their time in undergraduate STEM classrooms. A critical requirement of the protocol was that observers who were typical STEM faculty members could achieve those results with only 1 or 2 hours of training, as it is unrealistic to expect they would have more time than that available. In the quest for a suitable observation protocol, multiple existing options were considered, and ultimately rejected. The observation protocols considered were divided into two categories: open-ended or structured. When observers use open-ended protocols, they typically attend class, make notes, and respond to such statements as: “Comment on student involvement and interaction with the instructor” (Millis, 1992). Although responses to these types of questions can provide useful feedback to observers and instructors, the data are observer dependent and cannot easily be standardized or compared across multiple classrooms (e.g., all STEM courses at UBC or UMaine). Alternatively, structured protocols provide a common set of statements or codes to which the observers respond. Often, these protocols ask observers to make judgments about how well the teaching conforms to a specific standard. Examples of such protocols include the Inside the Classroom: Observation and Analytic Protocol (Weiss et al., 2003) and the Reformed Teaching Observation Protocol (RTOP; Sawada et al., 2002). These protocols consist of statements that observers typically score on a Likert scale from “not at all” to “to a great extent” and contain such statements as: “The teacher had a solid grasp of the subject matter content inherent in the lesson” (from RTOP; Sawada et al., 2002). The RTOP in particular has been used to observe university STEM instruction. For example, it has been used to evaluate university-level courses at several different institutions to measure the effectiveness of faculty professional development workshops (Ebert-May et al., 2011) and to compare physics instructors in a study examining coteaching as a method to help new faculty develop learner-centered teaching practices (Henderson et al., 2011). The RTOP is also being used to characterize classroom practices in many institutions and in all levels of geoscience classes (Classroom Observation Project, 2011). The RTOP was found to be unsuitable for the UBC and UMaine programs for two main reasons. The first is that the protocol involves many observational judgments that can be awkward to share with the instructor and/or the larger university community. The second is that observers must complete a multiday training program to achieve acceptable interrater reliability (IRR; Sawada et al., 2002). More recently, new observation protocols have been developed that describe instructional practices without any judgment as to whether or not the practices are effective or aligned with specific pedagogic strategies. These observation protocols use a series of codes to characterize instructor and/or student behaviors in the classroom; observers indicate how often each behavior occurs during a class period (Hora et al., 2013; West et al., 2013). One observation protocol in particular, the Teaching Dimensions Observation Protocol (TDOP), was expressly developed to observe postsecondary nonlaboratory courses. For this protocol, observers document classroom behaviors in 2-min intervals throughout the duration of the class session (Hora et al., 2013). The possible classroom behaviors are described in 46 codes in six categories, and observers make a checkmark when any of the behaviors occur. The TDOP instrument avoids the judgment issues associated with the RTOP, but it still requires substantial training, as one might expect for a protocol that was designed to be a complex research instrument. Preliminary work suggests that, after a 3-day training session, observers have acceptable IRR scores when using the TDOP (Hora et al., 2013). Observers at our institutions tried using this instrument, but without the full training, they found it difficult to use the TDOP in a reliable way, due to the complexity of the items being coded and the large number of possible behavior codes. We also found that the particular research questions it was designed to address did not entirely align with our needs. For example, it covers some aspects that are not necessary for faculty observation programs, such as whether an instructor uses instructional artifacts (e.g., a laser pointer or computer; Hora et al., 2013) and fails to capture others that are needed, such as whether an instructor encourages peer discussion along with clicker questions (Mazur, 1997; Smith et al., 2009, 2011). We also wanted to better characterize the student behaviors during the class period than the TDOP easily allowed. Out of necessity, we created a new protocol called the Classroom Observation Protocol for Undergraduate STEM, or COPUS. Like the TDOP, this new protocol documents classroom behaviors in 2-min intervals throughout the duration of the class session, does not require observers to make judgments of teaching quality, and produces clear graphical results. However, COPUS is different in that it is limited to 25 codes in only two categories (“What the students are doing” and “What the instructor is doing”) and can be reliably used by university faculty with only 1.5 hours of training (Figure 1 has a description of the codes; the Supplemental Material includes the full protocol and coding sheet). Observers who range from STEM faculty members without a background in science education research to K–12 STEM teachers have reliably used this protocol to document instruction in undergraduate science, math, and engineering classrooms. Taken together, their results show the broad usability of COPUS. Figure 1. Descriptions of the COPUS student and instructor codes. DEVELOPMENT The development of COPUS was an evolutionary process extending across more than 2 years, involving many iterations and extensive testing. It began at UBC, where science education specialists (SESs) who were working with science faculty on improving teaching (Wieman et al., 2010) wanted to characterize what both the students and instructors were doing during class. The SESs began testing various existing protocols, including the TDOP, in different classes at UBC in late 2011 and early 2012. The original TDOP did not meet our needs (as described above), so we iteratively modified the protocol through nine different versions. These changes resulted in a format, procedure, data structure, and coding strategy that was easy to implement on paper or electronically and convenient for analysis and display. The overall format of the observation protocol remained largely stable, but the categories and codes continued to evolve. During the Fall term of 2012, 16 SESs, who are highly trained and experienced classroom observers, used this evolving protocol to observe a variety of courses in singles, pairs, or trios across most of the departments in the UBC Faculty of Science (including the disciplines of biology, computer science, earth sciences, mathematics, physics, and statistics). We analyzed the SES generated observation data to identify coding disagreements and met with the SESs to discuss the evolving protocol and coding. These discussions covered observed behaviors they found difficult to code and/or hard to interpret, and other important elements of instructor or student behavior they felt were not being adequately captured. The protocol evolved through five different versions during this stage of testing and feedback. The final version had substantially simplified categories and all identified problems with the wording on the codes had been eliminated. Notably, it was quite simple to reliably code classes taught with traditional lectures, as a very small number of behaviors need to be coded. Therefore, the majority of the work went into improving the protocol so it could reliably characterize classes that had substantial and varied interactions between instructor and students and multiple student activities. One substantial change during Fall 2012 was eliminating a category for judging the cognitive level of the activities. Observers had been asked to code the level of cognitive sophistication of current classroom activities, based on Bloom's taxonomy of educational objectives (Bloom et al., 1956). After multiple unsuccessful attempts to find a simple and reliable coding scheme that could capture this aspect of the classroom activities, we dropped this category. Our decision to drop this category is supported by recent work showing that, when faculty members write and evaluate higher-order questions, they use several criteria beyond the Bloom's level, including: question difficulty, time required to answer the questions, whether students are using a new or well-practiced approach, and whether the questions have multiple reasonable solutions (Lemons and Lemons, 2012). The second substantial change during this time was changing another category—coding the level of student engagement—from required to optional. Having a measure of student engagement is useful for providing feedback to the instructor and for judging the overall effectiveness of many instructional activities. With the coding of the levels of engagement simplified to only discriminating between low (0–20% of the students engaged), medium, or high (≥80% of the student engaged), some observers, particularly those who had some experience with observing levels of student engagement, could easily code engagement along with the other two categories, and there was reasonable consistency between observers. However, less-experienced observers found it quite hard to simultaneously code what the students were doing, what the instructor was doing, and the student engagement level. Also, there were difficulties with obtaining consistent coding of student engagement across all observers; the judgments were often dependent on the levels of engagement common to the specific disciplines and courses with which the observers were familiar. For this reason, the student engagement category was made optional. We recommend observers do not try to code it until after they have become experienced at coding the “What the students are doing” and “What the instructor is doing” categories. Another recurring theme of the discussions with the SESs was the extent to which classroom observations could accurately capture the quality of instruction or the efficacy of student work. In the end, after SESs observed different classes across many disciplines, there was a consensus that accurately evaluating the quality of instruction and the efficacy of student work was generally not possible. These highly trained and experienced observers concluded that these evaluations require a high degree of training of the observer in the material and the pedagogic strategies, as well as familiarity with the student population (prior knowledge, typical classroom behaviors, etc.). We concluded that quality judgments of this type were not realistic goals for limited classroom observations carried out by STEM faculty members. Thus, the present version of COPUS captures the actions of both instructors and students, but does not attempt to judge the quality of those actions for enhancing learning. After the completion of this development work at UBC, the COPUS was further tested by 16 K–12 teachers participating in a teacher professional development program at UMaine. The teachers used the COPUS to observe 16 undergraduate STEM courses in five different departments (biology, engineering, math, chemistry, and physics). While the teachers easily interpreted many of the codes, they found a few to be difficult and suggested additional changes. For example, the student code “Listening: paying attention/taking notes, etc.” was changed to “Listening to instructor/taking notes, etc.” The code was clarified, so observers knew they should select this code only when the students were listening to their instructor, not when students were listening to their peers. Also, new codes were added to capture behaviors the teachers thought were missing, such as the instructor code “AnQ: Listening to and answering student questions with entire class listening.” The coding patterns of the two teacher observers in the same classroom were also compared to determine which specific codes were difficult to use consistently. An example comparing two teachers employing the student code “Ind” is shown in Figure 2. Figure 2A compares how two observers marked this code in the first iteration of testing, when it was described “Ind: Individual thinking/problem solving in response to assigned task.” Observer 2 marked this code throughout most of the class, and observer 1 marked this code intermittently. Follow-up conversations with observer 2 and other teachers indicated that some observers were marking this code throughout the duration of the class, because they assumed individual students were thinking while they were taking notes, working on questions, and so on, but other observers were not. Therefore, we clarified the code to be: “Ind: Individual thinking/problem solving. Only mark when an instructor explicitly asks students to think about a clicker question or another question/problem on their own.” Figure 2B shows a comparison of the same observer pair, with the revised “Ind” code showing how the paired codes were now closely aligned. Figure 2. A comparison of how two observers coded the student code “Ind.” (A) When the code was described as “Ind: Individual thinking/problem solving in response to assigned task,” observer 2 marked this code more often than observer 1 did. (B) Coding after description of the code was revised. In addition, the teacher observation data revealed a more general problem: there was a lower degree of consistency in coding student behaviors than in coding instructor behaviors, and the teachers used a very limited set of codes for the student behaviors. The earlier coding by the SESs had shown similar, but less dramatic, trends. We realized that this problem was due to a natural tendency of observers to focus on the instructor, combined with the fact the instructor-related codes came first on the survey form. Therefore, the protocol was changed, with the student codes viewed first, and we emphasized coding student behaviors during subsequent training sessions (see further details below in the Training section). As shown below, these changes appear to have fixed this problem. These further revisions culminated in a final version of the COPUS. This version was tested by having the same 16 K–12 teachers use it to observe 23 UMaine STEM classes, and by having seven STEM faculty observers use it to observe eight UBC classrooms in pairs after 1.5 hours of training. Information about the types of classes observed is in Table 1. The seven UBC STEM faculty member volunteers who used the final protocol had not previously used the protocol and were not involved in the development process. Thus, the IRR of the protocol has been tested with a sample of observers with a wide range of backgrounds and perspectives. As discussed in Validity and Reliability, the IRR was high. Table 1. Information on the courses observed using the final version of the COPUS Institution Number of classes observed Number of different STEM departments Percentage of courses at the introductory levela Percentage of classes with >100 students UBC 8 4b 100 63 UMaine 23 7c 96 35 aSTEM courses at the first- and second-year levels. bBiology, chemistry, math, and physics. cBiology, molecular biology, engineering, chemistry, math, physics, and geology. TRAINING A critical design feature of the COPUS is that college and university faculty who have little or no observation protocol experience and minimal time for training can use it reliably. We summarize the training steps in the following paragraphs, and we have also included a step-by-step facilitator guide in the Supplemental Material. The first step in the training process is to have the observers become familiar with the codes. At UBC, facilitators displayed the student and instructor codes (Figure 1) and discussed with the observers what each behavior typically looks like in the classroom. At UMaine, the teacher observers played charades. Each teacher randomly selected a code description from a hat and silently acted out the behavior. The remaining observers had the code descriptions in front of them and guessed the code. The remainder of the training was the same for both groups, with a total training duration of 2 hours for the K–12 teachers and 1.5 hours for the UBC faculty members. Second, observers were given paper versions of the coding sheet and practiced coding a 2-min segment of a classroom video. An excerpt from the coding sheet is shown in Figure 3, and the complete coding sheet is included in the Supplemental Material. Observers often mark more than one code within a single 2-min interval. The first video we used showed an instructor making administrative announcements and lecturing while the class listened. After 2 min, the video was paused, and the group discussed which codes they selected. Because faculty at other institutions may have difficulty capturing videos for training, we have included web URLs to various video resources that can be used for training (Table 2). Figure 3. An excerpt of the COPUS coding form. Observers place a single checkmark in the box if a behavior occurs during a 2-min segment. Multiple codes can be marked in the same 2-min block. Table 2. Video resources that may be helpful for COPUS training Description of video URL Demonstration, clicker questions, and lecture http://harvardmagazine.com/2012/02/interactive-teaching Group activities and lecture http://podcasting.gcsu.edu/4DCGI/Podcasting/UGA/Episodes/12746/614158822.mov Clicker questions and lecture http://podcasting.gcsu.edu/4DCGI/Podcasting/UGA/Episodes/22253/27757327.mov Clicker, real-time writing, and lecture http://ocw.mit.edu/courses/chemistry/5-111-principles-of-chemical-science-fall-2008/video-lectures/lecture-19 Real-time writing, asking/answering questions, and lecture http://ocw.mit.edu/courses/biology/7-012-introduction-to-biology-fall-2004/video-lectures/lecture-6-genetics-1 The observers were then asked to form pairs and code 8 min of a video from a large-enrollment, lecture-style science class at UMaine that primarily shows an instructor lecturing and students listening, with a few questions asked by both the instructor and students. To keep the observers synchronized and ensure they were filling out a new row in the observation protocol at identical 2-min intervals, they used either cell phones set to count time up or a sand timer. At the end of 8 min, the observers compared their codes with their partners. Next, as a large group, observers took turns stating what they coded for the students and the instructor every 2 min for the 8-min video clip. At this point, the observers talked about the relationship between a subset of the student and instructor codes. For example, if the observers check the student code “CG: Discuss clicker question,” they will also likely check the instructor code “CQ: Asking a clicker question.” To provide the observers with practice coding a segment that has more complicated student and instructor codes, they next coded a different classroom video segment from the same large-enrollment, lecture-style science class at UMaine, but this time the camera was focused on the students. This video segment included students asking the instructor questions, students answering questions from the instructor, and clicker questions with both individual thought and peer discussion. The observers coded 2 min and then paused to discuss the codes. Then observers in pairs coded for an additional 6 min, again taking care to use synchronized 2-min increments. The observer pairs first compared their codes with their partners, and then the whole group discussed the student and instructor codes for each of the 2-min segments of the 6-min clip. At this point, the training was complete. VALIDITY AND RELIABILITY COPUS is intended to describe the instructor and student actions in the classroom, but it is not intended to be linked to any external criteria. Hence, the primary criterion for validity is that experts and observers with the intended background (STEM faculty and teachers) see it as describing the full range of normal classroom activities of students and instructors. That validity was established during the development process by the feedback from the SESs, the K–12 teachers, and those authors (M.S., F.J., C.W.) who have extensive experience with STEM instruction and classroom observations. A major concern has been to ensure that there is a high level of IRR when COPUS is used after the brief period of training described above. To assess the IRR, we examined the agreement between pairs of observers as they used the final version of COPUS in STEM classes at both UBC and UMaine. The two observers sat next to each other in the classroom, so they could keep identical 2-min time increments, but the observers were instructed not to compare codes with each other. To summarize how similarly observer pairs used each code on the final version of the COPUS, we calculated Jaccard similarity scores (Jaccard, 1901) for each code and then averaged the scores for both the UBC and UMaine observers (Table 3). For single codes, we calculated Jaccard similarity scores instead of IRR Cohen's kappa values, because observer pairs occasionally marked the same code for every 2-min increment throughout the duration of the class. For example, in a class that is lecture-based, observers would likely mark the student code “L: Listening” for the entire time. In a case such as this, the observer opinion is defined as a constant rather than a variable, which interferes with the IRR calculation. Table 3. Average Jaccard similarity scores for COPUS codes across all pairs observing in all courses for both UBC faculty observers and Maine K–12 teacher observers; numbers closer to 1 indicate the greatest similarity between two observers Student code UBC UMaine Instructor code UBC UMaine L: Listening 0.95 0.96 Lec: Lecturing 0.91 0.92 Ind: Individual thinking/problem solving 0.97 0.91 RtW: Real-time writing 0.93 0.93 CG: Discuss clicker question 0.98 0.97 FUp: Follow-up on clicker questions or activity 0.92 0.85 WG: Working in groups on worksheet activity 0.98 0.99 PQ: Posing nonclicker questions 0.86 0.80 OG: Other group activity Not used 0.97 CQ: Asking a clicker question 0.93 0.97 AnQ: Students answer question posed by instructor 0.91 0.84 AnQ: Answering student questions 0.94 0.89 SQ: Student asks question 0.96 0.93 MG: Moving through the class 0.96 0.97 WC: Engaged in whole-class discussion 0.96 0.98 1o1: One-on-one discussions with students 0.94 0.96 Prd: Making a prediction about the outcome of demo or experiment Not used 1.00 D/V: Conducting a demo, experiment, etc. 0.97 0.98 SP: Presentation by studentsa Not used Not used Adm: Administration 0.94 0.97 TQ: Test or quiza Not used Not used W: Waiting 0.95 0.98 W: Waiting 0.99 0.98 O: Other 0.97 1.00 O: Other 0.94 0.99 a“SP: Presentation by students” and “TQ: Test/quiz” were not selected in any of the observations at UBC or UMaine. This result likely occurred because when we asked UBC and UMaine faculty members if we could observe their classes, we also asked them if there was anything unusual going on in their classes that day. We avoided classes with student presentations and tests/quizzes, because these situations would limit the diversity of codes that could be selected by the observers. The equation for the Jaccard coefficient is T = n c/(n a + n b − n c), where n c = the number of 2-min increments that are marked the same (either checked or not checked) for both observers, n a = the number of 2-min increments that are marked the same for both observers plus 2-min increments observer 1 marked that observer 2 did not, n b = number of 2-min increments that are marked the same for both observers plus 2-min increments observer 2 marked that observer 1 did not. For example, for the data in Figure 2B, the class period is 42 min in length, so there are 21 possible 2-min segments. The student code “Ind: Individual thinking” was marked 12 times by observers 1 and 2, not marked eight times by both observers, and marked by observer 2 one time when observer 1 did not. Therefore, the calculation is: 20/(20 + 21 − 20) = 0.95. Numbers closer to 1 indicate greater consistency between how the two observers coded the class. Eighty-nine percent of the similarity scores are greater than 0.90, and the lowest is 0.80. These values indicate strong similarity between how two observers use each code. The lowest score for both the UBC and UMaine observers was for the instructor code “PQ: Posing nonclicker questions.” Comments from observers suggest that, when instructors were following up/giving feedback on clicker questions or activities, they often posed questions to the students. Observers checked the instructor code “FUp: Follow-up” to describe this behavior but stated they occasionally forgot to also select the instructor code “PQ.” To compare observer reliability across all 25 codes in the COPUS protocol, we calculated Cohen's kappa IRR scores using SPSS (IBM, Armonk, NY). To compute the kappa values for each observer pair, we added up the total number of times: 1) both observers put a check in the same box, 2) neither observer put a check in the same box, 3) observer 1 put a check in a box when observer 2 did not, and 4) observer 2 put a check in a box when observer 1 did not. For example, at UBC, when looking at all 25 codes in the COPUS, one observer pair had the following results: 1) both observers put a check in 83 of the same boxes, 2) neither observer put a check in 524 of the boxes, 3) observer 1 marked six boxes when observer 2 did not, and 4) observer 2 marked 12 boxes that observer 1 did not. Using data such as these, we computed the kappa score for each of the eight UBC and 23 UMaine pairs and report the average scores in Table 4. We also repeated this calculation using either the subset of 13 student or 12 instructor codes (Table 4). Table 4. Average IRR kappa scores from the observations at UBC and UMaine Observers All codes (± SE) Student codes (± SE) Instructor codes (± SE) Faculty observing UBC courses 0.83 (0.03) 0.87 (0.04) 0.79 (0.04) Teachers observing UMaine courses 0.84 (0.03) 0.87 (0.04) 0.82 (0.04) The average kappa scores ranged from 0.79 to 0.87 (Table 4). These are considered to be very high values for kappa and thus indicate good IRR (Landis and Koch, 1977). Notably, the kappa values, as well as the Jaccard similarity scores, are comparably high for both UBC faculty and UMaine K–12 teacher observers, indicating that COPUS is reliable when used by observers with a range of backgrounds and 2 hours or fewer of training. ANALYZING COPUS DATA To determine the prevalence of different codes in various classrooms, we added up how often each code was marked by both observers and then divided by the total number of codes shared by both observers. For example, if both observers marked “Instructor: Lecture” at the same 13 time intervals in a 50-min class period and agreed on marking 25 instructor codes total for the duration of the class, then 13/25, or 52% of the time, the lecture code occurred for the instructor. We visualized the prevalence of the student and instructor codes using pie charts. Figure 4 shows observation results from two illustrative classes: one that is primarily lecture-based and one in which a combination of active-learning strategies are used. The latter class is clearly differentiated from the lecture-based class. This example illustrates how, at a glance, this visual representation of the COPUS results provides a highly informative characterization of the student and instructor activities in a class. Figure 4. A comparison of COPUS results from two courses that have different instructional approaches. At a department- or institution-wide level, there are several ways to categorize the range of instructional styles. One of the simplest is to look at the prevalence of the student code “L: Listening to instructor/taking notes, etc.” across all courses observed, because this student code is the most indicative of student passive behavior in response to faculty lecturing (“Lec”) with or without real-time writing (“RtW”). Figure 5 shows that at both institutions the “L” code was marked 26–75% of the time. However, at UMaine, some of the classes have greater than 76% of the student codes devoted to listening. Faculty who teach these classes may benefit from professional development activities about how to design an effective active-learning classroom. Figure 5. Prevalence of the student code “L: Listening” across several UBC and UMaine classes. In addition, the data can be analyzed for a subset of faculty members who are using active-learning strategies, such as asking clicker questions. Thirty-eight percent of UBC and 43% of the UMaine classes that were observed used clickers. However, student code prevalence in these classes show that not all faculty members used clicker questions accompanied by recommended strategies, such as peer discussion (Mazur, 1997; Smith et al., 2009, 2011; Figure 6). Faculty members who are not allowing time for peer discussion may benefit from professional development on how to integrate peer discussion into clicker questions. Figure 6. Prevalence of student codes in four example courses that use clickers. In courses that use clickers with no or minimal peer discussion, the students are passively listening the majority of the time. DISCUSSION AND SUMMARY COPUS was developed because university observation programs needed a protocol to: 1) characterize the general state of teaching, 2) provide feedback to instructors who desired information about how they and their students were spending class time, and 3) identify faculty professional development needs. COPUS meets all of these goals by allowing observers with little observation protocol training and experience to reliably characterize what both faculty and students are doing in a classroom. There are several uses for COPUS data. On an individual level, faculty members can receive pie charts with their code prevalence results (examples in Figure 4). These results provide a nonthreatening way to help faculty members evaluate how they are spending their time. We discovered that faculty members often did not have a good sense of how much time they spent on different activities during class, and found COPUS data helpful. In addition, faculty members can use COPUS data in their tenure and promotion documents to supplement their normal documentation, which typically includes student evaluation information and a written description of classroom practices. Having observation data gives faculty members substantially more information to report about their use of active-learning strategies than is usually the case. COPUS data can also be used to develop targeted professional development. For example, anonymized, aggregate COPUS data across all departments have been shared with the UMaine Center for Excellence in Teaching and Assessment, so workshops and extended mentoring opportunities can better target the needs of the faculty. One area in particular that will be addressed in an upcoming professional development workshop is using clickers in a way that promotes peer discussion. The idea for this workshop came about as a result of the COPUS evidence showing the prevalence of UMaine STEM classes that were using clickers but allowing no or minimal time for recommended student peer discussions (Figure 6). Other planned uses for COPUS include carrying out systematic observations of all instructors in a department at UBC in order to characterize teaching practices. The information will be used with other measures to characterize current usage of research-based instructional practices across the department's courses and curriculum. In the end, the choice of observation protocol and strategy will depend on the needs of each unique situation. COPUS is easy to learn, characterizes nonjudgmentally what instructors and students are doing during a class, and provides data that can be useful for a wide range of applications, from improving an individual's teaching or a course to comparing practices longitudinally or across courses, departments, and institutions. Supplementary Material Supplemental Material
                Bookmark

                Author and article information

                Contributors
                Role: Monitoring Editor
                Journal
                Mol Biol Cell
                Mol. Biol. Cell
                molbiolcell
                mbc
                mboc
                Molecular Biology of the Cell
                The American Society for Cell Biology
                1059-1524
                1939-4586
                01 November 2018
                : 29
                : 22
                : 2611-2613
                Affiliations
                [a ]Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602
                Author notes

                Erin L. Dolan is the recipient of the 2018 Bruce Alberts Award for Excellence in Science Education.

                *Address correspondence to: Erin L. Dolan ( eldolan@ 123456uga.edu ).
                Article
                E18-06-0410
                10.1091/mbc.E18-06-0410
                6249828
                30376432
                fc0686a8-5380-469b-8898-beaab2b1865d
                © 2018 Dolan. “ASCB®,” “The American Society for Cell Biology®,” and “Molecular Biology of the Cell®” are registered trademarks of The American Society for Cell Biology.

                This article is distributed by The American Society for Cell Biology under license from the author(s). Two months after publication it is available to the public under an Attribution–Noncommercial–Share Alike 3.0 Unported Creative Commons License.

                History
                : 15 August 2018
                : 17 August 2018
                : 22 August 2018
                Categories
                ASCB Award Essay

                Molecular biology
                Molecular biology

                Comments

                Comment on this article