Using systematic reviews and evidence-based software engineering with masters students

Context: The problem of teaching research skills to masters students. In particular, improving their literature reviews, assessing them and providing good feedback. Objectives: To introduce systematic reviews and evidence-based software engineering (EBSE) guidance into our teaching, provide an experience report and empirical data, and investigate the results. Methods: A systematic review requirement was introduced into the students’ assessed work. The format of the assessment brief (also provided in this paper) was influenced by previous research on EBSE work with students. Qualitative and quantitative data was generated, and statistical analysis investigated the students’ performance across the different elements of the systematic review. Results: Most students could do a systematic review and more useful feedback could be given. The assessment brief deviated from the normal EBSE guidelines in order to address previous difficulties. This modification was successful. Differences were found in student marks for different elements of the systematic reviews, with a large effect size for differing scores between ‘search’ and ’criteria’, and ‘search’ and ‘evaluation’. Conclusions: Introducing systematic reviews and EBSE guidance can improve students’ literature handing skills and support improved feedback. The EBSE guidance should be modified for students and novice researchers to incorporate the process of developing a well-defined research question. Further work should investigate the differing performance across different elements of the systematic review.


INTRODUCTION
All masters students in our School of Computing take the same 2-semester research methods module.Some of these students are studying software engineering modules on MSc programmes such as MSc Software Engineering and MSc Applied Computing, but others are on MA programmes such as MA Computer Games Art and MA Creative Digital Media.Teaching appropriate research methods to such a varied cohort is challenging in itself!In this paper we concentrate on the assessment of the students' learning, and in particular on the problem of giving good feedback on the students' skills in searching, evaluating and using the research literature.This problem was addressed by the introduction of a systematic review into the students' assessed work.Systematic reviews are part of the evidence-based software engineering (EBSE) approach.The paper is thus an experience report, concerning the introduction of systematic reviews into the teaching and learning of research methods.It offers empirical evidence, practical experience and personal opinion.The paper can help academics and practitioners by: • Providing a case study, with both qualitative and quantitative empirical data, of systematic reviews and EBSE in practice, across a wide range of computing disciplines.• Improving the teaching and learning of research methods and EBSE, which should consequently promote the value of evidence-based computing practice.• Furthering the development of the EBSE approach, particularly the systematic review element, through the modification of guidelines and addition of experience-based heuristics.There have been few publications on the use of EBSE in teaching and learning.Hence the paper makes a contribution to both the software engineering and the pedagogical literature.Since many of the students were on routes other than software engineering, the paper also makes a contribution to the wider computing literature.

CONTEXT
Pedagogical researchers have shown that one of the most important requirements for effective learning is good feedback.Such feedback should not just tell students what they have done well or poorly, but also "feed forward", that is provide advice for how the students can improve in their next piece of work (Lunsford, 1997;Nicol & Macfarlane-Dick, 2004;Petty, 2006).Good assessment feedback should also inform teachers about where they need to improve their teaching (Hattie & Timperley, 2007).In the UK an annual National Student Survey is undertaken across all final year undergraduates asking about their views on their university and degree course (HEFCE, 2008).At our university and many others, the students rate the feedback they receive more poorly than other aspects of their university education.Increasing attention is therefore being paid to the problem of improving feedback from staff to students.This paper discusses an attempt to improve feedback by introducing systematic reviews into the students' assessed work.A systematic review is part of the evidence-based practice paradigm, which has become very important in medicine and health sciences, is spreading to other disciplines, but is so far not well-established in computing.Evidencebased software engineering (EBSE) gathers and appraises existing evidence on a technology via a five step methodology (Dybå, Kitchenham, & Jorgensen, 2005): 1. Convert problem or information need into an answerable question.
2. Search the literature for the best available evidence to answer the question.
3. Critically appraise the evidence.4. Integrate the appraised evidence with practical experience and the customer's values and circumstances to make decisions about practice.5. Evaluate performance in 1-4 and seek ways to improve it.
Steps 1-3 and their evaluation (i.e.Step 5) are achieved via a systematic review.A systematic review searches the literature and appraises the evidence in a systematic and transparent way.In a traditional literature review the search strategy and evaluation criteria for the results found are normally hidden from the reader, meaning that the review could well have been conducted in an unstructured, ad hoc way and evidence that does not support the researcher's preferred hypothesis could have been simply ignored.However, in a systematic literature review the search strategy and the evaluation criteria are made explicit, and all relevant evidence is included in the appraisal (Kitchenham, 2004).A recent study (Kitchenham, 2007) found only 20 systematic reviews had been published in the software engineering literature, and that is the only computing discipline that has really begun to explore the evidence-based practice paradigm.There has also been little published work on using the EBSE approach and systematic reviews with students meaning that this paper makes an important additional contribution.Jorgensen et al (2005) discuss the structure, content and assessment of their undergraduate modules on EBSE but offer no empirical data.This paper responds to their call for more university employees to report their experience and measurements of the effects, and move towards sharing course material and teaching experiences.Janzen and Ryoo (2008) describe teaching EBSE to students by asking them to summarise individual publications which present empirical research on a topic and upload the summaries to a web-based database.However, the students were not asked to evaluate critically the individual studies, nor to synthesise the findings from a set of empirical research projects on the same topic.Rainer et al. (2006) report on the experiences of undergraduate students using EBSE for their assessed work on a module titled 'Empirical evaluation in software engineering'.The students were required to evaluate a technology of their choice using EBSE, and empirical data is provided via student comments and an analysis of the students' use of the EBSE guidelines.Some of the findings of Rainer et al's study influenced the design of the student assessment tasks reported in this paper: • Students had problems composing well-formulated EBSE questions • Students used a limited number of search terms • Students' main search engine was Google, but AskJeeves and MSN were also used.
• Students made little use of material in scientific journals, but many of them used technology/practitioner websites

THE PROBLEM
In our two semester compulsory masters module in research methods the set text is Oates (2006).The students are introduced to design and creation as a research strategy, other empirical research strategies and data generation methods, data analysis techniques, legal aspects of research and research ethics, and the philosophical paradigms of research (the scientific method, interpretive research and critical research).They learn to read critically and assess the evidence in published articles that present empirical data on computing-related topics.
They also learn about the difference between academic research literature and non peer-reviewed literature, the purpose of a literature review, how to conduct a literature review and search online databases, referencing techniques and avoiding plagiarism.
For several years the assessed course work required students to propose a research project on a topic of their own choosing that would require 12-36 months work.They had to provide the rationale for the need for this research and its anticipated contribution to knowledge by placing it in the context of previous research, outline their intended research methodology and reflect on the research project's underlying philosophy.
Reflection on the students' performance led to the realisation that the rationale part of the students' work was often poor.Feedback comments included: "Unconvincing rationale.""Few/no references used." However, these comments are vague and unhelpful, both to the teacher seeking to improve her teaching, and to students wanting to know how to improve their work next time.The students' difficulty appeared to involve using the literature successfully, but only the outcomes of their review could be read, while the process, and so where they needed help, was hidden.It was not possible to say whether the work could be improved by better selection of keywords, better selection of databases, improved construction of search commands, removal of bias (e.g.not concentrating on websites that were readily available), better critical analysis of search results, or improved synthesis of the search results.In short more needed to be known about the students' review process, and a possible solution to the problem was offered by the introduction of systematic literature reviews, drawn from EBSE.The research questions were: 1. Can students do systematic reviews? 2. Does the introduction of systematic reviews enable better feedback?3. What features should be incorporated to avoid the problems observed by Rainer at al ( 2006)? 4. What are students' views on systematic reviews? 5. Are there significant differences in their marks for different elements of the systematic reviews?

METHOD
In the 2007-8 academic year students were given a new one hour lecture on the evidence-based practice paradigm, its increasing importance in software engineering and the need for more recognition and adoption in other computing disciplines.The lecture also explained the main stages of a systematic review, and suggested further reading (Dybå et al., 2005;Kitchenham, 2004;Runeson et al., 2006;Torchiano & Morisio, 2004).
The in-course assessed work was changed.Students were now required to conduct a systematic literature review on a topic of their choice before they proposed a research project as before.The student brief concerning the required systematic review is in the Appendix.In order to gain further insight into their experiences and learning, and gain qualitative empirical data, they were also asked to write a short reflective essay (750 words maximum) discussing what they had learnt about doing research by undertaking the assignment.Note that students were offered the option to work either individually or in pairs - Kitchenham (2004) recommends that systematic reviews are undertaken in pairs, to remove potential individual bias and force explicit attention to the evaluation criteria.Of the 43 submissions, 9 were jointly authored.In the statistical analysis in the results section, these jointly authored submissions have been treated as a single author submission.
To avoid the problems reported by Rainer et al. (2006) the following features were included in the briefing: • Students had problems composing well-formulated EBSE questions Students were advised that their initial question should be, "What has been published already on Topic X?' Depending on the number of results returned they could then refine their research question and search terms to narrow or widen their search.
• Students used a limited number of search terms Students were advised to keep on refining their search process until they had a final set of results comprising 10-30 articles.
• Students made little use of material in scientific journals, but many of them used technology/practitioner websites.Students were instructed that the literature search must use the online databases and e-journals provided by the university library (Learning Resource Centre).
• Students' main search engine was Google, but AskJeeves and MSN were also used.Students were told they could not use Google or other search engines which searched all the web, and they could use Google Scholar only in exceptional circumstances.

RESULTS
There were 43 submissions for assessment.In this section personal observations and empirical data (qualitative and quantitative) are presented and discussed.

Research Question 1: Can students do systematic reviews?
The majority of students, across all masters programmes, could do a systematic review.Only eight submissions failed to achieve the pas mark (50%) for the systematic review.Some reviews were of a high, publishable standard.Software engineering-related research questions included: • What technologies and design strategies have been used and evaluated for creating a web-based geographic information system?• What evidence is there to support that teaching functional programming to first year students will make them better computer programmers?• What evaluation has been done of the network technologies to be used in Smart Home environments?Research questions explored by students on other masters programmes included: • In comparison to 'keyframing' techniques, is 'motion capture' a more efficient method, in terms of time and money, to create realistic character animations for 3D real time games?• Does the literature show that users are aware of how their personal information can be used on the social networking site Facebook i.e. by the site (advertising) and other users?• Can computer games benefit learning?
• Is fur rendering using volumetric techniques justified in practice?

Research Question 2: Does the introduction of systematic reviews enable better feedback?
The systematic reviews gave better insight into the students' literature searching, so that better feedback could be given.Examples of feedback comments given to students show that it was now possible to home in on which aspects of their literature use should be improved: "Instead of simply summarising the content of each article, try to evaluate them in order to decide whether they are good pieces of research and helpful to you and your research question.""Try to summarise & critically evaluate each article studied as well as synthesising the full collection.""Be sure you have answered your own question.I couldn't see how your analysis of the published research answered the question re evidence for making better computer programmers." In assessing the work and giving feedback it became clear that the cycle described in the brief, and the associated assessment criteria, (see Appendix) were each missing one important element.Students were asked to provide criteria for evaluating their final set of articles, but they were not asked to explain how they had previously reduced their search results down to 10-30 except by refining their research question and search terms.In practice many did describe such criteria: they scanned titles and abstracts for perceived relevance, as well as using pragmatic criteria such as, "Is an electronic version of the full text readily available?"Hence, criteria for inclusion/exclusion in the list of potential articles for evaluation, should also have been included in the brief, as indeed Kitchenham (2004) recommends.Any subsequent work which builds on the assessment and feedback process described here should rectify this oversight.

Research Question 3: What features should be incorporated to avoid the problems observed by Rainer at al (2006)?
The features introduced were explained in the previous section.The following observations were made.The EBSE guidance Step 1 is "start with a well-defined research question".Rainer et al (2006) report that students find this difficult to do.Using the initial question "What has been published already on Topic X?" gave students a starting point.In most cases the number of results returned was initially too high.Setting a target number of results for analysis and requiring documentation of the process made students persevere with their search, usually by narrowing down to a more focussed research question and more sophisticated search strategies.Starting with "What has been published already on Topic X?" also makes the process of formulating a well-defined research question visible within the systematic review.It is therefore recommended that for students and other novices who do not have a good research question already, they should start with the question, "What has been published already on Topic X?" EBSE guidelines state that the review protocol should be defined in advance (Kitchenham 2004), but in practice this is also often very difficult to achieve.The search strategy is likely to be adjusted as the results are inspected and the research question evolves, so that the process of adjustment and refinement should be included in the reporting of the systematic review.Experienced researchers may be able to follow a top-down, linear approach: planning, conducting, and then reporting the review (Kitchenham, 2004).However, novices need iterative and incremental cycles of planning and conducting the review (cf. the traditional waterfall model of software development and prototyping approaches).The EPPI-Centre (http://eppi.ioe.ac.uk/cms/) (Evidence for Policy and Practice Information and Co-ordinating Centre) is a well-recognised leader undertaking systematic reviews and developing review methods in social science and public policy.Its guidance on systematic reviews notes that "In some reviews, the question and method is not so pre-specified, so allowing for a more iterative method of review.These reviews tend to have broader questions and take a more investigative approach to examining the evidence rather than pre specifying every aspect of the review."This recognition should also be made within the EBSE guidance, with new initial steps added to the guidance for students and other research novices: 1. Define a topic of interest.2. Define a search strategy to answer the question, "What has been published already on topic X?" 3. REPEAT: Refine the question and/or search strategy UNTIL you have found all articles relevant to your interest or a manageable number of articles within the resources available.Since formulating a well-defined research question is so crucial to the success of a systematic review it would help the education of inexperienced systematic reviewers if experienced systematic reviewers also described how they arrived at a good research question and appropriate search protocol.Students were set an advisory target for their final set of results of 10-30 articles.In practice, they found that to keep within the overall 2500 word limit for the systematic review, 10-12 articles was the maximum number they could assess and synthesise.If a teacher requires a wider, more comprehensive review, a higher word limit would be needed.All students used the library resources and the academic literature rather than websites, although for those on MA programmes it was observed that their subject databases (e.g.Design and Applied Arts Index) did contain some articles that were not peer reviewed and were aimed more at practitioners than academic researchers.Google Scholar was occasionally tried, with mixed success, where the University library was not able to provide access to a particular paper (e.g. because the library did not subscribe to a particular journal and obtaining the article from the British Library would take too long.)

Research Question 4: What are students' views on systematic reviews?
Students' comments in reflective essays that are assessed should be viewed cautiously, since the students may write what they think their teacher wants to read.On the other hand, the students were not aware that the introduction of a systematic review into the assessed work was novel and of particular interest to me.Their comments included "Now I realise research isn't just looking things up.""I used to just Google and take the first page of results.""Initially I struggled to see the relevance to a creative.Now I'm aware of the host of information available.""I hadn't released before how organised and careful you have to be when doing research.Now I think I'd like to do a PhD."Such comments suggest that they did now appreciate the need to be systematic and recognised the existence of useful literature beyond websites.

Research question 5:
Are there significant differences in student marks for different elements of the systematic reviews?Marks were given between 0 and 5 under the following headings: A. Full description of a repeatable process?B. Adequate literature search?C. Appropriate criteria, explained well?D. Sources evaluated well against the criteria?E. Reasoned synthesis and conclusions?F. Good discussion of limitations of systematic review?G.All references cited and listed correctly?The mean and standard error for each element of the ICA was calculated and is presented in Table 1 below.The standard error (SE) for each element is relatively small, indicating that the sample mean is an accurate reflection of the population from which the sample was taken.Figure 1 shows a box plot of the student marks for each element, with the median and the range of the middle two quartiles.Note that Search has a high median, but two outliers -students who did not perform an adequate search by utilising the library's resources.Unfortunately registers are not available to see whether these two students did not attend the classes which explained the types of searches and resources required.

FIGURE 1: BOXPLOT OF STUDENT MARKS
A Kolmogorov-Smirnov test on each element (see Table 2) showed significant values for all elements, with the exception of the C element, Criteria, indicating our sample distributions deviate from normal.The box plot (Figure 1) clearly shows how the distributions are skewed with the exception of the element 'Criteria'.To test the hypothesis that the difference in the scores between elements is significant a non-parametric test, Friedman's ANOVA was therefore used.This showed that the score obtained in each section was significantly different over the 7 sections (χ 2 (6) = 76.15,p<.01).Wilcoxon tests were used to follow up this finding (effectively a Using systematic reviews and evidence-based software engineering with masters students series of multiple t-tests).A Bonferroni correction was applied so that all effects are reported at a .0238level of significance.Table 3 below shows those element-pairs where the score differed significantly.Only two pairs, search-criteria and search-evaluation showed a large effect size (i.e.>0.5).This indicates that students could score well on one element and not the other.This is not unexpected.Students had been given clear requirements that they must use the online databases and e-journals provided by the university library, and find 10-30 relevant articles.Almost all were able to follow these Search instructions satisfactorily.However, students had to develop their own criteria for evaluating the articles, depending on their research question, and then apply those criteria.This requires higher order cognitive skills and it is likely they will find these tasks much harder.These tasks may provide a better assessment mechanism for distinguishing good students from poorer ones.

VALIDITY THREATS
Caution should be exercised in attempting to generalise from this study.It was based on masters students only; it is not known whether similar findings would result with undergraduates.The experiences and results discussed are based on one cohort of masters students at one UK university.Further similar work is required before we can make generalisations about all masters students and students at other universities.The student work was marked by one assessor, with second marking of a 10 of them.A single assessor ensures consistency of marking, but it is possible another assessor might interpret the marking scheme differently.Any teaching innovation also faces the 'problem' of researcher/teacher enthusiasm, where improvements can be a result only of increased motivation and effort by the teacher who initiates the changes (Petty, 2006).The statistical analysis also has the limitation that there were only 43 submissions and the range of marks for each element was only 0-5.

CONCLUSIONS
This paper has reported our experiences of introducing systematic reviews and evidence-based software engineering (EBSE) guidance into our teaching and assessment of masters students on a range of computing programmes.Both qualitative and quantitative data has been provided.It has shown that that most of the students could do systematic reviews, but we recommend that the EBSE guidelines on systematic reviews be modified for students and other novice researchers, to allow for a more iterative method of review.As hoped, introducing systematic reviews helped the students to approach the literature search in a structured fashion and enabled the provision of better feedback.Statistical analysis indicates differing scores across the different elements of the systematic review, with two pairs, search-criteria and search-evaluation, showing a large effect size (i.e.>0.5).Further work could investigate the reasons for differing marks across the different elements of the systematic review.It will also be interesting to discover whether there was any effect related to those students who chose to work in pairs on the systematic review versus those who worked individually.A dissemination outlet is now being sought for the student work that was of publishable standard.
With the subsequent cohort we have asked the students to analyse and evaluate the research methodology reported in each article by applying the evaluation guides of Oates (2006).This may improve the students' critical abilities.

APPENDIX: STUDENT BRIEF FOR THE ASSESSED WORK Part 1 (40 marks)
You are asked to carry out a systematic review of the literature, on a computing/digital media topic of your choice, using the online databases and e-journals of the University's Learning Resources Centre.Your question (or set of questions) should be clearly stated, and the process clearly described, so that the systematic review is repeatable: • Initial question(s) • The iterative cycle: Initial question(s), keywords used, databases searched, number of publications found, revised question(s), keywords used, databases searched etc • Criteria/question(s) for appraising the final set of articles found (ideally 10-30 articles, but the actual number will depend upon your topic) • Appraisal/evaluation of the articles • Synthesis of the articles and your conclusions • Discussion of the limitations of your systematic review • Full reference details for the final set of articles you analyse/synthesise (Harvard style) Your systematic review should be 1200-2500 words, excluding references.Criteria to be used in marking the systematic review: A

TABLE 3 : WILCOXON TESTS
. Adequacy of literature searching B. Full description of a repeatable process C. Appropriate evaluation/appraisal criteria D. Satisfactory evaluation/appraisal E. Satisfactory synthesis and conclusions F. Discussion of limitations G. Complete references in correct format