Flexibility in Research Designs in Empirical Software Engineering

Problem outline: It is common to classify empirical research designs as either qualitative or quantitative. Typically, particular research methods (e.g., case studies, action research, experiments and surveys) are associated with one or the other of these types of design. Studies in empirical software engineering (ESE) are often exploratory and often involve software developers and development organizations. As a consequence, it may be difficult to plan all aspects of the studies, and to be successful, ESE studies must often be designed to handle possible changes during the conduct of the study. A problem with the above classification is that it does not cater for flexibility in design. Position: This paper suggests viewing research in ESE along the axis of flexible and fixed designs, which is both orthogonal to the axis of quantitative and qualitative designs, and independent of the particular research method. According to the traditional view of ESE, changes to the research design in the course of a study are typically regarded as threats to the validity of the results. However, viewing the study designs as flexible, practical challenges can provide useful information. The validity of the results of studies with flexible research designs can be established by applying techniques that are traditionally used for qualitative designs. This paper urges an increased recognition of flexible designs in ESE and discusses techniques for establishing the trustworthiness in flexible designs.


INTRODUCTION
Empirical software engineering (ESE) studies often involve humans, as individuals or as part of a software development organization, who have their own constraints and expectations.It may be difficult for the researcher to predict these constraints and expectations, but in practice the researcher must adapt to them throughout the research (Anda, Hansen et al. 2006;Conradi, Dybå et al. 2006).Furthermore, studies in ESE are seldom based on established theories (Hannay, Sjøberg et al. 2007), and as a consequence elements of the research design, such as the research question or the concepts investigated, may need to be refined during the study.These features of research in ESE demand that the researcher be flexible, in order to manage research that takes unanticipated directions.To enable the researcher to be flexible, the research design must also be flexible.
We define flexibility simply as the capacity to adapt (Golden and Powell 2000), although a number of alternative definitions of flexibility of projects and organizations exist; see, for example, (DeLeeuw and Volberda 1996;Golden and Powell 2000;Olsson 2006).
Research designs are commonly classified as either quantitative or qualitative.Only qualitative designs are viewed as being flexible.The terms "quantitative" and "qualitative" are also used for the data collected in the empirical studies.Our experience is that using the same terms for both the characteristics of the data collected and the features of the research design leads to some confusion among software engineering researchers.Qualitative designs can, for example, incorporate quantitative methods of data collection.To better describe the degree of flexibility in research designs, leaving the type of data optional, Anastas and MacDonald (1994) and Robson (2002) use the terminology of fixed and flexible designs in social science.Because there is a need for increased awareness of the flexibility in ESE research designs, we suggest that this terminology for, or perspective on, design also be used in ESE.In fixed designs, the design is specified early in the research process, whereas in flexible designs, it is allowed to evolve during the research.Both qualitative and quantitative data may be used in both fixed and flexible designs.A research design may be either completely fixed, completely flexible, or have degrees of flexibility.We believe that there are completely flexible designs conducted in ESE, but that these typically follow the traditional qualitative framework, collecting qualitative data in ethnographies, action research, or exploratory case studies.In our experience, other types of study in ESE also face some form of uncertainty in the planning phase and thus require a degree of flexibility in the design.Hence, our main concern is to find a design perspective that embraces these studies.In particular, we believe that many experiments need a degree of flexibility in the design.A review of the literature on the type of evidence produced by empirical software engineers, performed by Segal (2005), shows that laboratory experiments dominate evaluations.Hence, our perspective might influence many empirical studies in software engineering.
Our main aims in this paper are to increase the awareness of why and how some ESE studies are flexible, and to initiate a discussion of how to handle this flexibility and simultaneously conduct methodologically sound research.We suggest using flexible designs, including appropriate techniques for establishing trustworthiness.
The remainder of this paper is organized as follows.Section 2 describes the features of fixed and flexible designs and gives examples of how the need for flexibility occurs.Section 3 suggests factors to consider when choosing a design.Section 4 suggests techniques for establishing the trustworthiness in studies with a flexible design.Section 5 concludes.

TYPES OF RESEARCH DESIGNS
We view a research design as consisting of the following elements as shown in Figure 1: purpose(s), theories, research questions, methods and sampling strategies.Both the purpose(s) and the theories help the specification of the research questions.Once the research questions have been specified, decisions can be made regarding the methods to use and the sampling strategies.The methods include the following: the research strategy, such as the case study, survey, or experiment; the constructs and measures; the methods of collecting data, such as interviews, observations, or questionnaires; the methods of analysis; the techniques for establishing the trustworthiness in the study; and the research schedule.Finally, the sampling strategy includes descriptions of the study units and how to select them.

Sampling strategy
Research questions FIGURE 1: A framework for a research design (Robson 2002) In a research study that follows a fixed design, the elements presented in Figure 1 must be specified before the data collection starts.More specifically, applying a fixed research design entails following the procedure that is presented in Figure 2a.Ideas are generated and the research is designed at the beginning of the procedure, and it is here that plans for collecting and analysing data are made.Any methods or data can be used, provided that they can be specified early in the research process.
Examples of typical types of fixed design are experiments that test theories and use statistical methods as a decision tool for drawing conclusions (Arisholm, Gallis et al. 2007), replications (Laitenberger, Emam et al. 2001), systematic reviews of well understood phenomena (Kitchenham, Mendes et al. 2006), and surveys that are based on questionnaires (Dybå 2005).
Another example of fixed designs would be studies that are not necessarily based on theories, but that have short time schedules that allow no flexibility.An example would be experiments performed at developer seminars (Grimstad and Jørgensen 2007).
In flexible research designs, the design components that are presented in Figure 1 are specified during the course of the study.Hence, when applying a flexible research design, the methods of inquiry evolve incrementally in response to the data obtained (Robson 2002).The generation of ideas, designing, data collection, and analysis and writing proceed together or in iterations, rather than in separate stages; as shown in Figure 2b.
So, whereas the procedure of following a fixed research design is analogous to the waterfall method of designing software, the procedure of following a flexible research design is similar to iterative software development or agile methods.The researcher's inability to fix one or several design elements at the beginning of the research, as well as practical constraints during the study, create the need for a flexible design.Examples are the specification of research questions, constructs and measures, and the research schedule: • Research questions.A study may set out with a tentative research question that is refined in the course of the study, because the understanding of the phenomenon under study and of what can actually be studied empirically increases.
• Constructs and measures.The mostly immature theories in software engineering mean that there will often be a corresponding lack of established constructs associated with the phenomenon under study.Further, the constructs may lack empirical validation.Consequently, constructs and measures may be refined during the study.Moreover, the knowledge about potential data sources and their quality may be limited at the outset of a study, so that the collection of data must be adapted to the actual data available.• Research schedule.The research schedule may have to be revised during the research, because of unforeseen events.
To give practical examples of how the need for flexibility occurs, we present in Table 1 experiences from two studies in ESE: a systematic review of the literature and a series of experiments.These studies were initially planned with a fixed design.However, because of lack of theories of the phenomenon under investigation, it became evident that some flexibility was required.The review was a quantitative investigation of the literature on experimentation over a decade.The first part of the review selected the relevant articles and summarized the characteristics of the experiments.The last part of the review investigated the reporting of effect sizes and quasi-experimentation.Research question: In the last part of the review, the initial research question asked whether there was a difference in effect sizes between randomized experiments and quasi-experiments.This appeared to be difficult to answer, because the experiments did not include the necessary information for estimating the effect sizes.As a consequence, the review ended up with investigating the state of practice of effect size reporting and quasi-experimentation.Constructs and measures: In the first part of the review, the operational definition of a software engineering experiment evolved with the selection of articles.Several researchers were involved in the study and the final criteria for inclusion were the results of several discussions.Further, the definition of effect size changed throughout the study, and ended up including the unstandardized effect size, because this type appeared to be reported in some articles and seemed very useful for describing the practical importance of the result.Furthermore, types of quasi-experiments in software engineering were not known in advance and therefore, the catalogue of quasi-experiments was continuously refined.Research schedule: The time schedule was revised continuously throughout the review.Lessons learned: The decision to not follow the initial plan, but account for new insight during the work, was important for the final quality of the review.All the refinements of research questions, constructs, and measures were valuable for the final results.However, the iterative process made the study more resource-demanding than planned; a flexible design requires a flexible budget.The iterative process was sometimes frustrating.If we had known the framework of flexible designs, we would probably have been more comfortable with all the refinements.Pre-review mapping and piloting the review protocol, as suggested by Brereton, Kitchenham et al. (2007), might have helped to reduce the number of iterations.However, many changes appeared late in the process and a flexible approach would still have been valuable for this type of review.The first experiment was a pilot study with 26 students as participants; the second was conducted with 53 students as participants; and the third was conducted with 22 professional software developers.The experiments were motivated by common claims in software engineering textbooks, but there were no established theories on the topic.Research question: The initial research question was whether there was a difference, regarding the time spent on design and quality of the final class diagrams, between a use-case-driven development process and a responsibility-driven development process.However, during the analysis and writing up of the experiments, we realized that the experiment had compared a more specific aspect of the two processes, i.e. the transition from use cases to class diagrams.
Consequently, the research question was changed to whether there was a difference, regarding time spent on design and quality of the final class diagrams, when classes where derived by analyzing the use cases compared to when the use cases are used to validate the class diagram.Constructs and measures: The exploratory nature of the experiments meant that the constructs used for the independent variable, the process, and for one of the dependent variables, quality, were not well established at the outset.Therefore, qualitative data was collected during the experiments to allow us to understand how the participants worked when performing the experimental tasks.The assessment of the quality of the final solutions was qualitative and was refined slightly on the basis of the actual data.
The procedure for data collection mostly remained as planned at the outset of the study, but we made some changes in response to the specific features of each experiment.In addition, a few of the participants did not manage to follow the process description that was part of the experimental material, so their solutions were discarded.
Research schedule: The study procedure was revised for each experiment.Lessons learned: Revising the initial research questions and central constructs during the course of the study was important for the quality of the final study, because it allowed us to take into account what we had previously learned.Furthermore, the collection of qualitative data, in particular about how the participants worked during the study, was valuable in ensuring the validity of the results.Conducting a pilot experiment is recommended in empirical research before fixing the design for the main experiment.Therefore, some revisions of the research design are also catered for in the existing literature on software engineering experiments.In this case, the first experiment can be characterized as a pilot.However, we found that it was difficult to fix all aspects of the design on the basis of the relatively small pilot and some flexibility was also useful in the later experiments.

CHOOSING AN APPROPRIATE RESEARCH DESIGN
The researcher must decide whether to use a fixed design or a design that accounts for a certain degree of flexibility early in the research process.In this decision process, we suggest considering the maturity of the research, the purpose of the research, the research setting and the time schedule of the research.
The maturity of research can be catalogued according to the extent of previous work in the field, for example nascent, intermediate, and mature theory; as shown in Table 2.
The purpose of research is commonly divided into exploratory, descriptive and explanatory; see Table 3.The purpose of the research often depends on the maturity of the research, but not in a deterministic way.Purpose and maturity represent two different perspectives, and both must be considered when choosing a design.In general, the less that is known about a specific topic, the greater the flexibility in the design.However, the research setting and the time schedule must also be considered.Exploratory research: Research in which the primary purpose is to examine a little understood issue or phenomenon to develop preliminary ideas and move toward refined research questions by focusing on the "what" question.Descriptive research: Research in which the primary purpose is to "paint a picture" using words or numbers and to present a profile, a classification of types, or an outline of steps to answer questions such as who, when, where, and how.Explanatory research: Research in which the primary purpose is to explain why events occur and to build, elaborate, extend, or test theory.
The research setting can be divided into two categories, which are based on the extent of control.In laboratories and classrooms, more control is possible than in studies that are conducted in a field setting.A controlled setting may enable a fixed design, even if the study is exploratory, whereas a field setting often requires a flexible design.Edmondson and McManus (2007) describe the process of field research on management as a journey that may involve almost as many steps backwards as forwards, in an iterative way.We think that their description fits well into the perspective of a flexible design.Moreover, Edmondson and McManus argue that this iteration is present in all types of field research on management, but that the timing and intensity of the iterations depends on the level of maturity of the research.They also argue that field research is exposed to so many unforeseen events that it must be viewed as a continuous learning process, and that the aim of the learning process is to achieve methodological fit.We present part of their model in Table 4.They suggest using qualitative data for nascent research, a combination of data types (hybrid or mixed methods) for intermediate research, and quantitative data for mature research.This recommendation is in line with our view of the type of data being orthogonal to the choice of fixed and flexible design.Further to this, we believe that quantitative data is sometimes useful for nascent research and qualitative data is sometimes useful for mature research.
A fourth factor to consider is the time schedule of the research.Studies that have a short time schedule (perhaps as short as an hour) often require a fixed design, whereas studies that have a long perspective (perhaps a day or more) often require a flexible design.Sometimes, participants in experiments perform tasks at different times.Such experiments might last for several weeks, allowing the researcher to influence the later part of the experiment using experiences obtained in the early part as a basis.In addition, the chances that other unexpected events will occur increase with time.

ESTABLISHING TRUSTWORTHINESS
An important part of the research design is to establish trustworthiness.In a fixed research design, trustworthiness is established by the production of a research plan, which includes control with potential biases that can influence the result, and a performance according to the plan.Central concepts when talking about trustworthiness in fixed designs, are validity and reliability; as shown for example by descriptions in (Shadish, Cook et al. 2002) and(Wohlin, Runeson et al. 1999).Examples of ways of establishing trustworthiness in fixed designs are randomization, blinding, random sampling, and computations of researcher's agreement scores.
In flexible designs, there is no fixed plan up front to compare performance to by the end of the study, and there might be types of biases different from those that apply to fixed designs.
In the remainder of this section, we describe techniques for establishing trustworthiness in flexible designs.We will use the definitions of validity and reliability that are suited to all types of research, as suggested in (Hinds, Scandrett-Hibden et al. 1990).We start with describing validity.
Validity is established when the findings reflect reality, and the meaning of the data is accurately interpreted.(Hinds, Scandrett-Hibden et al. 1990, p.431) One main threat to validity in studies that have a flexible design comes from the researcher's involvement in the study.It is the researcher's role to be deeply involved in every iteration and decision.In contrast to using a fixed design, where the researcher can concentrate on the planning in a specific time period followed by phases of practical work and analyses according to the plan, in a flexible design, he or she must continuously handle all aspects of the research: planning, performance, and analyses.This situation is very demanding.The researcher must avoid that research being more influenced by his or her personal assumptions than by the data.This threat from researcher bias and hence to valid interpretation can be reduced or eliminated by the techniques described in the literature on qualitative research; see for example (Kvale 1989;Huberman and Miles 2002;Creswell 2007).
In addition to the potential researcher bias, we believe there are two other main threats to validity in flexible designs.One is collecting data that is not the best suited for answering the research questions.This might occur when the research question changes in response to the research and the procedure for collecting data is not sufficiently flexible to account for these changes.The other threat to validity occurs when the researcher does not account for the flexibility of the design when analysing and reporting the results.The flexibility in the design will influence the inferences made from the results.For example, the assumptions for the statistical analyses might not be fulfilled.In such cases, the results can be regarded as justifying the formulation of hypotheses, rather than the formulation of firm conclusions.Furthermore, the reporting of the study must account for the insight obtained through the flexible approach.Hence, both the limitations and the gains obtained through the flexibility must be reported.
We suggest considering these threats to validity and corresponding techniques for reducing them, when performing studies in ESE that need a flexible design.In particular, we are concerned with those studies that traditionally do not use such techniques, for example experiments, systematic reviews, and other studies that use quantitative data.We recommend the following, which are mostly based on the descriptions in (Robson 2002): • Strive for the right researcher skill.The researcher must be able to manage unanticipated directions in the research and to balance adaption and rigour.Moreover, the researcher must know the issue under investigation, because the information gathered is interpreted, not only recorded.Finally, he must be open to contrary findings and ask for critical views on the work.• Use multiple researchers.There is probably more need for multiple researchers in the conduct and analyses in flexible designs than in fixed designs.Arrange peer debriefing and support group sessions.• Be aware of your value system.Write a description of your pre-assumptions and value system and keep a journal of your reflection.• Document everything.Produce an archive of your activities, raw data, analysis notes, etc. and let others inspect it (Audit trial).Document the analysis process so that you can trace the route by which you came to your interpretation.• Use the strategy of triangulation.Use multiple sources to enhance the rigour of the research.For example, collect both qualitative and quantitative data.• Collect data on a broad basis.Be open to the need for data that are related to, but that do not contribute directly to, answering the initial research questions.• Perform member checking.Check with the respondents to determine whether your interpretations are correct from their view.For example, interview the participants after their performance in experiments.• Account for flexibility in the analysis and reporting of the study.Both the limitations and the gains obtained through flexibility must be considered in the analysis and reporting.
Generalizability is one part of validity.Generalizability is possible in flexible design by providing sufficient information in the reporting of the study to enable the reader to determine whether the findings are applicable to his situation (Robson 2002).
Reliability is the second concept of trustworthiness.
Reliability is established when the repeatability of scientific observations, and sources that could influence the stability and consistency of those observations, have been identified and evaluated.(Hinds, Scandrett-Hibden et al. 1990, p.431) Subjectivity and objectivity in research are often connected to the question of reliability.The researcher's role in the flexible design makes it easy to consider flexible design to be subjective, and thereby unreliable.Patton (1990) prefers to avoid using the terms "subjectivity" and "objectivity".He strives for emphatic neutrality and with that, he means to be non-judgemental and report what is found in a balanced way.Phillips (1990) claims that "All good research is objective in the sense that it has been open to criticism and withstood serious scrutiny."Hence, a way of establishing reliability in flexible designs is to let other researchers evaluate all aspects of the research.
We have presented ways of establishing trustworthiness in the research to handle the challenges that arise from the flexibility of the design.In addition, worldviews (such as positivism, constructivism, and pragmatism) and particular choices of research methods and type of data gathered must be considered.For example, Lee (1989) discusses conducting case studies that are consistent with the conventions of positivism, Klein and Myers (1999) discuss how to conduct interpretive field studies, and Host and Runeson (2007) have suggested a checklist to use in case studies in software engineering (see also the book by Yin (2003) for general descriptions of case studies.)Moreover, the recent special issue of Information and Software Technology on qualitative software engineering research provides many useful examples of approaches for study designs, data collection, and analysis that should be relevant for future studies of software development that employ flexible designs (Dittrich, John et al. 2007).Finally, issues regarding mixed methods are presented in (Tashakkori and Teddlie 2003).

CONCLUSION
We have suggested that research designs in ESE often need to be flexible.The rationale for this perspective is that studies in SES are often exploratory, immature, or performed in a field setting.Moreover, the studies involve people, whose behaviour or skills we cannot predict exactly.Because such studies are difficult to plan in detail, the researcher must be flexible and be prepared to adapt when the research takes an unanticipated direction.This requires the use of flexible research designs.
Our impression is that most research in ESE use fixed designs, in the form of experiments and surveys, probably because this type of design is traditionally regarded as the most reliable, or the easiest to implement.This strategy might imply that the full potential of the study is not achieved, for example, because deviations from the plan are regarded as threats to validity.Using a flexible design, such deviations are regarded as learning opportunities and are used to adjust design for the remainder of the research as well as being part of the results.Moreover, flexible research requires a flexible budget.Hence, planning for flexibility will help to formulate a realistic budget.
A flexible design can be used in all types of ESE research, the extent and timing of the flexibility being studyspecific.In order to establish trustworthiness, techniques for reducing researcher bias must be used and the reporting of the study must account for both the limitations and the insight obtained through the flexible approach.
We hope that the work presented herein will promote discussion on how to handle the need for flexibility in research designs in ESE and simultaneously perform methodologically sound studies.

FIGURE 2 :
FIGURE 2: Visualizations of the procedures that follow a fixed research design and a flexible research design.
Example 2. A series of three laboratory experiments that investigated the effects of different ways of applying use cases in the construction of class diagrams (Anda and Sjøberg 2003; Syversen, Anda et al. 2003; Anda and Sjøberg 2005)

TABLE 1 :
Examples of how flexibility might occur in studies in ESE Example

TABLE 2 :
(Edmondson and McManus 2007)(Edmondson and McManus 2007)Mature theory presents well-developed constructs and models that have been studied over time with increasing precision by a variety of scholars.Intermediate theory presents provisional explanations of phenomena, often introducing a new concept and proposing relationships between it and established constructs.Although the research question may allow the development of testable hypothesis, similar to mature theory research, one or more of the constructs involved is often still tentative, similar to nascent theory research.Nascent theory proposes tentative answers to novel questions and suggests new connections among phenomena.

TABLE 4 :
(Edmondson and McManus 2007) fit for research in a field setting(Edmondson and McManus 2007)