An assessment of published evaluations of requirements management tools

Context: The traditional literature review is a low cost, relatively quick but potentially ineffective method for evaluating tools. Practitioners appear to place a greater emphasis on the practical constraints of an evaluation (e.g. that it is low cost and quick) and the efficacy of the technology to the company, rather than on generic scientific results. By contrast, academia appears to place greater emphasis on theory confirmation, rigour and validity, and their literature reviews focus on literature published in peer-reviewed journals and conferences, and tend not to consider the trade and ‘grey’ literature.


INTRODUCTION
Leading researchers within the empirical software engineering community [2][3][4]15] have for some time advocated the conduct of empirical evaluations as part of technology adoption decisions.Related to the promotion of empirical evaluations, researchers have in the past tended to concentrate on methods and methodologies for collecting new evidence, rather than identifying and appraising existing evidence.More recently however the Evidence Based Software Engineering (EBSE) methodology, together with the related Systematic Literature Review (SLR) protocols, has concentrated on identifying and appraising existing evidence.
Researchers have recognised the benefits and drawbacks of different methods of empirical evaluation and have also considered the context in which the different methods are more and less appropriate.The DESMET methodology (e.g.[6]) provides an exemplar of attempts to organise methods of empirical evaluation into a framework and guide evaluators on when it would be appropriate to use particular methods within that framework.Drawing on the work of Zelkowitz et al. (e.g.[14]), Wohlin et al. [13] and Herceg [5] have also organised empirical methods into frameworks.These frameworks have recognised that the traditional literature review is a low cost, relatively quick but potentially ineffective method for evaluating tools (e.g.[5,13,14]).While EBSE and SLRs in particular could address some of the limitations of traditional literature reviews, the resources required to undertake an SLR indicates that SLRs are not relatively low cost or quick.Also, the majority of SLRs conducted to date focus on literature found in peer-reviewed academic journals and conferences, and do not appear to consider practitioners' trade literature or the so-called 'grey' literature.
In this paper, we report a preliminary assessment of the quality and quantity of published evaluations of requirements management tools (RMTs) that have been reported in the academic, grey and trade literatures.We have undertaken this assessment to better understand the potential value of the literature review as a method of tool evaluation for technology adoption.For example, if there are many, high-quality evaluations already published then it may be more sensible for practitioners to first undertake a systematic literature review of previous evaluations before deciding whether it is appropriate for them to collect new evidence, perhaps through a pilot study, from within the company.Conversely, if there are few published evaluations and these are of low quality then there would be a stronger case for a company to conduct its own evaluation and collect new evidence specific to the company.
We base our assessment on three literature reviews that were conducted to identify published evaluations of RMTs.The three literature reviews were conducted by three different types of reviewer: a practitioner in a company, an experienced researcher, and 19 final-year undergraduate students.The reviews undertaken by the 19 students are treated here as one literature review.The researcher and the students followed a version of EBSE to undertake the literature reviews.The practitioner undertook a traditional, ad hoc literature review.We only briefly consider here the possible effects of using an EBSE-based protocol on the outcomes of the literature review; see [11] for a more detailed discussion of possible effects.All three literature reviews were motivated by a collaborative project being undertaken between a company and researchers at the University of Hertfordshire, as part of a Knowledge Transfer Partnership (KTP).To protect the commercial interests of the company we cannot name the company here or identify the specific KTP.
The remainder of the paper is organised as follows: section 2 briefly reviews related work in this area; section 3 describes the design of our study; section 4 presents the results of our study; finally, section 5 briefly discusses our results and identifies areas for further research.

RELATED ACTIVITY
Zelkowitz et al. found a notable contrast between the methods preferred by research to those preferred by industry.In particular, Zelkowitz et al. found [14] that the research community is more concerned with theory confirmation and validity of experiments and less concerned with costs, whereas the converse is true for industry: organisations within industry are more concerned with costs and the applicability of a technology in their respective organisation's environment, and are less concerned with general scientific results which would aid the community at large.This distinction suggests that academics and practitioners would tend to value different guidelines in, for example, a methodology like DESMET: the academics would 'gravitate' toward the guidelines based on the technical strengths and weaknesses of a method of evaluation whilst the practitioners would 'gravitate' toward the guidelines based on practical constraints.Wohlin et al. [13] present a framework for organising methods of evaluation that recognises the escalation in costs that would be incurred by an organisation when conducting different kinds of evaluation.The framework recognises that as one proceeds from literature studies (presumably traditional literature reviews rather than SLRs) through to case studies of standard projects, the costs for performing the evaluation increases, the similarity of context increases and the confidence in the evaluation increases.Given industry's focus on costs, it is reasonable to suppose that an organisation would undertake some kind of formal or informal literature review as a first step in evaluating a proposed technology.Indeed, Wohlin et al. state that "... a suitable first step [to evaluate new technology] is always to study the available literature to obtain a baseline concerning the state of the art in the area and also to get some information about best practices."([13]; emphasis added).
Aside from the costs of conducting an evaluation, Herceg [5] recognises the costs incurred from the adoption of an inappropriate technology.These costs include: the opportunity cost of not finding critical information of interest, the sunk costs of integrating the inappropriate technology, costs arising from reversing the integration of the inappropriate technology, and finally the additional costs of finding and inserting an appropriate technology.The first of these costs can be directly addressed with a literature review, and such a review could provide a foundation for subsequent decisions over technology evaluation and adoption.Herceg combined the methods proposed by Zelkowitz et al. and Wohlin et al. to develop a revised classification.One notable element of Herceg's list is the distinction between a literature study of vendor literature and that of academic literature.As with Wohlin et al., Herceg recognises an increase in both costs and benefits as one proceeds from the desk-based to production-targeted evaluations.
Whether one conducts a more traditional, ad hoc literature review or a more formal literature review such as an SLR there is the issue of whether there is literature available 'out there' that is appropriate to the review.Recently, some researchers have begun to investigate Systematic Mapping Studies (e.g.[7]).Systematic Maps (which appear to be similar to, if not the same as, scoping studies e.g.[1]) are an attempt to 'map' a 'terrain' of literature that may be relevant to a broad topic of interest.Formally, the study we report here is not a Systematic Mapping Study, however our study does explore the same kinds of issues e.g. to what degree are suitable publications available in the academic, trade and grey literatures for some kind of structured identification and appraisal of published evaluations of RMTs?Like systematic mapping studies, our study concentrates on the identification of appropriate literature prior to a structured appraisal of that literature.

Overview to the three literature reviews
The assessment reported here was originally motivated by a collaborative project undertaken between the University and a commercial company.The company is a small to medium enterprise (SME) with approximately 100 employees, has been ISO9001:2000 compliant since 2000 and before that ISO9001:1994, and is estimated to be approaching CMM Level 2. The company wanted to improve its requirements management and had already decided to adopt a RMT however the company had not decided which tool to adopt.
The University and the company worked together to identify, evaluate and decide on a RMT, and subsequently to deploy the tool and then review its deployment.A recent graduate (the third author of the current paper) was recruited full-time to work onsite as a practitioner at the company, and to undertake the RMT evaluation and RMT deployment.The graduate received academic support from the University and commercial support from the company.The collaborative project provided the opportunity for a coursework assignment to undergraduate students in the final year of the BSc(Hons) Computer Science degree programme.The assignment was also undertaken independently by an experienced researcher (the second author of the current paper, who did not teach the students).The undergraduate students and experienced researcher used a set of EBSE Supplementary Guidelines to assist them with their evaluations.Further information on the Supplementary Guidelines can be found in [10].The undergraduate students' experiences of using the Guidelines can be found in [9].Other investigations we have conducted using these datasets can be found in [11] and [8].A thorough description of the RMT evaluation, deployment and post-deployment review undertaken for the company can be found in [12].We emphasise that in the current paper we focus on the identification of evaluations of RMTs in the public domain.We briefly appraise the evaluations for their quality.We do not consider which RMT is the 'best' RMT, either in general or with regards to the specific company.Neither do we consider in detail how these evaluations influenced the tool adoption decision of the company.

Comparison of the three sets of literature search
There are differences between the literature searches conducted by the three groups of reviewers.These differences are summarised in Table 1.A fundamental distinction between the three reviews may be the focus of the reviews i.e. the practitioner is asking the questions 'What RMTs are available, and which is the more appropriate?'whereas each student and the researcher is asking the question 'Is RMT x better than RMT y?' (or where only one RMT is evaluated in the EBSE question: 'Is RMT x fit for purpose?').Another significant distinction may be the time each evaluator took to identify the literature.The practitioner estimates that they took 12 hours for their literature search.This time would be over twice as much time as was recommended to students and the researcher for the literature identification part of the coursework assignment.The relatively short durations of the literature searches (as opposed to the entire courseworks) are consistent with Zelkowitz et al.'s and Wohlin et al.'s description of industry undertaking initial, low-cost, quick literature-based evaluations prior to undertaken more expensive, but potentially more reliable, further evaluations.

Datasets
The practitioner and researcher each produced one set of publications from their respective literature searches.The 19 students produced 19 sets.For the 19 sets of student-produced publications, we removed duplicates to produce one set of unique publications for the students.There were still a small number of duplicates in the students' combined set of publications.These duplicates occur when, for example, two or more students downloaded the same literal article but from different websites.In the subsequent analysis of the students' combined set of publications, we identified each of these more subtle duplicates and removed those duplicates.

Selecting the higher quality publications
A preliminary examination of the full set of resources indicates that there is considerable diversity in the resources selected by the reviewers, particularly those selected by the students.This examination also suggests that many of the resources do not constitute reports of tool evaluations but rather provide information on, for example: requirements management, the general concepts of a requirements management tool, or product information from vendors on their tools.In addition, the EBSE and SLR protocols that we are using to define a literature review as a method of evaluation recommend the application of selection and rejection criteria during the literature search process (cf.step 2 of EBSE) to ensure that resources selected are of higher quality and relevance to the information need (e.g. as stated by the EBSE question) that has motivated the search.While the researcher explicitly applied selection and rejection criteria, the practitioner and the students did not.
The practitioner appears to have informally applied one selection and one rejection criteria.We have compensated for the differences between the researcher, practitioner and students' use of selection and rejection criteria by screening all of the resources found by all three evaluators.In effect, we are applying simple selection and rejection criteria ex post facto to select the higher quality evaluations.
We identified two types of evaluation, summarised here in Table 2.We then examined each evaluation to identify the specific RMTs considered in that evaluation.We also noted the date of publication of the evaluation: more recent publications are more likely to be relevant because they are evaluating the more recent RMTs.
Table 2 The two higher quality types of evaluation selected

Structured comparison
Design and use of an explicit framework (often in the form of a set of criteria or features) to evaluate more than one RMT and to allow comparison of RMTs against each other.

Multiple product review
More than one RMT is discussed, and strengths and weaknesses may be identified.The discussion is not undertaken in a way that allows a structured comparison of RMTs e.g. a categorical scale of measurement is used at best.Typically a multiple product review takes the form of each RMT being presented (either in a sub-section or a series of table entries) and discussed.

The 'coverage' of RMTs in published evaluations
Table 3 summarises the evaluations identified by the three sets of evaluators.14 software applications, were treated as RMTs for this investigation.The INCOSE Requirements Survey (http://www.paper-review.com/tools/rms/read.php) recognises over 40 commercial RMTs.With the exception of the Oracle and Sparx software applications (included in the table because the company shortlisted them) we have listed the 12 RMTs that occurred most frequently in the publications that reported an evaluation.One of the software applications, Oracle Database 10g, is not strictly an RMT.The application has been included because it was shortlisted by the company for evaluation, because the company expected that this application could be modified to manage requirements but also because the company used other software applications built on Oracle.Similarly, the Sparx application is not a RMT itself, but includes a RMT plug-in.Table 3 identifies seven RMTs that were shortlisted by the practitioner for detailed evaluation, and a further three RMTs that were held in reserve by the practitioner in case none of the first seven RMTs were satisfactory.The first four RMTs listed in the table (i.e.CaliberRM, Optimal Trace, DOORS and ARTS) all satisfied the mandatory and highly desirable requirements for an RMT, as defined by the practitioner as part of the detailed company evaluation.These four RMTs were also the RMTs available for evaluation by the students and the researcher in the coursework assignment.
In Table 3, there are duplicate resources analysed across the groups of evaluators (e.g. a resource found by the practitioner may also have been found by a student or the researcher) but there are no duplicates within each group (i.e. a resource found by more than one student is only analysed once for Table 3.) Taking account of the fact that different evaluators found the same evaluations, we have identified 22 unique structured comparisons and multiple product reviews.
Recognising the constraints of our study design, it is nevertheless surprising that given the number of RMTs listed in the table, but also the number of commercial RMTs available on the market (i.e. over 40), there are so few structured comparisons and multiple product reviews reported in the literature.
The table also indicates that there is no uniform coverage of RMTs across the evaluations.Three RMTs (i.e.Telelogic DOORS, Borland CaliberRM, IBM RequisitePro) are included in almost all of the identified evaluations, with a fourth RMT (Serena RTM) occurring in half of the evaluations.Of the other RMTs, each occurs in only a quarter or less of the evaluations.As a possible explanation for the skewed coverage of RMTs, several publications report the percentage of market share for the market leading RMTs.Table 4 summarises these percentages.DOORS and RequisitePro have consistently remained the market leading RMTs, with CaliberRM improving its market share more recently.Although the four main RMTs considered here may not be consistent in their rankings between Table 3 and Table 4, it is clear that the top four RMTs in Table 3 are the same RMTs reported in the sources listed in Table 4.We return to this point in section 5.

The frequency of evaluations
Figure 1 illustrates the number of evaluations published each year for the 22 unique evaluations.The noticeable tendency for evaluations since 2004 may be a combination of the three sets of reviewers in the study looking for more recent studies, or a form of publication bias with more recent publications more likely to still be on the internet and internet search engines more likely to rank more highly the more recent publications.Again, given the number of commercial RMTs on the market it is surprising how few evaluations appear to be reported each year.

The bias toward evaluating leading RMTs
The results presented in section 4 suggest that there is a bias toward evaluating the market leading RMTs.This bias may partly be self-perpetuating i.e. practitioners are more aware of the market leading RMTs which therefore encourages evaluators (e.g.consultancies, researchers) to evaluate the market leading RMTs, which helps to sustain the awareness of these RMTs amongst the practitioner community.Viewed with a different perspective, the evaluations have advertising value separate from their value as technical assessments.
In addition to short-listing the leading RMTs because the practitioners are more aware of them, there is also a natural risk mitigation strategy being applied i.e. it is human nature to assume that a market-leading RMT is more likely to meet a company's needs because it appears to have met the needs of many other companies.A notable weakness in this approach is that a company chooses to ignore a RMT because it is not a market leader and, in so doing, the company may overlook a tool that is appropriate to the company, and that may cost considerably less in purchase and operational costs.Indeed the company that collaborated with the University on this project identified and ultimately adopted a RMT that was not a market leader.The adoption decision has saved the company approximately £90K in initial purchase costs.In this particular case, the company reviewer was careful to ensure that, when undertaking the literature review, the RMTs short-listed as a result of the literature review were not only those RMTs that were market leaders.In so doing, the company reviewer ensured that the RMT that was eventually chosen was retained past the literature review phase.

The impact of differences between the three sets of evaluations
The students and the researcher adopted a much more structured literature search compared to the practitioner (see Table 1) however the single practitioner appears to have found the most publications (see Table 3, normalising for the number of reviewers in each of the three groups of reviewers).This difference is probably at least partly explained by the following factors: the researcher and the students were constrained by their respective EBSE questions to only search for one or two RMTs in particular; the researcher diligently followed the EBSE guidelines including the principles of SLRs and as a result constrained her searches to only the academic, peerreviewed bibliographies; and the practitioner ultimately had a different objective to their searches compared to the research and student.As another point of difference, with 19 students each spending up to approximately 5 hours on searching, this is potentially 90 hours of searching compared to the 12 hours undertaken by the practitioner.We think this difference could potentially be explained by: students (significantly) over-reporting the time spent actually searching, duplication of searching amongst students, restriction of student searching to one or two RMTs, students selecting a market-leading RMT for their coursework because there was obviously more evidence available on those RMTs, students' limited understanding of RMTs (with the result that they searched for resources not directly relating to RMT evaluations) and conversely the practitioner have a more developed understand of what specifically to search for.

The quantity and quality of the evaluations found by the three groups of searchers
We noted earlier that the researcher was the only reviewer to explicitly apply selection and rejection criteria, and the practitioner appeared to implicitly apply a selection and rejection criteria.Indeed, we applied ex post facto criteria to select the higher quality evaluations i.e. structured comparisons and multiple product reviews.We have as a result selected the best evidence available but this evidence may still not be very good quality.Overall, there appears to be limited quantity of evidence and limited quality of evidence.Furthermore, although not shown explicitly in the results, there were very few publications found by the students and the researcher that were not found by the practitioner.In other words, the practitioner found the most number of evaluations, the higher quality evaluations, and the widest coverage of evaluations.The implication is that if we take the company problemadopting an RMT -as the thing we are trying to improve, then actually the students and the researcher didn't really add very much beyond what the practitioner found.

Assessing our analysis with a focus on the quality of evidence
Two items of evidence independent from our analysis suggest that we should be cautious in our assessment.The first item concerns whether the evidence relates to the latest changes to the technology.The second item concerns the objectivity or independence of the evaluations.With regards to the first item, the Yphise 1 organisation has conducted at least three evaluations of commercial RMTs and they have done so on a bi-annual basis (i.e.2002, 2004, and 2006, with a related evaluation of RMTs and Agile in 2008).The bi-annual frequency of these evaluations provides an indicator of how quickly RMTs change and how quickly they may need to be re-evaluated.
A related implication is that many of the 22 unique evaluations we have identified here may be out of date.Also, the Yphise evaluations appear to be restricted to about six RMTs per evaluation, most of which are market leaders, with some indication that the same six RMTs are included in each evaluation.This suggests practical constraints on the evaluations, which provides a limited corroboration of the arguments raised in section 2 concerning practical constraints on evaluations.While we do not have space here to report details on all 22 evaluations, we note here that all three set of reviewers only found the most recent Yphise report; in other words all three sets of reviewers have failed to find at least two relevant evaluations.One might argue that it is only the most recent Yphise report that is relevant, the other two now being redundant.Figure 1 shows however that the three sets of reviewers found other evaluations that are likely to now be redundant.Alternatively, it may be that the previous two publications are no longer on the Internet.The INCOSE website provides the date of every RMT assessment posted on the site and approximately a half a dozen assessments are posted each year.Given that these are vendor-supplied assessments, one should be cautious about their validity.But the frequency of postings (i.e. a half a dozen per year) suggests that some information relative to evaluating RMTs is being posted more frequently than suggested in Figure 1.
The second item concerns the objectivity or independence of the reports.Many of the evaluations considered appear to be subjective.For example, the INCOSE website provides the largest repository of RMT evaluations, both in terms of the number of RMTs considered and the number of criteria used to evaluate the RMTs.But all of the tool assessments reported on the INCOSE website are undertaken by the vendors.

Further research
There are a number of directions in which to extend this research.Briefly, these include: • Investigating how appraisals of evidence from existing evaluations can effectively complement the conduct of a new evaluation e.g.combining a DESMET qualitative screening with a pilot study, and undertaking a Systematic Mapping study prior to deciding on whether to conduct a literature review.• Quantifying the costs and benefits to a company of undertaking an appraisal of existing evidence.
• Developing criteria for evaluating the quality of industry-conducted evaluations.

Conclusion
Using three independent literature reviews of published evaluations of commercial RMTs, we have reported on the coverage of RMTs in those published evaluations.We found that there are few evaluations (22 in number) of commercial RMTs (where an evaluation has been defined as a structured comparison of a multiple product review) and this is particularly surprising when one considers that there over 40 commercial RMTs available on the market.Coverage of evaluations is biased toward the marketing leading RMTs with at most only 4 of the 14 RMTs we consider here being covered in a half or more of the evaluations.Indeed, market share appears to relate to whether an RMT will be included in an evaluation.The three literature reviews identified evaluations from around the year 2000 to 2007, with on average about 3 evaluations being published each year.Again, there is a bias toward more recent evaluations which may be explained by a combination of the reviewers' search strategy (i.e.search for more recent publications) and an inherent publication bias on the Internet (with recent publications being published and indexed by search engines).Overall, our findings corroborate claims in previous work that the traditional literature review is a low cost, quick but ineffective method of evaluation.The main contributions of the traditional literature review for evaluating commercial tools would seem to be to identify candidate tools for a subsequent evaluation conducted by the company, and to provide some benchmarks.We caution against shortlisting candidate tools primarily on the basis of their market share.

Figure 1
Figure 1 The annual frequency of Structured Comparisons (SC) and Multiple Product Reviews (MPRs)