IR-based traceability recovery as a plugin - an industrial case study

Large-scale software development is a complex undertaking and generates an ever-increasing amount of information. To be able to work efficiently under such circumstances, navigation in all available data needs support. Maintaining traceability links between software artefacts is one approach to structure the information space and support this challenge. Several researchers have proposed traceability recovery by applying IR methods, based on textual similarities between artefacts. Early studies have shown promising results, but no large-scale in vivo evaluations have been made. Currently, there is a trend among our industrial partners to collect artefacts in a specific new software engineering tool. Our goal is to develop an IR-based traceability recovery plugin to this tool. From this position, in the environment of possible future users, the usefulness of supported findability in a software engineering context could be explored with an industrial validity.


Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

INTRODUCTION
In large-scale software development, coordination between different organizational units is a key success factor to develop high-quality products on time and within budget.Software development results in a myriad of information entities.Apart from the source code itself, requirements and design specifications at variousabstraction levels, test descriptions, test results and defect reports are examples of produced software artefacts.The term software artefact refers to any piece of information, a final or intermediate work product, which is produced and maintained during software development (Kruchten, 2004).Software artefacts are the tangible results of the development process.They are typically of volatile nature and subject to version control.
Developing techniques to navigate all this growing information is crucial.Current state-of-practice is to structure the information space by manually maintaining traceability links between software artefacts.This is widely recognized as an important factor for efficient development, since it supports verification, change impact analysis, program comprehension and software reuse (Antoniol et al., 1999).Lack of traceability has been identified as one of the top factors causing delays in software engineering projects (Dömges and Pohl, 1998).Since a traceability link can be established between any software artefacts, defining a suitable trace granularity is an important decision in a development project (Cleland-Huang et al., 2007).
Software artefacts can consist of source code, UML models, diagrams, state machines, graphics, binaryfiles etc.However, text in natural language is the common form of information representation during all development phases (Marcus and Maletic, 2003).Also source code contains natural language content in identifiers and comments.Consequently several researchers have proposed standard IR approaches to semi-automatically trace software artefacts by presenting candidate links.The trend in this research has been to hunt recall and precision values on a rather limited set of small publicly available datasets, often from student projects or the open source domain (Huffman Hayes et al., 2006, Zou et al., 2006, Capobianco et al., 2009).Recently, case studies have been conducted using proprietary data from the industry, but they are still in minority.
The goal of our research is not primarily to study how IR methods can be improved and configured to perform better in an industrial setting, but rather to evaluate the IR-based approach in general and study how software engineers can benefit from increased findability through traceability recovery.To reach that goal, we plan to implement the functionality as a plugin in an existing tool.

RELATED WORK
The most cited definition of traceability has been given by Gotel and Finkelstein (1994): "Requirements traceability refers to the ability to describe and follow the life of a requirement, in both a forward and backward direction (i.e. from its origin, through its development and specification, to its subsequent deployment and use, and through periods of on-going refinement and iteration in any of these phases)"  -Huang et al., 2007).Traceability in the context of (2) software understanding and reuse supports maintenance and reengineering of legacy systems (Antoniol et al., 2002).
Fiutem and Antoniol (1998) did early work on recovering traceability links between design and source code.They used basic string comparisons and edit distances to suggest links between design documentation and source code.In the following years, Antoniol et al. (2002) continued by using the Vector Space Model (VSM) and probabilistic models to recover traceability links between source code and textual documentation in natural language.Marcus and Maletic (2003) introduced Latent Semantic Indexing (LSI) to recover traceability, also between code and documentation.They also showed that LSI can achieve good results without the need for stemming.The risk of spending too much effort on improving techniques for document retrieval without considering the actual needs of the users has been knownfor decades (Lancaster,1968).Directing effort on increasing the size of datasets instead of spending time on optimizing algorithms on small corpora is important, since methods might converge (Banko and Brill,2001).Recently, Oliveto et al. ( 2010) presented a case study on traceability recovery where VSM, LSI and the Jensen-Shannon method were compared and the results were almost equivalent.

EXPLORING STATE-OF-PRACTICE AND STATE-OF-ART
A large in-depth exploratory interview study was initiated in 2009 to investigate software engineers' views on alignment between requirements and test activities.We have conducted 30 interviews in 6 different companies with interviewees representing different roles in the development process.The overall goal of the study was to better understand the context to focus our future research.Our study identified poor tool support, information distributed in separate systems with poor interoperability and lack of traceability as contributing factors of misalignment (Sabaliauskaite et al.,2010).
To investigate the state-of-art of IR-based traceability recovery, we are working on a systematic mapping study (Kitchenham, 2004).
Preliminary results from the meta-analysisshow a need for in vivo evaluations of the approach; most previous evaluations involving human subjects have been conducted in university settings with student subjects.The final step of the study will map empirical results according to IR techniques, validity of datasets and types of traceability links established.
Another parallel activity, a master thesis project, found the public availability of the research prototypes to be low.The thesis evaluated IRbased tools for traceability recovery using requirements and test case descriptions collected from safety critical development in the domain of power and automation (Brodén, 2011).

DEVELOP PLUGIN IN STATE-OF-PRACTICE TOOL
Some of our industrial partners are working on introducing HP Quality Center (QC) as a new software engineering tool.A direct outcome of this transition will be that requirements, test cases and defect reports will be accessible in the same tool.This means the issue of poor tool interoperability highlighted by practitioners in our case study will no longer be a major obstacle.Another major advantage of this tool change in industry is that QC has good support for plugin development, thus it can be used as a test bed for our approach.This would enable us to implement an IR-based traceability tool within the system, right in the centre of the information hub.

EVALUATE APPROACH IN INDUSTRIAL CASE STUDY
The aim of this study will be to evaluate how well the IR-based approach to traceability recovery works in a real industrial setting.With the plugin in place, we will be able to study the performance of IR-based approaches for traceability recovery with an industrial validity.It will also enable us to study software engineers and their artefacts without introducing any additional external tools.The focus will be less on recall and precision, since the real question is to what extent the approach actually supports engineers.Instead aspects such as how much you benefit from improved findability of traceability information, how it affects the way engineers work, how much time can be saved etc. should be addressed.
A suitable method for the empirical evaluation is a case study (Runeson and Höst,2009).In vivo studies are hard to conduct as experiments, since the level of control usually is too low.Collected data will include tool usage statistics complemented by answers from interviews and a questionnaire distributed among involved practitioners.The plugin solution would also simplify expanding the study to multiple companies.