Software Developers’ Information Needs: Towards the Development of Intelligent Recommender Systems

Software development is a process which is significantly reliable on information and in the context of the Internet on Information Retrieval (IR) tools. Approximately 20%-30% of work time of software developers is spent on information retrieval and this proportion would be significantly higher if it were not for time constraints and pressure to deliver code. Even though a number of IR solutions exist, 86% of work-related search sessions start with a general purpose search engine. Therefore there exists a significant potential for research and development of ubiquitous, domain specific tools to support the IR process. This paper discusses how the knowledge of work tasks and information needs of software developers can be used to deliver ubiquitous, highly context-sensitive search and intelligent recommendation tools. We present a detailed review of software developers’ work related tasks and habits. We also discuss factors that can be used as implicit feedback indicators for further collaborative filtering.


INTRODUCTION
Software development is a cognitively complex task (Détienne & Bott, 2002;Xu, 2004).As a consequence, even before the widespread use of the internet, software developers spent a significant proportion of their work-related time searching for various sources of information (Curtis et al. 1988;King & Griffiths, 1991;Seaman, 2002;Liu, Chen, Lakshminarayanan & Perry, 2005;Singer et al. 2010).The introduction of the internet in the software developers' work place did not change this proportion significantly (Freund, Toms & Waterhouse, 2005;Brandt et al., 2009) and still the ability to find and use various sources of information is one of the key abilities deciding about engineers efficiency and success (Tenopir & King, 2004).The availability of the internet changed the way developers acquire information which previously was significantly dependant on communication with co-workers (Hertzum & Pejtersen, 2000).The research suggests that the fact that online information is so easily available changes the social settings of work and discourages developers from interacting with their colleagues if it is not essential (Ye, Yamamoto & Nakakoji, 2007).On the other hand it allows developers to distribute their cognitive resources (Hollan et al. 2000) across the web and engage in practices of opportunistic programming (Hartmann, Doorley & Klemmer, 2008, Brandt et al. 2009), reusing sections of code varying in length and complexity from single lines of text to entire frameworks and libraries.With the increasing numbers of open source projects and frameworks, and popularisation of technical blogs and libraries, software developers have started to perceive the internet as their "key information resource" (Brandt et al. 2009) and feel that if it were not for the commercial constraints they would spend an even higher proportion of time searching it for projectrelated information (Freund et al. 2005).The results illustrated in Table 1 along with the reports of continuous feeling of "time famine" among software developers (Perlow, 1999) are motivation for the research of domain specific IR tools for software developers.Even though a significant number of search tools and utilities designed for software developers exist (Poshyvanyk et al. 2006;Begel, 2007Hoffmann et al. 2007;Gallardo-Valencia et al. 2009;Bajracharya et al. 2009;Brandt et al. 2010) they tend to be not used by the software developers who rely in most of the cases on the general purpose search engine (Umarji et al. 2008).
The reason for such software developers' behaviour is not fully understood but there exist indications that it is due to insufficient understanding of the tasks the software developer engages in.This on the other hand makes it very difficult not only to design usable and task-focused user interfaces but also to control the results of the IR system to meet the temporary task at hand.
In order to deliver fully effective domain / task specific IR tools, the design process has to take all of the characteristics of the task and technological limitations into account (Jrvelin & Ingwersen, 2004).
According to the Cognitive Approach to IR (Ingwersen & Jrvelin, 2005), to be able to design usable IR solutions we have to take into account the individual user's cognitive space, the user's socialorganizational environment, the IR interface / intermediary, information objects and the IR system settings.
Even though a significant amount of literature exists aiming to understand software developers' information behaviour, in the domain of Psychology of Software Development (Sim et al. 1998;LaToza et al. 2006;Umarji et al. 2008;Janjic et al. 2010;) and Human Information Behaviour research ( Ellis & Haugan, 1997;Freund et al. 2005;Buckley et al. 2006 ), it does not provide a consistent description of software developers' task distribution and does not address the variations of factors laid out in the Cognitive Approach Theory.
This paper will provide a detailed overview of the distribution of the tasks and information needs of the software developer / software engineer.It will argue that a very detailed understanding of the tasks of the software developer can lead to development of very efficient and highly usable IR solutions.The argument will be supported by the detailed analysis of the task of software delivery and especially with regard to the practice of opportunistic programming.The paper will conclude with the analysis of two user study results supporting the feasibility of dedicated IR tool development for this particular activity.

SOFTWARE DEVELOPERS' TASK DISTRIBUTION
One of the most comprehensive investigations of Software Engineers' task distribution in the context of IR was carried out by Freund et al. (2005).The research provides an overview of the changing information needs and strategies used to find the required information.Software Developers also have significant information needs across their work-related tasks and use the internet even more frequently than the more general group of Software Engineers.They use the internet not only as a knowledge source but also as a source of snippets of various complexity and length that are being reused in the process of opportunistic programming (Brandt et al. 2009).
Their work distribution is significantly different from widely understood Software Engineers as they focus mainly on: low level analysis and design, implementation, testing, optimization and debugging, code deployment, and code maintenance.
To understand the task distribution as well as tools and most common practices of Software Developers, LaToza et al. (2006) conducted two surveys and eleven interviews on overall 187 software developers from Microsoft Corporation.The results revealed the existence of 9 main development tasks.The results revealed also the correlation between the stage of the project and the amount of time spent on particular tasks.The overall distribution of the tasks and their quantity remained the same across the stages.Most of the developers engage simultaneously in more than one activity during the week with the majority engaging in all 9 activities all the time.The results pointed out an existence of statistically significant correlations, both positive and negative, between the various tasks revealing the sequential nature of many of them.
The research carried out by Brandt et al. ( 2009) suggests that the way software developers search for code can be classified on a linear scale depending on the novelty of the information which is required.The position of the information on that scale impacts not only the way users search for code, the type of code, but also the search process characteristics.UmarjiSim & Lopes (2008) who also classified the web searching behavior of coders based on the same spectrum (code for reuse/reference example).An additional dimension was added to the classification that is the size of the code related content that is being investigated.The experiment was carried out on a group of 69 professional software developers and based on the results and the proposed classification, 9 archetypes of behavior were identified spanning across most of the code development related IR cases.The above archetypal classification was then further extended to support the variation of developers' behavior across different stages of the project (Janjic et. 2010).The results illustrate that the different tasks / archetypes of behavior are not equally probable during different stages of the process.It is also important to notice that the stage of the process has also a direct impact on the information needs decreasing their abstraction level with the maturity of the project or project iteration.This supports the results from LaToza et al. (2006).Their empirical results also clearly demonstrate the variation of the importance of different tasks across different stages of the development lifecycle.The results of their study show that along with the maturity of the project, developers spend less time delivering new features and focus more on bug fixes and code maintenance.

Similar classification was reported by
Additionally their results suggest that most of the specific information, related to the usage of code, library or framework is being reported during the implementation and testing stages of the project.Other stages of the project involve the retrieval of more abstract information.In this research the notion of Software Developer was used quite arbitrarily though, and in a significant proportion of projects the phases of analysis and design as well as deployment and maintenance would be assigned to different engineers in the company.
The questionnaire based user study carried out on a group of 67 software developers by Sim et al. (1998) also identified a similar set of work-related tasks.The tasks were classified based on the search motivation and were reported to span over the majority of software development activities.The most common search tasks reported were: defect repair, code reuse, program understanding, feature addition, and impact analysis.On top of this classification, 11 archetypes of software development behavior were designed and used for an in-depth analysis of the search process, search behavior and the decision making process involved in search.
Overall the literature is quite consistent in terms of the description of the Software Developers' work tasks and similar or even identical tasks are observed.The literature is also quite consistent as far as the impact of the project role/project stage has on the set of software-related responsibilities and as a consequence on the search requirements.In all of the cases, architectural, code maintenance, code testing and development tasks required the usage of different data sources, different search strategies and also resulted in different use of the captured information.The next section of this paper will focus exclusively on tasks related to code delivery and more precisely on an opportunistic programming approach to code development.

SOFTWARE DELIVERY AND OPPORTUNISITC PROGRAMMING
The proportion of time software developers spend on code delivery varies with the development team, distribution of responsibilities and the project.The code delivery is obstructed by multiple interruptions and gaps in their knowledge which require the use of various forms of communication.For all of them though the delivery of code is the key activity and the overall goal of their work.Software engineers are significantly results-driven and if it is not necessary then in many cases they are not interested in the full understanding of the problem or piece of code that is being implemented.As a consequence software engineers often copy the solutions, even very loosely matching their problem and, "see what happens", engaging in the process of programming by example/iterative programming ( Brandt et al. 2008;Yeh et al. 2008;).The ability to search, copy and paste and then experiment on the sections of third party code is very valuable to the programmers because it allows them to distribute their cognitive resources across the internet (Hollan et al. 2000).Due to the availability of the online resources they can ignore the details of implementation of simple snippets of code, bigger sections and modules focusing exclusively on the task of gluing them together.The literature suggests that the extent of the code reuse through copy and paste is very high and reaches up to 30% of the code in the solution (Brandt et al. 2009).The research carried out by Kim et al. identified that the average user performs approximately 16 copy and paste operations per hour (Kim et al, 2004) out of which 2 are non-trivial sections of code and all of them carry relevant information.Similarly the research carried out by Umarji et al. (2008) reveals that 73% of search sessions could be classified as "searching for code for reuse".From the perspective of this discussion, the most interesting type of search identified, so the one that relates to nontrivial copy and paste operations in the previously discussed research, is the search for the subsystem components (41%).
The information listed above provides us with a significant potential for development of ubiquitous search and recommendation tools aiming to solve subsystem related code retrieval problems.From this information a number of design related conclusions can be made: The act of copy and pasting is an implicit indicator of relevance.The ability to capture the information related to copy and paste allows us to create a knowledge containing keywords used for search, content that was copied and the ID of the user that copied the information.The above knowledge can be quite easily used with existing collaborative filtering approaches to provide recommendations of related content.(ii) The information regarding the subsystem level code should not necessarily be provided directly in the development environment as it cannot be simply copy and pasted to the solution but instead will require significant modifications to contextualize it.Instead it should be accessible through the browser or even the search engine page providing a single and consistent search and recommendation of user interface (iii) The summary information related to the recommended content should be more focused on the functionality and the textual content used to describe the problem by other developers.
The next sections of this paper will describe the user studies designed to validate those design conclusions.

METHODOLOGY
The goal of the experiment was to assess the following: • Whether the copy and paste action indeed is an indicator of relevance.• Whether the copy and paste occurs frequently enough to make it possible to quickly and effectively build a comprehensive knowledge base.
• What is the nature of the copy and pasted content (whether block, subsystem or system information is most frequently copy and pasted?
In order to answer the following questions a two stage experimental process was designed.

Feasibility assessment
Feasibility assessment stage was carried out on 12 professional software developers.All participants were men with at least 5 years of professional software development experience.All of the participants were specialized in Microsoft .NET technologies with a special emphasis on MOSS2007 development.All of the participants were engaged almost exclusively in code delivery and tests.The user study had a form of a questionnaire with a variety of open-ended and close-ended questions followed by a short interview.Most of the questions provided the user with a set of exemplary answers which was followed by a number of empty text fields requiring more than one answer from the user.

Automated observation of user behaviour
A user study was carried out on the same group of software developers.The user study was carried out using a number of browser add-ins (Chrome, Fire Fox, IE) and a development environment add-in (Visual Studio 2003 and upwards) designed to automatically capture user interaction with information.
The automated observation lasted for a different period of time depending on the participant's availability.For the purpose of the data analysis a common period of 2 months was chosen between the 4th of January 2011 and 1st of March 2011.The results reported in this paper are based mainly on the output of the development environment add-ins.The output of the add-ins is composed of a number of reports.A report with the detailed description of user interaction is sent when the document on which the developer worked is closed.For the purpose of this analysis 665 non-trivial reports were selected.The report is classified as non-trivial if: the time the user spent working on the document is above the selected threshold (20 seconds) or the report contains either 'copy & paste' or 'focus change' information.

Feasibility assessment
The first goal of the user study was to identify the frequency with which software developers used the internet to solve their work-related problems and how frequently they have been reusing the online content in their applications.Table 2 demonstrates the normalized frequencies on a per-day and perworking-year basis.It is important to notice that those results are self reported and therefore had to be verified by the second stage of the user study.The second goal of the feasibility study was to identify what happens to the relevant information once it is found.Table 3 summarizes the most frequently reported actions.All of the participants reported that they often copy and paste the relevant information or alternatively directly and immediately use it to solve their development problems.Two responders when discussing the question during the short interview reported that they often copy even very loosely related code to be able to obtain the template for further work.
The third goal was to identify the information that is being searched for in the internet.Table 4 lists the most frequently reported information needs.We can see that only the search for code examples / snippets / modules was selected by all of the users but still there is a very high consistency in the answers provided by the users.During the post-questionnaire interview participants highlighted the fact that they most frequently search for highly processed information.
If the code examples / code solutions are not available or not appropriate to the problem, they usually search through highly processed information sources such as blogs containing other people descriptions of the problems.

Automated observation of user behaviour
We have recorded a total of 1296 non-trivial copy and pastes out of which 379 originated from outside of the development environment.Our results indicate that the presence of copy and pasting from the browser is an indicator of higher problem complexity what is manifested by much longer average active time across sessions that involved copy and pasting from the browser.Figure 1 illustrates the distribution of copy and paste actions in time.The results show that even though copy and pasting from the browser is not as frequent as in internal development environment operations, it is much more frequent than reported in the feasibility study.The distribution of browser interaction seems to have a normal shape with an average of 120 minutes with Q1=26 min and Q3 = 132 min (as shown in Table 5).
Table 6 shows the results of the analysis of the copy and pasted text depending on the source of copy and paste.The average number of characters copy and pasted from the browser was more than two times higher than the average number of characters copied from other sources.On the other hand users copied the information from the browser much less frequently than internally in the development environment.Single copy and pastes account for 60% of copied data, with the maximum of 8 copy and pastes in a single report (a single interaction with a source code file).

Figure1:
The average time between the Copy and Paste actions.
The data for development environment copy and pasting also contains a lot of isolated events with 60% of copy and pastes occurring only 1-3 times in the report (a single interaction with a source code file).The remainder of the reports contains significantly higher values ranging 44-46 indicating significant amount of work carried out on the document.
Finally the relationship between the copy and pasting and project building was investigated: • When the user copy and pasted the information from the browser, the build action was executed in 57.5% of cases after the paste and with the focus in the same window • When copy and pasted the information from the development environment the build took place in 11.14% of cases after the paste and with the focus in the same window This indicates that the user was not certain about the results of copy and pasting and used the build option to validate the correctness of the code and to further test it.Average number of copy and pastes from the browser in a single report (the reports having just one copy and paste were excluded from the calculation)

3.86
Average number of copy and pastes internally in the development environment 5

Summary
Overall the copy and paste action was indeed reported by all of the participants as an indicator of relevance.The results of our feasibility study and the automated observation were further supported by the interviews that were concluding the feasibility study.On the other hand the perception of relevance reported by the participants is quite complex and related not only to the similarity of the code to the problem they want to solve but also completeness of code and its usage license.
Secondly the copy and paste action of subsystem components did occur frequently enough to consider its usage as a source of implicit relevance feedback with a non-trivial copy and paste occurring every 120 minutes.
The code that was copy and pasted from the browser was complex and averaged in length to 230.05 characters (with values ranging from 6 to 2031 characters).

CONCLUSIONS
The literature review presented in this paper illustrates that the software development process is composed of a fixed amount of repeatable development tasks.Across each of those tasks and also across different stages of the software development lifecycle, software developer faces a fixed number of information need types.One of such needs is the need to reuse the code found on the internet to decrease the cognitive burden of software development and increase the speed of development.The further analysis and the user studies indicate that it is possible to infer content relevance from the copy and paste action which can be perceived as a strong implicit feedback indicator.
Additionally the analysis of the software development tasks allowed us to create a number of user interface guidelines for the process of retrieving non-trivial code snippets.
What is more important, the paper provided an example of the process that can be undertaken in order to design and then deliver domain specific recommendation systems.The process is based on a detailed analysis of the user group task distribution, isolation of the information intensive tasks that are characteristic for the user group and step by step analysis of user behaviour during the execution of the task in the work environment.By application of this process, the results of this user study can be further generalised to other domains.
It is also valuable to notice that such, domain specific, recommendation systems are already designed and successfully used especially by ecommerce even though a detailed design methodology does not yet exist: • Amazon uses the event of purchase and product visit as an indicator of relevance.• Last.FM uses the event of music playback as a source of implicit information • Gmail uses a combination of email related indicators, such as reading or replying, to infer relevance / priority of new incoming email Therefore further formalisation of the process has a significant potential and can lead to development of numerous domain/task specific IR tools supporting business critical tasks in the modern enterprise.

Table 1 :
Average amount of time software developers spend on information retrieval / communication

Table 2 :
Frequency of Internet resource usage

Table 3 :
The reported usage of relevant information

Table 4 :
The most frequently required information

Table 5 :
Active times of interaction: general statistics

Table 6 :
Copy and pasted content analysis