Investigating effort prediction of web-based applications using CBR on the ISBSG dataset

Strathprints is designed to allow users to access the research output of the University of Strathclyde. Unless otherwise explicitly stated on the manuscript, Copyright © and Moral Rights for the papers on this site are retained by the individual authors and/or other copyright owners. Please check the manuscript for details of any other licences that may have been applied. You may not engage in further distribution of the material for any profitmaking activities or any commercial gain. You may freely distribute both the url (https://strathprints.strath.ac.uk/) and the content of this paper for research or private study, educational, or not-for-profit purposes without prior permission or charge.


INTRODUCTION
With the increasing prevalence of web-based applications comes the requirement for the accurate estimation of the development costs associated with such applications.Accurate estimates are essential for companies to make competitive bids in the market and to efficiently resource development projects.In spite of considerable research in this area, no prediction techniques have proven to be consistently accurate.However, case-based reasoning (CBR) has been shown to be one of the stronger performing techniques, although this is usually in the context of traditional rather than web-based applications.This paper investigates the application of CBR to a set of web-based application data.Web applications are typically characterised by shorter development cycles, smaller teams, a variety of programming languages or frameworks, and less formal process control and estimation strategies (Reifer [11] contains a far more comprehensive comparison).The main purpose behind the study is to investigate the optimal number of analogies (i.e.how many of the most similar cases should be taken into account) to employ when making an estimate, but the aim is also to note any other issues that arise when using CBR on web application data.
The main findings of the paper are that there is no consistent improvement in the accuracy of the estimates as the number of analogies increases.This is shown to be due to a number of factors including the similarity measure employed which appears to bias numeric data, and the distribution of data -in particular the presence of small or large values and a lack of an even distribution of values.

RELATED WORK
Recently, research on cost estimation for web applications has started attracting more attention.Various modeling techniques drawn from statistics, machine learning and knowledge acquisition have been used with various degrees of success and on a limited number of mostly small and medium size data sets [2,4,16].
Cost estimation methods can be divided into two methods: non-model based (expert knowledge) and model based estimation methods (cost estimation tools).Combinations of both approaches are known as composite methods [12].
Non-model based estimation methods require the heavy involvement of experts to generate an estimate of the new project [19].The estimate will be based on the experts' accumulated experience rather than any particular model.
Model-based, or algorithmic, estimation methods are not dependent on individual's capabilities but require past project data for model building.Examples are OLS regression, the Constructive Cost Model (COCOMO) [20], and Classification and Regression Trees (CART).The major drawback with this model-based approach such as example in COCOMO provides equations that incorporate system size as the principal effort driver.Predicted development effort is then adjusted to accommodate the influence of 15 additional cost drivers.The main conclusions were these models perform poorly when applied to other environments [18].
In the recent years machine learning techniques have been used as a complement or alternative to the previous categories.There are a variety of machine learning methods including: artificial neural networks [4], rule induction algorithms [2], case based reasoning [9,13,16,17], hybrid approaches such as neuro-fuzzy methods and multiple learners [14].
The machine learning approach has been explored most in the context of cost estimation is that of CBR.Case based reasoning (CBR) was first formalised in the 1980s following from the work of Schank and others on memory, and is based upon the fundamental premise that similar problems are best solved with similar solutions.CBR is based on the psychological concepts of analogical reasoning, dynamic memory and the role of previous situations in learning and problem solving.Basically a CBR processing cycle is composed of four stages [1]: (1) Retrieve the most similar project case; (2) Reuse the project to attempt to solve the problem; (3) Revise the suggested solution if necessary; (4) Retain the solution and the new problem as a new project.
The appeal of CBR may rest on the fact that users may be more willing to accept a solution from a form of reasoning which is more akin to human problem solving, and even though there is no single best software cost estimation model, CBR is rated among the best methods in a variety of circumstances [12].In addition to being intuitive and having a reasonable level of accuracy, CBR is also simple and flexible, and may be applied to both qualitative and quantitative data, reflecting typical industrial datasets [8].
CBR also has some disadvantages.As with algorithmic models, the effect of old data points is not clear.As an organization develops and successively introduces new technology, the older data points may become increasingly irrelevant and potentially misleading [13,16].This needs more investigation, especially in the area of web applications, as there has been a rapid change in term of languages and technologies even in the short time that such systems have been around.
There are also a number of challenges towards the effective application of CBR, some of which are general to a domain and others which may only be relevant to a particular dataset.The problems that most researchers encounter in applying CBR fall into the following categories [15]: There are many features in the dataset but not all of them are necessarily relevant for predicting the project effort.They might be redundant or error data.

(ii) Scaling
Scaling or standardization represents the transformation of attribute values according to a defined rule such that all attributes are measured using the same unit.Angel for example assigns zero to the minimum observed value and one to maximum observed value.
(iii) Similarity Measure A distance measure in CBR is the degree of similarity between two projects in terms of their effort drivers.Euclidean distance is most commonly used to solve this problem.Similarity measures for categorical data typically employ a value of 1 to represent a match and 0 otherwise.This is an interesting point that demands further investigation.

(iv) How Many Analogies To Use
The number of analogies refers to the number of most similar cases that will be used to generate the estimation.Most of the previous work employs 1, 2 and 3 analogies, but there is no clear rule on how many analogies to be use [6,9,10].

(v) Analogy Adaptation
Analogy adaptation concerns how to generate the estimation once the analogies are retrieved.Different approaches include using the mean of analogies or nearest neighbour.
Several papers have investigated this last aspect in detail [6,16], focusing on dataset size as one of the major factors concerning the accuracy of analogy based methods by analyzing the trends in estimation accuracy as the datasets grow.Although the work of Kadoda et al. confirmed that analogy based estimation achieves better results by employing larger training sets [6], Shepperd and Schofield claim that accuracy in analogy based estimation does not always increase within the number of projects or datasets -showing instead that it can be affected greatly by introduction of outlying projects [16].
As discussed above, the question "Does accuracy improve as the number of projects cases increased?" is still in doubt.Much of the work in this uses public datasets, many of which are old and not employing web application data.Therefore it may be fruitful to investigate this question by using a web application dataset.

THE DATASET
The investigations in this paper are all based upon the International Software Benchmarking Group (ISBSG) Release 10 dataset [5].The data in ISBSG repository have come from over twenty-five countries, with 60% of projects being less than 7 years old.Software practitioners voluntarily submitted the projects in the ISBSG data set which was collected using questionnaire.The ISBSG collection pays much attention to the quality of gathered data.There are special data validation forms and the project managers are asked to report the confidence the have in the information they have provided [21].A specific field is used containing a rating code of A, B, or C applied to the project data by the ISBSG quality reviewers to denote the following: A= The submission satisfies all the criteria for seemingly sound data.B= The submission appears fundamentally sound but there is some evidence to question some of the supplied data.C= The submission has some fundamental shortcomings in the data.
As ISBSG point out, in any statistical analysis only projects with A and B rating should be used.Of the 4,106 project summaries in the repository, 422 are related to web applications, and it is this subset which is the subject of this study.
The dataset covers a wide range of applications, development techniques and tools, languages and platforms.Of the total of 109 features that may potentially appear in the ISBSG dataset, just 9 were selected which are considered relevant to this work, or which could potentially have an impact on effort.The table below lists the features used in this study.

METHODOLOGY
The main aim of this paper is to investigate the impact of the number of analogies on the accuracy of estimates obtained through case-based reasoning.Consequently, the large dataset needs to be broken down into smaller subsets in order to provide more opportunities to experiment with using different numbers of analogies, and also to mimic more closely the data set size that is likely to be available in an industrial context.The 422 web application records in the ISBSG dataset were divided into 3 groups, each consisting of 67 unique records (cases).Care was also taken not to include any cases that are incomplete.Similarly to previous studies (e.g.[6]), in order to explore the impact of the number of cases, these three datasets are further subdivided (randomly again) to populate smaller datasets consisting of 17, 33, and 47 records.This exercise yields a total of twelve data sets: three initial groups (labelled G1, G2 and G3) each containing 67 cases, each randomly subdivided into groups of 13, 33, 47 and labelled G1-Ran1-17, G1-Ran1-33, G1-Ran1-47, G1-Ran1-67, G2-Ran1-17, G2-Ran1-33, … G3-Ran1-67.This procedure is then repeated a further two times to guard against any freak results introduced by the randomising process [2] producing a second (G1-Ran2-17, G1-Ran2-33, … G3-Ran2-67) and third (G1-Ran3-17, G1-Ran3-33, … G3-Ran3-67) -thirty-six data sets in all 1 .
The CBR tool Angel [16] was used for this experiment to determine the prediction value of the effort according to jack-knife method (also known as leave one out cross-validation).This procedure is the same as that adopted by others, including Mendes et al. [10], and follows the procedure outlined in below.This was applied to all 36 datasets.
In Angel tool similarity is defined as Euclidean distance in n-dimensional space where n is the number of project features.Each dimension is standardized so all dimensions have equal weight.

RESULTS AND ANALYSIS
To gauge the accuracy of each estimated effort value two values are calculated for each number of analogies(k) used for each dataset: the Mean Magnitude of Relative Error 2 (MMRE), and PRED(25) For each case in the data set: Discard the effort data for that case (marked as "unconfirmed" -in order to simulate a new project) Using from 1 to 7 analogies: Use the remaining cases to estimate the effort for the unconfirmed case Restore the original effort value for the unconfirmed case and return it to the dataset PRED(25) values of at least 75%.The results for the MMRE are shown below.It has not been possible to include those for PRED(25) for reasons of space.The results are shown graphically in figures 1 to 9, where the number of analogies (k) is shown on the x-axis and the value of MMRE on the y-axis.

Fig.9. Result of MMRE vs Analogies on Group3Ran3
There are two immediately notable results concerning the MMRE values.Firstly, none of the averages is anywhere near the 25% value -in fact values below 100% are rare.Secondly, the graphs typically do not display any common trends.In some cases there is a general lowering of the MMRE values as k (the number of analogies) increases (for example Group2Ran3, which shows a gradual convergence as k gets larger), whilst other cases show completely the opposite trend, and others still display sudden peaks or troughs.The remainder of this section will attempt to provide an explanation for some of these more pronounced patterns by considering some particular questions.

What is the reason for the peak in the results for G1Ran2-33?
As can be seen, the results for this set show a very different pattern compared to G1Ran1-33 and G1Ran3-33 (drawn from the same set of 67 cases) and even for other configurations of the Group1 data (a similar shape can be observed in G1Ran3-49, but the peak value is considerably lower).Also, it is unusual that the MMRE value starts of as one of the lowest for k=1 and climbs to one of the highest for k=4.To investigate this result in more detail it is necessary to look more closely at the dataset (up to k=4 for space reasons), shown in table 2.
As can be seen for k=1, the most frequently predicted effort value is 47.This can be examined in more detail by looking at two different cases (those named13700 and 10566) which have very different values of actual effort (352 and 8580 respectively) but which show the same predicted effort value of 47 when k=1.Clearly, the effort associated with all these closely ranked cases is some way off the target value (47), but that associated with the second, third, and particularly fourth cases are substantially different.So as k increases the MRE gets significantly larger: 115(refer to footnote 4) for k = 3 and 146(refer to footnote 5 ) when k = 4. Admittedly, this data point is the only one that has a MRE value of more than 100; the rest of the cases result in values less than 8, and the majority of them are less than 1.Nevertheless, this is the main reason that the MMRE is so large.It is a poignant illustration of the impact that outliers, or even the lack of close matches in the dataset, can have on the accuracy of effort predictions.Furthermore, it also demonstrates the rather unpredictable effect of increasing the number of analogies.

Why does G1-Ran3-33 display such a different trend to G1-Ran2-33?
In contrast to G1-Ran2-33, G1-Ran3-33 has a very different trend of MMRE values, showing a slight downward trend until k = 4 and a very slight increase thereafter.There are no peaks or extreme values as in the case of G1-Ran2-33, and the MMRE values range between 1.716 and 0.909.In some ways this is curious as the pattern of data in the two sets are not dissimilar as can be seen by the summary table below: Both have the same minimum and maximum values, so why does G1-Ran3-33 not display any of the extreme values of G1-Ran2-33?From tables 4 and 5 it can be seen that the MRE for the predicted effort based on one analogy is better for G1-Ran2-33 than for G1-Ran3-33.This is caused largely by the poor initial matches for G1-Ran3-33, but also by the frequent predicted effort of 47 for G1-Ran2-33 -often a very poor match but still yielding a MRE value of less than1 (one of the weaknesses of the MRE calculation).In contrast, when four analogies are used the position is reversed and the top MRE values for G1-Ran2-33 are much higher (the value of 146 has already been illustrated) than those for G1-Ran3-33.These values are summarised in the tables 6 and 7 below.Although the worst case for G1-Ran3-33 produces a very high MRE value (8.625), this is substantially lower than the value of 146 which is primarily responsible for the overall high MMRE for G1-Ran2-33.Looking at this worst case in more detail it can be seen that the predicted effort values get closer to the actual effort (having started off some considerable distance away), which reduces the MRE.This is in contrast with the case of 13319 in G1-Ran2-33 where the values deviate even further as more analogies are brought into play.
 13700, IFPUG, 352, Enhancement, Process Control, ASP, SQL SERVER, 133, 133 The nearest data points for 13700 are: From this it could be argued that the distribution of projects in the dataset is important: rather obviously, a case base that does not contain projects that are remotely close to those for which predictions are being made is unlikely to produce accurate results.This point is illustrated by group G2-Ran3.The trend for all subcategories in this group is the same: initially disparate values for k=1 quickly converge to a much smaller range as k increases.The MMRE values are still too high for this to be considered a "good" prediction, but the pattern of the graph follows that shape that might intuitively be expected.The reason for this is that the group (and subgroups) consists of data which is spread evenly from the lowest to highest value.All groups have the same maximum (21700) but also contain other large values (19306, 14992 and 11165) which tend to be chosen as close matches to each other and result in relatively good estimates, or at least not very poor ones.
This appears to confirm the observations of Kadoda et al. [6] and Shepperd and Kadoda [15], that there is likely to be a strong interaction between the accuracy of a given prediction system and underlying characteristics of the dataset it is applied to.However, looking at the graphs of the results, it does not appear that increasing the size of the dataset improves the accuracy of the prediction -larger datasets appear to display similarly erratic results to the smaller ones.This interaction between the dataset and the predictions can be clearly observed in the figures 10 to 12 which group the results by different sized datasets.except where there is a close categorical match where the effort is almost consistently 5.5 times the size.This may be coincidence but may also indicate data which comes from the same company or even the same team.Unfortunately, such information is not available in the data set for reasons of privacy, even though it is potentially useful in finding matching cases.

Questions arising from the Pred(25) results.
As mentioned at the start the PRED(25) results are not included for reasons of space, even though they are considered a more preferable mechanism to MMRE for assessing the accuracy of prediction mechanisms given the weaknesses associated with MMRE [3].The PRED(25) results display similar characteristics to the MMRE results: no general trends regarding the accuracy of the estimate and the number of analogies, and a clear indication of the impact of the underlying data set.

CONCLUSIONS
The main finding of this study is that no reliable guidance can be given regarding the number of analogies that should be employed in making a prediction.In some cases there is a tendency for the data to converge as k increases whilst in others it diverges.Most of the graphs seem to suggest that the data is having big influence in calculations.The results also do not give any confidence that increasing the size of the dataset results in more accurate predictions.In some cases the smallest set (17 cases) is the least accurate, but in others it is the most!The larger datasets (with 33, 49 and 67 values) tend to gravitate towards each other more and display less volatility, but their relationship to each other is not always predictable.
It was also found that outliers in the form of large or small values could possibly effect the predictions.Related to this is the distribution of data within the dataset -those with a more even spread of data tended to produce lower MMRE values.The quality of the data set seems plays a major role in the precision of the prediction.
Another important result of this study is the relationship between the features used and the distance calculation.In this study only 8 features are employed, and only 2 of these are numeric -Functional Size, Adjusted Functional Points (Effort is also numeric but is not employed in the distance measure as it is the value which is being predicted) and the rest is categorical.Again the characteristics of the dataset could influence the prediction accuracy because categorical data contributes either 1 or 0 to the distance calculation depending on whether there is a match or not.As a consequence the numeric values appear to dominate the distance calculation resulting in cases which are arguably slightly poorer matches being ranked higher than apparently better ones.
Future work in this area will aim to address these issues, particularly those relating to the spread of data and the distance calculation (and the subsequent adaptation of the analogy) with the aim of making the use of CBR for effort prediction more reliable.

Table 1 .
Description of selected features

Table 3 .
Summary of Particular Group Dataset

Table 8 :
PredictedWhen using only one analogy there is obviously no opportunity to average the results and so the difference in the value of effort could effect the result.In G2Ran2-17 there are two big values in this group (14992 and 11165) and the next value is 5018, followed by 3303 and lower.The presence of these high values could skew the effort predictions.We investigate this further by looking at the result of the data set in both groups (see table8).