Investigating the Use of Chronological Splitting to Compare Software Cross-company and Single-company Effort Predictions: a Replicated Study

CONTEXT: Three previous studies have investigated the use of chronological split to compare cross-to single-company effort predictions, where all used the ISBSG dataset release 10. Therefore there is a need for these studies to be replicated using different datasets such that the patterns previously observed can be compared and contrasted, and a better understanding with regard to the use of chronological splitting can be reached. OBJECTIVE: The aim of this study is to replicate [17] using the same chronological splitting; however a different database – the Finnish dataset. METHOD: Chronological splitting was compared with two forms of cross-validation. The chronological splitting used was the project-by-project chronological split, in which a validation set contains a single project, and a regression model is built from scratch using as training set the set of projects completed before the validation project's start date. We used 201 single-company projects and 593 cross-company projects from the Finnish dataset. RESULTS: Single-company models presented significantly better prediction than cross-company models. Chronological splitting provided significantly worse accuracy than leave-one and leave-two out cross-validations when based on single-company data; and provided similar accuracy when based on cross-company data. CONCLUSIONS: Results did not seem promising when using project-by-project splitting; however in a real scenario companies that use their own data can only apply some sort of chronological splitting when obtaining effort estimates for their new projects. Therefore we urge the use of chronological splitting in effort estimation studies such that more realistic results can be provided to inform industry.


INTRODUCTION
Numerous software companies find it difficult to accumulate data on their past finished projects, and yet, to remain competitive, they must provide project effort estimates that are accurate.This situation has motivated researchers to investigate the use of cross-company project data to estimate effort for projects that belong to a single-company, where this company's projects are omitted from model construction [11].Numerous studies have been conducted to date; however without converging results.Some have found that cross-company models did not present significantly worse prediction accuracy than singlecompany models [1], [5], [24], [27], [19], [17].Some have found that cross-company models did present significantly worse prediction accuracy than single-company models [8], [9], [11], [23], [27].Others were inconclusive for various reasons.The normal approach in these studies -and indeed in almost all work in software engineering that builds effort estimation models from historical data -involves separating the data into a training set (from which the model is built) and a validation set (used to assess a model's accuracy).When using leave-one-out cross-validation, each of the N projects in the data set is estimated in turn using a model built from the other N-1 points.Alternatively, the data can be allocated randomly or in a stratified way into a single training set (usually of about two thirds of the data) and a single validation set.Bootstrap and k-fold validation fall somewhere in between, involving multiple repetitions of model building and validation on different training and validation sets.Almost invariably, the assignment of projects to training and validation sets is done without regard to the completion date of the projects.This makes it very likely (in leave-one-out cross-validation it is certain) that the data used to build a model to estimate effort for a given project p include projects that were completed after p was finished.
However, in a real scenario, a developer building a model to estimate effort for a new project could only consider a set of projects already completed.In other words, future projects could not be considered.To reflect this reality when building effort estimation models from historical data, the allocation of projects to training and validation sets should be based on a chronological split.Looking at this issue, three previous studies recently investigated the use of different types of chronological splitting for software effort estimation.Lokan and Mendes (2008a) (CS1) [17], and Lokan and Mendes (CS2) [19] investigated chronological splitting when comparing cross-to single-company effort models.CS1 used a project-by-project chronological split, whereas CS2 used a date-based chronological split.A date-based chronological split consists of choosing a splitting date d, which is used such that both crossand single-company data sets only contain projects completed prior to d, and the validation set only contains projects started at or after d.Later, Lokan and Mendes (2008b) compared project-by-project splitting to date-based chronological split; however without also taking into account the issue of cross-versus single-company predictions.All of these three studies used the ISBSG dataset release 10, and unanimously found no differences in prediction accuracy between the use of chronological splitting and other commonly used techniques (e.g.leave-one-out cross-validation).Given that these three studies used a single dataset, we conjectured as to whether results would differ if employing a different dataset.Therefore, the aim of this paper is to replicate CS1 using the Finnish dataset.
As in CS1, the research questions addressed by this study are as follows: 1.Using a project-by-project chronological splitting for both cross-and single-company projects: a) How successful is a cross-company model at estimating effort for projects from a single company, when the model is built from a data set that does not include projects from that company?b) How successful is a cross-company model, compared to a single-company model? 2. Using a leave-one-out cross-validation approach: a) How successful is a cross-company model at estimating effort for projects from a single company, when the model is built from a data set that does not include projects from that company?b) How successful is a cross-company model, compared to a single-company model?3. Using a leave-two-out cross-validation approach: a) How successful is a cross-company model at estimating effort for projects from a single company, when the model is built from a data set that does not include projects from that company?b) How successful is a cross-company model, compared to a single-company model? 4. Does the use of a project-by-project chronological split, instead of a leave-one-out cross-validation, affect the accuracy of the models? 5. Does the use of a project-by-project chronological split, instead of a leave-two-out cross-validation, affect the accuracy of the models?6.Does the use of a leave-two-out cross-validation, instead of a leave-one-out cross-validation, affect the accuracy of the models?
The main contribution of this paper is therefore to replicate CS1 using a different dataset.
The remainder of the paper is organised as follows: Section 2 briefly summarizes related work.Section 3 describes the research method employed in this study.Results are presented in Section 4, and research questions are addressed in Section 5. A comparison between our results and those from the original study are discussed in Section 6, followed by a Section on threats to the validity, and finally conclusions and directions for future work in Section 7.

Cross-versus Single-company Models
Two systematic reviews have been published on how the accuracy of cross-company models compares with that of single-company models [12][21].Both found ten studies each.Five further papers have since addressed the question [16], [19], [24], [27], [29].
The conclusions of the studies vary, and it is not clear what characteristics of the data sets and analysis methods affect the outcome.Kitchenham et al. [12] note that there are no consistent patterns concerning quality controls on data collection, the quality of the overall study, the size metrics used, the procedure used to build the singlecompany model, or the strength of the cross-company relationship.
If there is a trend emerging, it is to do with the homogeneity of the data.Kitchenham et al. [12] also note that all studies in which single-company models were significantly better than cross-company models used small singlecompany data sets; they conjecture that large data sets from a single company may be an indication of the size of the company and the homogeneity of the data set.The range of effort values in the single-company data set, compared to the range in the cross-company data set, also seems important: the greater the difference, the less likely it is that the cross-company model will be accurate for single-company projects.Premraj and Zimmerman [29] studied the issue of homogeneity by grouping projects by business sector before building cross-company models.While their results varied for different companies, they concluded that it is better to train models using only homogeneous data rather than all data available.Lokan and Mendes [16] applied Mendes et al.'s (S2a) [25] experimental procedure to the ISBSG database version used in Jeffery et al. (S1a) [9], to assess if differences in experimental procedure would have contributed towards the different results of [9] and [25].This work was later extended [24] by applying the experimental procedure of S1a to the data set used in S2a.By investigating the effect of all the variations between S1a and S2a, they concluded that differences in data preparation and analysis procedures did not affect the outcome of the analysis.Thus, the different results of S1a and S2a were due to fundamental differences in the data sets.Mendes et al. [27] replicated Mendes and Kitchenham [23] study using data on Web projects from the Tukutuku database, volunteered after [23] was conducted.Their results corroborated most of those in [23], except for one of their regression-based cross-company models, which showed significantly similar predictions to the singlecompany model, thus contradicting the findings from [23].
As previously mentioned in the Introduction, Lokan and Mendes (2008a) [17], and Lokan and Mendes [19] investigated chronological splitting when comparing cross-to single-company effort models.The former used a project-by-project chronological split, whereas the latter used a date-based chronological split.Both found no differences in prediction accuracy between cross-and single-company predictions.

Use of Chronological Data
Although chronological splitting has been applied as a data-splitting approach in several other domains, few papers in software engineering have used chronological splitting.Two studies -Lefley and Shepperd [15], and Sentas et al. [30], used a chronological splitting where a set date was chosen and used to split their data; however they did not investigate chronological splitting as a research question in its own right.Lokan and Mendes (2008a) [17], and Lokan and Mendes (2009) [19] investigated as a research question in its own right the use of chronological splitting to compare cross-and single-company predictions.They found that the use of chronological split, and the use of two different dates to split the data chronologically, did not seem to improve the accuracy of either single-or cross-company estimates.Lokan and Mendes (2008b) also investigated the use of chronological splitting as a research question in its own right; however they did not compare cross-to single-company predictions.Auer and Biffl [2] and Auer et al. [3] considered chronology directly in their research into estimation by analogy.They tracked changes in accuracy as the portfolio of completed projects grew.However, they did not consider chronological splitting vs. other methods as a separate research question.We are also aware of one study, by Kitchenham et al. [13], which considered a growing data set and whether a sliding window should be used.They found that when they ordered their projects by start date and then divided them into four equal-sized subsets, the regression models changed between the subsets.As a result they argued that old projects should be removed from the data set as new ones were added, so that the size of the data set remained constant.They recommended that the estimate for project n should be based on projects n-30 to n-1: a sliding window of 30 projects.
Another two studies have addressed a slightly different problem, using information from previous phases within a project to estimate effort in later tasks in the same project.MacDonell and Shepperd [20] studied effort distribution in major waterfall phases in 16 projects.They found the patterns of effort distribution between phases varied too much for this approach to work by itself, though it was helpful in conjunction with expert estimation.Abrahamsson et al. [1] were more successful using two projects developed using an extreme programming environment, finding that data from early iterations produced increasingly accurate effort estimates for later iterations.The chronology of projects was also important in [28], which investigated changes in software development productivity over time.The focus in that paper was on characterising productivity in different years, rather than using past data to estimate future projects.

Data set Description
The analysis presented in this paper was based on software projects from the Finnish dataset (the version of that data set available as at May 2008).This dataset contains data on 856 projects, including 201 from the single company that we studied.Rules of confidentiality enabled the single company's identity to be unknown.
To form a data set suitable for our analysis, we removed projects according to the following criteria: • Remove projects if they were assigned a low data quality rating (X).
• Remove any duplicate projects.
• Remove projects if their size is measured in COSMIC, rather than FiSMA FPs.
We finished with a set of 794 software projects, where 201 projects came from a single company, and 593 came from other companies.All have high data quality, and comparable definitions for size and effort.The project selection criteria we applied were carried out carefully in order to maximise comparability between projects.
The Finnish dataset provides data on many variables.The fundamental variables are size, effort, and four basic project classifiers: development type, hardware platform, development language, and business sector.Other variables include several situation analysis variables, analogous to general system characteristics in IFPUG function points, or to COCOMO cost drivers.In this study we restricted our attention to the six fundamental variables.Over 50 development languages are represented; we consolidated these into a broad classification by language type.Our final variables are presented in Table 1.
Summary statistics for the ratio-scale variables are presented in Table 2.We also include the project delivery rate (PDR), calculated as Effort/Size, to provide readers an additional way to compare cross-to single-company projects.This measure is often used to measure productivity, where high values indicate low productivity.Table 2 suggests that size, effort and PDR are all slightly smaller in the single-company data than in the cross-company data.We do not present group statistics for the nominal variables because their description would use a large amount of space.To summarise, the single-company projects include fewer new developments (56% compared to 70% for cross-company projects), more mainframe projects (62% compared to 41%), and more 3GL projects (82% compared to 70%).They are almost entirely from the insurance sector, whereas the cross-company projects are varied (though insurance is still the most common sector).The single-company and cross-company projects differ notably in age: although the overall range of dates is similar, only 8 of 201 single-company projects started before 1998, compared to 232 of 593 cross-company projects.

Project-by-project Chronological Splitting
Project-by-project chronological splitting was used both with the single-and cross-company data sets.
When employed with the single-company data set to obtain predictions for the single-company projects, the project-by-project chronological splitting was used as follows: a.A project p in the single-company data set was selected as the target project, for which estimated effort was to be obtained.b.The starting date (sd) for p was used to split the remaining single-company projects into two groups: i. Completed: projects that had finished prior to sd ii.Active/future: projects that were active or had not yet started at sd c.The set of completed projects was used as the training set in order to build a regression model R d. Cook's distance [7] was used to determine whether any highly influential completed projects should be removed; if any were removed, R was then refitted using the reduced data set e. R was applied to p's data in order to obtain an effort estimate for p f. Project p was returned to the single-company data set.g.Steps a to f were repeated until effort estimates were obtained for all the projects in the single-company data set.
When employed with the cross-company data set to obtain predictions for the single-company projects, the procedure was the same, except that cross-company projects replaced single-company projects as the training set.

Modelling Techniques
All the models used in this investigation were built using an automated process programmed in the statistical programming language R. The procedure followed in the automated process was as follows: • The first step in building every regression model was to ensure numerical variables were normally distributed.We used the Shapiro-Wilk test on both the training and validation sets to check if Effort and Size were normally distributed.Statistical significance was set at α = 0.05.In every case, Size and Effort were not normally distributed, and were therefore transformed to a natural logarithmic scale.The transformed variables' names were preceded by 'L'; so Effort became LEffort, and Size became LSize.
• Models were built using backward stepwise multivariate regression.
• The size of the training set was always considered: no model was investigated involving more than N/10 independent variables, for a training set of N projects.• To verify the stability of the effort model, the automatic process used the following step [11]: • Calculate Cook's distance values for all projects to identify influential data points.Any projects with distances higher than 3 × (4/N), where N represents the total number of projects, are immediately removed from the data analysis.Those with distances higher than 4/N but smaller than (3 × (4/N)) are removed temporarily in order to test the model stability, by observing the effect of their removal on the model.If the model coefficients remain stable and the goodness of fit improves, the highly influential projects are retained in the data analysis [16].

Prediction Accuracy Measures
To date the four measures most commonly used in software engineering to compare different effort estimation techniques have been [6]: • Magnitude of Relative Error (MRE).
• Prediction at level l (Pred(l)).MRE is defined as: where e represents actual effort and ê estimated effort.MMRE is the mean of all MREs.An alternative to the mean is the median, which also represents a measure of central tendency; however it is less sensitive to extreme values.The median of MRE values is called the MdMRE.The Prediction at level l, also known as Pred(l), measures the fraction of estimates that are within l % of the actual values.Suggestions have been made [6] that l should be set at 25% and that a good prediction system should offer this accuracy level 75% of the time.
Although MMRE, MdMRE and Pred(l) are often used as evaluation criteria to compare different effort estimation techniques, Kitchenham et al. [12] showed that MMRE and Pred(l) are respectively measures of the spread and kurtosis of z, where (z = ê / e).They suggested the use of box plots of z and box plots of the residuals (ê -e) as useful alternatives to simple summary measures since they can give a good indication of the distribution of residuals and z and can help explain summary statistics such as MMRE and Pred(.25).We use MMRE, MdMRE, z and Pred(.25) to compare the effort models built in this study, and test for significant differences in absolute residuals.
To compare the statistical significance of predictions we used the paired-samples t-test, and set the statistical significance at 5%.All calculations were carried out using the statistical language R.

RESULTS
The single-company data set contained data on 201 projects.However, with project-by-project chronological splitting we only built regression models whenever there were at least 12 training projects that had been completed prior to start of the single-company project for which effort was to be estimated.Using single-company training data we were only able to obtain predictions for 183 of the 201 single-company projects; using cross-company training data we could obtain predictions for 199 of the 201 single-company projects.For comparability, all models were evaluated using the 183 projects that could be estimated using project-by-project chronological splitting with singlecompany training data.

Cross-company Model
The cross-company models used as part of the project-by-project chronological splitting procedure will not be described in this Section as there were 199 different models that were automatically fit using the statistical language R. Therefore, the only cross-company model that will be detailed here is the one to be compared with predictions obtained from applying leave-one-out and leave-two-out cross-validations in the single-company data set.
In the cross-company data set, the categorical variables Platform, DevType, LangType and Sector had six, three, three and seven levels respectively.They were each replaced by five, two, two, and six dummy variables respectively (the dummy variable corresponding to the most common level was omitted).In addition, Effort and Size were transformed to a natural logarithmic scale to approximate a normal distribution.Table 3 presents the final set of variables used to build the cross-company model.199 of 201 single-company projects provided data for every variable.The best cross-company model, based on the 577 out of 593 cross-company projects which also had data for all variables), included all of size, development type, platform, language type, and sector.Its adjusted R 2 was 0.754 (see Equations 2 and 3).

Cross-company Models applied to Single-company projects
Two different sets of prediction accuracy statistics are presented in Table 4: 1. Predictions based on the application of project-by-project chronological split to each of the automatically fit cross-company models (CC1), applied to each of the 201 single-company projects.Each of the crosscompany models was built from scratch for each single-company project, using a procedure automated in the statistical language R. 2. Predictions based on the application of the cross-company model represented by Equation 4(CC2) to the same 201 single-company projects.(The absolute residuals and z values resulting from CC2 will later be compared to those obtained from leave-one-out and leave-two-out cross-validations.)Overall both CC1 and CC2 presented poor prediction accuracy (high mean and median MREs, and low Pred(.25)values).However, CC2's median MRE and Pred(.25)seemed better than the values obtained for CC1.This trend was confirmed by the Paired T-Test, where CC2 was found to provide significantly superior predictions than CC1.

Single-company Models
Three different sets of prediction accuracy statistics are presented in Table 5: 1. Predictions based on project-by-project chronological splitting, using single-company data, for each of the same 201 single-company projects.Each of the single-company models (SC1) was built from scratch, using a procedure automated in the statistical language R. 2. Predictions based on the use of leave-one-out cross-validation applied to the 201 single-company projects, where each of the single-company regression models (SC2) was built from scratch, using a procedure automated in the statistical language R. 3. Predictions based on the use of leave-two-out cross-validation applied to the 201 single-company projects, where each of the single-company regression models (SC3) was built from scratch, using a procedure automated in the statistical language R.
The training set was small for early projects in the chronological sequence, but by the end of the sequence it was almost the complete data set.Overall all single-company models presented poor prediction accuracy (high mean and median MREs, and low Pred(.25)values).However, accuracy statistics for SC2 and SC3 were in general very similar and better than the accuracy presented by SC1.This trend was confirmed by the Paired T-Test, when applied to absolute residuals: SC1 was found to present significantly worse accuracy than either SC2 or SC3, and SC2 and SC3 were found to present similar accuracy.

Comparing Single-to Cross-company Predictions
Table 6 shows the results for the statistical significance tests comparing predictions between single-and crosscompany predictions.All results were obtained using the Paired T-test (α = 0.05), and absolute residuals.Absolute residuals unanimously indicate significant differences in prediction accuracy between cross-and single-company models, where superior predictions were presented by all three single-company models.

ANSWERING OUR RESEARCH QUESTIONS
Research questions 1a), 2a) and 3a) from Section 1 ask how accurate cross-company models are for singlecompany data.These questions are addressed in Table 4. None of the estimates obtained for the single-company projects using cross-company models indicate good prediction accuracy.MMRE values range between 0.952 and 0.827, which are considered poor (0.25 is considered "good" [6]).The same applies to Pred(.25), which ranges from 0.2896 to 0.3388 (0.75 indicates a good prediction model).
Research questions 1b), 2b) and 3b) compare cross-and single-company models applied to single-company data.These questions are addressed in Table 6.Absolute residuals show significant differences between all the crossand single-company predictions, when used to estimate effort for single-company projects.Single-company models unanimously present significantly superior predictions than cross-company models when estimating effort for single-company data.
To answer the fourth research question, comparing project-by-project chronological splitting with leave-one-out cross-validation, we used the Paired T-test to compare absolute residuals between models CC1 and CC2 and models SC1 and SC2.No significant differences were found when comparing single-company models; however CC2 showed significantly better predictions than CC1.So the answer to question four is 'yes' if based on singlecompany models and 'no' otherwise.Question five (comparing leave-two-out cross-validation, rather than leave-one-out cross-validation, with projectby-project chronological splitting) had the same answer as question four: the answer to question five is 'yes' if based on single-company models and 'no' otherwise.To answer question six, comparing the two cross-validation approaches, we also used the Paired T-test to compare absolute residuals between models SC2 and SC3.No significant differences were found, therefore the answer to this question is 'no'.In terms of the comparison between cross-and single-company predictions, results showed that single-company models significantly outperformed cross-company models.The two clear differences between the single-company and cross-company projects are that the single-company projects are more homogeneous in both business sector and age.In other respects the two data sets are both large and diverse; our other studies have found single-and cross-company models based on data with these characteristics to be similar in accuracy.This suggests that similar business sector, or age, or both, are particularly important when considering the use of cross-company data for estimation.In relation to the use of project-by-project chronological splitting, our results showed that in the single-company data set it either showed similar accuracy to, or significantly worse than, both leave-one and leave-two out crossvalidations.Also, project-by-project chronological estimation using cross-company data showed significantly worse accuracy than regression models based on the entire cross-company data set.In both situations, estimates that use the whole data set as training data are more accurate than estimates that can only use the data available so far for each individual project.In other words, research methods that draw on the entire data set to build estimation models are likely to over-estimate the accuracy that could be obtained in practice by estimators who can only make use of data available on a project-by-project basis.This is the most important finding in this paper: it suggests that it is important for researchers in this field to consider chronology, if the research results are to be relevant to industry.

COMPARISON WITH PREVIOUS RESULTS
In this Section we compare the results presented in this paper to those from [17], using as basis the same six research questions addressed in both studies.
1. Using a project-by-project chronological splitting for both cross-and single-company projects: a) How successful is a cross-company model at estimating effort for projects from a single company, when the model is built from a data set that does not include projects from that company?Both studies showed the same trend: the estimates obtained for the single-company projects using cross-company models indicated poor prediction accuracy.MMRE values ranged between 0.952 and 0.982, considered poor (0.25 is considered "good" [6]); Pred(.25)ranged between 0.184 and 0.290 (0.75 indicates a good prediction model), also considered poor.Therefore, the results of the replicated study corroborated those from the original study.
b) How successful is a cross-company model, compared to a single-company model?Lokan and Mendes [17] showed no significant differences between the predictions obtained for single company projects using cross-company data and predictions obtained using single-company data; however these results were not corroborated by those reported herein, where predictions obtained for single company projects using single-company data presented significantly superior accuracy than predictions using cross-company data.Therefore, the results of the replicated study did not corroborate those from the original study.

Using a leave-one-out cross-validation approach:
a) How successful is a cross-company model at estimating effort for projects from a single company, when the model is built from a data set that does not include projects from that company?Both studies presented a similar trend, where the prediction accuracy using cross-company data to estimate effort for single company projects was poor.MMRE values ranged between 0.827 and 1.477, and Pred(.25)ranged between 0.180 and 0.339.Therefore, the results of the replicated study corroborated those from the original study b) How successful is a cross-company model, compared to a single-company model?The same pattern observed for 1.b) occurred here too.Therefore, the results of the replicated study did not corroborate those from the original study.

Using a leave-two-out cross-validation approach:
a) How successful is a cross-company model at estimating effort for projects from a single company, when the model is built from a data set that does not include projects from that company?Once again both studies presented a similar trend, where the prediction accuracy using cross-company data to estimate effort for single company projects was poor.MMRE values ranged between 0.8272 and 1.477, and Pred(.25)ranged between 0.180 and 0.339.Therefore, the results of the replicated study corroborated those from the original study.b) How successful is a cross-company model, compared to a single-company model?The same pattern observed for 1.b) and 2.b) was repeated here.Therefore, the results of the replicated study did not corroborate those from the original study.
4. Does the use of a project-by-project chronological split, instead of a leave-one-out cross-validation, affect the accuracy of the models?Lokan and Mendes [17] reported no significant differences between predictions obtained using project-by-project chronological split and leave-one-out cross-validation.The same pattern was observed in this study when comparing SC1 to SC2; however the pattern differed when comparing cross-company models given that CC2 showed significantly better predictions than CC1.Therefore, the results of the replicated study partially corroborated those from the original study.
5. Does the use of a project-by-project chronological split, instead of a leave-two-out cross-validation, affect the accuracy of the models?The same results as for Research question 4. Therefore, the results of the replicated study partially corroborated the results of the original study.
6. Does the use of a leave-two-out cross-validation, instead of a leave-one-out cross-validation, affect the accuracy of the models?Both studies showed no significant differences between predictions obtained using leave-one or leave-two out cross-validation.Therefore, the replicated study corroborated the findings from the original study.
An important difference between the single-company and cross-company datasets used in [17] and herein relates to the project delivery rate (PDR).The single-company projects from the ISBSG dataset presented similar PDR to the cross-company projects; conversely, the single-company projects from the Finnish dataset presented superior PDR when compared to the cross-company projects.This may help explain the differences observed between both studies for this research questions 1.b), 2.b) and 3.b).

THREATS TO THE VALIDITY
This study has some limitations and threats to validity.First, the Finnish dataset is a convenience sample, and does not represent a random sample of projects.Therefore these results are only applicable to those companies that volunteered data to the Finnish dataset, and companies that manage software projects similar to those used in this study.Second, the models were built automatically.Automating the process necessarily involved making some assumptions, and the validity of our results depends on those assumptions being reasonable.For example, log transformation is assumed to be adequate to transform numeric data to an approximately normal distribution; residuals are assumed to be random and normally distributed without that being actually checked; when choosing between two models in which all independent variables were significant, the one with higher adjusted R2 is assumed to be preferred; multi-collinearity between independent variables is handled automatically in the stepwise procedure used to build the models.Based on our past experience with manual model building (from ISBSG data), we believe that these assumptions are acceptable.One would not want to base significant decisions on a single model built automatically, without at least doing some serious manual checking.But for calculations such as leaveone-out/leave-two-out cross-validation, or project-by-project chronological estimation across a substantial data set, we believe that the process here is reasonable.

CONCLUSIONS
This paper replicated a previous study and investigated the use of project-by-project chronological splitting when comparing cross-and single-company prediction models, and compared it with two cross-validation approaches.Data from the Finnish dataset was used to answer our research questions.When comparing cross-and single-company predictions, we found that single-company models presented significantly better prediction accuracy than cross-company models.These results contradict those obtained using the ISBSG dataset, where cross-company predictions showed similar accuracy to single-company predictions.When using project-by-project chronological splitting, these presented either similar accuracy or significantly worse accuracy than other two commonly used techniques (leave-one-out and leave-two-out cross-validation).Except for the comparison between CC1 and CC2, all remaining results (SC1 significantly worse than SC2 & SC3; SC2 presented similar accuracy to SC3) were common between this paper and the previous study it replicated.However, in a real setting, the only realistic option available for a company that employs its own data is to use some form of chronological splitting.If, however, a company relies on cross-company data, the results presented herein suggest that the use of common techniques such as leave-one-out cross-validation for effort estimation may be more beneficial than to apply a project-by-project chronological split.We found no difference between the results from leave-one-out and leave-two-out cross-validation, which corroborates previous results.Leave-two-out cross-validation has the advantage of providing a distribution of estimates for each project, rather than a single estimate; it has the disadvantage that it involves a great deal more computation, proportional to the square of the size of the data set instead of linear in the size of the data set.Our results suggest that the computational cost of leave-two-out cross-validation is not worthwhile.Future work includes repeating this experimental approach using other data sets; particularly, more homogeneous data sets, including Web projects.We are also interested in tracing the evolution of prediction models and their accuracy as a training set grows.Another research question to be investigated is whether it is best to use the entire history of past projects for project-by-project estimation, or whether it is more appropriate in a rapidlychanging world to use a window of recent projects.

TABLE 1 :
Variables used in this study

TABLE 2 :
Project summary statistics for the ratio-scaled variables

TABLE 3 :
Variables used to build the cross-company model One single-company project did not provide data about the platform.The best cross-company model, based on the 588 cross-company projects that had data for the same set of variables, was (after transformation back to the raw scale):

TABLE 4 :
Prediction accuracy Statistics for cross-company models applied to single-company projects

TABLE 5 :
Prediction accuracy Statistics for single-company models applied to single-company projects

TABLE 6 :
Comparing prediction accuracy between single-and cross-company models