Evaluation and Assessment in Software Engineering Prediction of Overoptimistic Predictions

Overoptimistic predictions are common in software engineering projects, e.g., the average software project cost overrun is about 30%. This paper examines the use of two popular general tests of optimism (the ASQ and the LOT-R test) to select software engineers that are less likely to provide overoptimistic predictions. A necessary, but not sufficient, condition for this use is that there is a strong relationship between optimism score, as measured by the ASQ and LOT-R tests, and predictions. We report from two experiments on this topic. The experiments suggest that the relation between optimism score as measured by ASQ or LOT-R and predictions is too weak to enable a use of these optimism measurement instruments to select more realistic estimators in software organizations. Our results also suggest that a person's general level of optimism and over-optimistic predictions of performance are, to a large extent, unrelated.


Introduction
Organizations that develop software have, in general, a bad reputation for enffort overruns.Studies reporting a high degree of over-optimism in software-related and other types of project include (Buehler, Griffin et al. 1994;Buehler, Griffin et al. 1997;Newby-Clark, Ross et al. 2000;Moløkken-Østvold and Jørgensen 2005).According to (Moløkken-Østvold and Jørgensen 2003), the average software project cost overrun is about 30%.Large cost overruns may lead to dissatisfied customers, low quality software and frustrated software developers.
One possible way to reduce or eliminate the strong bias towards over-optimistic cost predictions is to use formal estimation models.Proper formal prediction models are unbiased and are not affected by political issues or wishful thinking.Due to these positive factors, there has been much research on formal cost estimation models, e.g., the COCOMO-model (Boehm 1981).However, empirical studies suggest that the more flexible method "expert estimation" is just as accurate; see (Jørgensen 2004) for a review of the studies.Possibly for that reason, most software companies rely on expert estimation and seldom use formal cost estimation models, as documented in (Heemstra and Kusters 1991;Hihn and Habib-Agahi 1991;Paynter 1996;Hill, Thomas et al. 2000).Applying expert estimation means that it is critical to select proper estimation experts.A proper estimation expert should have relevant experience and expertise.In addition, we would like the estimation expert to be as realistic as possible, e.g., not systematically be subject to wishful thinking.
This paper examines the possibility of selecting realistic expert estimators based on the two widely used optimism measurement tools ASQ (Attributional Style Questionnaire) (Seligman 1995) and LOT-R (Life orientation Test-Reduced) (Scheier and Carver 1985).Our research question is as follows: How well do standard optimism measurement instruments, i.e., ASQ and LOT-R test values, correlate with optimistic predictions provided by software engineers?Similar uses of the optimism measurement instruments to select between people have been studied before, e.g., (Seligman 1995) report studies on successful use of the ASQ instrument to select high-performance insurance salesmen and winners of political elections.
The remainder of this paper is organized as follows: Section 2 discusses optimism and relevant previous studies.Section 3 reports and discusses results from the empirical studies.Section 4 concludes.

Optimism
Optimists expect good things to happen more often than pessimists and interpret events and states more positively.A common illustration is the optimist's description of a glass as "half full", while a pessimist describes the same glass as "half empty".For the purpose of scientific studies there is frequently a need to decompose optimism, and to find operational procedures that measure the various aspects of optimism.Aspects of optimism that researchers have attempted to measure include dispositional optimism, generalized self-efficacy, optimistic explanatory style, unrealistic optimism, and strategic optimism (Fournier, Ridder et al. 1999).
In this paper, we will focus on (i) optimistic explanatory style, defined as the pattern of explaining positive events as permanent, general and internal to the subject, while explaining negative events as unstable, specific and external to the subject (Seligman 1995), and (ii) dispositional optimism, defined as the tendency to believe that one will generally experience good outcomes in life (Scheier and Carver 1985).The fact that these quite different descriptions are the basis for two measurements of optimism, suggest that the concept of optimism is not well defined.The main reason for the selection of these two aspects of optimism is that there exist operational, widely used, measurement tools for these aspects, i.e., the ASQ test (Attributional Style Questionnaire) for optimistic explanatory style and the LOT-R test (Life Orientation Test-Reduced) for dispositional optimism.Both measures are based on questionnaires with predefined statements to which the subjects respond.The questionnaires are enclosed as Appendices 1 and 2.
There have been many studies on optimism with relevance to our research question.Relevant and representative research findings include the following: • Most people are overoptimistic about future life events and their own performance (Steen 2004).A finding frequently quoted is that 80% of drivers in Texas believe their driving ability is above average (Svenson 1981).

•
A frequently reported observation is that people who are prone to depression seem to have the most realistic self perception, e.g., (Taylor and Brown 1988).

•
The level of overoptimism may be culture-and time-dependent.In some cultures, overpessimism may be more common (Chang 2000).

•
A high level of optimism is typically an indicator of well-being, e.g., psychological functioning, effective coping with stress, psychological well-being and physical health.Pessimism, on the other hand, has been found to be linked to learned helplessness, apathy and depressions (Ek, Remes et al. 2004).

•
Traditional theories of economics, e.g., the expected utility theory, assume that rational actors have advantages over non-rational actors and will, on the long run, be winners.Recently, however, studies point out that there are reasons to believe that moderately overoptimistic actors in some contexts have advantages, e.g., a better motivation to work harder, and that a society will benefit from containing overoptimistic, e.g., risk-seeking people.See, for example, (Manove 1995;Steen 2004) for a discussion on this topic.There are diverging opinions on most of the results on optimism.In addition, although the studies typically show statistically significant differences in performance between optimists and pessimists, the size of effects are typically low.Our impression from reading the studies is that variations in the level of optimism typically account for less than 10% of the variation in performance.
We searched for studies on the use of the optimism measurement tools ASQ and LOT-R to select realistic estimators (forecasters, predictors, judgment experts), but were unable to find any.There is, however, some evidence suggesting that dispositional optimism, e.g., as measured by LOT-R, and comparative optimism, e.g.predicting that oneself will perform better than the average performer, are not strongly correlated, see (Shepperd, Carroll et al. 2002) for an overview.Although the idea that optimists have more optimistic performance predictions than pessimists is intuitively appealing, the available evidence seems to be in favor of no or small difference in prediction realism.

Empirical Studies
We include two experiments in this section.Section 3.1 describes an experiment on software engineering students predicting examination marks.Section 3.2 describes an experiment on software professionals estimating the effort for a software project.Section 3.3 discusses limitations of the studies.

Experiment 1: Prediction of Examination Marks
Twenty-five software engineering students at the University of Oslo volunteered to participate.The participants were paid for participation in a study that collected several characteristics about their study expectations, study technique and examination results.We collected information about their general level of optimism using the ASQ measurement tool.The questions (events) used in the questionnaire are shown in Appendix 1.The distribution of ASQ scores is described in Table 1.The number of very pessimistic individuals was surprisingly high, e.g., more than half of the students were categorized as "very pessimistic".This does not correspond well with our impression of Norwegian software engineering students, and it might be that the category labels of the ASQ instrument labels may not be well calibrated for them.The correctness of the category labels is, however, not the main issue in our study.As stated earlier, we want to study whether differences in ASQ score can be used to predict optimism in predictions.The distributions of examination mark predictions and actual examination marks are described in Table 2.We asked the students to predict their examination marks early in the semester, then just before the examination, and finally a few days after the examination (before they got their examination marks).
Table 2 shows an initially (Early Prediction) strong bias towards overoptimism, e.g., 16 students believed they would get A or B, while only nine students actually achieved one of these marks.This overoptimism regarding prediction is somewhat surprising in light of the high number of students categorized as very pessimistic according to the ASQ score.It is interesting to note that proximity to the date of the examination seems to affect the level of optimism, in that the students get more realistic, i.e. less overoptimistic, as the date of the examination looms closer.One potential reason for this is the conscious or unconscious, strategic use of optimism.At the beginning of the semester, increased optimism may stimulate harder work.Around the examination period, when there is little or no possibility of affecting the outcome, a less optimistic outlook may be useful to avoid disappointments.This explanation may also be relevant for predictions of software development effort, i.e., overoptimism may be higher when the time of evaluation is far away.An alternative explanation is that the students know more about their own performance when they get closer to the examination.The explanation that there is a shift from examination mark prediction optimism towards more realism and even pessimism when getting closer to the examination is, however, supported by findings reported in several studies, e.g., (Manger and Teigen 1988).It seems therefore to be a robust finding that the time horizon has an important role to play regarding the level of optimism.
The main analysis of this study is the relation between difference in general level of optimism (the ASQ score) and the accuracy of examination mark prediction.For this purpose, we labeled all predictions that were better than the actual outcome as over-optimistic, those identical to outcome as realistic, and those worse than the outcome as overpessimistic.Tables 3a-3c show results related to the relation between ASQ score and accuracy in examination mark prediction.Notice that we have separated the "Very pessimistic" category into two sub-categories: "Very pessimistic I" (ASQ score in the interval [0, -5]) and "Very pessimistic II" (ASQ score lower than -5).Since we had no observations in the "moderately optimistic" category, we removed it from the table.Tables 3a-3c suggest that there are no clear pattern regarding the relation between ASQ score and overoptimistic prediction.This lack of relation is well illustrated by the observation that the most pessimistic student (ASQscore of -12) had the most overoptimistic prediction!His/her early prediction was A while the actual examination mark was D!

Experiment 2: Software Development Effort Predictions
Fourteen senior project managers from the same Norwegian software development company participated in this experiment.Their task was to estimate the most likely effort necessary to complete a specified software project.All participants received the same information (a requirement specification), and were instructed to base the effort estimate on the same assumptions.The project was a real project, where the development work was about to start.None of the participants in our experiment had participated in the previous cost estimation of the project.
The participants' general level of optimism was measured using the LOT-R tool, i.e., measurement of so-called dispositional optimism.The LOT-R instructions are described in Appendix 2. Notice that statements 2, 5, 6, and 8 of the questionnaire are just filler items and not used for the analysis of optimism.Statements 1, 4, and 10 are positive statements, while statements 3, 7 and 9 are negative statements.The response to a negative statement should be "inverted" to correspond to the response to a positive statement.For example, the response "strongly agree" (A) to a positive statement corresponds to the response "strongly disagree" (E) to a negative statement in relation to level of optimism.Table 4 describes the distribution of LOT-R answers.Table 4 shows a strong tendency towards dispositional optimism, i.e., most software professionals did agree with the positive statements and disagreed with the negative statements.To compare individuals' LOT-R optimism score we proceeded as follows: 1) All answers by an individual were valued as follows: • A on positive statements and E on negative statements gave the value 2 • B on positive statements and D on negative statements gave the value 1 • C gave the value 0 • D on positive statements and B on negative statements gave the value -1 • E on positive statements and A on negative statements gave the value -2 2) We excluded the filler items and calculated the average optimism value for an individual (AvOptLOT-R) as the average value of the answers on Q1, Q3, Q4, Q7, Q9 and Q10.
We then analyzed the relation between effort predictions and AvOptLOT-R.The data are displayed in Figure 1 (a dot corresponds to an individual).AvOptLOT-R Figure 1 displays, at best, a weak relationship between general level of optimism and predicted effort (the correlation is -0.17).This correlation may, however, be due to chance and is not significant.
The project is currently not completed.Indications from the project manager about the ongoing project suggest that the actual total effort will be somewhere between 1700 and 2200 work hours, i.e., in this case the lowest predictions may be the most realistic!This further illustrates the problem of identifying the most realistic predictions using individuals' level of general optimism as an indicator.

Limitations of the Studies
The most important objections to the validity of the results are, we believe, related to the artificial experimental context.All our subjects knew they were part of a study, which may have affected their behavior.A common objection is that results achieved in artificial settings do not generalize to real-world contexts.Generalization of results from experiments should, however, not be a naïve generalization to field settings.
We believe that an important role of experiments in artificial settings is to understand a phenomenon with reduced noise from the environment compared to field settings.The generalization then happens through better understanding of basic relationships, i.e., by theory.In our case, the role of the experiment is to increase understanding of the (lack of) relation between general level of optimism (as measured by ASQ and LOT-R) and optimism in predictions.This increased understanding, together with other knowledge, should be used to make testable hypotheses about the real world.This may be just as valid a method of generalization as statistical generalization from sample to population.To illustrate the difference in roles between laboratory experiment and field study, assume that we had conducted a field study of the use of ASQ and LOT-R to predict over-optimistic software development effort predictions.We would then not be able to use the same project for all predictions and there would be more non-controlled factors.The realism and the representativeness would be higher in a field study, but the added noise would make it difficult to draw conclusions about what the ASQ and LOT-R tools had measured.The ultimate test of whether a method works in the software industry or not should obviously be based on field data.Deliberately introduced artificiality may, however, be useful to reduce noise and examine causal relationships.
In both experiments, the participants first predicted performance (examination marks or effort) and then completed the ASQ or LOT-R questionnaire.Is it possible that the completion of the ASQ and LOT-R questionnaire was affected by their previous predictions?If so, we have studied the impact of optimistic predictions on responses to ASQ and LOT-R questionnaires and are not able to answer the research question.We do not believe that this "opposite cause-effect" happened in our experiments for three reasons: 1) The participants did not know the design of our study.This, we believe, excludes the effect of "pleasing the experimenter".2) The ASQ questionnaire and LOT-R questionnaires are not related to the type of prediction provided by the participants, i.e., the connection between them is not obvious for the participants.3) There was at least one question not related to these experiments between the prediction and the ASQ/LOT-R questionnaire.It would, nevertheless, be interesting to study whether a reverse completion of tasks, i.e., first the questionnaire and then the prediction, would generate different results.
Another possible objection to our studies is that we were doomed to fail, in that it should have been obvious from earlier studies that ASQ and LOT-R would not have sufficient explanatory power to be used as means to improve the selection of subjects more likely to predict realistically.We partly agree with this objection.A more thorough reading of previous studies may have made us less optimistic about the usefulness of the ASQ and LOT-R measurement tools.On the other hand, optimism is by definition related to optimistic predictions about the future.If tests of optimism are not closely related to overoptimism in predictions, then what are they supposed to measure?In addition, scientific studies should, in our opinion, not only focus on discovering new relationships.They should also evaluate claimed or intuitively obvious relationships.We would categorize the claim that optimists have overoptimistic predictions as intuitively obvious and, for that reason, worth a study.In addition, other researchers may build on our negative findings and try to use refined versions of general optimism tests to build tests that have a stronger relation to the identification of people who are more likely to produce realistic predictions.

Conclusion
The main goal of our studies was to test the extent to which the general level of optimism, as measured by ASQ and LOT-R, are useful means of selecting software engineers with realistic predictions.A precondition for this use of optimism tests is that there is a strong relationship between the optimism test scores and overoptimistic predictions.
Previous optimism studies have typically used the measurement tools LOT-R and ASQ to examine how level of optimism affects health, success in work, and, educational performance.We have been unable to find any study on the relation between these measures of general level of optimism and realism in quantitative predictions of performance.One reason for this lack of studies on overoptimistic predictions may be that that it is believed to be obvious that very optimistic people make overoptimistic predictions.After all, overoptimistic expectation is part of the typical definition of optimism.If the tools for measuring optimism do not predict overoptimistic predictions, this would suggest severe shortcomings of these instruments.
Our results suggest that the general level of optimism is, at best, a very weak predictor of optimistic predictions.This result is consistent with the results of previous studies where optimism is a weak, and not very robust, indicator of individuals' behavior and success.
The main consequences of our results are, we believe, twofold. 1) The ASQ and LOT-R optimism measurement tools do not have sufficient predictive power to be used to select expert estimators less likely to be overoptimistic in, for example, software cost estimation contexts.2) It may be incorrect to categorize estimators as optimists or pessimists based on ASQ and LOT-R scores.Very pessimistic individuals may make overoptimistic predictions, and vice versa.19.Your car runs out of gas on a dark street late at night.I didn't check to see how much gas was in the tank.The gas gauge was broken.20.You lose your temper with a friend.He/she is always nagging me.He/she was in a hostile mood.21.You are penalized for not returning your income tax forms on time.I always put off doing my taxes.I was lazy about getting my taxes done this year.
22.You ask a person out on a date and he/she days "no."I was a wreck that day.I got tongue-tied when I asked him/her to the dance.

Figure
Figure 1: Relation Between Effort and Average LOT-R Score 23.A game show host picks you out of the audience to participate in the show.I was sitting in the right seat.I looked the most enthusiastic.24.You save a person from choking to death.I know a technique to stop someone from choking.I know what to do in a crisis situation.

Appendix 1: ASQ Questionnaire Read
the description of each situation and vividly imagine it happening to you.Then click on the cause that is likelier to apply to you.You may not like the way some of the responses sound, but don't choose what you think you should say or what would sound right to other people; choose the response that's most like you.Your answers are not being recorded.1.The project you are in charge of is a great success.I kept a close watch over everyone's work.Everyone devoted a lot of time and energy to it.You miss an important engagement.Sometimes my memory fails me.I sometimes forget to check my appointment book.9.You run for a community office and you lose.I didn't campaign hard enough.The person who won knew more people.10.You host a successful dinner.I was particularly charming that night.I am a good host.11.You stop a crime by calling the police.A strange noise caught my attention.I was alert that day.12.You buy your spouse/partner/boyfriend/girlfriend a gift and he/she doesn't like it.I don't put enough thought into things like that.He/she has very picky tastes.13.You gain weight over the holidays and can't lose it.Diets don't work in the long run.The diet I tried didn't work.14.Your stocks make you a lot of money.My broker decided to take on something new.My broker is a top notch investor.15.You win an athletic contest.I was feeling unbeatable.I train hard.16.Your fail an important examination.I wasn't as smart as the other people taking the exam.I didn't prepare for it well.17.Your boss gives you too little time to finish a project, but you get it finished anyway.I am good at my job.I am an efficient person.18.You lose a sporting event for which you have been training for a long time.I'm not very athletic.I'm not good at that sport.