Are users more diverse than designs ? Testing and extending a 25 years old claim

Twenty-five years ago, Dennis Egan published a review on the impact of individual differences in human-computer interaction, where he claimed that users are more diverse than designs are [5]. While being cited frequently, this claim has not been tested since then. An efficient research design for separating and comparing variance components is presented, together with a statistical model to test Egan’s claim. The results of a pilot study indicate that Egan’s claim does not universally hold. An extension to the claim is suggested, capturing the trade-offs when prioritizing user tasks. An alternative strategy towards universal design is proposed.


INTRODUCTION
When the aim is to optimally serve a diverse population of users, understanding the interplay of design options and individual differences is crucial [13].For example, Jennings et al. suggested that systems could adapt their interfaces to the cognitive abilities and styles of users [9].With the emerge of the concepts of accessibility and universal design, the idea of adaptive systems got broader attention [12].An alternative approach to universal design is to care for the least capable users, again requiring a good understanding of those users' abilities.For example, Freudenthal studied the effects of hypertext structure and agerelated cognitive abilities on browsing performance, reaching the conclusion that broad structures are preferable over deep structures when designing for populations including elderly users [7].
One of the first systematic treatments on individual differences in HCI was a review by Egan, presented in 1988 [5].A main conclusion was that individual differences have a strong impact on performance in using computer systems.Egan's report has been widely recognized in HCI and has been cited more than 300 times since it was first published 1 .Many authors refer to either the predictors for performance, as identified by Egan [3], the proposed approach to robust designs [8], or the unmatched amount of variability in human- 1 Google Scholar lists 328 citations as of 2013-06-11.36 citations fall in the period of 2010 -2013.computer interaction, e.g."a far greater range than usually found in human factors work" [4].Some authors [1,7] directly referred to a specific claim of Egan on the relative impact of sources of variability: "differences among people usually account for much more variability in performance than differences in system designs" [5:543].
For this claim, Egan provided evidence from three application domains, text editing, programming and information retrieval.For example, Egan and Gomez [6] compared performance on two different editor designs and found that individual differences caused 20 times higher variability then the two designs under comparison.
Almost all studies mentioned in Egan's review had used factorial experimental designs, comparing groups of users (e.g., novices and experts) to a small number of designs (e.g., command vs. menu based control).As Monk pointed out, designs should rather be regarded a population as opposed to fixed effects in HCI studies [10].This is particularly relevant as, nowadays, many more design variants exist as compared to the 1980s, for example the hundreds of municipal websites in a country.
Based on these considerations, we introduce a research design that allows efficient comparison of large samples of designs.In line with Monk's suggestion to view designs as populations, we introduce a statistical model that uses multiple random effects to dissect overall variability into its components, thereby allowing us to test Egan's claim.The approach is demonstrated by a pilot study on university websites.

METHOD
Forty-one Dutch students from a variety of social science and engineering disciplines participated in the study (29 were male).Five Belgium and five Dutch university websites were selected for the study.No strict sampling procedure was used, but websites were eyeballed to represent a good range of different designs and minimize the possibility that a participant has used the website before.
Every participant was asked to complete ten different tasks on ten different website, for example:


-Find the schedule of the first year bachelor biology?‖  -You have a complaint about how you were treated by a teacher.Find an ombudsperson or complaints desk.‖Several performance measures were taken, such as time-to-completion, mental workload and path length until the desired information was reached.
Here we will only report on the path length.
The aim of the study is to decompose the variance of performance into components for users, designs (websites) and tasks.Such a decomposition becomes possible through having repeated measures on every component of interest.
Obviously, a complete design with every participant encountering all 100 combinations of website and task is not practical, also for the reason of undesired learning effects when the same website is visited multiple times.
The experimental design therefore rested on two principles: first, every participant must encounter each task and each website exactly once.Second, every combination of website and task must be encountered about the same number of times in the sample.This results in an incomplete design that is balanced over users, designs, tasks and the combinations of design and task.Note that for future studies with larger samples of websites and tasks, the first principle can be relaxed to -encountered once at most‖, without compromise.
Since we are interested in variance components, rather than a direct comparison between levels of any factor, a multiple random-effects models was constructed.The variance components are represented by non-nested (cross classified) random effects for participants, websites and tasks.Furthermore, we added a fourth random effect representing the variance in the design-task combinations.
Statistical inference on multiple random effects is notoriously unreliable when using asymptotic procedures from least squares or maximum likelihood estimation [2].For that reason, the Bayesian estimation method of Markov-Chain Monte-Carlo (MCMC) sampling was used.Such an analysis results in a posterior distribution of belief, which can be interpreted in about the same way like confidence intervals.Using uninformative priors ensures that the estimates are consistent with maximum likelihood estimation.
For the data model, a Poisson term was chosen, which typically is appropriate for count data.In order to account for likely over-dispersion, an observation-level random coefficient was added.All computations were done with the Bayesian modelling software Stan [11].

RESULTS
Out of the 410 trials, 367 were completed successfully, whereas in 22 trials a wrong answer was given.In 19 trials the participant gave up.Half of all trials were completed with three or fewer steps.However, 25% of all trials took more than seven steps, with a maximum of 41 steps.
Figure 1 shows that even within one design, path lengths are strongly skewed and widely spread, with considerable differences in range between designs.With the exception of the Hasselt and Leiden websites, the mean path length does not differ much between designs.In contrast, much stronger variance is observed within and between tasks (Figure 2).
Overall, we obtained ten measures per participant, 41 measures per website and per task and between four and five measures per combination of task and website.For these four components, the standard deviations of random effects were estimated via MCMC sampling.The Stan model specification can be found in the appendix.
The posterior distribution for the standard deviation per random effect was recorded and is shown in Figure 4.If Egan's claim were true, the standard deviation for the user-level random effect should clearly exceed the random effect of design.Indeed, it appears that the variance of the design random effect leans towards zero ( ̅̅̅̅̅ ), making the smallest contribution to overall variance.The userlevel random effect is clearly above zero ( ̅̅̅̅̅ ), however, there is a strong overlap with the design-level random effect.The strongest impact on variance comes from tasks ( ̅̅̅̅̅ ) and the 100 combinations of designs and tasks ̅̅̅̅̅̅̅̅ .
The narrow posterior distribution of design x task represents a rather firm belief that the standard deviation of the respective random effect is around 0.5.The 95% credibility interval of the design x task standard deviation does not overlap with those of subject and design.Hence, we may view this difference to be statistically significant (α<.05).

DISCUSSION
Egan's claim was disconfirmed in our study.Differences between users do not cause much stronger variance than designs do, as both random effects were on about the same level.
Furthermore, the results point at an extension to Egan's claim: performance varied strongest for tasks and designs conditional on the task.The possibility of task-based variability is plausible for informational websites, where thousands of information items compete for promotion to the most reachable positions.The strong design x task variability indicates that information architects do not fully agree on the priority of tasks on university websites.
Prioritizing tasks is an issue of user requirements analysis.The priority of any task requirement depends on its impact, the frequency, and the development costs.Setting the development costs aside, the priority for an information item can be approximated as Frequency of a task can be estimated as the number of expected transactions per time period.For example, schedules are (to our experience) frequently accessed items on university websites.Contrary to that, finding the schedule of the biology study was one of the most difficult tasks in the study (Figure 2).
Impact of tasks depends on the expected consequences.For example, the consequences for a student with personal trouble, failing to become aware of the university's psychological counselling service, we would regard as severe.(This was another difficult task as observed in the study, see Figure 2.) High losses can occur for the student and the university when this student had to repeat courses due to untreated troubles, or drops out completely.
Not necessarily would one only try to minimize average loss.Long before the ideas of universal design and accessibility became widely spread, Egan and Gomez outlined an approach to, what they called, robust designs [6].A robust design "should result in more uniformly high performance across users.This approach is similar to standard human-interface design, except that it is shaped by a concern for the variability among users."Given this definition, robust designs can be said to adhere to the minimax principle by maximizing the minimum performance of users.Reviewing Figure 1 once again, it is observed that some designs do better than others in that respect: the website of RU Groningen can be regarded a robust design as all trials could be performed with less than 15 steps.In contrast, at the Antwerp university, two attempts of finding the ombudsperson took 19 and 22 steps.
Egan and Gomez's approach rests on assaying the sources of human variability (e.g., spatial ability), isolating and accommodating the design parameters that cause the most variance.For example, changing the command key control of a text editor into a menu based interaction effectively reduced the number of errors made by elderly users.Interestingly, the benefit for the elderly was at the expense of performance of younger users.In Figure 4 of [5], a clear interaction effect is visible, reflecting the general complication of Egan and Gomez' approach: design parameters and user traits are in many ways conditional on each other, resulting in an enormous complexity for design.The whole approach may have worked well for the comparably simple (and few) computer system designs as of 1985.But, it is unlikely that the same strategy is efficient for modern systems where the user population is much broader and thousands of design parameters have to be orchestrated, like the many options one has to structure a website.
Whereas designing for diverse users and multiple tasks is a burden, the multitude of available designs of today's computer systems is a resource.As an alternative to the factorial approach of Egan and Gomez, we propose to regard designs as a population and select the -fittest‖ by testing samples of designs against samples of users and tasks.Designs with uniformly high performance could be selected as references for systems yet to be build.In our small scale example, we would perhaps select the website of RU Groningen as a reference, as it minimizes variance at an acceptable average level.
The depicted experimental design and method of data analysis, can be extended to support such a -cherry picking‖ strategy.The incomplete balanced block design scales up to larger populations of designs, tasks and users, through simply increasing the sample sizes.The statistical model can be extended to also capture variance within individual designs.The process of measurement can further be simplified by analysing time-tocompletion measures, instead of path length.This would allow for self-administered remote tests, or even fully crowd-sourced studies.
Furthermore, the Bayesian approach connects well to rational decision making.Posterior distributions can easily be connected to loss functions, preparing for decision rules, such as the minimax principle.
The cherry picking approach is fundamentally different to most research in universal design, as it builds solely on empirical measures, in contrast to psychological or design theory.It appears most promising in domains where serving a diverse population of users is mandatory.But, it also requires a large population of diverse designs.Example domains with such characteristics are municipal websites, e-commerce and online banking websites, and, perhaps, certain categories of smartphone apps.
The approach outlined here requires modern statistics and decision theory, but at the same time rests on a fundamental idea of HCI: performance lies in the interaction of users, designs and tasks.

Figure 1 Figure 2 Figure 3
Figure 1 Distribution of path length per design.Dots indicate average path length

Figure 4
Figure 4 Posterior distribution of standard deviation estimates for random effects