Measuring User Experience in Conversational Interfaces: A Comparison of Six Questionnaires

User experience (UX) has become an important aspect in the evaluation of interactive systems. In parallel, conversational interfaces have been increasingly used in many work and everyday settings. Although there have been various methods developed to evaluate conversational interfaces, there has been a lack of methods specifically focusing on evaluating user experience. This study reviews the six main questionnaires for evaluating conversational systems in order to assess the potential suitability of these questionnaires to measure various UX dimensions. We found that (i) four questionnaires included assessment items, in varying extents, to measure hedonic, aesthetic and pragmatic dimensions of UX; (ii) two questionnaires assessed affect, and one assessed frustration dimension; and, (iii) enchantment, playfulness and motivation dimensions have not been covered sufficiently by any questionnaires. We recommend using multiple questionnaires to obtain a more complete measurement of user experience or improve the assessment of a particular UX dimension. User experience


INTRODUCTION
User experience (UX) has become an important aspect of interactive system evaluations in the last two decades.Although there is an increasing adoption and acknowledgement of the need to understand and evaluate user experience, two extensive reviews have not found a consensus in defining and evaluating user experience (Law et al. 2009;Bargas-Avila and Hornbaek 2011).
According to ISO (2010), user experience is a -person's perceptions and responses resulting from the use and/or anticipated use of a product, system or service‖.According to a survey study of 275 UX researchers and practitioners (Law et al. 2009), this definition is in line with the views of most respondents about the subjectivity of UX.There are also three notes added to this definition.The first emphasises the extensive range of UX dimensions to be considered at different stages of using a product: -User experience includes all the users' emotions, beliefs, preferences, perceptions, physical and psychological responses, behaviours and accomplishments that occur before, during and after use".The second note draws our attention to the various brand-related, system-related, user-related, and context-related factors shaping UX: -User experience is a consequence of brand image, presentation, functionality, system performance, interactive behaviour and assistive capabilities of the interactive system, the user's internal and physical state resulting from prior experiences, attitudes, skills and personality, and the context of use‖.Finally, the third note aims to clarify the relationship between usability and UX: -Usability, when interpreted from the perspective of the users' personal goals, can include the kind of perceptual and emotional aspects typically associated with user experience.Usability criteria can be used to assess aspects of user experience".Although the ISO's main definition and the notes provide a valuable understanding of UX and its broad scope, from a more practical viewpoint, it is still not clear what constitutes UX for a particular user interacting with a particular product in a particular context.There is a lack of information explaining what UX dimensions need to be considered for different types of projects.
there are more UX dimensions, but they are less frequently assessed.In our study, we will use the list of UX dimensions offered by Bargas-Avila and Hornbaek (2011) as a basis for our assessment.
It is important to situate the focus of this study within the larger landscape of UX approaches.Battarbee and Koskinen (2005) usefully summarised UX approaches in three categories: the measuring approach, the emphatic approach, and the pragmatist approach.The measurement approach focuses on the aspects of user experience that can be measured directly by physical reactions of bodies or by subjective reporting.The emphatic approach is based on developing a rich understanding of users' needs, desires, dreams, and motivations through various formative methods involving visual and textual data, and creative tasks in design phase.This approach aims to project future user experience and inspire designers rather than assess a current user experience with a system.The pragmatic approach, which is informed by pragmatist philosophy (Dewey 1934), provides a holistic view of user experience focusing on understanding interactions between users, technologies and environment as the indivisible constituents of experience.The studies employing pragmatic approach tends to be theoretical and does not provide practical guidance on design and evaluation.Rather, they focus on increasing an awareness of and sensitivity to the irreducibility and embodied nature of experience.This study is situated within the measurement approach.The focus is upon questionnaires used for evaluating user experience and/or subjective user satisfaction in conversational interfaces.
To understand what the current standardised questionnaires for evaluating conversational systems can offer in measuring user experience, we analysed the assessment items listed in the six main questionnaires and coded them according to their association with UX dimensions frequently assessed in the UX literature.This paper presents work focusing on understanding the extent in which UXrelated items are present in these questionnaires.Our intention is to use the questionnaires' coverage of UX dimensions as a preliminary step towards assessing their suitability to measure user experience.Our assessment is based solely on the presence of UX-related items in the questionnaires.Therefore, what this study offers is an initial assessment of the potential suitability of the commonly used questionnaires to measure UX.

USER EXPERIENCE IN CONVERSATIONAL INTERFACES
McTear, Callejas and Griol (2016) define conversational interface as -the technology that supports conversational interaction with virtual personal assistants by means of speech or other modalities‖ (p.11).They explain that the rising popularity of conversational interfaces has been facilitated by a renaissance in Artificial Intelligence, the development of powerful processors supporting deep learning algorithms, and advances in the Semantic Web making available large amount of knowledge online.
A conversational system comprises various modules including Automatic Speech Recognition, Natural Language Understanding, Dialogue Management, Natural Language Generation, and Speech Synthesis (López-Cózar et al. 2011).There are specific measures defined for each module.For example, the word error rate is used for Automatic Speech Recognition, and the rate of out of vocabulary words is used for Natural Language Generation (López-Cózar et al. 2011).In terms of UX, there is no specific module directly responsible for UX; rather, all system modules play a role in shaping UX.There is very little work on evaluating UX in conversational interfaces.One approach specifically focusing on assessing user experience of multimodal dialogue systems is SUXES proposed by Turunen et al. (2009).It is a complete evaluation procedure with four phases involving three questionnaires.
Although the authors of SUXES provided their user experience questionnaire's constructs, the actual questionnaire statements were not presented, making their proposal difficult to evaluate.
There are many studies explicitly mentioning UX in conversational systems literature in which UX has been variably defined as: usability (Bijani et al. 2013;Tchankue et al. 2010), something going beyond usability (De Carolis et al. 2010), an aspect of usability (Soronen et al. 2009), user satisfaction (Goulati & Szostak 2011;Xu et al. 2013), and a combination of ease of use, overall feeling and user satisfaction (Wulf et al. 2014;Lee & Choi 2017).Therefore, in parallel to the lack of a clear understanding of UX in the field of HCI, there are large variations between the conceptions of UX in conversational systems as well.In general, the main tendency appears to be conceiving UX as a design and evaluation factor going beyond usability and closely associated with user satisfaction.
In the next section, we will introduce some major standardised questionnaires used in conversational interfaces literature.In addition, we will present some other frequently used non-standardised questionnaires.

STANDARDISED QUESTIONNAIRES
A standardised questionnaire involves an established procedure for collecting and presenting the measurement, and psychometric qualification (Lewis 2016).Standardised questionnaires provide higher levels of reliability (Sauro & Lewis 2009;Hornbaek & Law 2007) and facilitate an easier comparison between similar studies (Hornbaek 2006).
To determine a list of questionnaires for our assessment, we examined recent review studies (Dybkjaer etl al. 2004;Kühnel 2012;Larsen 2003;Lewis 2016;McTear, Callejas & Griol 2016;Wechsung 2014;Wechsung & Naumann 2008;Wechsung et al. 2012) and identified the major questionnaires for evaluating conversational interfaces.Then, we compiled a list including the standardised questionnaires on evaluating user experience or subjective user satisfaction in conversational interfaces.Our final list included the AttrakDiff, the Subjective Assessment of Speech System Interfaces (SASSI), the Speech User Interface Service Quality (SUISQ), the Mean Opinion Scale (MOS), the Paradigm for Dialogue Evaluation System (PARADISE), and the System Usability Scale (SUS).We have eliminated some other well-known questionnaires including the Software Usability Measurement Inventory (SUMI), the Questionnaire for User Interaction Satisfaction (QUIS), SUXES, TRINDI Tick List and DISC Dialogue Management Grid.Because, the SUMI and QUIS were focused heavily on graphical user interfaces, and the TRINDI, DISC, SUXES did not have a validated questionnaire.We have included the non-standardised PARADISE and SUS questionnaires into our final list as both of them have been widely used as measurement tools.

AttrakDiff
The AttrakDiff (Hassenzahl et al. 2003) is one of the most frequently used standardised questionnaires in the HCI field to measure hedonic qualities.Although it has an explicit focus on hedonic qualities, it also measures pragmatic qualities and overall appeal of a product.The Attrakdiff has a strong theoretical basis informed by Hassenzahl's (2003) model of user experience.The model proposes that a product can have two main qualities: hedonic and pragmatic.Hedonic qualities refer to the capacity of a product to -support the achievement of "be-goals," such as "being competent," "being related to others," "being special"‖ (Hassenzahl et al. 2008, p. 473).Pragmatic qualities refer to the capacity of a product to -support the achievement of "do-goals," such as "making a telephone call," "finding a book in an onlinebookstore" or "setting-up a webpage"‖ (Hassenzahl et al. 2008, p. 473).While hedonic qualities support stimulation, communicate identity and provoke memory, pragmatic qualities support instrumental and task-related features of a product, ensuring effective and efficient means to perform a task (Hassenzahl 2003).In addition to hedonic and pragmatic qualities, the AttrakDiff measures overall appeal of a product as a result of its hedonic and pragmatic qualities.The AttrakDiff contains 28 items in three categories: pragmatic quality, hedonic quality, and attractiveness.It is important to note that the theoretical model behind the AttrakDiff does not try to measure emotions such as fun, satisfaction, joy, or anger as they are considered as ‗consequences of a cognitive appraisal process" (Hassenzahl 2003, p. 483).As we will discuss later, AttrakDiff may need to be complemented by another measuring tool to capture affect-related dimensions of UX.

SASSI
The Subjective Assessment of Speech System Interfaces (SASSI) (Hone & Graham 2000) questionnaire is one of the most commonly used standardized measuring tools for assessing subjective satisfaction with speech-based interfaces (Larsen 2003).SASSI focuses on the speech input quality while excluding the speech output.Although it is a limitation, a wide-range of factors related to user experience are assessed by the SASSI's 34 items in six categories: system response accuracy, likeability, cognitive demand, annoyance, habitability, and speed (Hone & Graham 2000).Different from the other questionnaires, the SASSI also assesses habitability, which ‗refers to the extent to which the user knows what to do and knows what the system is doing' (Hone & Graham 2000, p. 300).Hone and Graham (2000) explain that speech systems needed a design quality similar to the visibility used in graphical user interfaces.Since the term visibility was not suitable for voice interfaces, they preferred to use the term habitability to assess the degree of a match between the user's conceptual model and the actual system and its behaviour.

SUISQ
The Speech User Interface Service Quality (SUISQ) questionnaire is a measuring instrument developed to assess service quality of speech interfaces (Polkosky 2005(Polkosky , 2008)).The SUISQ has a total of 25 items in four categories validated by a principle component analysis (Polkosky 2005): user goal orientation, speech characteristics, verbosity, and customer service behaviour.Polkosky (2008) found that all four categories significantly correlated with customer satisfaction.Compared to the SASSI, SUISQ has five items in the speech characteristics category assessing speech output quality.One unique category of the SUISQ is the customer service behaviour that assesses the ‗the extent to which the system"s behaviour is similar to the expectations of human service providers" (Polkosky, 2008, p.48).This category covers the aspects of system such as politeness, friendliness, and professional attitude.Polkosky's (2008) research shows that expectations associated with human conversations, customer service, and interpersonal interaction play a role in people's judgement of speech user interfaces.

MOS-X
The Mean Opinion Scale (MOS) (Schmidt-Nielsen 1995) is a widely used measurement tool for evaluating the quality of synthetically created speech.It is a Likert-style questionnaire assessing intelligibility and naturalness of the synthetic speech by seven items: (i) Global Impression, (ii) Listening Effort, (iii) Comprehension Problems, (iv) Speech Sound Articulation, (v) Pronunciation, (vi) Speaking Rate, and (vii) Voice Pleasantness.In order to assess a larger range of voice characteristics, MOS-Expanded (MOS-X) has been developed with additional eight assessment items: (i) Voice Quality, (ii) Emphasis, (iii) Rhythm, (iv) Intonation, (v) Trust, (vi) Confidence, (vii) Enthusiasm, and (viii) Persuasiveness (Polkosky & Lewis 2003).MOS-X with its fifteen items covering the aspects of naturalness, intelligibility, prosody, and social impression is a very valuable instrument for assessing synthetic voice and speech quality that are important constituents of user experience of conversational interfaces.However, MOS-X with its very specific focus on voice quality by itself is not sufficient to evaluate core usability and many other user experience dimensions of such systems.

PARADISE
The Paradigm for Dialogue Evaluation System (PARADISE) is a general framework involving a model for predicting user satisfaction and a user satisfaction survey.PARADISE's model is based on a weighted linear combination of task success measures, dialogue costs, and a user satisfaction survey with eight questions on usability aspects of interacting with a system.Despite being widely used, PARADISE has been criticised for not describing how user satisfaction is actually measured and the omission of psychometric validation of its user satisfaction survey (Hone & Graham 2001).Another limitation is that PARADISE's model is based on the assumption that minimising dialogue cost and maximising task success would maximise user satisfaction.Although dialogue cost and task success play a key role in improving the usability of a system, this formulation of user satisfaction is too reductive, ignoring various experiential and emotional aspects of user-system interactions.

SUS
The System Usability Scale (SUS) (Brooke 1996) is the most well-known and widely-used questionnaire to evaluate the usability of interactive systems.It is a very simple ten-item Likert scale assessing the perceived ease of use and learnability of using a system.Each questionnaire item corresponds to a statement expressed in a very general form by the perspective of a user.The non-specific formulation of the statements has allowed the SUS to be used in many different contexts to evaluate many different systems.Many variations of the SUS have replaced the term ‗system' in the original statements with the terms ‗website' or ‗mobile app' to make them suitable for their own application domain.The SUS has been employed in evaluating many conversational systems such as (Hoque et al. 2013;DeVault et al. 2014).

Others
The TRINDI Tick-List (Bos et al. 1999) and DISC Dialogue Management Grid (Bernsen et al. 1999) are two similar non-standardised measuring tools focusing on evaluating conversational capabilities of a system.TRINDI with its seventeen assessment items covers a larger range of factors than DISC with its nine items.Their questions are not targeted to users; rather, they are formulated for system designers to be used as a heuristic assessment tool.Some example dialogue capabilities to be assessed include: -Is the utterance interpretation sensitive to dialogue context?" (TRINDI), -Can the system deal with answers to questions that give different information than was actually requested?"(TRINDI), and -Can indirect speech acts be handled?(DISC).All the assessment items of both TRINDI and DISC are associated with the instrumental aspects of conversational interfaces.Neither of them is a complete evaluation tool, but they provide a useful set of questions for system designers to think about and test the conversational competence of the systems they are designing.

ASSESSMENT OF THE QUESTIONNAIRES
In this section, we will first explain our method to assess the six questionnaires.Then, we will present the results with a table and six radar charts.

Method
To understand what these questionnaires can offer in measuring user experience, the first step was to identify the main dimensions of UX.However, as discussed earlier, there are no commonly accepted UX dimensions or factors.Therefore, we decided to use the most commonly assessed UX dimensions in literature as identified by the systematic review study of Bargas-Avila and Hornbaek (2011)

Enjoyment/Fun
This dimension refers to playful interactions characterised by perceptions of pleasure and involvement (Webster et al. 1993).How enjoyable, fun, or playful using a system is the focus.

Aesthetics/Appeal
This dimension refers to classical aesthetics emphasizing clean and orderly design, and expressive aesthetics associated with the qualities of creativity and novelty (Lavie & Tractinsky 2004).Physical and sensory features of a product or an interaction resulting in attractiveness or appeal (Hassenzahl 2001).

Hedonic Quality
This dimension refers to -the product"s perceived ability to support the achievement of "be-goals," such as "being competent," "being related to others," "being special'‖ (Hassenzahl et al. 2008, p. 473) Flow: A person's feelings such as control, attention focus, curiosity, and intrinsic interest (Jackson & Marsh 1996).

Motivation
This dimension refers to -internal factors that impel action and to external factors that can act as inducements to action."(Lock & Latham 2004, p.388).
Product or interaction attributes such as motivating, and discouraging (O'Brien & Toms 2008), or any expressions of rationale behind using or interacting with a product (Jacobson & Pirinen 2007).

Enchantment
This dimension refers to a state in which a sense of disorientation co-exists with heightened levels of perception and attention (McCarthy et al. 2006).

Frustration
This dimension refers to the disliked aspects of a product or an interaction (Blythe et al. 2006;Hone & Graham 2000).
Product or interaction attributes such as repetitive, boring, irritating and frustrating (Hone & Graham 2000).

Pragmatic Quality b
This dimension refers to -the product"s perceived ability to support the achievement of "do-goals," such as "making a telephone call," "finding a book in an onlinebookstore" or "setting-up a webpage"‖.(Hassenzahl et al. 2008, p. 473).Pragmatic qualities correspond to instrumental and task-related features of a product providing effective and efficient means to perform a task (Hassenzahl 2001).
a Adapted from Bargas-Avila and Hornbaek (2011) b Added because instrumental and ergonomic aspects considered part of UX dimensions (Hassenzahl 2001) and ISO (2010) pragmatic quality into our assessment scheme as an additional dimension.However, it is possible to exclude pragmatic quality from our final assessment to understand the non-instrumental UX dimensions of a questionnaire.
After determining the UX dimensions, we performed an extensive literature search to obtain relevant attributes characterising each UX dimension.We started with the references cited in Bargas-Avila and Hornbaek's ( 2011) study for determining a few attributes, and then used these attributes as a guide to compile a set of relevant attributes for each UX dimension.It is important to note that the set of attributes we identified are not complete; rather, they are representative and indicative.
The aim of having a set of attributes was to use them as a frame of reference to understand the overall scope and characteristics of a particular UX dimension.Table 1 shows the final assessment scheme with the UX dimensions, definitions, and corresponding lists of attributes.The next step involved a coding process with two researchers coming together and aligning their understandings of each UX dimension and their attributes.This was followed by the actual coding activity in which the two researchers performed coding of all the items in each questionnaire independently.After completing the coding, the researchers came together again to compare their assessments.In the cases of conflicting assessments involving differently coded items, the researchers worked together to agree on a final assessment for each item by discussing the

Questionnaire Items UX Dimension(s)
I felt tense using the system. (SASSI)

Affect/Emotion
The product is: Motivation,

Engagement/Flow
The system's voice sounded like people I hear on the radio of television.

Aesthetics/Appeal
Did the voice appear to be trustworthy?
(MOS-X) Hedonic Quality From your current experience with using the system, do you think you'd use the system regularly when you are away from your desk?
(PARADISE) Generic UX I found the system unnecessarily complex. (SUS)

Pragmatic Quality
Notes.Double-or triple-coding of an item was possible because one item could be associated with multiple UX dimensions.
rationale behind their initial coding.A few sample questionnaire items with their associated UX dimensions are listed in Table 2.

Results
The results of our assessment are available in The other three questionnaires, the MOS-X, SUS and PARADISE, have provided very little coverage of UX dimensions.Amongst the three, the MOS-X was the only questionnaire with the assessment items in hedonic (4 items) and aesthetics/appeal (6 items) category.The SUS and PARADISE provided majority of their assessment items in pragmatic quality (9 and 7 items respectively) while having a single item in generic UX and engagement/flow.Across all questionnaires, enchantment, motivation and enjoyment/fun were the UX dimensions most neglected whereas the pragmatic quality dimension was the most commonly assessed.Another important missing dimension was affect/emotion which was only assessed by the SASSI with eight items and the SUISQ with three items.The dimension of enchantment was not assessed by any questionnaires at all.This is possibly due to the ambiguous and complex nature of the concept.No studies in Bargas-Avila & Hornbaek's (2011) systematic review reported how enchantment could be measured, suggesting enchantment is possibly a UX dimension more suitable for more qualitative interview-style approaches.Notes.The sum of the items per questionnaire is greater than a questionnaire's actual total number of items, because one item can be associated with several UX dimensions.

DISCUSSION
The results indicate that there is no questionnaire providing sufficient coverage across all UX dimensions.It is understandable that having assessment items for every dimension would require placing a lot of items in a single questionnaire, making it a very long and a less practical questionnaire, and a very daunting task for users to respond.Bradley and Lang (1994), the authors of the Self-Assessment Manikin (SAM), has raised a similar point on practicality.They explained that the Semantic Differential Scale (SDS) devised by Mehrabian and Russell (1974) has 18 bipolar adjective pairs to be rated along a 9-point scale, requiring a heavy investment of time and effort (Bradley & Lang 1994).Their study showed that the SAM with its much simpler three major affective dimensions correlated highly with ratings of the SDS, suggesting using a simpler assessment method could effectively and efficiently provide an understanding of the overall affective states of users.
Similarly, it was argued that expressive aesthetics and hedonic quality are strongly overlapping constructs (Hassenzahl & Monk 2010), and a further consolidation of these constructs is attainable (Bargas-Avila & Hornbaek, 2011).

Recommended Questionnaires
In terms of selecting which questionnaire(s) to use, it is possible to recommend a few alternatives based on the needs of different cases.Our main rationale behind recommending a particular questionnaire is based on a questionnaire's coverage of the relevant UX dimension.The questionnaire with the highest number of items associated with a UX dimension is recommended.Table 4 shows the recommended questionnaires to use according to the desired UX dimension and the preference on using single or multiple questionnaires.While the second column includes the questionnaires with the highest coverage of the relevant UX dimension assessed by this study, the third column offers some other widely used and sufficiently generic questionnaires to complement the single questionnaire:  . 1989).While the items in the SAM, APS, and FSS can be used without much modification, the IMI's items need to be modified slightly to fit specific activities in focus.
To illustrate how our recommendation can be understood, if a study is interested in the affective states of people when interacting with a conversational agent, the SASSI would be recommended as a single questionnaire of choice; or, the SASSI and the SAM as a combination of multiple questionnaires.Here, the reason of using the SASSI as a single questionnaire is that it has the highest number of items in affect/emotion, and as well, many other items assessing other major UX dimensions.Therefore, in addition to having a large coverage in a desired UX dimension, the SASSI provides an assessment of other potentially relevant UX dimensions.
If feasible, using multiple questionnaires could be very useful.For example, in a study to understand the effects of an individual's motivational orientation and particular product attributes on the perceived value of interactive products, Hassenzahl et al. (2008) used the SAM's affective measurements to complement the AttrakDiff's assessment of hedonic, aesthetics, and pragmatic qualities.In this respect, their strategy was to obtain a more complete assessment that covers a larger range of UX dimensions rather than focusing on a particular UX dimension.Moreover, in some cases, having a questionnaire with a large number of items in one dimension may not be needed as the same assessment might be performed with a fewer number of items by another questionnaire.Practicality and putting less demand on users are important concerns when conducting user testing and evaluation studies.Therefore, depending on a study's needs and focus, the questionnaire with fewer items on a desired UX dimension might be preferred if the validity of the shorter questionnaire for the desired factor is available.
If what is feasible is to use only one questionnaire for assessing the overall UX, then the SASSI is potentially the most suitable choice with its assessment items associated with a large range of UX dimensions.However, coupling it with the AttrakDiff would provide even a larger coverage of the UX dimensions with an increased focus on hedonic, engagement and aesthetics qualities.
Depending on the availability of resources, these two could be further complemented by the MOS-X with its items assessing voice-and speech-focused aspects of user experience.
In terms of assessing mainly instrumental aspects of a conversational interface, again the SASSI would be recommended as a single measurement tool.However, the 10-item SUS with its effective and efficient assessment of usability could be a more economical and practical alternative.In addition, the TRINDI Tick List and DISC with their specific set of items focusing on conversational competence would provide useful ways to assess some core functionalities needed in a typical conversational system.

Combining Subscales of Different Standardised Questionnaires
Instead of using multiple questionnaires separately to obtain a more complete assessment of UX, combining the subscales of some of the standardised questionnaires into a single questionnaire might be an alternative solution.For example, Polkosky and Lewis (2003) worked on the MOS questionnaire to improve its reliability and added new items to the questionnaire to extend its coverage.Similarly, Lewis and Hardzinski (2015), worked on the SUISQ questionnaire and produced a shortened version of the SUISQ referred to as the SUISQ-R.In both cases, the authors retested the validity and reliability of the new questionnaires, and they obtained comparable results with the original questionnaires.Therefore, a similar approach can be employed across the subscales of the standardised questionnaires presented in this paper to construct a new questionnaire with a larger coverage of UX dimensions.However, this alternative has potentially some problems in relation to internal reliability, content validity, and construct validity.Although the subscales of the original questionnaires are reliable and validated, they need to be revalidated and their reliability needs to be retested as part of a new questionnaire.In addition, the subscales of these questionnaires are not fully aligned with the UX dimensions identified in this paper.In some cases, the questionnaires' subscales include a few items that are not relevant to measuring the desired UX dimension, and using a subscale with some irrelevant items can violate the content validity (Haynes, Richard & Kubany 1995).There are also other difficulties associated with the different types of scales employed by different questionnaires.For example, the AttrakDiff uses a semantic differential scale, but the SASSI and SUISQ use a Likert scale.Therefore, some kind of scale conversion is required.In addition, the differences between the degrees of specificity of semantic differential scale items and Likert scale items can be confusing for the questionnaire participants.As a result, there are some challenges against combining the subscales of the current standardised questionnaires.However, using these questionnaires' subscales as a starting point can shorten the development time of a new standardised questionnaire, which requires a substantial amount of work (Lewis & Hardzinski 2015).

Further UX Considerations and Dimensions
An evaluation consideration that can potentially gain more importance in relation to UX is the personality and attitude of a conversational agent.In this respect, the SUISQ is unique with its eight items from a customer service behaviour perspective.A few sample SUISQ statements include:  The system seemed professional in its speaking style.
 The system seemed polite.
 The system used terms I am familiar with.
We believe UX measurement instruments for conversational interfaces can benefit from engaging with social-communicative theory and studies focusing on interpersonal communication and customer service behaviour.Polkosky's (2008) research provides a useful starting point.
Another potentially important concept as a specific UX dimension for conversational interfaces is the concept of habitability defined as ‗the extent to which the user knows what to do and knows what the system is doing' (Hone & Graham 2000, p. 300).Although it has not received enough attention in the field of spoken language systems, with the recent increased interest towards speech interfaces, habitability will likely to play an instrumental role in design and evaluation of such systems.Because, it allows the creation of a fundamental design construct that can define the -visibility‖ in voice user interfaces.Hone and Baber (2001) proposed a conceptualisation of habitability in relation to semantic, syntactic, lexical, dialogue, and recognition constraints operating over user utterances.While their proposal is useful, we believe the scope of habitability to extend beyond the notions of constraints and visibility, and include some other factors such as familiarity, emotional connection, and sense of agency.
Our study has focused on evaluating user experience from a measurement-oriented perspective.However, there are also other more emphatic and pragmatic user experience evaluation perspectives involving qualitative methods such as conversational analysis (Porcheron et al. 2018) and in-depth interviews with users (Luger & Sellen 2016).While conversational analysis studies can allow us to understand the evolution of user experience over a period of time in an indirect way without actually asking any questions to users, indepth interviews can provide some richer data to understand the various factors shaping users' experience with conversational interfaces.

CONCLUSION
In this paper, we have briefly reviewed UX approaches in the field of HCI and presented varied definitions of UX in conversational systems.We used the most frequently assessed UX dimensions and their relevant attributes to obtain an assessment scheme.Then, we employed the assessment scheme to understand the questionnaires' coverage of dimensions as a preliminary step towards assessing their suitability to measure UX.We found that (i) four questionnaires included assessment items, in varying extents, to measure hedonic, aesthetic and pragmatic dimensions of UX; (ii) two questionnaires assessed affect, and one assessed frustration dimension; and (iii) enchantment, playfulness and motivation dimensions have not been covered sufficiently by any questionnaires.Our assessment has suggested that the SASSI with its large coverage spanning over eight UX dimensions would be a suitable questionnaire for overall UX measurement.Using multiple questionnaires may prove useful for obtaining a more complete measurement of user experience or improving the assessment of a particular UX dimension.Future work involves assessing the actual performances of these questionnaires in measuring the relevant UX dimensions.-Avila, J.A. and Hornbaek, K., 2011

Figure 1 .
Figure 1.The radar charts for the six questionnaires' coverage of UX Dimensions.Notes.The dimension of Enchantment is excluded because none of the questionnaires contained any relevant items.The values on each UX dimension have been normalised to a range between 0 and 100.

Table 1 .
A User Experience Assessment Scheme for Standardised Questionnaires . Hedonic qualities support stimulation, communicate identity and provoke memory(Hassenzahl et al. 2008).

Table 2 .
Sample questionnaire items with their associated UX dimension(s)

Table 3 .
The number of items associated with each UX dimension in the six questionnaires

Table 4 .
Some recommended questionnaires when a single questionnaire or a combination of questionnaires is to be used * Overall UX corresponds to the combination of all other UX dimensions.