A new metric scale for measuring trust towards holographic intelligent agents

Trust is an essential attitude in social relationships, but it also mediates our approach to certain technology. The definition of interpersonal trust, however, is too wide to expound our understanding of how trust impedes such interaction with technology, and the lack of an applicable quantifiable model in particular presents an obstacle to our quest of building reliable, trusted, and intelligent holographic agents. In this paper, we therefore develop a novel metric scale to measure trust. We identify, select, and refine over a hundred items related to trust, check their precision and validity with the help of a judges panel, and select polarising items that are able to bring out the distinctive characteristics regarding people’s trust towards intelligent agents. Our findings indicate that an assessment of trust involves looking at the user’s belief about the agent’s competence, integrity, benevolence, and compassion, which drive the attitude of trust, influenced by the user’s general propensity to trust. Trust then drives intention to engage and ultimately engagement, which, if successful, results in the establishment of a trust relationship with the agent. While we propose an item-response scale for measuring this model of trust, we also add our deliberations on how elements of it could be replaced with alternative means that possibly offer more immediacy than self-inspection, discussing in particular alternatives for measuring elements of compassion, competence, and social relationships.


INTRODUCTION
With rise of compatible portable consumer devices (both mobile as well as head-mounted), Mixed Reality (MR) and Augmented Reality (AR) have become very commonplace in Human-Computer Interaction, not least due to their ability to integrate both digital and physical information.Inspired by their success in VR games, the use of agents in AR and MR is well investigated.Holz et al. (2011) propose a taxonomy for Mixed Reality Agents focused on agency, corporeal presence, and their interactive capacity (see Figure 1).They define a Mixed Reality agent (MiRA) as "an agent embodied in a mixed reality environment", emphasising 'embodiment' in their circular definition and including robots.Based on that, Campbell et al. (2013) explore Augmented Reality agents (AuRAs) and their environmental contexts.Both definitions bring out how agents integrate into the physical world and what degree of behavioural realism they express.Although MiRAs emphasize the relationship of intelligence and adaptivity, the taxonomy does not indicate whether it is a real-time response nor further define intelligent features of virtual agents in the mixed context.In our view, spatial mapping and interactive/environmental responsiveness is key to distinguish holographic intelligent agents from MiRAs and AuRAs.
Besides, intelligence of virtual agents is more than just adapting with the help of embodiment to a physical world, it also includes multimodal interaction (e.g., using natural language processing and dialogue understanding), as well as capability to perceive and match users' demands.Moreover, responsiveness requires the agent to be able to interact with both virtual and physical surroundings in real-time.In our view, intelligent holographic agents ('holographic Ais') should have capacity to react in real-time, driven by their analytical and inferential abilities and aimed at supporting the execution of the users' tasks in the mixed reality environment.Figure 2  With the help of Augmented Reality, we can project intelligent agents 'holographically' into a user's view of the physical surrounding.For such agents, we have coined the notion of 'Holographic Artificial Intelligences (AIs)'.They have humanoid appearance and behaviour, possess enough intelligence to adapt experiences to context, offering responsive face-to-face and environmental interaction (Huang, Wild, and Whitelock, 2021).Virtual agents are sometimes even treated as equivalent to interactive partners (Pinxteren et al., 2020).Positive interaction between users and holographic AIs, however, relies on their affective attachment and a rational assessment on whether to invest into a reciprocal relationship.Trust is a key driver, influencing this relationship.Mayer et al. (1995) propose competence, integrity, and benevolence as the essential elements to establish a model of trust.Computer agents are social actors (Nass et al. 1994).In human-computer interaction (HCI), competence means the computer agents possess necessary skills, abilities, and knowledge to execute and complete tasks.The definition of integrity is consistency of behaviour and promise in human-computer trust, and it allows systems to take responsibility and fulfil promises (Dobel, 1999).The process of evaluation towards integrity is rational by cognitive judgement.Benevolence reflects whether agents care about users, and execute tasks based on their interests (Phillip et al., 2020).Competence and integrity are objective evaluation criteria, and benevolence has an emotional component.
Cognitive perception is based on judgements or evaluations that make one feel confident in the other party's knowledge and ability (Borum, 2010).The cognitive aspect involves the trustors' belief that users believe agents are trustworthy in digital environments.Thus, behaviours and competence drive cognitive trust The affective foundation, on the other hand, is emotion.Affective trust emerges from both the level of concern, caring, and benevolence (Borum, 2010).
Figure 3, from Mayer et al. (1995), depicts how perception of values influences the nature of trust.Values, attitudes, and emotions are three trust traits in interpersonal trust (Jones and George, 1998).In terms of HCI, values employ to shape agents' features or personalization to speculate user experience, and emotions influence attitudes and relationship.Trust involves intentions, perceptions, and beliefs.It relies on belief, and belief, behaviour, intention, and attitude also affect the degree of trust (Kulms and Kopp, 2018).Besides, trust reflects a state of mind, a willingness and predisposition, formed by an association of cognitive and affective based behaviours (Sousa, Lamas, and Dias, 2014, see Figure 4).Attitudes are emotional judgements of belief that are influenced by the accuracy of information and personal experiences in order to adopt intentions and produces corresponding behaviours (Lee et al., 2013).Similarly, trust in HCI is able to be an attitude that intelligent agents are willing to help achieve intentions as well (Kulms and Kopp, 2018).Although trust towards automation can be defined an attitude that computer agents help achieve goals in uncertain context (Lee and See, 2004), there is a little evidence to prove this definition can be used for humanoid holographic AIs.With this research project, we intend to investigate exactly this: what is trust towards a holographic AI and how can we measure it using a questionnaire to quantify the sense of trust.
The work we present here focuses on identifying the relevant constructs that predict trust towards a holographic AI.For simplicity reasons, we conducted this work using the traditional method of Likert scaling, resulting in the validation of a 11-item scale.Moving forward, however, we are keen to further investigate, how the items of this trust model relate to each other and how they can be supplemented with other, more innovative methods for quantifying their user response.
The rest of this paper is structured as follows.The second section deals with related works.Section 3 elaborates on the methodology, introducing a new metric scale for measuring trust towards holographic AIs.Section 5 describes discussion and methods to measure values of trust.Finally, conclusions and future work are presented in Section 6. et al., (2019) investigate the sense of trust in virtual reality based on cognitive load levels, physiological sensor data, such as galvanic skin response and heart-rate variability and a system as well as subjective mental effort questionnaire.

Gupta
Although it combines multiple methods and equipment, the processes are complex and difficult to implement.
Interviews, competitive research, and surveys often employ to method human feelings, but it is hard to transform emotional values to numbers or maths.
Likert scales are a psychometric instrument used to measure human attitudes and are commonly used in questionnaires.Likert scales apply a trust scale (Borum, 2010) which is used to assign subjective and abstract statements to semantics, translating attitudes, opinions, and feelings to a rated value set in the form of a disagree-agree response scale.The format of five-level items is: 1. Strongly Disagree; 2. Disagree; 3. Undecided; 4. Agree; 5. Strongly agree (Piemetel, 2010).
Likert scales have two parts: statements and response scale.The statement should be a sample, short and clear (Bryman, 2012).Five-point scales, e.g., allow participants to express how much they agree with each statement in order to ensure the explicit differences, interval values and results with no bias.The responses are equated with corresponding integrates, for instance, strongly unfavourable= 1; unfavourable= 2; undecided = 3; favourable = 4; strongly favourable= 5.A high level of agreement with a positive item is endorsed by participants with positive attitudes, and negative perceptions are reverse scored (Kocaballi, Laranjo, and Coiera, 2019).Undecided or neutral decision is able to avoid bias.
For example, in order to analyse the perceptions of interactions with computer assistants, as well as investigate the impressions and experiences of the comprehensibility of the computer agent's communication (including trust), Hanna and Richards (2019) applied Likert scaling to survey degrees of satisfaction with a virtual assistant's verbal conversation.They recruited 73 undergraduate students and tracked the behaviours of all participants, such as inputs and keystrokes.This was done to investigate the relationship between trust, promises, and performance.
Additionally, Kulms and Kopp (2018) also use Likert scales to measure the warmth of computers and how it influences trust.However, they change the strongly agree to disagree format to a trustworthy, good, truthful, well-intentioned, unbiased, and honest format.As a self-reporting measure of trust, Likert scales requires enough data and number of participants to support evidence in order to avoid those extreme cases with biased attitudes.Lack of enough statements with accurate words in a specific context, such as synonyms that have different meanings cannot allow users to make right choices.Observing users' reactions, decisions, and behaviours is able to be an indirect measure of understanding their expectations (Pimentel, 2010).
On the other hand, Thayer proposes a twodimensional mood model in which valence (happy to unhappy) and degrees of arousal plotted on X and Y axis, describe mood.Multidimensional opinion of emotion is reliable for a single dimensional view (Lang 1980).Self-Assessment Manikin (SAM) is a subjective assessment of emotions based on valence arousal model that applies graphics to judge pleasure, arousal and dominance associated with affective reactions (Bradley and Lang 1994).To indicate arousal states, the images of SAM range from happy (smiling face) to sad (frowning face), excited (wide-eyed image) to calmed (sleepy image).A small picture of a man means minimum control while a large one shows maximum dominance.SAM can be a reference of user experience in order to explore the sense of trust.

METHODOLOGY: A NEW SCALE FOR 'TRUST'
There is no validated Likert scale for evaluating trust towards holographic AIs that we consider applicable.We therefore set out to construct a new scale, using Likert scaling (Trochim, 2021) as methodology.We pool items extracted from literature and extended with brainstorming among the researchers first.Subsequently we use a pool of 15 judges to rate the pool of items from 'strongly unfavourable to the concept' to 'strongly favourable to the concept'.Item selection then will be pursued by dropping statements that do not correlate well with sum scores of all statements.Moreover, via item-item correlations, we can further reduce the number of items by eliminating closely related items.We discard lower values to achieve an administrable scale (e.g., > 0.6 or 0.7) that forms a group of optimal statements (Roberts, Laughlin, and Wedell, 1999).Finally, a t-test between top and bottom quarter answers serves to identify those items most suited to polarise answers.
We had identified 104 statements to explore the concept of trust towards holographic AIs, extracted from literature and further extended through brainstorming by the authors.To cut down the selection of items to a reasonable number that can be administered well, we used a pool of judges to explore which of these statements were loading on trust (via the correlation with the sum score) and were polarising.
We eliminate the items where the correlation with the sum scores was less than 0.6, retaining 22 statements (see Figure 5).This removes constructs from the pool, which are not directly loading onto the general direction of all statementsusing the direction as a proxy for the core concept of trust.
Next, we calculated the t-test for the mean value of ratings given by the top quarter judges (judges assigning the highest scores) and tested them against the mean values given by the bottom quarter judges (judges assigning the lowest scores), checking whether their t-values indicated strong polarisation.The higher the t-value is, the bigger the difference in judgement between the judges assigning top scores and the ones assigning lowest scores.This means that items with high t-values are more discriminant, better able to separate.Literature recommends relying on our own judgement in selecting the right trade-off between discriminance and number of items.We chose, based on our analysis, a t-value of over 5.5, resulting in 11 question items (see Figure 6).
To check, whether we missed any important items, we additionally conducted a cluster analysis over the Euclidean distances of the questions, using hierarchical clustering (hclust, see R core team, 2021, package "stats").We cut the resulting cluster hierarchy in the dendrogram so that questions were clustered into k=20 groups, which from the visual analysis of the cluster dendrogram looked like a good homogeneity level for the resulting clusters.
Subsequently, we manually analysed all resulting groups to systematically double check whether we missed out on particular groups of items.There were several groups that were not retained and looked interesting at first, but when investigating the correlation values of the contained items in the group with the sum scores, it turned out that in all cases, the correlation was too low (below a threshold of 0.6).This clearly indicates that these questions ask about something else, not (or 'not only') about 'trust' (see table 1).The order of items presented in Table 1 has no priority.
We arrived at the conclusion that indeed the statistically motivated selection is the best selection and upon inspection of groupings in the 'raw' statements, we can clearly see that other groups measure (also) something else, not just trust.
We pre-tested the questionnaire with the help of a student in a small group (5 participants) (Watts, 2020).All questions are positive statements to ensure a uniform polarity, and this questionnaire with multiple choices, consisting of strongly agree, agree, undecided, strongly disagree and disagree, is able to provide evidence of user experience regarding holographic AIs.
Besides, the questionnaire has been verified feasibility.It employed it to investigate the validation of a holographic assistant and an insight of interactions by testing the sense of trust.This questionnaire is able to clearly understand and analyse user intuitionistic feelings and opinions.The final selection retained provides a set of distinct items, all of which contribute to exploring critical elements of trust and explaining whether holographic AIs are trustworthy or not.In line with earlier models published (see introduction), these items can be grouped along constructs as follows.

DISCUSSION
Statement #3 and #4 (see Table 1) can be interpreted to reflect the aspect of integrity, in agreement with Mayer et al.'s proposal (1995).
Competence is instrumental to trust.In the words of McLeod ( 2020), trust requires that we "rely on others to be competent".Competence is a potential for human action (Wild, 2016) and when put into practice turns performance.Conversely, the visible demonstration of performance helps establish trust.
Competence is typically is defined to subsume skills, knowledge, and abilities (Hager and Gonczi, 2009), and competent agents require a certain degree of domain-specific expertise.Chatbots, for instance, call for communication skills.For intelligent tutors, the pedagogical expertise is key.The agent's capabilities and their competent application are preconditions to their success.Users consider whether the agents can unsuccessfully achieve their goals and whether this matches with their expectation.The user's optimism about the agent's competence (and its subordinate concepts) are assessed through statements #1 and #2.
"The hologram feels real to me" (#7) involves empathy.Empathy is the ability to resonate with other's emotions.Holographic AIs be emphatic by picking up on affective input of the user and then perform corresponding emotive expressions or other matching behaviour.Compassion casts the net a bit wider and has two components, i.e., empathy and the motivation to help (Singer and Klimecki, 2014;Gilbert, 2014).In terms of HCI, compassionate agents not only respond to users' feelings but are also able to remediate problematic states (Ray, 2018).Compassion refers to both motivation to help and actual helping in response to sensed needs and feelings of others, expressed in statement #8 and #9.
Tseng and Fogg (1999) define trust as "a positive belief about the perceived reliability of, dependability of, and confidence in a person, object, or process".In our model, trust is an attitude that users have towards holographic AIs based on their belief on whether the agents help achieve goals with good intentions and positive behaviour, ultimately aimed at building a positive relationship with the user.Mayer et al., (1995) proposed further that interpersonal trust is moderated by a propensity to trust, i.e., the willingness to be vulnerable and accept risk based on expectations regarding another person's actions.
We propose that trust is the attitude resulting from the belief assumptions listed above, which then, following Sousa et al. (2014) results on the predisposition to engage, an intension, which ultimately drives engagement behaviour.A measurable result of trust is the establishment of a relationship, as assessed by #11 (see Figure 7).
This could form the basis for a testable structural equation model with trust being a latent variable predicted from the observable ones.
For some of these items, there may also be a more direct approach as alternative to quantify the users' response from Likert scale items.This relates particularly to elements in 'compassion', such as the affective state of the user: • Sentiment analysis over the dialogue transcripts could provide accurate classifiers detecting affect expressed by the (and by the AI).• Facial expressions could provide further insight.
• Prosody of speech can be used to pick up on intonation, as a proxy for affect.• EEGs are now routinely used to gauge affective state / facial expression of the user in laboratory settings.The imaging technology could offer useful comparison of how the inner, neurological state relates to the more conscious selfassessment expressed in the answers to the questionnaire items.• Self-assessment manikins (SAMs) can be used as a more graphical replacement, especially useful when dealing with underage participants.
Relationships can be investigated with the help of social network analysis.There are proposals for proxies of engagement in the literature looking at interaction data in the speech dialogue (e.g., interanimation score, see Rebedea et al., 2010).
Competence could be further assessed using a Turing test (with a Wizard-of-Oz control group agent), providing insights into how the user interface elements influence perception of competence.

CONCLUSION AND FUTURE WORK
In this paper, we reported on the construction of a new metric scale for conceptualising (also quantitatively) an extended model of trust towards holographic Ais, using Likert scaling as development methodology.We identified and refined 104 statements, and then retained eleven of them as a comprehensive model for predicting the level of trust people develop towards holographic AIs.
Trust of holographic AIs is different from interpersonal trust, but also from e-trust towards non-anthropomorphic technology (such as banking apps).Trust is an attitude that such agents help achieve goals, as well as devote themselves to building a positive and interactive relationship with their users.Competence, integrity, benevolence, commitment, and empathy are all key dimensions in this model of trust.
The model provides a questionnaire with Likert scales that we can apply to measure the degree of trust.
In the future, we intend to further investigate whether the questionnaire items of the model can be replaced with other valid measurement methods, like, for example, direct observation.We also would like to investigate, how the questionnaire approach could be emulated with other modalities.For example, we would like to analyse what would change if the questions were asked by the holographic agent directly ("do you find me competent?","Do I feel real to you?", etc.) and whether that would trigger the same response.
Furthermore, we plan to use self-assessment manakins and a SWOT analysis (focusing on strengths, weakness, opportunities, & threats) to explore user experience and the nature of the relationship between human and agent.Marinaccio et al., (2015) and Kim and Song (2021) investigated how to recover trust from error.This could be a good idea for our planned pilots, to build in challenging situations in order to derail the relation under stress.Moreover, inappropriate reliance (misuse) and disuse of computer agents may cause distrust (Lee and See, 2004) represents this difference in a new Venn diagram.

Table 1 :
Final metric scale for measuring 'trust' , which we could use to create contrasting behaviour, able to shed more light on the development of trust and how we can influence it.