Robot public speakers’ effect on audience affective reaction and attention allocation

Social robots delivering public speeches have a wide range of practical applications as stand-ins for educators, experts, or entertainers. The goal of our work is to investigate how a social robot should be programmed to deliver an effective public speech. Applying a mixed methods research design to collect quantitative and qualitative data, we have conducted a study, which compares a human speaker with a semi-anthropomorphic social robot speaker (the SoftBank Pepper robot). The robot was programmed to mimic the behaviour patterns of the human delivering the same speech. The study results show that the robot is perceived as intelligent and rational, which contributes to effective delivery of the message content. However, the robot struggles with actively engaging the audience and with establishing an emotional connection. In addition, the behavioural patterns that appear natural in the human speaker tend to be distracting in the robot. Suggestions for the improved design of robot public speakers are discussed, which include implementing rhetoric skills, exploiting and synchronising the robot’s specific communication channels, and creating a robot persona.


INTRODUCTION
Public speaking is a form of communication in which a single speaker addresses an audience to inform, persuade, or entertain [Trenholm 2017]. The speaker is usually physically distant from the audience and therefore must speak clearly and use gestures and visual aids to be seen and heard. Generally, we would expect a human speaker to deliver the message. But what about a social robot delivering a public speech or a lecture? Could a social robot deliver a message as well as a human speaker? A robot public speaker could be useful in a variety of circumstances. Robots could represent speakers or teachers around the world, increasing the reach of the message. The "human speaker" would not have to travel and could deliver a speech at the same time in different locations. In educational environments, a robot speaker could deliver a lecture content in a manner similar to a human lecturer, freeing up teacher resources for individual student support.
Robots certainly still lack the social skills necessary to deliver an effective speech. Restrictions in movement and emotional expressiveness limit the range of expressions a robot can exhibit. On the other hand, because of their enduring novelty and appeal robots can attract large crowds wanting to listen to them. Building on the fascination many have with robots, we believe that robot public speakers can be ambassadors for a good purpose to deliver important messages to a wider audience.
The goal of our research is to investigate how a robot public speaker should be programmed to deliver an effective message by comparing the robot speaker to a skilled human speaker. Even though knowing that the robot speaker at this stage will likely under-perform compared to the human speaker we believe that such a comparison is useful to identify weak as well as strong points of the robot speaker. By collecting quantitative as well as qualitative data we conducted a study to inform suggestions for the improvement of the robot speaker. The results of this study will help develop robotic speakers that can deliver engaging, entertaining, and compelling messages to a wider audience, bringing us one step closer to a realistic future with stand-in robotic speakers. The study was conducted with social distancing measures in place, so participants rated videos of the robot and human speech in an online experiment.

Delivering an effective message
How to successfully deliver a message has been studied for many centuries. Already the ancient Greek philosopher Aristotle described in his studies of rhetoric, how to successfully influence an audience using three methods: Ethos, pathos, and logos. Ethos is about establishing personal credentials, persuading the audience to trust the speaker, acknowledging the speaker's competence; pathos is about the ability to evoke an emotional response and get the audience to feel and be emotionally involved; and logos is about making reasonable, logical arguments to get the audience thinking.
Modern communication theory has taken a slightly different approach to communication skills and to what makes a good speech. In the education domain, teachers' communication skills, which are identified as one of the most effective contributors to student's achievement, have been examined under five dimensions: empathy, transparency, equality, effectiveness, and competence [Gulec and Leylek 2018]. They have also been described as involving verbal, nonverbal and para-verbal components [Muste 2016]. In [Bambaeeroo and Shokrpour 2017], the impact of the teachers' non-verbal communication is emphasised.
Overall, modern approaches agree with ancient philosophers that credibility (e.g., trustworthiness), identification (e.g., perceived connection between the speaker and the listener), and appeal (e.g., emotional appeal) are important factors in effective delivery of a message [Hovland et al. 1953] [Higgins and Walker 2012]. We have therefore based our evaluation of the effective delivery of a speech on these three aspects, naming them for simplicity after the Greek rhetoric doctrine of ethos, pathos, and logos.

Social robots in public settings
Social robots can be an effective medium for instruction and communication with users in a public setting. Robots show promise as information assistants and store clerks, among others. Numerous studies have used robots as tour guides in museums and exhibitions [Matsumoto et al. 2020;Wang and Christensen 2018;You and Lin 2019]. Many of these studies focus on how robots can initiate interactions with visitors [Iio et al. 2020] [Rashed et al. 2015] or convey information [Velentza et al. 2019] [Velentza et al. 2020]. A meta-review on robots in public spaces [Mubin et al. 2018] showed that educational scenarios for public speaking robots are the most popular applications and most of these focus on providing information. In such positions, robots will be expected to be experts in their field. However, expertise alone is not enough; effectively communicating expertise is also necessary to build trust and ensure compliance with the information they provide.
The presence of a physical robot in the same location as the learner enables co-location and can support collaboration between the robot and the learner. The most common form of robot tutoring is one-to-one tutoring, where a robot teaches a learner individually, e.g., [Ramachandran et al. 2019] [Baxter et al. 2017] and only few studies have examined robots teaching a group of learners., e.g., [Huang andHoorn 2018] [Edwards et al. 2016]. A review of social robots as tutors [Belpaeme et al. 2018] showed that robots perform well when the educational task is limited, approximating the performance of human tutors in cognitive tasks. However, it remains unclear what characteristics of a robot tutor may contribute to learning success. In a study by Striepe et. al. [Striepe et al. 2021] an emotional and a neutral robot produced comparable effects in participants' sense of immersion in the content and affective responses during a storytelling task. Konjin and Hoorn [Konijn and Hoorn 2020] reported that underachieving students benefited more from a neutral robot than from one that exhibited a social behaviour. Contrary to previous research our study does not focus on learning outcome but rather on message delivery as one aspect that might be important for a successful robot speaker and robot teacher likewise.

Research question
Previous studies have demonstrated that robots can effectively be used for storytelling [Striepe et al. 2021], teaching [Saerbeck et al. 2010

] [Brown and
Howard 2013], social interaction, rehabilitation and companionship [Uluer et al. 2020]; but very few studies have compared humans and robots in these roles. The few studies that have compared human and robot speakers focused on the audience's comprehension and retention of the information delivered by the speaker [Li et al. 2016] [Palanica et al. 2019]. For our study, we created and filmed a speech delivered by a robot that mimicked the gestures and motions of a human speaker but fell short of being completely human-like as it didn't display facial expressions and spoke in a monotonous voice. The choices we made when creating the robot's behaviours were dictated by two major reasons: (1) dealing with the mechanical constraints of the robot, and (2) avoiding the uncanny valley effect [Seyama and Nagayama 2007]. The main research questions were: Can a social robot, which mimics the gestures and motions of a skilled human speaker, deliver a speech that triggers desirable audience's affective response?
The main aim of this research was to uncover the strengths and weaknesses of a social robot when given the role of a public speaker or teacher, and to devise robot-specific speech delivery strategies.

Participants
Twenty-eight university students (14 males, 14 females) participated voluntarily in this study. Participants were on average 25 years old ( = 24.61; = 7.51). Approval was granted by the Ethics Committee of the University. Informed consent was obtained from all individual participants.

Material
We programmed the social robot Pepper [Robotics 2021] to deliver a short speech in its robotic voice while mimicking the behaviour of a human speaker delivering the same speech (gestures, pace, and body movements). The Choregraphe software used to program Pepper has a built-in virtual robot that can be used for the design and testing of the robot's movements. The movements are composed from motion boxes, and when several boxes are chained from the input to the output box, the sequence of the robot's motion and behaviours is complete and can be run using a real robot.
A five and a half minutes long TED talk by Moriba Jah about "the world's first crowd-sourced space traffic monitoring system" was chosen [Jah 2019]. Choosing a TED talk implied that the speaker was likely to demonstrate high public speaking skills, which was setting the bar high for the robot, but also an opportunity of instilling in the robot Pepper realistic public speaking behaviours by mimicry.
The two speakers, human and robot, had in common the content of the speech (same words), the pace of delivery, and the body gestures and movements, including arms and hands gestures, torso and neck orientations (within the mechanical constraints of the robot). The differences lied in their appearance (see Fig. 1), their voice and the expressivity of their face.

Questionnaire (see Appendix)
The videos were embedded in a questionnaire and order of videos was counterbalanced. After watching each video participants responded to questions about their affective reaction and attention allocation.

Self-reported affective reaction
Participants' affective reactions were measured on the dimensions: unpleasant/pleasant (i.e., valence); calm/excited (i.e., arousal); and tired/awake with 7point Likert scales. Questions were based on the valence-arousal model of emotions [Watson and Tellegen 1985]. The dimension tired/awake was added based on a previous study by McAdams et al. [McAdams et al. 2017].

Perception of the speech and attention distribution
Participants rated ten statements on 7-point Likert scales regarding the perceived logos (3 items), ethos (4 items) and pathos (3 items) of the speeches. Two questions regarded attention distribution and distraction. In these questions, participants ranked to what extent different features of the presenter and the background (e.g., face, hand and arm gestures, body torso and chest, motion of the entire body, background) attracted their attention or were seen as distracting.

Self-reported impressions
Additionally, participants were asked to provide qualitative feedback in the form of a short essay describing their impressions of the speakers.

Evaluation of affective reactions
The robot speaker was rated as giving a neutral impression (valence), feeling of calm (arousal dimension), and tired (awake dimension). The human speaker created a positive impression, triggering arousal, and feeling of being awake (see Fig. 2). In comparison, the human speaker evoked a more positive affect (valence) ( (27) = −5.41, < .001), and was rated as being more exciting (arousal) ( (27)   The robot speaker was perceived as being smart, critical, and rational (logos), was evaluated as neutral in terms of identifiability, competence and trustworthiness (ethos), and received lowest ratings for being engaging and emotionally stimulating (pathos) (see Fig. 3). The human speaker received high scores across all three dimensions as being able to engage the audience emotionally and stimulate imagination (pathos), being trustworthy and competent (ethos), and was rated high in being rational, smart and critical (logos). Compared to the robot, the human speaker received higher ratings across all dimensions, i.e., for pathos ( (27)

Attention allocation
Participants ranked five different areas of the speaker and the video according to how much attention they paid to each area, as a previous study based on eye tracking [Bourguet et al. 2020] has shown that viewers are able to accurately report on their visual attention distribution when viewing a video.
For the robot speaker, participants allocated most of their attention to the hands ( = 2.00), body ( = 2.00), face ( = 3.00) and torso ( = 3.00), and least attention was given to the background ( = 5.00). For the human speaker most attention was given to the face ( = 1.00), followed by the hands ( = 2.00), body motion ( = 3.00), torso ( = 4.00), and least attention was given to the background ( = 5.00).
In a similar way we asked participants which areas of the speaker and the video they found most distracting. Interestingly the areas that were regarded as attracting attention in the video of the robot speaker, were also evaluated as being distracting; hand ( = 2.00) and body motion ( = 2.00); torso ( = 3.00); face ( = 4.00); and background ( = 5.00). For example, the robot's hands and body motion were regarded as attracting attention but were also seen as causing distraction. On the contrary, for the human speaker areas causing distraction differed from areas attracting attention, with the background causing the most distraction ( = 1.50), followed by body motion ( = 2.50), hands ( = 3.00) and face ( = 3.00), and least the torso ( = 4.00). As indicated by the data, the face of the human speaker attracted the most attention but was ranked only 3rd in being distracting.

Self-reported impressions
Twenty out of twenty-eight participants provided detailed feedback in the form of a short essay about their impressions of the speakers. The essays were qualitatively analysed aiming to explore significant themes mentioned by the participants. They reflected especially upon four areas: 1) The voice of the robot; 2) Gestures; 3) Motions; and 4) Facial expressions.

Voice
Ten participants mentioned that the voice of the robot lacked intonation and was monotonous. The lack of intonation in the voice was associated with difficulties to understand the important points in the speech ( = 2), and with making the speech less engaging ( = 3). This might also be related to difficulties in staying focused and requiring more energy to understand what the robot had said ( = 2). Furthermore, eight participants mentioned a lack of emotionality in the robot's speech. However, one participant mentioned that the lack of emotion might be an advantage, because the speech sounded more scientific and trustworthy. Positive aspects of the robot speaker were that it conveyed information well ( = 3), had a pleasant appearance ( = 1), and created a good first impression ( = 2).

Gestures
The gestures of the robot speaker were modelled according to the human speaker. However, participants especially mentioned that the gestures were distracting ( = 5) and unnatural ( = 6). From the listeners' perspectives, the robot's gestures appeared to be incongruent ( = 2) and did not seem to support the speech ( = 4). In this regard, participants mentioned that gestures "did not provide additional information", "seemed meaningless", "did not change with the speech content", and "were hard to understand".

Motions
Similarly, participants reported that the whole body motion of the robot was distracting ( = 4). The sound coming from the motors of the robot when moving around seemed to have especially contributed to the feeling of distraction. Furthermore, participants felt that the motions were unnatural ( = 9). This impression seemed to come from unsynchronised motions ( = 1), and the robots body facing towards the front when moving side way ( = 1).

Facial expressions
Six participants mentioned the lack of facial expressions of the robot speaker. It was said that the lack of facial expressions made the speech less engaging ( = 3), gave the impression of an emotionless speaker ( = 1) and the speech was boring ( = 1).
Participants also provided some suggestions on how the robot speaker could be improved. While seven participants suggested that the robot should be more human-like, especially its voice ( = 3), or movements ( = 1), two participants preferred a robot that was less human-like. Four participants suggested that the robot should interact and engage more with the audience by maintaining eye contact, asking questions, or adding humour. Furthermore, it was suggested that the robot speaker would improve by having facial expressions and showing emotions.

DISCUSSION
The goal of our study was to investigate how a robot public speaker should be programmed to deliver an effective message by comparing it to a skilled human speaker. The robot speaker showed potential in being perceived as smart and rational (important factors to deliver the content of a speech) but had difficulties to engage the audience because of its lack of emotional interaction. This was most evident in the measured affective responses on the dimensions of valence, arousal, and wakefulness, all of which trended more toward neutral. In other words, the robot did not elicit negative affective responses, but it failed to evoke positive affective responses necessary to spark interest and engage the audience in meaningful ways. Although the robot's body movements and gestures mimicked the human speaker, they were perceived as distracting in the robot, and were criticised as unnatural and incoherent.
The audience's comments on the incoherence of the robot's behaviour are interesting and have potential implications on the design of a robot speaker. Significantly, the audience's attention allocation is very different for the robot and for the human speaker. In the human speaker condition, the audience allocates most of its attention to the face, stressing the importance of facial expressions and possibly eye and lip movements when giving a speech (one participant felt disturbed by the fact that the robot's mouth wasn't moving). In the robot speaker condition, the audience allocates most of its attention to the most mobile parts of the robot, i.e., its hands and body, which, in the absence of facial expressions, eye and lip movements, become the main nonverbal channels that participants rely on for a better reception of the speech. However, the robot's gestures and motion alone failed to create adequate emotional interaction and affective response in the audience, resulting in the low ratings of the robot's pathos. Furthermore, these gestures, when performed by the robot are judged distracting and unhelpful, which is not the case for the human speaker.
A speaker's behaviour is fundamentally multimodal and is globally perceived as the result of combining various communication channels (e.g., facial expressions, voice, intonation, gestures, motion, gaze). Designing a robot speaker by mimicking some of the behaviours of a human speaker is not enough to create an effective robot speech performance. It is in fact counterproductive to strive for human-likeliness on one of the communication channels, when the other channels are perceived as non-human-like. Instead, completely new sets of multimodal behaviours should be devised, taking advantage of robot-specific communication channels (e.g., availability of screens, lights, and sound effects) alongside human inspired gestures, postures, and facial expressions when available. These robot-specific multimodal behaviours are more likely to meet audience's expectations and create an effective speech delivery.
This study had to be conducted online, but ideally, we wanted to compare the video of a human speaker with a robot giving a speech in front of a larger audience. We expect that a robot "in person" can create a stronger effect on the audience than just watching a video of it. A further limitation of the study is that we relied on participants' self-evaluation for affective reactions. Adding physiological measures (e.g., eye movements, posture) as done in a study by Bourguet et al. [Bourguet et al. 2020] would increase the reliability of the measurements.

IMPLICATIONS
Effective human public speaking skills are too complex still to be imitated by social robots. They include not only appropriate use of gestures and precise control and inflection of the voice, but also choice of vocabulary and register, use of humour and enthusiasm, ability to develop a good rapport with the audience and effective use of questions [Abella and Cutamora 2019]. Given the complexity of the task, a better approach to the design of a robot speaker should take advantage of the robot's specificity. Its behaviour should be designed to reinforce its existing strengths rather than undermine them. We recommend the four following areas for improving a robot speaker: (1) implementing rhetoric skills; (2) exploiting robot specific communication channels; (3) synchronising communication channels; and (4) creating a robot persona.
For a human speaker speech preparation and training often involve a considerable amount of effort.
Considering the limited emotional expressiveness of a robot and the problems associated with the uncanniness of overly anthropomorphic and emotional robots, a robot's public speech must first of all effectively use language. This can be achieved by telling a compelling story, using humour, and emphasising the core message. In addition, greater audience engagement can be achieved through rhetorical questions and by delivering a message from the robot's perspective.
Robots should take advantage of their unique communication capabilities that come from their appearance, built-in sensors, and/or unique output channels. For example, screens, projections, specific body movements, and sounds can accompany the speech to increase comprehension and improve entertainment value. Nonverbal communication is an important part of every speech and robots could use unique nonverbal codes such as olfactory displays or light and colours.
Creating redundancy by using multiple synchronised communication channels is also important. A human's speech is a synchronised display of voice, gestures, facial expressions, and body movements, and humans are sensitive to conflicting messages being sent through different communication channels [Trenholm 2017]. Therefore, it is important to synchronise the messages sent by a robot's multimodal channels to avoid uncanny expression (e.g., synchronisation of speech content, gestures, body movements, and voice).
Creating a robot displaying specific personality characteristics could increase trust, competence and identification with the speaker increasing its perceived ethos. Depending on the purpose of the speech or audience, the robot can be an expert, an outsider bringing in a new perspective, or the person from next door. It would be very interesting to see what kind of persona is the most successful in delivering a specific message to a specific audience. The goal would be to design a set of behaviours which reinforce the robot's public identity (ethos) while matching the topic of the speech and the audience it is delivered to.

ROBOT EMBODIMENT
Finally, the question of different robots' embodiment on their ability to deliver an effective speech remains. Pepper is a human-sized semianthropomorphic social robot, friendly looking and approachable, who has already been used in many different social settings. He can perform a wide range of arm, hand, and body movements, although he cannot display facial expressions. But what effect would have a completely non-anthropomorphic robot speaker on its audience?
We have decided to replicate the Pepper study using a different robot, which is non-anthropomorphic in its shape but can display facial expressions. We chose the Anki Vector robot (see Fig. 4), with the expectation that well timed facial expressions by Vector will add the social cues that have been missing in Pepper. Facial expressions are a powerful tool to convey emotions and our study has shown that participants paid close attention to the face of the human speaker.

Figure 4: The non-anthropomorphic Anki Vector robot
By displaying facial expressions, we expect that Vector will be better able to establish an emotional connection to the audience compared to Pepper. However, programming a non-anthropomorphic robot to mimic a human speaker poses new challenges. A systematic scheme to retarget human motions and facial expressions into the various modalities of the robot is necessary. In [venture et al], such a scheme based on video analysis that extracts relevant information from a human speaker's behaviour and send them to the robot is described (see Fig. 5). The robot is then controlled using this information depending on its interaction modalities and movement abilities. We have used this scheme to program Vector to deliver the same speech as Pepper and will soon conduct a new user study to compare the effect of the two robot speakers on their audience affective reactions and attention allocation.

CONCLUSION
We believe that employing robot speakers as standins for public speakers is feasible, if the robot's speech and behaviour is designed to be engaging and if the robot is able to create an emotional connection to the audience.
The Pepper study investigated the impact of a semianthropomorphic social robot with no facial expressions on audience affective reactions and attention allocation. It provides valuable insights into the possible strengths and limits of a robot speaker. Nonverbal communication is an important part of every speech and robots should use unique nonverbal codes to accompany their speech. With Pepper, we found that the number and amplitude of gestures (especially the beat gestures) should be reduced to achieve a better match with the robot's perceived persona, and to avoid the unwanted distraction they created.
Future work is to compare Pepper and Vector in terms of their ability to evoke the positive affective responses necessary to spark interest, motivate the audience to listen, and engage the audience in meaningful ways. In a future study, Vector's single arm's gestures will still be inspired from human performance using video data information extraction and retargeting. They are however going to be more limited both in frequency and amplitude to achieve a better match with the speaker itself, i.e., Vector. A new user study will be designed to find out how a "human-like" performance (Pepper's) compares with a more robot-like performance, where the Vector robot's gestures coincide with the human gestures but are very different in terms of type and amplitude. We anticipate that Vector's gestures, combined with facial expressions, will be found more natural and less distracting. It is however harder to anticipate if and how this behaviour will contribute to evoke positive affective responses from the audience, which is something Pepper failed to achieve despite its semi-anthropomorphic and friendly appearance.

ACKNOWLEDGMENTS
We thank all the volunteers who participated in our study.

A QUESTIONS USED IN THE SURVEY TO EVALUATE THE HUMAN AND ROBOT SPEAKERS
A.1 Self-reported affective response • How negative / positive did the speech make you feel?

A.2 Evaluation of speech performance
Please indicate on a scale of 1 (strongly disagree) till 7 (strongly agree) how much you agree or disagree with the following statements.
• The speech stimulated my imagination (Pathos) • It was fun to watch the speech (Pathos) • The speech was engaging (Pathos)) • I can identify with the speaker (Ethos) • The speaker is competent (Ethos) • The speaker is knowledgeable (Ethos) • The speaker is trustworthy (Ethos) • The speaker is smart (Logos) • The speaker is rational (Logos) • The speaker is critical (Logos)

A.3 Attention allocation
• Where do you look at most often while watching the video? Please rank the following items starting with the item you pay the most attention to and finishing with the item you pay the least attention to! -Face -Hand and arm gestures -Body torso / chest -Entire body moving -Background • Which areas of the video were distracting for you while watching? Please rank the following items starting with the item that was most distracting and finishing with the item that was least distracting.
-Face -Hand and arm gestures -Body torso / chest -Entire body moving -Background

A.4 Direct comparison of both speakers
In the last five questions, please compare the two presenters/presentations with each other. Move the slider either towards the Robot Pepper or Moriba Jah according to your preference. Moving the slider to the left indicates you prefer Robot Pepper. Moving the slider to the right indicates you prefer Moriba Jah.
• Which presentation was more impressive?
• Which presenter did you like most?
• Who's performance was better?
• If you could watch another presentation of the presenters, who's presentation do you want to see?
• Who would you choose as an instructor for an online class?