The Effects of Video Instructor ’ s Body Language on Students ’ Distribution of Visual Attention : an Eye-tracking Study

Previous studies have shown that the instructor’s presence in video lectures has a positive effect on learners’ experience. However, it does increase the cost of video production and may increase learners’ cognitive load. An alternative to instructor’s presence is the use of embodied pedagogical agents that display limited but appropriate social signals. In this extended abstract, we report a small experimental study into the effects of video instructor’s behaviour on students’ learning experience, with the long term aim of better understanding which instructor’s social signals should be applied to pedagogical agents. We used eye-tracking technology and data visualisation techniques to collect and analyse students’ distribution of visual attention in relation to the instructor’s speech and body language. Participants also answered questions about their attitudes toward the instructor. The results suggest that the instructor’s gaze directed towards the lecture’s slides, or a pointing gesture towards the slides, is not enough to shift viewers’ attention. However, the combination of both is effective. An embodied pedagogical agent should be able to display a multimodal behaviour, combining gaze and gestures, to effectively direct the learners’ visual attention towards the relevant material. Furthermore, to make learners pay attention to the lecturer’s speech, the instructional agent should make use of pauses and emphasis.


INTRODUCTION
In remote learning, videos have the potential to offer many of the advantages of a classroom-like experience and, in addition, they enable student's control over the pace of their learning (Yousef et al., 2014).Various studies have looked at the effects of different video-based instruction designs in relation to students' engagement, attention, emotion, cognitive load, knowledge transfer and recall (Chen & Wu, 2015;Guo et al., 2014).Based on the eyemind assumption that eye fixation locations reflect attention distributions (Just & Carpenter, 1980), an increasing number of studies are using eye-tracking techniques to understand how students learn using videos (Lai et al., 2013;Sharma et al., 2014) and especially how instructor's presence in the video affects students distribution of visual attention (Garrett, 2015;Kizilcec et al., 2014).
A general positive effect of instructor's presence in instructional videos has been found (Wang & Antonenko, 2017).For example, it contributes to increase students' "with-me-ness", which is the extent to which the learner succeeds in following the content that is being explained (Sharma et al., 2016).Moreover, as lecturers' hand gestures and facial expressions are often linked to their pedagogical intentions (Tian & Bourguet, 2016;Zhang, 2012), the availability of social signals such as the instructor's pointing gestures and gaze can improve learning experience and performance (Ouwehand et al., 2015;Pi et al., 2017).
However, including the lecturer's presence in videos entails a high production cost (Hollands & Tirthali, 2014).Moreover, there is a concern that it may contribute to increasing the learners' cognitive load (Chandler & Sweller, 1991;Mayer, 2001) by inducing a split attention effect (when learners must divide their attention across multiple information sources).For example, it has been found that learners are looking at the instructor's face up to 65% of the time in average, and that they switch between the lecturer's face and the instructional material up to every 2.4 seconds, depending on the multimedia design (Garett, 2015).
A low-cost and accessible alternative to instructor's presence in videos is the use of embodied pedagogical agents (Li et al., 2015).Agents that display limited but appropriate social signals may also incur less cognitive load than their human models.In this work-in-progress paper, we report the results of an experimental study into the effects of video instructor's behaviour on students' learning experience, with the long term aim of better understanding which instructor's social signals should be applied to pedagogical agents.The scale of the study is small (8 participants), but at this early stage of the research, the intention is to capture some of the instructor's important social signals in order to build a first prototype that can be used for further studies.We briefly describe our pedagogical agent prototype in the conclusion of the paper.

Method
We used eye tracking technology and data visualisation techniques (Bojko, 2009) to collect and analyse students' distribution of visual attention in relation to the video instructor's speech and body language.Participants to the experiment also answered questions about their attitudes toward the instructor.

Video Stimulus
All participants watched the same video (duration of 4 minutes and 13 seconds) on the topic of "Design Techniques" (covering brain storming, mind maps and storyboards), extracted from a 3 rd year undergraduate telecommunications engineering course.The video showed the instructor's head and upper body on the right side of the lecture's slides, all within the same frame (see Figure 1).Prior to conducting the experiment, the video was manually annotated with instructor behaviour's markers, using the ANVIL annotation tool (Kipp, 2014).Behaviour markers included three markers for gaze (looking towards the camera, i.e. the viewer; looking towards the slides; looking elsewhere); seven markers for hand gestures (pointing towards slide; waving hands; clasping hands; unfolding hands; ball; other gesture; no gesture); and three markers for speech (speaking with direct reference to the slide's content; not directly referring to slide's content; no speech).The annotations were not displayed to the participants.

Participants
Ten undergraduate students from an International Bachelor's degree in Electronic Engineering in China delivered in the English language were recruited.Prior to the study, each participant was asked to complete a background questionnaire to ensure that all participants shared a similar level of prior domain knowledge (all of them had taken the module of the video in the previous semester) and English comprehension (CET6 level).Two participants had to be excluded due to problems with their eye tracking data, leaving a sample of eight participants, three males and five females, aged 20 to 22. None of them had abnormal vision or abnormal hearing.

Procedure and Equipment
The experiment was conducted in individual sessions of approximately 10 minutes.Before the video stimulus started, the experimenter gave participants a brief introduction to the experiment and to the eye-tracking equipment, and each participant was asked to follow a simple procedure for equipment calibration purpose.The participants were then asked to watch the instructional video without being able to pause or stop it.To ensure that they were paying attention and trying to learn from the video, they were told that they would have to write a summary of the video content immediately after watching it.
The participants' eye position was measured using the Tobii 4C eye tracker.The device was mounted to the bottom of the computer monitor on which the lecture video was displayed.Tobii 4C operates at a distance of 50-95cm and has a high accuracy of 0.4 degrees.The sampling frequency is 90 Hz.Computer's screen size was 13.3 inches, and the resolution of the monitor was 1440 x 900 pixels.

Measurements
Visual attention is typically measured in the form of fixations, which (in our study) describe durations of at least 200ms that a viewer spends looking at a small area on the screen (i.e. an area of side limited to 10 pixels).Fixations are connected by saccades, and a sequence of fixations and saccades is called a scanpath.

Areas of Interest
Areas of Interest (AOIs) are parts of the video frame that are of high importance for the hypothesis of the study.Two non-overlapping AOIs were determined: the instructor area and the slide area (see Figure 1).We found that, in average, participants spend 95.33% of their time (percentage of gaze point distribution) watching one of the two AOIs.They spend slightly more time on the instructor AOI (M=49.09%,SD=14.12)than on the slide AOI (M=46.23%,SD=13.58),although the difference is not significant (a paired sample t-test was conducted: t(7) = 0.29, p=0.05, ns).After dividing the instructor AOI into two: the face area and the body area; we observed that students look more at the instructor's face than the body gestures (M=75.66%,SD=11.28).
Table 1 shows for each AOI and different instructor behaviours the average fixation rate, i.e. the average fixations count divided by the total duration of the behaviour (note that the gaze, hand and speech behaviours are not exclusive behaviours).Surprisingly, behaviours that are meant to attract attention to the lecture's slide (e.g.Gaze towards slide, Hand pointing and Speech with reference) have higher fixation rate on the instructor AOI than on the slide AOI.This could be explained by the fact that there is a delay between the behaviour and the effect it has on the students' visual attention.It could also be explained by a higher rate of transitions between the two AOIs (see next section).For a better explanation, combinations of behaviours should in fact be scrutinised (see visualisation section).

Transitions
A transition is a movement from one AOI to another.The typical measure related to transitions is the transition count, i.e. number of transitions between two AOIs.
Table 2 shows average transition counts across participants in relation to different instructor's behaviours.Given that the total duration of each behaviour is variable, we computed an average transition rate (transition count/duration) for each behaviour.We can see that when the instructor is looking at the slide, the transition rate is relatively high (1.195), which contributes to shorten the fixation rate on the slide AOI and corroborates the findings of Table 1.

Attention maps
An attention map (or heat map) is a graphical representation of the attention distribution.Different kinds of attention maps have been proposed (Bojko, 2009), e.g.: "Fixation count heat map", which results from the aggregation of fixation counts across time and participants (also called bee swarm); and "Absolute gaze duration heat map", which is the aggregation of absolute gaze duration across time and participants.
Figure 2 (left) shows a fixation count heat map calculated on a 5.32 second clip during which the instructor is performing a pointing gesture, looking at the slide and delivering speech that is directly referring to the slide content.With the three behaviours combined, the student's visual attention is clearly directed towards the slide AOI, where gaze duration is also longer.relevant information, and the gaze duration is actually longer on the instructor's face.

Temporal Evolution of Scanpaths
Figure 3 shows horizontal fixation positions in the vertical axis and time on the horizontal axis.Each line corresponds to a different participant.The top image has been calculated on the clip of Figure 2 (left) (slightly extended to 6 second duration), during which the instructor is pointing at the slide while looking towards it.The bottom image has been calculated on the clip of Figure 2 (right) (also extended to the same 6.00 second duration), during which the instructor is pointing at the slide and looking at the audience.
We can clearly observe less transitions between the two AOIs in the top image, showing that the instructor's gaze towards the slide, when combined with a pointing gesture, has the effect of helping students maintain their attention on the slide.The pointing gesture alone does not prevent students from shifting their attention back and forth between the slide and the instructor's face, hence potentially increase their cognitive load.In the bottom image, some students keep staring at the instructor's face, whereas in the top image the opposite can be observed: some students keep staring at the slide without shifting their attention back to the instructor.

QUESTIONNAIRE RESULTS
After watching the video, the participants answered a short questionnaire about their attitude towards the video instructor.
The results show that all participants consider the instructor's presence useful, and 75% of them think that the instructor's behaviour is helping them understand the lecture's content.Two behaviours in particular: 'Hand pointing' and 'Speech with emphasis', are regarded as particularly important.
When the instructor performs a pointing gesture, 87.5% of the participants thought that there must be something worth of attention in the slide, which corroborates the results of the eye tracking experiment.Conversely, 62.5% of the participants believed that 'Speech with emphasis' means that the instructor was saying something important.
Further results show that participants feel most concerned with the instructor's speech, followed by the slides' area, and finally the instructor's body.Indeed, we know already from the eye tracking experiment that participants spend much more time looking at the instructor's face area than the body area.The main function of the instructor's behaviour is to help shifting the students' visual attention between the teaching material and the teacher's face, i.e. the speech.

CONCLUSION AND FURTHER WORK
In this paper, we reported a small experimental study into the effects of video instructor's behaviour on students' distribution of visual attention.The results suggest that pointing gestures combined with gaze constitute an important and useful social signal.An embodied pedagogical agent should be able to display a multimodal behaviour, combining gaze and gestures, to effectively direct the learners' visual attention towards the relevant material.Furthermore, to make learners pay attention to the speech, the instructional agent should make use of pauses and emphasis.
We have implemented a prototype of an embodied pedagogical agent for further studies on what social signals should such an agent display (Figure 4).We chose the social robot Pepper (SoftBank Robotics, 2017) because of its neutrality (e.g. it is non-gendered), because a robot looks playful and non-judgmental (Clark & Mayer, 2011), and because it is not expected to display the complex, but not always useful, behaviour of a human instructor.Pepper's main social signals for now include gaze (head direction) and pointing gestures.Further studies using Pepper are being conducted to test the acceptability of a robot as instructor, and the social signals it should display to support the learners.

Figure 2 :
Figure 2: Fixation count heat maps where the instructor looks at the slide (left) versus the camera (right).

Figure 2 (
Figure2(right) shows a fixation count heat map calculated on a 4.52 second clip during which the instructor is performing a pointing gesture and delivering speech that is directly referring to the slide content but is looking at the camera.The student's visual attention is scattered on the slide, as if the gesture alone did not allow them to find the

Figure 3 :
Figure 3: Scanpaths when the instructor is looking at the slide (top image) versus the camera (bottom image).The instructor AOI is the top dark grey area.

Figure 4 :
Figure 4: Pepper the virtual social robot and embodied pedagogical agent.

Table 1 :
Average fixation rate (count / duration) on each AOI in relation to instructor behaviour; and total duration (in seconds) of the observed behaviours.

Table 2 :
Average transition count [standard deviation] and transition rate (count/duration) between the two AOIs in relation to instructor behaviour.