Multitasking and Monetary Incentive in a Realistic Phishing Study

This paper introduces an empirical study focusing on task settings similar to those in the real-world that captures user behavioral information of fine granularity. In online experiments, participants recruited from the Mechanical Turk human subject pool sorted legitimate and phishing emails. Subgroups of these remote users performed a secondary question-answering task and/or were incentivized by a monetary reward based on email sorting accuracy. This web-based framework automates a complete process from the informed consent to a post-study questionnaire, which can be scaled up to a large number of human subjects. In the preliminary result analysis, the monetary incentive can positively affect users’ behavior and performance, but not in a straightforward manner. Multitasking, on the other hand, has a negative effect on users’ ability to correctly classify emails.


INTRODUCTION
Technological countermeasures do not always protect information assets when human elements fail due to distraction or a lack of awareness (Thomson & Solms, 1998;Willison & Warkentin, 2013).Active research efforts have studied the risks of phishing and other computer security issues.Many are conducted in a lab environment that can significantly change the attitude and behavior of participants.Moreover, data collection often relies on video recording or self-reporting, which are hard to scale up or to consider as realistic scenarios in an employee's office.
Our contributions are three-fold: (1) The user study focuses on multitasking and a monetary incentive in sorting real legitimate and phishing emails; (2) the experimentation framework supports large-scale "in the wild" experiments in an automated and unattended manner similar to real-world settings; (3) the implementation enables data capture and collection of micro-level user behaviors.We developed a web-based solution using JavaScript and the LAMP stack (Lawton, 2005).It integrates Roundcube (https://roundcube.net), a webmail system, and Qualtrics (https://www.qualtrics.com),an online survey system.Benítez et al (2017) proposed a web-based tool that enables researchers to manage questionnaires and visualize the data collected.Kaczmarek et al (2015) presented an unattended study of users performing security tasks like pairing wireless devices.Gajos et al (n.d.) have been conducting an online user study on multitasking with the help of Google Analytics.Ollesch et al (2006) found no significant difference in psychometric data collected in a lab setting and its online, virtual counterpart.Many successful examples used participants remotely from Amazon Mechanical Turk for performing research-focused tasks (Bartneck et al, 2015;Kittur et al, 2008;Layman & Sigurdsson, 2013).For example, Bianchi et al (2015) utilized Amazon Mechanical Turk to disseminate noVNC clients via HTTP to end users to study Android GUI design-based attacks.Atterer et al (2006) proposed a framework of using web technologies (e.g., JavaScript, Proxy) to track user interactions with a web page.We expanded upon their idea to track users' interaction with a webmail client.

DESIGN OF USER STUDY EXPERIMENT
A key challenge here is how to balance uncertainty and familiarity of an email's source to participants.Participants in this user study were instructed that they were an administrative assistant working for the department chair, Dr. Jane Smith.They did not need to respond to any of the 40 emails, only sort them into either a "Keep'' or "Suspicious'' folder, without using the internet or other sources.

Condition-based User Tasks
During the pre-study survey, participants were instructed in the way they were expected to complete the experiment.Once a participant ran out of time or finished early and chose to move on, s/he was taken to the post-study survey in Qualtrics.For the incentive condition, participants could earn an additional monetary compensation, up to $8.00, based on the number of correctly sorted emails, in a tiered scheme.Those participants in the condition of incentive and multitasking, in order to be eligible for the incentive, must have correctly sorted 30 out of 40 emails (75%) and correctly answered 15 out of 20 multitasking questions (75%).

Email Design and Phishing Cues
The 40 emails were presented in a random order for each participant.Twenty phishing emails were derived from a semi-random sample of emails in Cornell University's "Phish Bowl" database (it.cornell.edu/phish-bowl).The 20 legitimate emails were derived from emails received by the research team.Their selection and design for experimental use considered 14 different phishing cue categories, including Sender's Display Name, URL Hyperlink, and Spelling and Grammar Errors, among others.

User Self-reported Information
Like other user studies, we were interested in acquiring demographics and other self-reported information on participants' experience to better interpret experiment results.For example, in the post-study survey, each participant reported whether s/he took a network or cybersecurity course/certificate before and estimated the number of correctly sorted emails.Moreover, during email sorting, the participants' confidence of classifying each email was collected by selecting a rating (1not confident at all, to 10 -extremely confident).

THE EXPERIMENTATION FRAMEWORK
The solution supported user tasks conducted remotely, managed concurrent user experiments, and logged participants' actions in experiment and responses to surveys in real time.

System Workflow
The system consists of four major components: residing on the client-side web browser, JavaScript-Based Data Capturer to collect participants' input and AJAX-Based Data Sender to communicate the captured data to the server, and, residing on the server side, PHP Listener to receive the data sent from AJAX and a Logger to log the data.The Qualtrics view is embedded as a HTML Inline Frame (IFrame) in Roundcube's interface.As in Figure 2, participants at Amazon Mechanical Turk were led to an online pre-study survey for demographic information and the informed consent, powered by Qualtrics.They were then redirected to the modified webmail client with user account information passed as URL parameters.The poststudy survey including questions on participants' experiment experience varies according to their assigned experimental conditions.

User Interface Design
We disabled some of Roundcube's functionalities to prevent unexpected operations, such as the "New" and "Reply" menus for email editing.We added new elements including the "Keep" and "Suspicious" mail folders and a "Rating" drop-down list menu for reporting classification confidence.
For data collection, we identified UI artifacts in the source code of Roundcube and added listeners on their respective components, focusing on behaviors that a user commonly performs:  Click -An event triggered when the user leftclicks on an object;  Hover -An event triggered when the mouse hovers over certain interactive objects;  Scroll -An event triggered when the user scrolls the mouse in the email body view;  Mouse Movement -Mouse cursor coordinates recorded.

Data Collection
Tables 1-3 shows a sample of the information collected on the Roundcube webmail client and the online Qualtrics survey system.

PRELIMINARY RESULT ANALYSIS
As in Table 4, we performed the experiments in batches of 40 participants at a time for 177 participants in total.They averaged 34 years of age.Sixty participants were female and 117 were male.Sixteen participants were students.One participant noted that English was not his/her first language.Only 146 participants were able to finish sorting all the 40 emails in the given time.The condition 1 group, where the participants took on two concurrent tasks of email sorting and question-answering under monetary reward, had the lowest completion rate.One potential explanation is that tasks under this condition was cognitively demanding.The condition 2 group had the highest completion rate where the participants concentrated on email sorting with the monetary incentive.

Email Sorting Accuracy
Shown in Table 5 for all participants, hypothesis tests indicated there was a significant difference in the email sorting scores between condition 1 and condition 2, and between condition 2 and condition 3, using a significance level  at 0.05.Overall multitasking significantly worsened a participant's sorting accuracy.No-multitasking combined with the incentive helped to carry out tasks.However, the incentive alone did not make a difference in either multitasking or no-multitasking cases.Using Bonferroni correction for multiple comparisons where the number of hypotheses m is 6, the level of significance  drops to 0.0083.Then we did not find significance in these results.As shown in Table 6, the subset of 146 participants who sorted all 40 emails displayed similar differences for different conditions with lower

Figure 1 :
Figure 1: User Study Task InterfaceParticipants were randomly assigned to one of four experimental conditions determined by two factors, multitasking/no-multitasking and incentive/noincentive.For the multitasking condition in Figure1(a), participants answered 20 sets of questions in Qualtrics on the right side while completing the email sorting task on Roundcube.Each question set would be presented for two minutes; participants could manually advance to the next question set after one minute had elapsed.For no-multitasking in Figure1(b), participants only had the email sorting task and had 30 minutes to complete it.Second, participants were in either the incentive or no-incentive condition.For the incentive condition, participants could earn an additional monetary compensation, up to $8.00, based on the number of correctly sorted emails, in a tiered scheme.Those participants in the condition of incentive and multitasking, in order to be eligible for the incentive, must have correctly sorted 30 out of 40 emails (75%) and correctly answered 15 out of 20 multitasking questions (75%).

Table 1 :
Features Collected on Roundcube

Table 2 :
Features Collected on Qualtrics

Table 3 :
Information Collected of User Operations Switching Between Roundcube and Qualtrics

Table 4 :
Participants by Condition

Table 5 :
Overall Sorting Accuracy for All 177 Participants