Designing Mobile Friendly CAPTCHAs: An Exploratory Study

CAPTCHAs (Completely Automated Public Turing Test to Tell Computers and Humans Apart) are one of the most widely used authentication mechanisms that help to prevent online service abuse. With the advent of mobile computing, mobile devices such as smartphones and tablets have become the primary way people access the Internet. As a result, increasing attention has been paid to designing CAPTCHAs that are mobile friendly. Although such CAPTCHAs generally show their advantages over traditional ones, it is still unclear what the best practices are for designing a CAPTCHA scheme that is easy to use on mobile devices. In this paper, we present an exploratory study that focuses on developing a more holistic view of usability issues with interactive CAPTCHAs to inform design guidance. This is done through investigating the usability performance of seven mobile friendly CAPTCHA schemes representing five different CAPTCHA types.


INTRODUCTION
CAPTCHAs (Completely Automated Public Turing Test to Tell Computers and Humans Apart) are a kind of authentication mechanisms that utilises hard Artificial Intelligence (AI) problems to create tests to distinguish humans and computers (Von Ahn et al. 2004). CAPTCHAs have been widely used on webpages to protect online services from automatic attacks and abuses since the term was coined in 2003. The tests in a CAPTCHA scheme are often referred to as challenges or Human Interactive Proofs (HIP) where a popular type of test is based on character recognition (Chellapilla et al. 2005). For example, Google ReCAPTCHA (version 1) asks users to recognise two separated and distorted words and/or number plates to prove they are real humans (Figure 1). With the advent of mobile computing, mobile devices such as smartphones and tablets have overtaken laptops and desktop computers as the most used devices to access the Internet and online services (Ofcom 2015;comScore 2015;Pew Research Center 2015). Unlike desktop and laptop computers which often feature large displays with peripheral input devices (e.g., mouse/trackpad, keyboard), smartphones and tablets are usually equipped with small, touch-enabled displays. When CAPTCHAs are used on such a device, the challenge lies in not only the display size but also the shift of the main interaction methods from point and click + type to touch gestures (Wismer et al. 2012).
Various approaches have been taken to address this challenge. One approach is to use interaction techniques that are more 'compatible' with touch screens (e.g., using mouse point + click instead of keyboard input) (Chow et al. 2008;Desai & Patadia 2009;Chaudhari et al. 2011;Lin et al. 2011;Ye et al. 2013;Conti et al. 2016). The other approach is to purely rely on interaction methods that are native to mobile devices (Okada & Matsuyama 2012;Saxena et al. 2012;Jiang & Tian 2013;Jiang & Dogan 2015;Leiva & Alvaro 2015;Tsuruta et al. 2013). However, unlike character recognition based CAPTCHAs where there exist some general design guidelines for improving their usability (Yan et al. 2008;Bursztein et al. 2014), guidelines for designing mobile-friendly CAPTCHA schemes still remain underexplored.
In this paper, we present an exploratory study that aims to develop a more holistic view of usability issues in mobile friendly CAPTCHA design to inform design guidance. In the study, the usability performance of seven mobile friendly CAPTCHAs representing five different types of mobile friendly 2 CAPTCHA schemes is examined. The paper is organised as follows. The methodology is explained in Section 2 followed by the results and discussion in Section 3. The design implications are introduced in Section 4 and the conclusion is drawn in Section 5.

CAPTCHA selection and benchmarking
It is clear that the selection needs to focus on CAPTCHA schemes that are mobile friendly. In other words, it should look into whether a CAPTCHA can utilise input interaction methods that are native to mobile devices (i.e., touch gestures) rather than only being 'supported' on mobile devices (i.e., soft/virtual keyboards). Therefore, CAPTCHAs that require user to recognise characters in an image, audio clip and video clip, were excluded. Moreover, as selected CAPTCHAs need to be installed and configured in a testing environment for the experiment, their matureness and availability are also key considerations for the inclusion. As a result, Table 1 shows the list of CAP-TCHAs selected for this study where their success criteria, interaction method and challenge type are also stated. Note except for TapCHA v2, all CAP-TCHAs in the table are commercial solutions but they also represent some of the most common design approaches as seen in the research. For example, FunCaptcha uses a face recognition scheme (D'Souza et al. 2012;Kim et al. 2014;Kalsoom et al. 2012) and an image orientation scheme (Gossweiler et al. 2009;Kim et al. 2010;Banday & Shah 2015). KeyCAPTCHA uses a puzzle scheme which is similar to (Gao et al. 2010). It should also be noted that Google ReCAPTCHA version 1, a character-recognition based CAPTCHA has been chosen as the benchmarking CAPTCHA for two main reasons. First, it is the most widely used 'traditional' type CAPTCHA so any mobile friendly CAPTCHA should achieve at least better usability performance when being compared with it on a mobile device. Second, in comparison with Google ReCAPTCHA version 1 which is the most widely used character-recognition based CAPTCHA solution, there lacks a universally recognised mobile CAPTCHA scheme.

Apparatus
CAPTCHAs are widely used to protect online services where a typical application area is website signup and log in. A dummy user login form was therefore created using Bootstrap, a responsive front-end framework to ensure the form would always be displayed properly on a mobile device. Note the user information was hardcoded in this form so participants were only required to complete the CAPTCHA challenge in order to submit the form. All 7 selected CAPTCHAs were integrated into this form individually so there were 7 forms prepared. An Apple iPhone 6, which features a 4.7inch 1334x750 display, was used as the testing device to enable participants to access these online login forms.

Design
A within-subjects design was chosen for this study so that each participant had to complete all CAPTCHA tests on the mobile device. Moreover, for counter balancing, each participant was asked to complete Google ReCAPTCHA v1 then a tap type CAPTCHA followed by a drag & drop type CAPTCHA as shown in Table 1 in an interleaving mode. The order remained the same but the actual CAPTCHA selected for each type was randomised using a script.

Procedure
Each participant was briefed on the purpose of the study and their consents were obtained before the actual experiment. They were then presented with all CAPTCHAs on the testing device and they were asked to try each of them several times to familiarise themselves with these schemes and the mobile device. After that, each participant was asked to complete 6 challenges for each CAPTCHA so a total of 120 tests per CAPTCHA were logged (20 x 6). Their performance was video recorded by an observer for further analysis. After completing all tests, they were asked to fill in a questionnaire and provide oral feedback about their experience. Before the smartphone was handed to the next participant, its screen was cleaned to remove finger smudges and the tests were reset.

Measurement
Common usability metrics were used in the measurement including: completion time (efficiency), completion rate (effectiveness) and errors (effectiveness). In addition to that, System Usability Scale (SUS) was used as the questionnaire method to obtain participants' subject assessment on the CAPTCHA schemes they have tried.

Completion time
The completion time was defined as the average time participants spent on completing a CAPTCHA test. It was recorded in seconds starting from the time as soon as the CAPTCHA was fully displayed in the login form and stopping at the time when the submit button in the form was tapped. The results are shown in Figure 2 (min = 4.95; max = 8.8; SD = 1.53; Mean = 6.71) where the horizontal line across all columns is the benchmark (Google ReCAP-TCHA v1: 8.55 seconds). All CAPTCHAs, except for KeyCAPTCHA, were completed quicker than Google ReCAPTCHA v1 where four CAPTCHAs reported less than 6 seconds average completion time (p < 0.05, one-way ANOVA). It was noticed that there were two obvious lags when participants were completing a KeyCAP-TCHA test. The first lag was related to the position of all puzzle pieces as many participants tended to separate them from the stack first before deciding where to place them. This happened more often when it was not easy to distinguish puzzle pieces with similar patterns and colours ( Figure 3). Second, many participants tried to place the puzzle to the exact location as if they were completing a real jigsaw puzzle. This has caused some lags as this kind of control requires precision whilst a participant's finger is usually larger than a puzzle on the screen (i.e., "fat finger problem" (Vogel et al. 2007;Roudaut et al. 2008)).

Completion rate
The completion rate was defined as the percentage of CAPTCHAs that were completed successfully by participants. Note as a CAPTCHA challenge is considered as solved only when all success criteria in the test were met, there was no partial success considered in the measurement.The results are shown in Figure 4 (

Figure 4: Average completion rate of CAPTCHAs.
It was noticed that failures recorded with FunCaptcha were relevant to both types of challenges it used. For the "Roll the ball" challenge, the main issue was to do with the progress indication when a sub test was completed as a tick would always appear no matter whether the test was completely successfully or not. For the "Move the woman in the middle" challenge, as the other 7 images were created using one contrasting face in different angles and distances, it sometimes confused participants. For Google ReCAPTCHA v2, most failures were related to one particular type of tests where participants were asked to select all images containing a specific object (e.g., store front). As the images used in these tests were very small, when different level of detail was presented, sometimes participants could easily miss some relevant images. Moreover, as there was no undo mechanism after selecting an image, any wrong selection by accident would lead to the whole test failure.

Errors
The errors were defined as the total number of excessive, missing and/or wrong user actions in comparison to the minimal user actions required for completing a test successfully. Errors were captured in quantity and grouped in four categories as shown in Table 2. Note the number of errors was calculated as the difference between the total number of actual user actions and the total number of correct minimal actions.

Accidental (A)
Errors caused by accidental user actions (e.g., a user chose a wrong object by accident).

Unnecessary (U)
Errors caused by unnecessary user actions (e.g., a user clicked an object twice to confirm the selection).

Defective (D)
Errors caused by system defects (e.g., an object was not responsive to touch operation or CAPTCHA was not displayed properly in the screen so the user had to zoom in and move the screen).

Misinterpreted or misjudged (M)
Errors caused by incorrect user actions due to user's misunderstanding of instructions or misjudgement of success criteria (e.g., a user clicked on an object which they thought it was relevant).
The detailed error breakdowns for each CAPTCHA are shown in Figure 5. In general, sweetCaptcha, TapCHA and visualCAPTCHA reported the fewest number of errors across all categories where Tap-CHA also reported no errors in Category U and M (p < 0.05). Errors in Category M were only found with Confident CAPTCHA and Google ReCAPTCHA v2 indicating the rest of CAPTCHAs had reasonably clear instruction and easily distinguishable objects. As the two CAPTCHAs were based on checking image relevancy, it was noted that sometimes participants over-interpreted what was required for the challenge. For example, when being asked to select an image where people are present in a Confident CAPTCHA test, some participants chose not to select the only relevant image where some LEGO men were present which led to a test failure. This also happened with Google ReCAPTCHA v2 where participants selected not only the Parking sign facing forward but also the sign showing the back on the image (Figure 6) when they were asked to choose all street signs. The two CAPTCHAs also reported high number of errors in Category A and D. As they both were based on image recognition where images were displayed with different level of detail, it was observed that many participants had to zoom in an image to check its relevancy and zoom out to continue the completion (Category D). Moreover, when participants were zooming in and out to inspect the detail, they might have tapped on an image by accident as the images in such tests occupied the whole screen (Category A). Such issues suggest that image recognition based CAPTCHAs need to consider using clearer instruction to avoid confusion and choosing images that are more appropriate (e.g., clarity, representation, distinctiveness etc.) when being displayed on a mobile screen. Moreover, it should provide a mechanism to allow users to restart after making mistakes instead of carrying on the completion of the test.
There was also a specific issue to do with Google ReCAPTCHA v2 when a test was asking participants to select all images in two scenarios: (1) Click verify once there are none left and (2) if there are none, click skip. Some participants were unaware of the dynamics of the first scenario. That is, if they only chose all relevant images displayed initially and clicked verify without keeping choosing the new images displayed, it would lead to a failure.
Last but not least, it was found that most Category D errors reported with KeyCAPTCHA were due to the lack of a quick restart button. When participants moved puzzle pieces into the wrong places, it was sometimes not easy to locate those wrong ones as they were already blended into the image. In that case, many of them preferred to reload the form to start again.

System Usability Scale (SUS)
The SUS scores for all CAPTCHAs are shown in Figure 7 (min: 65%, max: 85%, SD: 6.07%, Mean: 74.29%) where the benchmarking score is 62.5%. The highest SUS score was found with TapCHA which was 85%. Note various studies have suggested that an aver-age system should achieve at least 60 in SUS (Bangor et al. 2009;Lewis & Sauro 2009).

DESIGN IMPLICATIONS
Based on the discussions above, the following guidelines for designing mobile friendly CAPTCHA schemes are proposed.
Guideline 1: The CAPTCHA scheme must be designed in a way to only rely on native gesture interactions on mobile devices (e.g., tap and swipe).
Our study shows that CAPTCHA schemes that are native to mobile devices such as sweetCaptcha, visualCAPTCHA and TapCHA performed generally better than the compatible ones such as Google ReCAPTCHA v2 and KeyCAPTCHA (e.g., click->tap, drag and drop -> swipe).

Guideline 2:
The CAPTCHA scheme should utilise system feedback methods to help users monitor their actions constantly when solving the challenge. This is on the contrary of character-recognition based CAPTCHAs where such feedback is only needed at the challenge level (i.e., after typing all characters and submitting the answer). For example, when an object is tapped, a haptic feedback could be triggered and corresponding colour change could occur to confirm the selection with the user to help them identify potential issues.
Guideline 3: The CAPTCHA test and the objects presented in the test must be rendered properly on the mobile screen without the need for any additional user action such as pinching (zooming in/out) and swiping (scrolling up/down and/or left/right). For example, the CAPTHA test needs to fit in one screen long to avoid excessive or unnecessary scrolling. Moreover, any object rendered and presented in the CAPTCHA test must be easily recognised and distinguished. If real-life image objects are used, the level of detail should be appropriate to minimise the risk of misinterpretation (e.g., Google ReCAPTCHA version 2). Our study shows that defective user errors were found with all CAPTCHA schemes tested and unnecessary errors were reported with 6 out of 7 where they are all object recognition based. to support redo and undo so as to help users recover from errors. It should be noted that supporting redo and undo does not mean the system should provide feedback on the outcome of each completion as it will weaken the security of a CAPTCHA. For example, restart/reset button can be provided to help users quickly restart the same test after making mistakes instead of carrying on the completion of the test.

CONCLUSION AND FUTURE WORK
In this paper, we examined the usability performance of 7 CAPTCHAs representing 5 different types of mobile friendly CAPTCHA schemes in a user study with 20 participants. The purpose of this study was to investigate the best design practices based on usability issues identified in these CAP-TCHAs rather than comparing them to find out which one is more advantageous. The results show that most schemes have outperformed the benchmarking Google ReCAPTCHA v1 in all aspects indicating their suitability for mobile devices. However, we also noticed that these schemes all presented usability issues to certain extent in the test. Interestingly, some issues identified in our study such as instruction clarity and the lack of support of redo/undo are often seen as more general design issues for all computer systems (Nielsen 1994;Reynaga et al. 2015). This suggests that designing a mobile friendly CAPTCHA scheme should not only follow specific design guidelines but also need to check universal system design guidelines. The list of design guidelines are proposed based on the findings which need to be further validated by including more CAPTCHA schemes in the future. It should note that a well-designed CATPCHA scheme should address both usability and security requirements to a high standard. This means the study also needs to be extended by looking into the security issues of interactive CAPTCHAs as overemphasizing on the usability side may have an impact on the security side which is also essential to a CAPTCHA design.