Volume 5, Issue 3. DOI: 10.1037/tmb0000137
Being surrounded by others has enabled humans to optimize everyday life tasks, from cultivating fields to industrial assembly lines. The mere presence of others can increase an individual’s arousal, resulting in better performance for familiar tasks. However, the presence of an audience can also be detrimental to an individual’s performance, especially when the arousal becomes excessive. Still, it is unclear what happens when these “others” include artificial agents, such as humanoid robots. Literature has shown mixed results in understanding whether robots can be facilitators or distractors in joint tasks. Thus, to understand the impact that the presence of a robot might have on human attentional mechanisms, we designed a visual search-based game that participants could play alone, under the surveillance of a humanoid robot, or collaborating with it. Thirty-seven participants completed this experiment (age = 26.44 ± 6.35, 10 males). Attentional processes were assessed using metrics of performance (i.e., search times, accuracy) and eye-tracking (i.e., fixation duration, fixation count, and time to first fixation). Results showed that the presence of the robot negatively affected participants’ performance in the game, with longer search times and time to first fixation when the robot was observing them. We hypothesize that the robot acted as a distractor, delaying the allocation of attentional resources to the task, and potentially exerting pressure of monitoring.
Keywords: human–robot interaction, social presence, attention, monitoring pressure
Funding: This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (Grant 715058 awarded to Agnieszka Wykowska, titled “InStance: Intentional Stance for Social Attunement”). The content of this article is the sole responsibility of the authors. The European Commission or its services cannot be held responsible for any use that may be made of the information it contains.
Disclosures: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.
Data Availability: The data, analyses, and study materials for this project have been made publicly available for other researchers. These resources can be accessed via the Open Science Framework at https://osf.io/p24zw/. In the same folder, users can also access the anonymized data of participants and the R script used for data analyses. The data reported in this study has not been used in any prior publications or projects.
Open Access License: This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC-BY- NC-ND). This license permits copying and redistributing the work in any medium or format for noncommercial use provided the original authors and source are credited and a link to the license is included in attribution. No derivative works are permitted under this license.
Correspondence concerning this article should be addressed to Agnieszka Wykowska, Istituto Italiano di Tecnologia, Via Morego, 30, 16163, Genova, Italy. Email: [email protected]
Living in social groups has been claimed to be beneficial for many animal species (Krause & Ruxton, 2002). The increased survival rate accelerated brain development, and enhancement of cognitive abilities are just some of the many advantages of coexisting with other individuals (Neumann, 2009). Throughout the history of humankind, being surrounded by our conspecifics did not only grant us a higher survival rate but allowed for the optimization of everyday life tasks. From cultivating fields to industrial assembly lines, collaboration with others has been crucial for boosting productivity and reducing the workload of an individual.
When it comes to motor tasks, such as coordination or athletic performance, the mere presence of “others” seems to increase an individual’s physiological (and psychological) arousal. Arousal is related to preparatory (or reactive) activity of the body allowing for being ready to act when the environment requires fast (re-)action (e.g., response to threat). Arousal can be defined as an elevated state of activation of the autonomic nervous system, characterized by increased physiological responses such as heart rate, respiration rate, and skin conductance, as well as heightened cognitive alertness (Mullen et al., 1997). Heightened arousal can be influenced by the social context, particularly the presence of others, and it plays a crucial role in modulating an individual’s performance on tasks, enhancing performance on simple or well-learned tasks (facilitation effect) and potentially impairing performance on complex or novel tasks (detrimental effect). In other words, if a task is familiar, a boost in physiological arousal, related to social presence, tends to cascade and result in better performance, through an increase in the frequency of dominant responses (i.e., responses with the greatest habit strength; Platania & Moran, 2001). Initially, this effect has been studied within the drive theory of social facilitation framework (Triplett, 1898; Zajonc, 1965), giving rise to several different models and hypotheses (see, e.g., Carver & Scheier, 1981; Strauss, 2002). Although this theory has been grounded originally in the observation of facilitation during motor behavior, the presence of others seems to affect individuals’ performance also at a cognitive level.
In a functional Magnetic Resonance Imaging work, Chib et al. (2018) investigated the neural correlates of social facilitation using a monetarily incentivized reward paradigm. The authors found increased connectivity between-participants’ ventral striatum and dorsomedial prefrontal cortex in trials in which social facilitation occurred (i.e., when an audience was watching them). The authors claim that the increased brain activity in the ventral striatum reflects motivational encoding occurring when other individuals are watching participants during the task. This might indicate that when we are “watched” by someone while performing a task, we are more motivated to perform better or to focus more on the task, rather than when we perform the same task alone.
Providing a systematic review of social facilitation theories goes beyond the scope of the current article, but it is important to highlight that according to most social facilitation accounts, the (social) presence of an audience can improve individuals’ performance (Steinmetz & Pfattheicher, 2017). Still, there are times in our daily life in which the presence of another individual is not necessarily beneficial. Every one of us has experienced the (sometimes unpleasant) feeling of being observed by someone while performing a casual task, such as cooking or writing an essay. Indeed, even during daily activities, people tend to behave differently when they think they are being watched by someone else, a phenomenon known as the Hawthorne effect (McCambridge et al., 2014). Whether it is simply walking down the street or giving an athletic performance, having an audience observing you can be either facilitating or detrimental, depending on task demands, context, and personality traits (see, e.g., Kobori et al., 2011; Yoshie et al., 2011). The Hawthorne effect has been studied and debated within several theoretical frameworks, highlighting the complex interplay between social presence, contextual factors, and individual differences (Anderson-Hanley et al., 2011; Uziel, 2007).
Several studies examined the potentially detrimental effects of social presence on individuals’ performance (Yoshie et al., 2016). Indeed, elevated arousal levels can improve individuals’ performance, but only up to a certain point, especially if the task (or the environment) is unfamiliar to the individual (Bond & Titus, 1983). Therefore, in conditions under which the arousal becomes excessive, performance might dramatically deteriorate. This effect is also known as the Yerkes-Dodson law, which postulates that an optimal level of arousal can help individuals focus on a task, while a high level of arousal can impair the ability of an individual to concentrate. Under certain circumstances, the increase in an individual’s arousal can be connected with feelings of anxiety and stress (Janelle, 2002). In particular, the phenomenon of “choking” (i.e., performing worse than one’s actual skills would allow) under monitoring pressure presents a fascinating counterpart of the social facilitation theory (see, e.g., Belletier et al., 2015; Cañigueral & Hamilton, 2019).
In an functional Magnetic Resonance Imaging study, Yoshie et al. (2016) found out that the presence of an evaluative audience can worsen participants’ fine motor performance. While engaged in a motor task (i.e., feedback-occluded isometric grip task), participants reported higher subjective anxiety in the “observed” condition, compared to the “unobserved” one. Interestingly, in the “observed” condition, the authors reported increased activation of the posterior superior temporal sulcus, when compared to the “unobserved” condition. As the posterior superior temporal sulcus is claimed to be a key neural substrate for social perception based on visual information (Nummenmaa & Calder, 2009), the authors claim that individuals performing the task needed to allocate additional attentional resources to monitor external observers. Such costs (i.e., the reallocation of attentional resources) might conflict with the execution of the task. In another study, Belletier et al. (2015) investigated the same phenomenon, which they define as “monitoring pressure.” The authors observed that being watched by an evaluative audience leads individuals to choke on executive control tasks. This is in line with the distraction-conflict theory (Baron, 1986) stating that, when an individual is performing a task, the mere presence of others generates an attentional conflict between attending to the observers and attending to the task.
The impact of social presence on performance has been widely studied in the context of human–human interaction (Mnif et al., 2022). However, despite a growing body of literature investigating the impact of artificial agents’ presence on performance in collaborative tasks (Riether et al., 2012; Spatola et al., 2019), it remains to be understood whether collaborative artificial agents, facilitate or impair individuals’ performance. This question is becoming of great importance, as interaction with social robots is becoming increasingly present in our lives (Vincent et al., 2015). From industrial to clinical applications, robots are becoming part of our everyday life activities (Korn, 2019). Within this context, when we mention interaction, we refer to the dynamic exchange of information, (social) signals, or actions between a human and a robot, involving reciprocal influence and responsiveness. When it comes to human-robot interaction (HRI), interactions encompass a range of modalities, including verbal communication, nonverbal cues, gestures, and actions, usually expressed bidirectionally, with both the human and the robot contributing to the ongoing communication and shaping the overall nature of the interaction.
Past research demonstrated that the presence of a robot gazing at individuals while they perform a task modulates attentional orienting (Kompatsiari et al., 2017; Kompatsiari, Ciardo, et al., 2021), social decision making (Belkaid et al., 2022), and engagement. Those studies showed that while mutual gaze is indeed engaging both at the behavioral and neural level (Kompatsiari, Bossi, & Wykowska, 2021; Kompatsiari et al., 2022), it impairs individual performance and cognitive control (Belkaid et al., 2022). On the other hand, in two studies, Spatola et al. (2018, 2019) found that individuals’ performance in a Stroop task (see MacLeod, 2005) improved when participants were observed by a social robot rather than when they were observed by a nonsocial robot or when they were not being observed at all (Spatola et al., 2019). These results speak in favor of the social facilitation effect in human–robot interaction. However, in a recent work, Koban et al. (2021) did not find the same results using a similar Stroop paradigm (Koban et al., 2021). The authors did not find a substantial difference in individuals’ performance when participants were playing alone or when they were observed by a robot or a human, suggesting the existence of additional contextual factors that might influence social facilitation processes in human–robot interaction (e.g., individual’s familiarity with the task or with the observer; task difficulty).
One potential reason for such conflicting findings might be that the paradigms that are usually adopted to study social facilitation in human–robot interaction rely on classical attentional tasks designed to be cognitively demanding, but not interactive (e.g., the Stroop task, see MacLeod, 2005). In particular, the complexity of the abovementioned tasks might not be ideal for studying how individuals perform in HRI in daily life. In fact, the cognitive demand elicited to complete artificial lab tasks jointly with the unfamiliarity with the agent, might bias results toward a worsening of performance not because of the robot per se, but because of the unfamiliarity with the entire context. On the contrary, defining scenarios of HRI that are more ecologically valid might provide useful insights to navigate social cognition processes happening in the human during spontaneous interactions. Indeed, in a recent work, Irfan et al. (2018) highlighted the importance of adapting classical experimental paradigms to more natural human–robot interaction environments, intending to increase the ecological validity of the approach. Furthermore, performance measures (i.e., accuracy, response times) are often the only indicators used to assess participants’ cognitive (and attentional) engagement during human–robot interaction experiments. Recent literature demonstrated the potential that other measures of attention, such as eye-tracking metrics, may have in exploring social cognition processes that underpin human–robot interaction (see Baxter et al., 2014; Kompatsiari et al., 2019).
Based on the conflicting literature on social facilitation/inhibition in HRI, it is not easy to define a clear a priori hypothesis on the effect that a robot presence might have on users interacting with it. We may hypothesize that, by nature, robots are not familiar entities for humans. Thus, the request to be in the presence of (or interact with) a robot might exert a certain degree of pressure, which might hinder the performance at a task, adding complexity to an already unfamiliar context (i.e., experimental setting). To explore the effect of the presence of a robot on the performances of individuals accomplishing a task, we designed a novel visual search game, in which participants were asked to find a letter hidden within pictures of naturalistic photographs. To increase the naturalness of the experiment, the game was designed to be as much as possible engaging and intuitive for the participants. We assumed that this would minimize the impact that boredom and fatigue could play on the results. For the same reason, the task was designed to be short, lasting approximately 20 min (depending on the speed of the participant). The game was designed to be played both alone and with another player, and to leave participants the freedom to explore the environment (including the social environment) during the experiment. To monitor subtle changes in the allocation of attentional resources, we asked participants to wear a mobile eye-tracker during the task.
We decided to focus on three eye-tracking metrics, previously used in research on visual attention (Cheng et al., 2022; Ziv, 2016), namely fixation duration, fixation count, and time to the first fixation. The importance of the use of eye-tracking metrics to better understand human cognition has been recognized in both human–computer interaction (Joseph & Murugesh, 2020) and human–robot interaction (Admoni & Scassellati, 2017) studies. Specifically, fixation duration refers to the length of time an individual’s eyes remain relatively still and focused on a specific region in their visual field. It represents the temporal aspect of visual attention and can be employed to analyze the focus of attention on particular locations within a scene or stimuli. For example, longer fixation durations suggest sustained interest or cognitive processing of the information present in that region (Irwin, 2004). This metric is often analyzed along with fixation count, which are calculated as the number of times a person’s gaze comes to a rest (fixates) on a specific point or region of interest. It represents how often the eyes pause or focus on a particular area and can help identifying salient features or regions. For example, higher fixation counts in specific areas indicate the importance or relevance of those features in capturing visual attention (Behe et al., 2015). Another metric of interest is the time to first fixation, which represents the duration it takes for an individual’s eyes to first settle on a particular point or region of interest after stimulus presentation. It provides insights into the speed of initial attention orienting and is a valuable metric for understanding the immediate visual impact of stimuli. For example, faster times may indicate the salience or significance of specific elements or events, reflecting their quick capture of attention (Wermes et al., 2017). On the contrary, slower times may indicate that the individual was distracted when e certain stimulus was presented. Altogether, these metrics can help understand where individuals direct their visual attention in social scenes, providing insights into the processing of social information. Thus, we can claim that fixation duration, fixation count and time to first fixation are robust indicators of cognitive engagement and allow to infer the level of processing or interest also in a social content.
As these metrics have been widely adopted to assess attentional processes, we decided to include these eye-tracking data along with participants’ performance measures to examine the effects of the robot’s presence on the task we designed.
The first group of participants was asked to play the game alone. After 2 weeks, the same participants were asked to play the game again, but with the humanoid iCub robot (Metta et al., 2010) observing them. This first part of the experiment was aimed at testing whether social facilitation/inhibition could happen in the presence of a robot. We hypothesized that unfamiliarity with the observing agent would hinder performance of participants, leading to more frequent distractions. We then recruited a second group of participants, to explore whether such hindering can be modulated by the behaviors and the role played by the robot during the interaction (i.e., a robot that is not just “observing” but is also able to play the same game). We hypothesized that the more the agent is involved in the task, the more the participant might be distracted, as part of their cognitive (or, at least, attentional) resources might be allocated to monitoring the robots’ behavior and/or performance. Thus, the second group of participants was asked to play a turn-taking version of the game, where the robot was both observing and cooperating with them. We defined cooperation in this HRI context as a specific type of interaction that involves mutual efforts between a human and a robot to achieve a shared task. In the case of our experiment, the task execution was indeed shared between the human and the robot, taking turns to achieve the best result together. The combination of these three versions of the game (i.e., “Solo,” “Observation,” and “Interaction”) allowed us to investigate further contextual aspects that might play a role in social facilitation/distraction, such as the difference between a “passive” robot observer and an “active” cooperative one. This difference might be of high relevance, as in everyday life interactions observers are not always “passive,” especially when it comes to scenarios where robots can be deployed as human assistants. It is of great importance to understand if undesired effects (deteriorated performance) do not occur in contexts that are intended as facilitatory (cooperative HRI).
Thirty-seven participants completed this experiment (age = 26.44 ± 6.35, 10 males). All participants reported normal or corrected-to-normal vision and no history of psychiatric or neurological diagnosis, substance abuse, or psychiatric medication. The first group of participants (n = 20, age = 24.11 ± 4.11, four males) played the visual search game twice. Participants within this group played the task alone during the first game (solo condition), after 2 weeks, they played an identical game with the same task under the surveillance of an iCub robot (observation condition). The number of participants for this group was based on a previous study conducted by the authors using a screen-based task and similar dependent variables (see Ghiglino et al., 2020). Cohen’s d was equal to .78 for the analysis performed on fixation duration in the previous study. Thus, we used GPower (see Faul et al., 2007) to estimate the minimal sample size required for the within-group comparison, considering a power of .95 and an error probability of .05. The output of the software suggested collecting a sample of at least 16 participants. The within-subjects design allowed for identifying the effects of the mere presence of the robot on participants’ performance. The second group of participants (n = 17, age = 28.89 ± 7.45, six males) played the same visual search game only once, taking turns with the same robot, and cooperating with it to accomplish the task1 (interaction condition). This third condition was added as a pilot exploration of potential differences in performance due to the behaviors displayed by the robot (i.e., observation vs. interaction conditions). We recognize that such a small sample might have increased the risk of Type II errors for between comparisons, (i.e., false negative), and we discussed this aspect in the Limitations and future directions section below. In a between-subjects design, we aimed at comparing this group with the first group to understand whether active social presence (interaction) has a different impact on performance than mere passive social presence. Our experimental protocols followed the ethical standards laid down in the Declaration of Helsinki and were approved by the local Ethics Committee (Comitato Etico Regione Liguria). All participants provided written informed consent to participate in the experiment. Anonymized data collected for this study are available at a dedicated repository on Open Science Framework (see Ghiglino, De Tommaso, & Wykowska, 2023) along with the script used for the main analyses, the stimuli used in the task, the python code to run the experiment, and a video of the experimental setup.
The main task consisted of a visual search game. Participants were instructed to search for one letter (either a “Z” or an “M”) that was hidden within high-resolution images (3,008 × 2,000 pixels) of real-life environments.2 By pressing the left button of a Logitech Adaptive Gaming Kit participants communicated that they detected the “Z” letter while by pressing the right button, they communicated the detection of the “M” letter. For each trial, participants were given 30 s to find the hidden letter and were instructed to be as fast as possible. Such a relatively long timeout was provided to increase the chance that participants distribute their attention to the entire environment, instead of focusing solely on the visual search task. For each correct answer, participants accumulated +1,000 points, while for each incorrect answer they lost −500 points. Timeouts were associated with the highest penalty, corresponding to a subtraction of −1,000 points This choice was made to increase the chance that participants’ (overt) attention was focused mainly on the task, rather than solely on the exploration of the environment. Indeed, despite the high penalty in case of timeout, it is important to anticipate that the average response time of participants at this task was 12.26 ± 2.33, suggesting that, in most of the trials, participants were able to detect the letter well before the timeout occurrence. The experimental setup comprised three screens (23.8′′ LCD screens, resolution: 1,920 × 1,080, see Figure 1): (1) one horizontal, central screen, located in front of the participant, at the center of the table; (2) one vertical lateral screen, located on the participant’s right; (3) a second vertical lateral screen, located on the participant’s left.
Visual search stimuli were presented on the horizontal screen, named the “Stage,” while the lateral screens were providing additional task-related information. Namely, the right screen, named the “Timer,” was presenting the time left for the participant to provide the answer, turning on at the beginning of each trial, and turning off after the response was provided, or the timeout occurred. In addition, after the first 10 s from the presentation of the stimulus on the Stage, if participants hesitated in providing the answer, a hint could appear on the Timer screen at any time before the timeout. When instructing our participants, we made clear that such hint could appear from the 10th to the 29th second of the trial randomly, to make the environment as immersive as possible. The hint consisted of highlighting, on a black square, the quadrant of the stimulus in which the letter was hidden. At the end of each trial, based on the correctness of the participant’s response, the left screen named the “Scorer” turned on, presenting feedback and an update of the score (see Figure 2). The Timer and the Scorer were added to the Stage to make the task more engaging for the participant, to simulate a gaming-like scenario. Specifically, we designed the setup to stimulate spontaneous shifts of attention toward lateral screens, so that the robot was not the only “distracting” element in the environment. However, it is important to point out that attending to the lateral screens was not needed for completing the task, as much as attending to the robot (when present) was not needed. This choice also allowed for isolating the impact of social presence per se in the within-group comparison between the solo and observation conditions, as both conditions involved other distractors (i.e., lateral screens), and they differed only concerning the social presence of the iCub robot.
The experiment comprised three conditions: (1) solo condition, in which participants were playing the game alone; (2) observation condition, in which participants were playing the game while an iCub robot was “observing” them; (3) interaction condition, in which participants were taking turns with the iCub robot in playing the game. The solo and observation conditions comprised 30 trials each, while the interacting turn-taking condition comprised 60 trials (30 per player). During the solo and observation conditions, participants were told their scores would be compared with other players’ scores. During the interaction condition, participants were asked to cooperate with the iCub to achieve the highest team score, which would be eventually compared with other teams’ scores. At the end of each session, a fake top-10 ranking of the other players was presented, so that the actual player/team was always the winning one. Eventually, 19 participants underwent conditions (1) and (2), while seventeen participants underwent only condition (3).
During each trial, we collected participants’ search times (STs) calculated as the time between the stimulus onset and the participant’s response. Participants performed the task wearing an eye-tracker (Tobii Pro Glasses 2), which enabled collecting also gaze-related data (i.e., fixation durations, fixation count, and time to the first fixation), and monitoring participant’s gaze location during each phase of the experimental sessions. It is important to note that we opted for a mobile eye-tracker to make the experiment as naturalistic as possible. We acknowledge that using a static eye-tracker and a chinrest would have allowed us to extract additional (and potentially more detailed) eye-tracking metrics, such as precise heatmaps and gaze trajectories. However, this would have added constrains to participants’ behavior, which we wanted to avoid. Instead, the use of the Tobii Pro Glasses 2 allowed participants to move freely their head (and upper body) during the task, which we believe made the task more engaging and ecological. Still, mobile eye-trackers can provide robust metrics when it comes to the characteristics of fixations (Skaramagkas et al., 2023). Before and after the observation and the interaction conditions, participants were also asked to complete the Intentional Stance Test (IST; see Marchesi et al., 2019), as part of a side project unrelated to the task, aimed at assessing changes throughout different experiments in participants’ tendency to adopt the intentional stance in HRI scenarios. Even though analyses and discussions of the IST data go beyond the scope of the current article, we reported the results of the IST in the result section.
The behavior of the robot was different between the observation and the interaction conditions. In the former, the robot’s behavior was limited to “monitoring” the participant during the task. Specifically, during each trial the robot was mainly observing the Stage screen and turned its head only when participants were looking at one of the lateral screens. In those cases, the robot was slightly moving the head and the eyes toward the center of the lateral screen at which the participant was looking, keeping that pose for 500 ms, and then going back to the Stage screen. In addition, during the short pauses between the trials, the robot was gazing at the participants, waiting for them to start a new trial before starting to fixate the Stage screen. The turns of the head were introduced to make the behavior of the robot contingent on the participants’ behavior, potentially increasing the impression of “monitoring.” During the interaction condition, we kept all the elements of the observation condition during the turns of the participants and we introduced additional behaviors that the robot was displaying when actively playing the game during its turns. To maximize the human-likeness of such a condition, the robot’s behavior was derived from the participants’ recordings collected during the observation condition. Specifically, the robot was programmed to perform as an “average” participant in terms of STs and accuracy. The accuracy of the robot was defined to be identical across participants, which means that the robot was responding correctly in 26 trials, and incorrectly in four trials. The robot was providing its responses between 9.98 and 14.18 s after the appearance of the stimulus, based on the average response time extracted from participants during the observation condition (i.e., 12.08 ± 2.10 s). In addition, before providing its answers, the head and gaze of the robot were programmed to switch from the Stage to the Timer with the same timing and frequency as an average participant (i.e., four times during its first two trials, two times during the second, third and fourth trials, and then one time every second trial until the end of the game). After providing its response, the robot immediately turned its head and eyes toward the Scorer. When the feedback on the Scorer disappeared, the robot was programmed to gaze back at the participant to prompt them to start their turn. The robot was providing its responses by keypress using both hands, exactly as the participants. In both the observation and interaction condition, the robot posture was in a neutral position, and the only moving parts were the head, the eyes and the wrist (just for the purpose of pressing the buttons to provide the responses, and only during the Interacting condition).
To explore the effects of the presence and the role played by the robot on participants’ performance during the task, we adopted various mixed models on behavioral and eye-tracking data, using R Studio (RStudio Team, 2022). For all the models, we used the same approach, considering metrics acquired during the experiment as separate dependent variables, and the subjects’ intercept as a random factor. We used within-group comparisons to assess the effect related to the mere presence of the robot, by contrasting the solo condition with the observation condition. Then, we used between-groups comparisons to assess the effect related to the role of the robot, by contrasting the observation condition with the interaction condition. Thus, experimental conditions were included as the fixed factor within each model.
Regarding the performance metrics, we analyzed STs. We refer here to STs instead of reaction times as the participants were instructed to search for a hidden element in a visual search task rather than reacting to the appearance of the stimulus. Given the positively skewed distribution of STs, we adopted two separate generalized mixed models (GLMM) to analyze the data, in which we considered the inverse γ distribution as a reference for the model. Even though participants showed high accuracy during the task (87.27% on average) we also investigated the effect that the presence of the robot exerted on the number of correct responses provided during the game, as additional indicator of task performance.
For eye-tracking data, we defined three main areas of interest (AOI) a priori: (1) the Stage; (2) the Timer; and (3) the Scorer. 88.12% of total fixations were recorded within the Stage AOI, 4.56% within the Timer, and 7.33% within the Scorer. Considering the insufficient amount of data points within the non-Stage AOIs, we focused our analyses only on the Stage area. We did not include the robot as an area of interest for this task due to the technical constraints of the mobile eye-tracker. More specifically, to make eye-tracking data processing and analyses as clean as possible from artifacts, we placed at each corner of the three screens static markers that allowed us to extract precise features from the images collected with the eye-tracker. This was possible to achieve, as the screens had a precise position that was kept constant across the conditions. As the main area of interest for the robot would have been the head, but the head was moving, we could not achieve the same level of precision. Thus, for the sake of robustness, we focused only on the static elements of the setup (i.e., the three screens).
We considered three main parameters for data analyses. As indicators of the attentional engagement displayed by participants during the task, we analyzed fixation duration (i.e., amount of time a person’s eyes remained still and focused on a specific region of interest) and fixation count (i.e., the number of times a person’s gaze paused and remained still on a particular region of interest; Ghiglino et al., 2020; Papageorgiou et al., 2014). As an indicator of participants’ attentional focus on the task, we also analyzed (B) time to first fixation (i.e., the duration it takes for a person’s eyes to initially focus on a specific region of interest; Underwood & Foulsham, 2006). All these metrics were log-transformed before statistical analysis, to make the distributions suitable for applying separate linear mixed models.
Analyses were conducted using the lme4 package (Bates et al., 2014) in R. We reported t-statistics along with p values and parameter estimates (β), corrected using the Satterthwaite approximation for degrees of freedom (Kuznetsova et al., 2017), to show the magnitude of single effects, with bootstrapped 95% confidence intervals (Efron & Tibshirani, 1994). Due to the way mixed models partition variance, and the lack of consensus on the calculation of standardized effect sizes for individual model terms (Rights & Sterba, 2019), only t and p values are associated with the main effects in the “Results” section. Nevertheless, it is important to point out that the employment of mixed models allows for superior control for Type-I errors than alternative approaches (Matuschek et al., 2017). In addition, all comparisons report parameter estimates (β) and relative confidence intervals, providing a measure of effect size (Kumle et al., 2021; Meteyard & Davies, 2020). We reported the mean values of each dependent variable divided by condition to ease the interpretation of the results.
To assess the effect of the presence and the role played by the robot on task performance, we analyzed the differences, across conditions, within and between-participants’ average STs, using separate GLMM. Within-subjects comparisons revealed a main effect of the presence of the robot during the task, β = −0.007, t(19) = −2.55, p = .011, 95% CI [−0.013, −0.002]; Figure 3. Namely, when the robot was present, participants performed slower than when they performed the task alone (MSolo = 11.08 ± 1.94; MObservation = 12.02 ± 2.14). Between-subjects comparisons revealed a main effect of the role played by the robot during the task, β = −0.011, t(34) = −2.54, p = .016, 95% CI [−0.020, −0.003]; Figure 3. Specifically, when the robot was playing with the participants (the interaction condition), they performed slower than when the robot was just observing them (MObservation = 12.02 ± 2.14; MInteraction = 13.86 ± 2.14). Although the comparisons between the solo and the interaction condition resulted significant, t(36) = 773.2, p < .001, with the STs of the interaction condition being larger than in the solo condition, the eigenvalue of the Hessian (inverse curvature matrix) at the maximum likelihood lead the model to not converge (for details, see the description of Package “lme4,” 1.1-29; See https://cran.r-project.org/web/packages/lme4/lme4.pdf). As the reference model for this comparison was nearly unidentifiable, the difference between the solo and interaction conditions will not be further discussed.
As a further metric of task performance, we then analyzed the accuracy of participants in finding the target letters during the task. We compared the three conditions using a general linear model approach (GLM). Our results showed a significant difference in accuracy among conditions, β = 1.64, t(36) = 2.29, p = .026, 95% CI [0.24, 3.05]; MSolo = 26.9 ± 1.77; MObservation = 26.70 ± 2.70; MInteraction = 25.06 ± 1.89; Figure 4; post hoc comparisons revealed a significant difference between the solo and the interaction condition, β = −1.84, t(36) = −2.57, p = .034, 95% CI [−3.57, −0.11], but only a trending difference between the observation and the interaction condition, β = −1.64, t(36) = −2.29, p = .066, 95% CI [−3.37, 0.09]. No difference was found between the solo and the observation condition in terms of accuracy (p > .05).
The effects of the presence and role of the robot on participants’ attentional engagement were analyzed by examining variations in fixation duration across conditions, using separate GLMM. Within-participants comparisons showed no effect due to the presence of the robot (p > .05; MSolo = 53.71 ± 11.25; MObservation = 49.6 ± 15.27). Similarly, between-participants comparisons showed no effect due to the role of the robot (p > .05; MInteraction = 49.5 ± 12.17).
Similarly to the analysis on accuracy, we compared fixation counts in the three conditions using a general GLM. Results showed no effect due to the presence of the robot (p > .05), and no effect due to the role of the robot (p > .05; MSolo = 26813.3 ± 14629.23; MObservation = 26586.2 ± 13575.09; MInteraction = 24346.6 ± 7102.38).
We also analyzed the effect of the presence/role of the robot on the initiation of engagement, by examining variations across conditions among times to first fixation (TTFF), using separate GLM models. Within-participants comparisons revealed a main effect of the presence of the robot during the task, β = 0.309, t(19) = 2.60, p = .013, 95% CI [0.008, 0.543]; Figure 5. Specifically, the mere presence of the robot during the observation condition negatively affected initiation of engagement in the task, delaying participants’ TTFF compared to the solo condition (Msolo = 0.08 ± 0.16; MObservation = 0.76 ± 1.66). Between-participants comparisons revealed no effect due to the role played by the robot during the task, β = −0.238, t(34) = −1.75, p = .090, 95% CI [−0.514, 0.039]; MObservation = 0.76 ± 1.66; MInteraction = 0.19 ± 0.33; Figure 5.
To explore whether participants in the interaction condition perceived the robot differently than participants in the observation condition, we also analyzed the scores of the IST, administered before and after each of those conditions. Results showed no significant difference between pre- and postscores (p > .05; MPre = 39.64 ± 21.56; MPost = 37.59 ± 23.00), or between conditions (p > .05; MObservation = 42.01 ± 22.07; MInteraction = 32.16 ± 21.31). In addition, no significant interaction between the time of administration and the condition was found (p > .05), suggesting that the presence or behavior of the robot in this task did not affect participants’ likelihood of adopting the intentional stance toward the robot.
The present study aimed at investigating the impact of the social presence of a robot on human performance in an attentional task. To meet this aim, we designed a game that participants could play alone, under the gaze of the robot, or jointly with it, by taking turns.
Our first result on STs revealed that the presence of the robot negatively affected participants’ performance in the game. Specifically, participants became slower in providing their responses when the robot was observing them, compared to when they played the task alone. This is in line with the body of research highlighting the detrimental effect that social presence sometimes exerts on individuals’ performance. In human–human interaction, choking under monitoring pressure may result from distraction, as well as from the interference of self-focused attention with the execution of automatic responses (Baumeister & Showers, 1986).
Past research also showed that completing a task in front of a robot observer elicits a higher perception of monitoring presence than completing the task alone or in front of another human (Woods et al., 2005). We can hypothesize that this is related to the unfamiliarity individuals might perceive when interacting with robots. Therefore, the mere presence of a robot observer might raise the feeling of monitoring pressure (Bond & Titus, 1983). We can also speculate that the presence of the robot led participants to redefine the context of the task, distributing their attentional resources between the observer and the task itself, as if the observer needs to be monitored as an integral part of the game.
This explanation of distributed attentional resources is confirmed by the results we found on the TTFFs. Indeed, participants took longer to make the first fixation on the task when the robot was present, relative to when it was absent. This confirms that the robot was acting as a distractor for the participants, delaying the allocation of attentional resources to the task, and potentially exerting monitoring pressure. We can speculate that this detrimental effect on performance is due to the unfamiliarity that participants perceived with the robot. Indeed, uncertainty about the unfamiliar has been vastly debated as one potential source of social distraction (Kagan et al., 1984). Robots, and humanoids in particular, may constitute for most humans ambiguous and unknown entities, that cannot be subsumed under the category of “machines” but, at the same time, cannot be treated as natural agents either (Kahn & Shen, 2017). We speculate that such unfamiliarity contributed to the effect we found in this study. Indeed, performing a task in “co-presence” with such an unknown entity could require participants to “keep an eye” on it, drawing cognitive resources from the task they are performing. Indeed, TTFFs are thought to reflect preattentive processes (Liechty et al., 2003) or covert attention (Henderson et al., 1989), meaning that they provide a measure of how quickly participants can shift their attention to a new stimulus in their visual field (van der Laan et al., 2015). Longer TTFFs usually indicate that participants are taking longer to process or react to the visual stimulus, which can have negative implications for task performance. Indeed, in our task, the observation condition resulted in longer TTFFs. This indicates that participants took longer to orient their attentional focus on the Stage (the visual search display), spending longer time monitoring the robot when the robot was monitoring them.
This interpretation is also in line with the results we found when we compared the two roles played by the robot between the two groups of participants. Indeed, the moment the robot was playing an active role in the game, participants’ search time worsened even further. This suggests that while performing the task, participants were monitoring the robot even more extensively. We can speculate that being aware of the acting possibilities of the robot required the participants to distribute additional peripheral attentional resources between the Stage screen and the artificial co-agent while performing the task. In addition, our results on the accuracy of participants revealed that when the robot assumed the role of the interaction partner, participants tended to make more errors with respect to the observation condition. This might suggest that participants got distracted by the robot to a point where their performance dropped.
Interestingly, however, fixation duration data did not show entirely the same pattern. In the comparison between performance in the solo condition and the observation condition, fixation durations on the main screen with the task did not differ across these two experimental conditions. We hypothesize that this divergence in the data reflects two different sides of visual processing, that is, covert and overt attention (Itti & Koch, 2000). Indeed, the duration of fixations is a direct consequence of physically directing the eyes to a stimulus (the Stage screen in our task) and reflects overt attention (Kaspar & König, 2011; Wang et al., 2019) while the delay in STs and TTFFs in the observation condition (relative to solo condition) might be related to covert attentional processes. Therefore, the different pattern between fixation duration and TTFFs suggests that engagement of overt attention in the task does not change across the conditions, but covert attentional resources are being redistributed to monitor the robot when it becomes part of the environment. Indeed, in human–human interaction, when covertly monitoring others, attentional processes engage in a complex interplay involving sustained attention, selective attention, and inhibitory control. Sustained attention facilitates the maintenance of vigilance over the observed individual, while selective attention allows for the extraction of relevant information amidst competing stimuli. Inhibitory control suppresses overt signs of monitoring, enabling discreet observation without betraying awareness. However, this distribution of attentional resources across different mechanisms might have distracted our participants from their main task (i.e., the visual search game).
Interestingly, in the second comparison (the condition of observation vs. interaction) neither fixation duration nor the time to the first fixation differed based on the role played by the robot, even though STs differed across these two conditions. We acknowledge that a potential explanation for the lack of significant differences in fixation duration across groups might be the result of a lack of power. As also suggested by the relatively small β values reported as a measure of effect size, future studies involving eye-tracking metrics in human–robot interaction might need larger samples to clarify whether fixation duration is modulated by the role of an artificial agent. Indeed, for the between-groups comparison, assuming a medium effect size, around 50 participants would be needed to achieve optimal statistical power (i.e., 1 − β = .8). It is important to stress that the decision to run the interaction condition was to explore potential differences that could be due to the role played by the robot, beyond its mere presence. However, this should be considered as a pilot result to guide further exploration of the topic, rather than a definitive result. Indeed, despite the lack of proper power, this result might also suggest that there was an additional process, not related to attentional mechanisms that delayed STs in the interaction condition. Perhaps the process was related to social cognition, for example, engaging theory of mind regarding the interaction partner (see, e.g., Belkaid et al., 2022; Siri et al., 2022). Belkaid et al. (2022) explored the effects of a humanoid robot’s gaze (mutual or averted) on how people strategically reason in social decision-making situations. In their experiment, participants’ decision times became longer when the robot was gazing at them, compared to when the robot was averting their gaze. Such a “delay” was paralleled by a differential effect in synchronized α activity during the period of eye contact, with higher α synchronization compared to averted gaze. These findings suggest that the more social the robot was behaving, the higher the need for suppression of distracting information for the subject. Such a suppression might affect performance, as it happened in our experiment. We speculate that in the interacting condition of our experiment (i.e., where participants were more “socially” engaged with the humanoid, still not knowing exactly what to expect from its behavior) the “social” presence of the robot influenced participants’ fluency in their performance in the task, while not affecting covert attention. This is an interpretation in line with another recent study by Roselli et al. (2021) on the vicarious sense of agency in HRI. In their experiment, Roselli et al. showed that participants’ performance in a joint task with a humanoid worsen as a function of the level of intentionality attributed to the robot. The authors claimed that, when another social agent is present in a joint task, it activates spontaneous mentalizing processes, which are fundamental to predicting its actions (Ciardo & Wykowska, 2021). This can disrupt individuals’ ability to make decisions and act smoothly, as mentalizing requires cognitive resources that can compete with action selection processes. Indeed, anecdotal information collected during the debriefing with the participants highlighted that most of them expected the robot to help them during their turns, as much as they were trying to help it during its turns, as if it was treated like a social partner in that condition. The fact that the robot was neither providing hints to the participants nor responding to their suggestions, might have elicited the need to monitor the agent’s behavior and engage in theory of mind processes. However, there is also an alternative explanation for this discrepancy between eye-tracking and performance metrics in the conditions involving the robot. Specifically, the alternation between responses from participants and the robotic agent during the interaction condition might have introduced a potential complication in the task. This could potentially have attenuated the pace of the learning process and impeded the development of an optimized search strategy, thereby contributing to the disparities in STs. In the context of visual search paradigms, performance typically ameliorates with repeated displays, leading to an enhanced efficiency over time (Chun & Jiang, 1998; Sisk et al., 2019). Consequently, the act of switching between responses, be it from participants or the robotic entity, has the potential to adversely affect search performance. While the visual layout in which the letter is presented maintains spatial invariance, the prospect of an augmented search performance is jeopardized due to the introduced response-switching element. If a computational entity, such as a computer, were to assume the role of responding and alternating between-participants’ inputs in this task, it is conceivable that the outcomes might parallel those observed with a robotic agent. The impact of response switching on performance could hinge on participants’ adaptability to the task dynamics. Further might address these alternative explanations and the potential role of the response-switching mechanism in performance.
It is also quite plausible that, contrary to what was observed in the results of our study, HRI could also lead, in different scenarios, to an improvement of participants’ performance. This hypothesized facilitation effect is based on previous literature (see Ganesh et al., 2014). It might be that facilitation effects is observed when the robot displays a contingent behavior to the participant’s behavior interaction (which mimics better natural human–human interaction). Given that in our study, the robot was neither providing hints nor responding to participants’ suggestions, the highly “mechanistic” behavior of the robot might have hindered the interaction fluency of the participants during the task (see Hoffman, 2019). Even though this alternative explanation might open further directions for future studies, we believe that our study provides useful insights to better understand how humans interact with mechanistic artificial agents, which are not necessarily embedded with enough “intelligence” to appropriately respond to social cues. Indeed, when it comes to human–human interaction and cooperation tasks, we rely on behavioral cues to adapt to others (Sebanz et al., 2006). Although our task did not require social coordination or joint attention, it would be interesting to add more behavioral adaptation to the robot, or make the robot more “cooperative” (e.g., by providing hints) Additional fluency metrics could be also added to further assess participants’ performance (see, e.g., Hoffman, 2019). Considering our experimental setting, we believe that to further understand the complexity of spontaneous interaction with a robot would be beneficial to embed the robot with technologies that can detect the user’s engagement (i.e., algorithms determining the time spent in mutual gaze), along with recording additional physiological metrics from the participants (i.e., galvanic response, heart rate, respiratory frequency) and conducting structured interviews and questionnaires immediately after the interaction.
It is also important to note that the robot used for our experiment was a humanoid, Using different robots with different aspects and behavioral capabilities is needed to understand whether the results of our study can be generalized to different scenarios and various robot platforms.
Finally, it is important to highlight that the sample size for this study was limited. Although mixed effect models have been shown to be effective in keeping Type-I error down to the nominal α (Matuschek et al., 2017; Murayama et al., 2014), the power of the present study might be not sufficient to test the impact of all variables we considered.
In terms of limitations of this study, one should also mention the missed within-/between-subjects design. The decision of including a second group of participants just for the Interacting condition was motivated by the idea of comparing the effect of role of the robot, only, without any additional factors (such as robot presence/absence). This might have introduced a further difference between-group, related to the order effects. Indeed, in the within-group, we decided a specific order of presentation of the conditions: Participants were coming the first time for the solo condition, and after few weeks for the observation condition. Thus, this group of participants was more familiarized with the task than the second group of participants who took part only in the interaction condition. It is important to note, however, that although the task was the same across those two sessions (solo and observation conditions) in terms of instructions, the stimuli were different across all conditions, to minimize carry-over effects. Furthermore, the conditions were not run in the same day, as a further mitigation strategy to avoid habituation with the task. Our results (namely longer STs in the second session and the absence of difference in the accuracy within-subjects) do not suggest the order effect facilitating performance by means of practice. However, for follow-up studies, it might be useful to counterbalance the conditions in a different way. The increase of STs in the second session might be due to a decline in engagement, as participants were already familiarized with the task. However, the second group of participants, who were naïve to the task, showed longer STs than the first group, which argues against this interpretation. Therefore, our results, taken together, speak rather against the potential order effects. For future studies, however, it might be worthwhile to explore whether multiple exposures in a full-within design (or alternative designs with a different balancing of conditions) affect STs.
A further aspect that might be worth further investigation is whether familiarity with the robot or with the task itself affect social facilitation. Based on our results, we speculated that the unfamiliarity with the humanoid boosted its distracting effect. Indeed, past research showed that increasing familiarity with robots positively affects individual performance in human–robot interaction (Kumle et al., 2021). Similarly, we can hypothesize that individuals who are frequently exposed to robots require less effort to monitor robot actions during the interaction. This might be further tested in a study in which the degree of exposure to robots is experimentally manipulated.
A final potential limitation is that our study focused exclusively on the effects of the social presence of a humanoid artificial agent, without considering how individuals would perform the same task in front of another human. This decision was dictated by our specific interest in human–robot interaction, as a follow up of literature on social facilitation in human–human interaction. However, given the novelty of the experimental paradigm and the task, future studies might explore the effects as a function of type of interaction partner, namely human versus robot partners.
Despite the limitations listed above, our study’s strength is a novel experimental approach for investigating human–robot interaction and its effect on attentional focus and task performance. By designing a novel attentional task embedded in an ecologically valid interaction, we tried to boost the engagement of participants. Indeed, we are already navigating a world where such interactions are happening in airports and malls, which are, by nature, less controlled, and potentially more engaging than pure screen-based attentional tasks. We believe there is a need to better understand attentional processes in interactive and diverse environments. Our task was designed specifically to encourage spontaneous shifts of attention. Although this approach requires the adoption of mitigation strategies and entails limitations, it also provides insights into the complexities of studying attention in more ecologically valid interactive setups, accounting for contextual factors that may influence behavior, perception, and responses. Of note here are our results showing a dissociation between covert and overt attention effects. More specifically, our results showed that participants might have covertly monitored the robot’s behavior, without displaying overtly this monitoring process. In fact, this might be a common mechanism that people use daily, but it might not be detectable in restricted classical attentional experiments. It is also in line with the results of Kingstone (2020) (see also Foulsham et al., 2011; Laidlaw et al., 2011) who showed that different gaze patterns were observed when participants were walking around campus, relative to a condition when they observed videos of the same campus scenes recorded from a person’s point of view, walking around the campus. Similarly, participants seated in a waiting room were more likely to attend a video of a confederate, relative to a situation when the confederate was physically present in person. Our present result showing a dissociation between covert and overt attention in an interactive setup with social presence of another agent highlights that our approach can help bridge the gap between controlled laboratory studies and real-world applications. In the context of the focus of our article, the transfer from controlled lab experimental setup to real-world situations should facilitate the development of socially intelligent robots that can seamlessly integrate into human environments and contribute meaningfully to various societal contexts. The emphasis on ecological validity is fundamental for advancing the field of cognitive psychology and ensuring that research findings are applicable and generalizable beyond controlled experimental settings.
The present study reported that the presence of a robot negatively affects participants’ performance in an attentional task, potentially due to the monitoring pressure. It is important to highlight this effect in the context of robots are being progressively introduced in our environments. It is interesting to note that the presence of robots might actually add further pressure and cognitive load on users instead of helping them in performing a task. These findings have practical implications for the design and implementation of robotic technology, particularly in situations where robots are used in conjunction with human operators. It suggests that the design of robots should be carefully considered to ensure that they do not negatively impact human performance, particularly in attentional tasks. Furthermore, the study suggests that the engagement of social cognition processes regarding the interaction partner may also affect performance fluency, particularly in the context of human–robot interaction. This implies that the design of robotic technology should consider the social cues and interactions that are necessary to establish a sense of ease with the robot, which may improve HRIs and performance in tasks that require attentional resources.