Volume 3, Issue 4: Winter 2022. DOI: 10.1037/tmb0000089
In recent years, mindfulness meditation apps—which offer voice-guided exercises for relaxation—have been promoted as an effective tool for self-managing stress. This study examines how the type of voice—human male, human female, synthetic male, or synthetic female—can impact users’ levels of relaxation, perceived usefulness, and enjoyment when following a guided meditation. Participants listened to a guided meditation and then evaluated their feelings toward both the voice and the meditation exercise. Those who listened to a human voice rated the voice as more enjoyable than those who listened to a synthetic voice. Additionally, respondents in the human voice conditions rated the meditation exercise itself as more enjoyable and useful and themselves as more relaxed than did participants guided by a synthetic voice. Finally, the effect of voice human-likeness on perceived usefulness was significantly more pronounced with female voices. These findings suggest that natural-sounding speech is preferable for individuals completing guided mindfulness exercises. In turn, to enhance users’ enjoyment and perceptions of meditation effectiveness, generated speech used in such apps should sound human-like and utilize natural speech patterns.
Keywords: voice, synthetic speech, meditation, mindfulness, mobile health
Disclosures: The authors declare that they have no conflict of interest.
Data Availability: There have been no prior uses of this data. Data and study materials are publicly available to other researchers at https://osf.io/52ta9/ (Menhart & Cummings, 2022).
Open Science Disclosures:
The data are available at https://osf.io/52ta9/.
The experimental materials are available at https://osf.io/52ta9/.
Contact Information: Correspondence concerning this article should be addressed to James J. Cummings, Division of Emerging Media Studies, College of Communication, Boston University, 640 Commonwealth Avenue, Boston, MA 02215-2422, United States. Email: [email protected]
Considered “the new science of health and happiness” (Williams, 2016; p. 13) mindfulness meditation practice has become an increasingly popular trend in personal health and wellness within the past decade. Coinciding with the increasing accessibility of mobile applications in daily life, the recent surge of interest in meditation practice has resulted in the development of several mindfulness meditation apps, advertised by their creators as a way to help individuals manage their stress levels. Some of the most popular of these apps, such as Headspace, claim to alter the parts of the brain responsible for “positive traits like focus and decision making” (Headspace, 2021). Similarly, in reviewing popular meditation apps, Mani et al. (2015) found that many claim regular use can lead to benefits including greater focus, better sleeping, and a more positive outlook.
Notably, modern mindfulness apps use the same general techniques as relaxation tapes of the past, including an auditory component in which a voice guides the user through a series of breathing exercises. While some of these apps may add accompanying visuals, the narrator’s voice remains the key element for guiding the user through the meditation exercise. Drawing upon previous literature on speech-based interfaces, the present study investigates how particular voice attributes—specifically, level of human-likeness and gender—may assist the users’ ability to practice mindfulness meditation, examining how users feel about the different types of voices and how different voices impact relaxation, usefulness, and enjoyment during a guided meditation exercise.
Mindfulness in Western medicine and psychology can be traced back to the growth of Zen Buddhism in the United States in the 1950s and 1960s (Keng et al., 2011). However, it was not until the 1970s that mindfulness meditation was studied as an intervention to enhance psychological well-being. Many studies focused on the mindfulness-based stress reduction technique, in which mindfulness is practiced by “paying attention on purpose, in the present moment, and nonjudgmentally to the unfolding experience moment by moment” (Kabat-Zinn, 1982). Mindfulness meditation in these early studies consisted of a multiweek course with instruction and training intending to instill the practice so that participants would be more aware of their experiences and reactions (Kabat-Zinn, 1990). These studies were guided by an instructor informed of mindfulness meditation techniques. Mindfulness apps available today, such as Headspace and Calm, are designed to train users similarly by creating weekly and monthly programs of exercises.
Notably, mindfulness meditation traditionally only instructs the individual to pay attention to the present moment without judgment (Keng et al., 2011), reducing reactivity to challenging experiences through purposeful self-awareness of physical sensations, thoughts, and emotions (Epstein, 1999). In contrast, traditional relaxation exercises emphasize the goal of attaining a relaxed state, using techniques such as guided imagery or progressive muscle relaxation (Caldera, 2017) to release bodily tension and decrease physiological arousal (Jain et al., 2007). Today’s mindfulness meditation apps combine these goals and are explicitly promoted as tools meant to elicit relaxation (Mani et al., 2015). The apps commonly divide exercises into goal-oriented categories like stress reduction or sleeping soundly. Within each category, users can then pick a week- or month-long program with daily exercises meant to train them in applying mindfulness meditation techniques.
Combining traditional meditation techniques with the convenience of mHealth technologies, mindfulness apps allow users to access ongoing training regimes or complete single exercises regardless of where they are. In this process, users are typically guided by an in-app voice, serving the role of a surrogate instructor. This voice—whether synthetically generated or a recording of actual human speech—is responsible for guiding the user through the daily training as well as stand-alone exercises, together meant to focus attention in a manner similar to traditional mindfulness meditation regimens while also relaxing the user.
This tendency for users to respond automatically and socially to interactive media technologies is the central tenet of the Computers Are Social Actors paradigm (CASA; Nass et al., 1994, 1995). Supporting this perspective, researchers have found that individuals—regardless of level of education or familiarity with technology—intuitively identify and assign human traits such as gender (Lee et al., 2000), politeness (Nass et al., 1999), and expertise (Nass & Moon, 2000) to media technologies that include social cues. Thus, while these automatic and mindless responses can be elicited by text or images, voice interfaces may be particularly powerful when it comes to triggering the social treatment of a technology (Edwards et al., 2019). Voice-based technologies, like the speech of other people, activate all parts of the brain that are associated with social interaction (Nass & Brave, 2005), and “[b]y communicating via the method that humans have evolved to use, voice interfaces should represent an extraordinarily pleasant and effective way to interact with technology” (p. 6).
Notably, users prefer voices they consider appropriate in light of other source cues: for instance, preference for a human voice or a robotic voice depends on the visual appearance of the entity (Gong & Nass, 2007). In the absence of accompanying visuals, users will rely on characteristics of the voice itself, such as tone, to evaluate correspondence with the role or job the voice is enacting in the particular scenario (Nathan et al., 1981). For example, in the context of relaxation tape exercises, calm voices were preferred and rated as more helpful than business-like voices (Morris & Suckerman, 1976). Similarly, a voice-guided mindfulness meditation exercise may cue user schemas of a therapist or instructor as a role occupied by a human actor using a calm and soothing tone, rather than a mechanical voice simply relaying information. With a preference for consistency, human voices should be evaluated more positively than voices synthetically produced by text-to-speech systems (Allen, 1992). In turn, the meditation experience itself should be rated as comparably more enjoyable when guided by a human voice.
Hypothesis 1: (a) Participants completing a guided meditation that includes a human voice will rate the exercise’s voice as more enjoyable than those completing a meditation that includes a synthetic voice. (b) Participants completing a guided meditation that includes a human voice will rate the overall meditation exercise as more enjoyable than those completing a meditation that includes a synthetic voice.
Compared to human speech, synthetic speech is relatively poor with respect to phonetics (Delogu et al., 1998), as the latter is often generated on the basis of a rule-based synthesis in which a limited number of acoustic cues are manipulated (Pisoni, 1981). The resulting speech may impair the recognition of individual words and decoding of messages (Delogu et al., 1998), requiring relatively greater effort on the part of a listener than is the case with human speech (Picou et al., 2011). This increased listening effort includes a greater demand on cognitive resources (Fraser et al., 2010; Hick & Tharpe, 2002), which may be particularly problematic in the case of a guided mindfulness meditation exercise. In this scenario, cognitive effort is directed toward maintaining one’s breathing rate and identifying points of discomfort while verbal prompts encourage users to scan their body and release tension. Thus, self-awareness during the process of mindfulness meditation depends on how well a person can process and follow external directions all the while focusing their attention internally (Epstein, 1999). Cognitive capacity in such a state is likely constrained, such that the attentional demand of external stimuli leaves fewer cognitive resources available for other tasks (Kahneman, 1973; Kantowitz, 1974; Picou et al., 2011), such as monitoring one’s bodily state and functioning. In the context of a meditation app, exercises that require less cognitive effort in processing the spoken prompts should permit users to better focus on their physical state, and in turn, be relatively more effective. Therefore, given the demand differences in processing human versus synthetic speech, it is predicted that the type of voice used within a meditation exercise will impact the level of relaxation achieved.
Hypothesis 2: Participants completing a guided meditation that includes a human voice will report higher levels of relaxation after the exercise than will those completing a meditation that includes a synthetic voice.
Previous research suggests that the perceived usefulness of a voice system may depend on the type of voice used. In human–human interactions, fluency of speech may guide estimates of another party’s competence (Huang et al., 2000). To examine how this may carry over to human–media interactions, Huang et al. (2000) compared evaluations of two different phone-in bidding systems, one using a recorded human voice and the other a synthetic voice. Participants rated the recorded speech system to be more useful despite the two systems being functionally identical. Considering these findings, as well as the hypothesis that voice type will influence levels of relaxation, it is predicted the type of voice used in a meditation exercise will influence perceived usefulness of the exercise:
Hypothesis 3: Participants completing a guided meditation that includes a human voice will report the experience as being more useful than will those following a guided meditation that includes a synthetic voice.
When interacting with machines, humans have the tendency to treat them in a manner similar to how they treat other humans, including extending gender stereotypes to inanimate machines that cue gender (Lee et al., 2000; Nass et al., 1997; Reeves & Nass, 1996). For instance, consistent with gender stereotypes that may guide judgments of other humans, female-voiced computers occupying dominant roles are rated more negatively than male-voiced computers in the same roles (Nass et al., 1997). Similarly, male-voiced computers have been found to exert greater influence on users’ decision-making, while female-voiced computers are perceived as more socially attractive and trustworthy (Lee et al., 2000).
Historically, guided meditation is similar to the procedure a therapist might enact. Prior to today’s meditation apps, guided meditation tapes were sometimes provided by therapists for use by patients both in and outside of office visits (Nathan et al., 1981). Notably, Nathan et al. (1981) found that clients who were asked to rate different voices in recorded meditation tapes generally preferred a female voice over a male voice. Beyond meditation contexts, additional research suggests the gender of the therapist may matter, with previous studies alternately finding more positive evaluations of female therapists (Hill, 1975) as well as preferences for male therapists (Goldman, 1977). Given the mixed findings regarding gender preferences with respect to in-person therapists, meditation tapes, and voice interfaces, this study investigates the following:
Research Question 1: Does the gender of the voice used in a meditation exercise influence the participants’ enjoyment of (a) the voice guiding the meditation or (b) the overall exercise?
Notably, Hill (1975) found that clients visiting a therapist’s office reported relatively higher levels of satisfaction for sessions conducted with female counselors, aligning with the suggestion that female counselors may be perceived as more empathic and understanding (Howard et al., 1970). However, in the context of human–computer interaction, previous research suggests that computer tutors with female voices are perceived as less competent than their male-voiced counterparts (Nass et al., 1997). Given these differing effects of gender for interacting with an in-person human counselor versus a media-based guide, the present study also examines how voice gender may impact perceptions of the usefulness of a voice-guided meditation exercise.
Research Question 2: Does the gender of the voice used in a meditation exercise influence the participants’ perceived usefulness of the overall exercise?
The present study employed a 2 (human-likeness of voice: human vs. synthetic) × 2 (gender of voice: female vs. male) full factorial design to observe the effect of these voice qualities on users’ ratings of enjoyment, relaxation, and usefulness during a voice-guided meditation exercise. A university institutional review board approved all questionnaire items and study procedures. All study materials are available online at https://osf.io/52ta9/.
Participants were recruited from online forums on Reddit to participate in a study exploring how different voice qualities influence user perceptions during a guided meditation exercise. Specifically, participants were recruited from the r/Meditation, r/Zen, and r/guided meditation “subreddits” or topic-focused forums. Beyond these interest-themed subreddits, participants were also recruited from r/samplesize, a general subreddit in which people who are conducting research post calls for study participants. Participation was voluntary with the compensation of entry into a gift card raffle. Participants were randomly assigned to one of four experimental conditions via the Qualtrics online survey platform.
In total, 179 responses were initially collected, with 76 of these removed due to incomplete responses. The average age of 103 remaining participants was 37.7 years (SD = 14.4), with 74 females and 29 males. The distribution of race within the sample was 6.8% African American or Black, 11.7% Asian or Asian American, 2.9% Hispanic, 1.0% Native American, 1.0% Native Hawaiian or Pacific Islander, 69.9% White, 1.9% identifying as a race not listed, and 4.9% declining to answer. From this sample, there were 27 participants in each of the Human Male Voice and Synthetic Female Voice conditions, 25 in the Synthetic Male Voice condition, and 24 in the Human Female Voice condition.
Post hoc power analyses were conducted using GPower (Erdfelder et al., 1996). While the present study considered multiple dependent user effects (enjoyment, relaxation, perceived usefulness) of the human-likeness and the gender of voices, we were unable to find reported corresponding effects sizes in the literature for each independent effect. That said, Mayer (2014) review of the general impact of human versus synthetic voices in instructional or learning contexts observed a relatively large median effect size (d = 0.74). A post hoc power analysis for an omnibus test using this effect size estimate, an α of 0.05, and the study’s final sample size of 103 suggested a statistical power of 0.96. With respect to gender of voices, Ernst and Herm-Stapelberg (2020a) reported a robust effect size (d = .772) for likeability, with female voices being rated as more enjoyable; in contrast, the same researchers (Ernst & Herm-Stapelberg, 2020b) also found that male voices were perceived as more competent than female voices (d = .592). Separate post hoc power analyses found the present study’s power, based on these size estimates, to be 0.97 and 0.84, respectively.
Meditation stimuli were created using either human-recorded voices or computer-generated voices. The script for the meditation (see Appendix) was based on a combination of the transcripts of two different meditation exercises: an introductory 5-min-guided meditation from the meditation app Headspace (https://www.headspace.com) and a guided breathing meditation from Inner Health Studio (https://www.innerhealthstudio.com/breathing-meditation.html). Pauses were written in the script to help regulate pacing. To create the human voice conditions two individuals reading the script were recorded via Audacity recording software. The speakers were American but did not have any specific regional accents. Speakers were asked to speak slowly and clearly. The synthetic voice audio files were generated by inputting the guided meditation script into the default Mac OS text-to-speech program.
The gender of the voice recordings was manipulated by creating two different types of audio meditations for each of the human-likeness stimulus creation procedures. To create the gendered synthetic audio conditions, the Mac text-to-speech program’s male English language voice (“Alex”) and female English language voice (“Samantha”) were used. For the human conditions, a masculine gender presenting volunteer and a female gender presenting volunteer were each recorded reading the meditation script.
Enjoyment of the meditation voice was measured using questions adapted from the study by Huang et al. (2000). Participants were asked to rate their agreement with four 7-point Likert-type items. Example items included “The voice was annoying” (reverse coded) and “The voice was likable” (α = .91, M = 3.64, SD = 1.71). Item scores for each participant were averaged to determine a final score for the scale.
Enjoyment of the meditation exercise itself was also measured using adapted versions of the questions from Huang et al. (2000). Participants were asked to rate their agreement with three 7-point Likert-type items: The meditation was enjoyable, The meditation was fun, and The meditation was interesting (α = .93, M = 4.34, SD = 1.71). Item scores for each participant were averaged to determine a final score for the scale.
Relaxation level was measured using the Relaxation Inventory (α = .97, M = 4.91, SD = 1.05) developed by Crist et al. (1989). The inventory consisted of three distinct subscales—Physiological Tension, Physical Assessment, and Cognitive Tension—that asked participants to assess mental and bodily states on 45 7-point Likert-type items. All 15 items on the Physiological Tension scale were reverse coded as a measure of relaxation. Example items from the Physiological Tension scale include My jaw is set tight, and My palms are sweaty (α = .93, M = 5.69, SD = 1.07). Examples of items on the Physical Assessment scale include My muscles feel loose, and I feel a sense of tranquility throughout my body (α = .99, M = 4.47, SD = 1.50). All 20 items on the Cognitive Tension scale were reverse coded. Examples of items on the Cognitive Tension scale include I am thinking about my problems, and I feel like I am in a state of mental strain (α = .91, M = 4.61, SD = 1.26). Items on each subscale were averaged giving each participant a final score. For each participant, their composite score for the relaxation scale was determined by averaging all 45 items on the scale.
Usefulness measures were adapted from those used by Huang et al. (2000). Participants were asked to rate their agreement with four 7-point Likert-type items. Examples of items included The voice in the meditation was useful and The meditation was easy to use (α = .75, M = 5.13, SD = 1.22). Item scores for each participant were averaged to determine a final score for the scale.
Within the opening page of the Qualtrics online survey, participants were told that the present study explored the influence of different voice qualities during a guided meditation exercise. They were told that if they chose to participate they would be asked to listen to a mindfulness meditation audio and then answer some questions about their experience and thoughts on the exercise. They were then provided with instructions about the experimental procedure and asked to provide their consent to participate. Participants were assigned one of four random audio meditation conditions. On the first page of the online instructions, participants were informed that they would listen to and complete an approximately 3-min-guided mindfulness meditation exercise upon proceeding to the next page. Participants were then brought to a blank page with an unlabeled embedded audio file that automatically started playing upon page onset. The beginning of the audio included instructions asking the participant to get comfortable and focus on completing the mindfulness exercise. After completing the exercise, participants then answered questions related to their enjoyment of the voice they interacted with, enjoyment of the meditation, current relaxation state, and perceived usefulness of the meditation exercise, as well as a set of demographic questions.
To investigate the main and interaction effects noted in the hypotheses and research questions posed, a series of two-way factorial analyses of variance (ANOVAs) were conducted. In each case, Bonferroni corrections were made for multiple comparisons and estimated marginal means were compared.
The first hypothesis predicted that participants completing a guided meditation that includes a human voice would report higher levels of enjoyment than those completing a guided meditation with a synthetic voice, with respect to both the voice heard (Hypothesis 1a) and the meditation experience as a whole (Hypothesis 1b). Participants who listened to a human voice reported a higher level of enjoyment for the voice (M = 4.59, SE = .20) than those who listened to a synthetic voice (M = 2.73, SE = .20); F(1, 99) = 42.90, p < .001, ηp2 = .302. Using Cohen (1988) benchmarks for effect size, this suggests the relative humanness of the voices had a large effect on the extent to which participants’ enjoyed the voice guiding the meditation. In investigating H1b, the factorial ANOVA also revealed a significant difference in the level of enjoyment of the meditation as a whole, with those that listened to a human voice (M = 5.00, SE = .22) reporting a greater enjoyment than those who listened to a synthetic voice (M = 3.70, SE = .22); F(1, 99) = 17.00, p < .001, ηp2 = .147. Thus, the human-likeness of the voices had a relatively large effect on users’ enjoyment of the exercise itself. Together, these results provide support for both Hypothesis 1a and Hypothesis 1b.
Hypothesis 2 anticipated that participants completing a guided meditation with a human voice would report higher levels of relaxation (and thus lower levels of tension) following a guided meditation than would those that completed the same exercise with a synthetic voice guide. The factorial ANOVA revealed that people who listened to a human voice experienced significantly greater overall relaxation (M = 5.19, SE = .14) than those who listened to a synthetic voice (M = 4.62, SE = .14); F(1, 99) = 7.99, p = .006, ηp2 = .075, a medium-sized effect providing direct support for Hypothesis 2. With respect to each of the relaxation inventory three subscales, participants who listened to a human voice reported significantly lower levels of physiological tension (M = 1.10, SE = .15) than those who listened to a synthetic voice (M = 1.52, SE = .15); F(1, 99) = −4.16, p = .04, ηp2 = .040. Similarly, participants who were guided by a human voice reported higher levels of physical comfort (M = 4.85, SE = .21) than those who encountered a synthetic voice (M = 4.10, SE = .21); F(1, 99) = 6.70, p = .011, ηp2 = .063. Thus, human-likeness of voices had significant small- and medium-sized effects on physiological tension and physical comfort, providing additional support for Hypothesis 2. However, while cognitive tension scores for participants in the human voice conditions (M = 2.18, SE = .18) were lower than those in the synthetic voice conditions (M = 2.60, SD = .17), this difference was not significant, F(1, 99) = −2.76, p = .100, ηp2 = .027.
This study’s final hypothesis was that participants following a guided meditation including a human voice would rate the voice as being more useful than those completing the same exercise guided by a synthetic voice. The factorial ANOVA revealed significantly greater perceived usefulness of meditations that included human voices (M = 5.73, SE = .15) compared to that for meditations including synthetic voices (M = 4.57, SE = .15); F(1, 99) = 30.98, p < .001, ηp2 = .238. This significant, relatively large effect provided direct support for Hypothesis 3.
RQ1a-b considered whether the gender of the voice used in a meditation exercise may influence users’ enjoyment of the guiding voice and of the experience as a whole, respectively. The factorial ANOVAs revealed that voice gender had no significant direct effects on voice enjoyment, female voices: M = 3.65, SE = .20; male voices: M = 3.67, SE = .20; F(1, 99) = 0.00, p = .947, ηp2 = .000, or enjoyment of the exercise as a whole, female voices: M = 4.33, SE = .22; male voices: M = 4.38, SE = .22; F(1, 99) = 0.00, p = .888, ηp2 = .000. They also revealed no significant moderation of the main effect of human-likeness of the voice on either enjoyment of the voice, F(1, 99) = 2.20, p = .141, or enjoyment of the meditation exercise as a whole, F(1, 99) = 0.41, p = .524.
RQ2 similarly asked if the gender of the meditation exercise’s guiding voice influences users’ perception of the exercise’s usefulness. The factorial ANOVA found no significant direct effect for voice gender, female voices: M = 5.04, SE = .15; male voices: M = 5.25, SE = .15; F(1, 99) = 1.04, p = .311, on perceived usefulness. However, it did reveal a significant Human-likeness of voice × Voice gender interaction, F(1, 99) = 5.12, p = .026, ηp2 = .049, independent of the main effect for the human-likeness of the voice. As illustrated in Figure 1, while participants generally rated sessions guided by synthetic voices as less useful than those guided by human voices, a post hoc Tukey test found that this effect was more extreme for sessions guided by female voices (p < .001, see Table 1); indeed, the difference in perceived usefulness of human male and synthetic male voices was not itself significant (p = .094).
Perceived Usefulness of Meditation Exercise by Voice Condition | ||
Condition | Female | Male |
---|---|---|
Human | ||
M | 5.85a | 5.60ab |
SD | 0.67 | 0.89 |
Synthetic | ||
M | 4.23c | 4.91bc |
SD | 1.22ab | 1.29ab |
Note. F (1, 99) = 5.12, p = .026. Means with no subscripts in common are different at p < .05. |
Mindfulness meditation apps have been described as self-help tools to aid users in regulating their stress levels. These apps have become some of the most popular in the mobile health section of the app store for both Apple and Samsung (Statista, 2018). There are many claims about the benefits of these apps, such as better focus, lower stress levels, and a more positive outlook (Mani et al., 2015; Yang et al., 2018). However, there has not been much empirical research into how specific interface factors lead to this effect.
The purpose of this study was to determine whether voice characteristics like human-likeness or gender influences users’ enjoyment of the voice, enjoyment of the experience as a whole, relaxation levels, and perceived usefulness of listening to a guided meditation. Based on previous literature investigating enjoyment of human and synthetic voices (Gong & Nass, 2007; Huang et al., 2000; Nass & Brave, 2005), we predicted that participants would find meditations including human voices more enjoyable, in terms of the voice (Hypothesis 1a) and the meditation experience overall (Hypothesis 1b). As expected, and similar to the findings in the literature, the present study found higher levels of enjoyment for the voice and meditation in conditions in which users listened to a human voice. Additionally, informed by previous literature on decoding synthetic speech (Edwards et al., 2019) and the allocation of cognitive focus toward effortful listening (Fraser et al., 2010; Hick & Tharpe, 2002), two additional hypotheses (Hypothesis 2, Hypothesis 3) correctly predicted that participants guided by human voices would experience higher levels of relaxation and rate the exercise as useful. Altogether, these findings demonstrate that in guiding a meditation, human voices—compared to synthetic voices—may offer advantages with respect to health-relevant outcomes as well as the overall user experience.
Finally, based on mixed findings from past research on preferences of therapy clients and users of voice-based interfaces, open-ended research questions investigated whether female and male voices would be enjoyed differently (Research Question 1a), impact levels of enjoyment of the guided meditation overall (Research Question 1b), or influence levels of perceived usefulness of the meditation (Research Question 2). Our analyses found that while voice gender had no effect on enjoyment, it significantly interacted with the human-likeness of the voice in shaping usefulness perceptions.
This study’s findings raise important questions about factors influencing the success of mediated mindfulness meditation exercises. These results could help inform our understanding of the relative impact and value of human voices when designing such apps, as well as other new speech-based communication technologies. As speech is one of the essential mediums of human communication, voice interfaces are being integrated into an increasing number of digital technologies (Edwards et al., 2019). A recorded voice system requires the creator to find someone to record and potentially go through multiple takes, as well as to edit the audio. In contrast, most computers now have a built-in text-to-speech feature that allows users to save audio files of a computer-generated synthetic voice reading a given text. Yet, as the current findings indicate, taking this convenient route could actually result in a product that users find relatively less pleasing and experience as comparably less useful.
There are several challenges to conducting natural conversations between a human and computer: natural language is hard to understand, natural behavior is tricky to model, latency expectations require fast processing, and generating natural sounding speech, with the appropriate intonations, is difficult (Leviathan & Matias, 2018). As such, the synthetic voices used in this experiment, representing modern mainstream voice applications, were clearly nonhuman and artificially produced, lacking the rhythmic and phonetic qualities that a human voice would have (Delogu et al., 1998). However, if a synthetic voice could better mimic such qualities of a human voice, it is possible that the enjoyment and perceived usefulness of interactions with these voices may be enhanced, perhaps equaling the levels ascribed to human-voiced exchanges.
Some technology firms, such as Apple and Microsoft, have perhaps implicitly reached this same conclusion, as suggested by their using voice actors for synthetic speech generated by their newer voice assistant technologies (Perez, 2018; Verger, 2017). These voice actors record a wide range of words and sounds which are then used by the software to synthesize speech. Personal assistants like Apple’s Siri and Microsoft’s Cortana thus present a potentially effective hybrid between recorded human speech and computer-generated speech. Although the process of creating a program to generate synthetic speech in this manner may be relatively laborious compared to more rudimentary text-to-speech systems, the resulting speech—which is relatively human-sounding in comparison—may be preferable to most users. Synthetic speech generated in this way might foreseeably be leveraged for an app that generates meditations allowing for dynamic interaction or real-time adaptations to fit the needs of the user. A vast interpersonal improvement over their relaxation tape progenitors, such guided meditation apps would permit an experience much more akin to an interactive exercise with an actual human therapist or meditation coach.
Indeed, as technology continues to develop, it is possible that interactive synthetic speech will become ever more human-like and perhaps nearly indistinguishable from human voices. Recent years have witnessed major strides in the ability of computers to understand and to generate natural speech. Advances in artificial intelligence (AI) such as Google Duplex—which uses an AI assistant with a human-sounding voice—have evidenced that synthetic speech can be afforded the same level of interactivity as actual human-to-human exchanges. In the case of Duplex, the AI can understand the voice of the human on the other end of a phone call and respond with correct answers to real-time inquiries, all the while employing common speech disfluencies like “um” and “hmm” and pause breaks (Leviathan & Matias, 2018). Duplex highlights the potential for more advanced synthetic voice systems to eventually rival or equal human voice interfaces in other settings, including meditation guidance and relaxation training.
Notably, participants in the present study reported no difference in enjoyment across differently gendered voice conditions. Past research on preferences of therapy clients and users of voice-based interfaces has produced mixed results regarding relative enjoyment of male and female voices. With respect to therapy clients in particular, a preference for female voices might be related to gender stereotypes in which women are often associated with caretaker roles. In the present study, the lack of higher ratings for a female voice could indicate a breakdown in this stereotype, such that people may not necessarily hold the same gender stereotype for therapists and meditation guides nearly 30 years later. On the other hand, voice gender did influence users’ ratings of the usefulness of the meditation exercise, moderating the main effect of human likeness: Though synthetic voices were generally perceived as less useful than human voices, this effect was much more pronounced in female voices than male voices. In terms of practical implications, this would suggest that mindfulness meditation apps limited to synthetic voices may do well to employ male-sounding voices.
With respect to theory, this observed interaction presents some interesting implications for the CASA perspective users’ experiences with interactive media technologies. The difference in usefulness scores for meditation sessions guided by synthetic female versus human female voices generally aligns with the previous finding that female-voiced computers in a tutoring role are perceived as less competent (Nass et al., 1997); however, while Nass and colleagues found this gender effect to occur with both synthetic and human-recorded voices, the effect was only observed for synthetic voices in the present study. One possible interpretation of this discrepancy is that today’s users may not hold the same gender stereotypes about human actors, including those recorded for the human-voiced meditations. Indeed, recent reflections on the paradigm suggest that CASA-related findings regarding the social treatment of technology may not be as enduring as once considered, as schematic processing of mediated interactants may change in light of the emergence of new technologies, new modes of interaction with them, and new schemas about social interactions (Gambino et al., 2020). Even if they do not hold dated stereotypes about gender in human actors, participants in the present study may have schemas related to how a synthetic, machine-like voice is “supposed” to sound, with which female-sounding voices may be inconsistent. Given this possibility, as well as the fact that no systematic preference for female or male voice was observed here, the current findings suggest that app designers may do well to instead allow voice gender to be a custom setting for users in guided meditation apps, particularly as there is some evidence that other elements of therapy—such as openness with emotions—may depend on both therapist and client gender (Hill, 1975).
The possibility of evolving schemas about machine gender, violations of which may negatively influence perceived usefulness or other aspects of user experience, is a possibility in line with recent reflections on the CASA paradigm (Gambino et al., 2020) and should be empirically explored by future research. Notably, the interaction effect described above may alternatively be due to certain aspects of the specific stimuli used in this study. Like many media psychology experiments, the present study relied on a single stimulus operationalization per condition; in turn, the generalizability of the current findings may be limited (Reeves et al., 2016). In particular, the nonsignificant gender difference in perceived usefulness for sessions guided by the two human voices used here may not generalize to a wider sample of male and female human voices. Moreover, if a variety of male and female voices had been used—for instance, with lower and higher pitches to create a more fully representative variety of gendered voices—it is possible that gender effects for enjoyment, suggested by previous counseling literature, may have been observed. By choosing voices of a higher and lower register, gender stereotypes that specifically involve pitch could potentially play a bigger role in the way participants respond to the voices than was the case observed currently.
By that same token, a larger number of stimuli may have also offered greater insight into the significant effect of voice humanness that was observed. Notably, this study sought to compare the effects of voices that differed in terms of relatively broad, higher order voice variables (gender and humanness), doing so with only four voice stimuli. As such, it was not possible to examine how component attributes of a given voice—such as pitch, tone, diction, dialect, or accent—may impact the observed effects. If multiple voices were employed per experimental condition, it is possible that outcome measures might have varied across voices within a given level of gender or human likeness. Given the present study findings, future work should systematically examine these specific attributes through a wider assortment of stimuli per experimental condition; doing so will allow investigation into the constituent characteristics driving the relative effects of the broader voice categories compared here. Further, beyond voice qualities, additional research might also examine different aspects of speech—for instance, pacing, degree of repetition, or other speech patterns that may conceivably differ between informal recordings of text and those that would be provided by trained professionals—to examine how voice and speech conventions together shape user perceptions of guided meditation exercises.
Additionally, the measurement of relaxation levels for participants in this experiment was limited by the use of self-report measures. The Relaxation Inventory (Crist et al., 1989) contains 45 items for participants to rate their relaxed state in various ways. This number of items comprising this measure could lead to some fatigue, which is a concern for self-report measures in general but especially problematic if intending to self-reflectively measure the level of relaxation. To help address this possible limitation, future research should emphasize the use of physiological measures that may correspond with mental and physical relaxation. Galvanic skin response (GSR), particularly tonic levels, would provide for an appropriate measure of overall state relaxation. Similarly, cognitive effort measures could include evaluations using electroencephalography (EEG). Such physiological indices, in complement to self-report, could help triangulate measures of relaxation both during and after the guided meditation.
While GSR and EEG are increasingly common physiological measures employed in research on media processing and effects (Potter & Bolls, 2012), body muscle tension is another biometric that could be useful in future research on the effects of meditation apps. Muscle tension can be measured through electromyography which uses sensors attached to the skin to detect electrical signals from nerves controlling the muscles (Blackett, n.d.). Measuring levels of muscle tension could help differentiate between individuals’ baseline levels of trait mindfulness, which is the general tendency to be mindful in daily life, and state relaxation, a present state of relaxation determined by a stimulus (Keng et al., 2011). Change scores for muscle tension before and after the guided meditation could be combined with ratings on the Physical Assessment scale used here to jointly gauge the relative effectiveness of different voices in guiding a meditation exercise intended to elicit higher levels of relaxation.
Relatedly, future research could also account for relevant individual differences—such as trait mindfulness and trait anxiety (Sedlmeier et al., 2012)—that may interact with message features to determine the relative effectiveness of different voice-guided meditations. People with high trait mindfulness would be more likely to have higher base relaxation levels and respond positively to the meditation, whereas those with high trait anxiety—the habitual feeling of anxiety, worry, and stress—would be more likely to have lower base relaxation scores and experience greater difficulty relaxing. Accounting for these traits can help provide a more nuanced understanding of the effects of different voice characteristics of meditation apps, not just generally but also with respect to a variety of potential user types. Additionally, while the present study relied on convenience sampling through relevant online forums, the relative distribution of theoretically relevant traits might be investigated through a purposeful sampling of a wide diversity of users, in terms of race, gender, socioeconomic status, age, and access.
Finally, beyond incorporating alternate operationalizations and accounting for relevant individual differences, future work might investigate how additional social characteristics of the meditation guide’s voice—such as having a name or expressing certain personality traits—could also potentially influence user satisfaction or perceived usefulness. People are attracted to others with similar personality types and have been shown to like a voice that expresses a personality matching their own (Nass & Gong, 2000). As such, perceived homophily with the user might help elicit more positive opinions of the technology, including higher ratings of enjoyability or usefulness. More, as people also rate computers more positively when interdependently associated—such as being on the same team as indicated by a team name or color (Nass et al., 1996)—a mindfulness meditation app that includes a human voice that expressly contextualizes the user experience as social interaction with a compassionate therapist or supportive coach could encourage trust and elicit higher enjoyment from users.
During this breathing meditation, you will focus on your breath. This will calm your mind and relax your body.
There is no right or wrong way to meditate. Whatever you experience during this breathing meditation is right for you. Don’t try to make anything happen, just observe.
All you need to do is sit back, relax, and allow the body and mind to unwind.
So just take a moment to get comfortable. It doesn’t matter how you’re sitting, just do whatever feels best.
[Pause for 3 Seconds]
I’d like you to begin with your eyes open, not staring too intently, just aware of the space around you. And just maintain that soft focus with the eyes.
Starting with a couple of deep breaths, in through the nose, and out through the mouth.
As you breathe in, feel the lungs expand as they fill with air. And as you breathe out, notice how the muscles soften as the body exhales.
Just one more time, and as you breathe out this time, gently close your eyes. And allowing the breath to return to its natural rhythm in and out through the nose.
[Pause for 3 Seconds]
So just take a moment, just to feel the weight of the body pressing down against the seat beneath you. The feet on the floor, the arms and the hands resting on the legs.
Starting to notice the space around you, maybe sounds. Just allowing those sounds to come and go. And then just bringing the attention back to the body.
Just noticing how the body feels right now. Just to help with this, starting at the top of the head, gently scanning down through the body. Noticing what feels comfortable, what feels uncomfortable.
Smooth, even, steady, from head to toe. And as you scan down your body, starting to notice the movement of breath in the body.
How your breath creates a rising and falling sensation. For some people that’s in the stomach, for others that’s in the chest. Sometimes the diaphragm. If you can’t feel anything, just gently place your hand on the stomach. Notice the movement.
Don’t worry if your mind wanders off, that’s perfectly normal. As soon as you notice it has wandered just gently bring the attention back to the breath again.
Starting to notice where the breaths are long, short, deep or shallow.
And then just for a few seconds letting that focus of the breath go and letting the mind do as it wants to do. So, if it’s been wanting to think, just letting it think now.
[Pause for 3 seconds]
And then bringing the attention back to the body again. Coming back to that feeling of weight pressing down again. Perhaps noticing the sounds around you again.
And you can just open your eyes in your own time. And ask yourself how that feels having taken a few minutes out of the day to slow down. Whether there’s a greater sense of calm, perhaps clarity. Don’t worry if there are still lots of thoughts running around; that’s normal.
Sit for a few moments more, enjoying how relaxed you feel, and experiencing your body reawaken and your mind returning to its usual level of alertness.
https://doi.org/10.1037/tmb0000089.supp