Skip to main content
SearchLoginLogin or Signup

Procedural Learning in Virtual Reality: The Role of Immersion, Interactivity, and Spatial Ability

Volume 3, Issue 4: Winter 2022. Special Collection: Learning in Immersive Virtual Reality. DOI: 10.1037/tmb0000087

Published onNov 09, 2022
Procedural Learning in Virtual Reality: The Role of Immersion, Interactivity, and Spatial Ability


Although there is much enthusiasm for using virtual reality (VR) in training and education, research findings on the benefits of VR over traditional learning environments are mixed. This disconnect in the literature may be because the cognitive mechanisms that underlie learning in VR have not been addressed systematically. This research explored immersion and interactivity, two key features of VR that set it apart from traditional training approaches. In addition, the impact of spatial ability (SA) on learning outcomes was also assessed. In this experiment, 83 college students learned a maintenance procedure in a virtual environment (VE). Students were assigned randomly to one of three between-subjects training groups. In the Desktop group, students used a desktop-based environment and interacted with it using a mouse and keyboard. In the VR-voice group, students used a VR-based environment and interacted using voice-based commands, while those in the VR-Gesture group interacted using gesture-based commands. To measure learning, students completed a written recall test. There were no significant differences in learning outcome across the three groups, but SA was a significant moderator. Overall, those with high SA outperformed those with low SA in the Desktop condition; however, this effect was not significant in the VR conditions. In fact, those with low SA performed best in the VR-gesture condition compared to the VR-voice or Desktop training. These results suggest that VR environments that incorporate gesture-based interactions may help individuals with low SA better understand the content of the lesson.

Keywords: virtual reality, spatial ability, immersion, gestures, learning

Acknowledgments: The authors gratefully acknowledge Ms. Courtney McNamara and Mr. Matthew Proetsch who developed the testbed and Ms. Emily Gonzalez-Holland for her assistance with data collection.

Funding: This work was funded by the Section 219/Naval Innovative Science and Engineering (NISE) Basic and Applied Research (BAR) program.

Disclosures: The authors have no conflict of interest to disclose.

Data Availability: In accordance with 32 CFR Part 219 and DoD Instruction 3216.02 data availability is limited. Additional use of the data for this study may be granted to nongovernment agencies or individuals by the Navy Surgeon General following the provisions of the Freedom of Information Act or contracts and agreements.

Correspondence concerning this article should be addressed to Cheryl I. Johnson, Naval Air Warfare Center Training Systems Division, 12211 Science Drive (ATTN: GT551), Orlando, FL 32826, United States. Email: [email protected]

There is excitement for virtual reality (VR) technology in research and education because it provides the opportunity for immersive, hands-on experiences in diverse learning situations (Vélaz et al., 2014). VR has been used to instruct a variety of domains, including science concepts (Moreno & Mayer, 2002; Parong & Mayer, 2018), visual search (Ragan et al., 2015), assembly tasks (Hamblin, 2005), and health care (Kyaw et al., 2019); but the research findings on the benefits of VR over traditional learning environments are mixed (Abich et al., 2021). This disconnect in the literature may be because the cognitive mechanisms that underlie learning in VR have not been systematically addressed (Barrett & Hegarty, 2016). Determining which features of VR impact specific learning mechanisms for which types of learners is an important consideration for developing VR educational systems that implement effective instructional techniques. To these ends, the goals of this research were to determine if VR-based training is more effective than traditional desktop-based training, examine whether the type of interaction with the VE affects learning, and understand the impacts of individual differences on learning within VR.

The first goal of this research was to evaluate whether training with VR technology was more effective than desktop-based training. Early researchers in the field of VR made the distinction between immersive VR and desktop computer VEs, stating that desktop systems are less immersive because they do not surround the user in the virtual space (Psotka, 1995). On the other hand, VR systems that use head-mounted displays (HMDs), create an immersive experience by following the four factor model of immersion proposed by Slater and Wilbur (1997) in that these displays occlude the physical world (inclusive), involve the sensory system extensively (extensive), surround the viewer with the virtual world (surrounding), and are vivid in content and resolution (vivid). To better understand the link between immersive VR and learning, the current experiment investigated whether training a procedural task in immersive VR using a HMD led to better learning than if the training took place using a nonimmersive desktop system.

The next goal of this research was to parse out why VR may differ from desktop-based training by investigating the two main features that distinguish these systems: immersion and interactivity (Huang et al., 2016). Immersion is the feature in which the senses are engaged in the VE, resulting in a feeling of presence in the virtual space (Riva & Waterworth, 2014). For example, viewing a VE through a 3-D HMD may be perceived as more immersive than viewing the same environment on a 2-D desktop screen (Cummings & Bailenson, 2016; Psotka, 1995). Interactivity is the ability to manipulate the VE, such as by using handheld controllers or gesture-based commands in VR to interact with virtual objects. In contrast, desktop-based systems typically use a mouse and keyboard for interaction. Interactivity may influence the sense of immersion in the VE and produce combined effects on learning (Johnson-Glenberg et al., 2014). While immersion and interactivity are often discussed in conjunction, they may actually represent two distinct characteristics of VR technology that affect the learner’s experience with the system. The present experiment investigated if VR benefits learning compared to desktop training, and if so, whether the effect from VR was a result of viewing a 3-D environment (immersion) or hands-on practicing of the procedure (interactivity).

Finally, the last goal of this research was to determine whether the differences between VR and desktop training were related to individual differences between the learners. An important part of determining optimal features of VR for learning is investigating for whom these features are beneficial. For example, it may be that immersion and interactivity affect learners differently depending on one’s spatial ability (SA). Testing features of VR (e.g., immersion and interactivity) in the context of individual differences (e.g., SA) can elucidate the cognitive mechanisms by which VR is effective. If learning from the VR lesson is greater depending on an individual’s SA, this could indicate that the cognitive mechanism that causes learning gains is related to spatial processing with practical implications for deploying specific VR features when learners will most benefit.

Cognitive Theory of Multimedia Learning

To understand how immersion and interactivity affect learning, the current experiment is framed in the context of the cognitive theory of multimedia learning (CTML; Mayer, 2020). CTML provides a cognitive framework to explain how learning occurs. One assumption of CTML is that cognition is limited by how much sensory information can be processed at once. During the process of learning, incoming sensory information from a learning environment is processed in two separate channels as either verbal information (e.g., spoken instructions) or visual information (e.g., on-screen text), and either of these channels can become overloaded by too much information. As information from the lesson comes into these two channels, the learner actively selects relevant pieces of the learning material, organizes these pieces into a mental representation, and integrates this information into long-term memory. The overall purpose of instructional design is to create a learning environment that supports the learning processes of selecting, organizing, and integrating learning material, while avoiding overloading learners with too many mental demands than can be efficiently processed.

According to the CTML (Mayer, 2020), there are different types of cognitive demands that must be considered to avoid overwhelming the sensory channels during each part of the learning process. These cognitive demands include extraneous, essential, and generative processing. Extraneous processing is a mental demand that results from poor instruction that includes sensory information that is not relevant to the learning material and can overwhelm the sensory channels without contributing to the learning process. An example of extraneous processing is using complex instructions for interacting with the VR system, such that the learner is required to think about which interaction to use to manipulate an object instead of processing the learning material. The poor instruction takes away cognitive resources that could have been devoted instead to selecting, organizing, or integrating the learning material into memory. Essential processing is the mental demand associated with the learning material itself, or the complexity of the to-be-learned information. If the learning material is complex, learners must hold more information in their working memory to mentally represent it. Finally, generative processing is the mental demand associated with making sense of the learning material and consolidating it within long-term memory. In summary, essential and generative processing are productive toward the learning goal and necessary to achieve meaningful learning, while extraneous processing is unproductive and should be minimized to the greatest extent possible. In this article, using CTML as the framework, we discuss how immersion and interactivity assist with selecting, organizing, and integrating educational information and why an individual’s SA may play a role in learning outcomes in VEs.

Factors That May Impact Learning in VR

Immersion and Presence

Immersion can be considered as an objective property of a technology that describes the degree to which a system delivers a virtual experience that is equivalent to a real-world experience (Slater 2003, 2018; Slater & Wilbur, 1997). For instance, a tracked HMD that allows for physical interaction within the virtual environment (VE) is more immersive than a desktop interface that uses a mouse and keyboard. Closely related to immersion is the subjective feelings of presence induced by highly immersive environments. Viewing a VE through a 3-D immersive display, such as a HMD, can create the sense of presence or being there that may impact learning (Cummings & Bailenson, 2016; Johnson-Glenberg et al., 2014), and researchers have proposed several reasons why this may be the case. First, feeling present in a virtual space makes the learning experience more meaningful and motivating (Cummings & Bailenson, 2016; de Back et al., 2018; Johnson-Glenberg et al., 2014). For example, researchers have found that displaying a VE using a HMD resulted in a greater feeling of presence than presenting the same scene on a desktop monitor (Buttussi & Chittaro, 2017). In a meta-analysis of immersive technology, Cummings and Bailenson (2016) found that immersion was strongly related to participant-reported presence. Furthermore, in a study investigating the mediating role of presence during immersive training for spatial tasks, Parong et al. (2020) proposed that CTML may explain why presence is important for learning in immersive environments. The authors suggested that a greater feeling of presence in a VE reduces awareness of the user interface, which reduces the extraneous load of focusing on the interface instead of the learning material. Additionally, feeling present in a virtual space may be associated with more salient mental imagery that can foster generative processing.

However, issues in the literature make it hard to establish a link between immersion, presence, and learning, such as not comparing informationally equivalent systems (Johnson-Glenberg et al., 2014; Li et al., 2017). If the first assumption that presence is higher in VR systems than desktop is met, a second assumption is that higher presence leads to better learning. In fact, several studies have not found a link between presence and learning from VR for procedural tasks (Bliss et al., 1997; Buttussi & Chittaro, 2017) or conceptual learning (Makransky et al., 2019; Moreno & Mayer, 2002). Finally, this link may be tenuous because the mechanism by which an immersive technology leads to a psychological state of presence and whether that state contributes to better learning has not been established from a cognitive framework. For example, Moreno and Mayer (2002) hypothesized that immersive learning environments that induce presence may be able to direct a learner’s attention toward relevant learning material that would otherwise be focused on the instructional system’s interface. This hypothesis is in line with CTML, because the immersive environment may reduce extraneous processing associated with nonimmersive technologies while facilitating generative processing of the learning material. Moreno and Mayer’s results supported this link between more immersion and feelings of presence, but they did not find that a more immersive learning environment led to better learning outcomes. In the present study, we examine participants’ learning outcomes and feelings of presence between desktop instruction and informationally equivalent, more immersive, VR-based instruction.


Although immersion may stand out as a key feature of VR, interactivity may be a more effective feature for learning outcomes if interactions better assist in the processing of information than merely being in an immersive environment (e.g., Jang et al., 2017). Interactions with the VR system could affect learning by influencing how information is encoded. According to CTML, interacting with VR using gesture-based commands that correspond with the learning material may support generative processing by aiding in selecting, organizing, and integrating learning information while not overwhelming the learner’s sensory system (Mayer, 2020). First, gesturing the relevant learning information helps the learner select what is important information to encode by highlighting the critical action to be performed in a given step. Second, using gestures to perform the task encodes the procedure in an organized way as each part is performed in succession. Lastly, the gestures create an integrated link between pieces of the learning material that can be cued for recall by reenacting the gestures. This theory is supported with research showing that physically performing an action helps to remember it later (dubbed the enactment effect; Engelkamp et al., 1994; Engelkamp & Jahn, 2003). Interactivity in VR can therefore provide an avenue to support learning a procedural task by enacting the procedure with gesture interactions.

Gesture-based interactions with a VR system may also manage the cognitive processing associated with learning so that the learner is not overwhelmed by information. Interactivity may serve to reduce extraneous processing of irrelevant information if the interactions directly relate to the material to be learned (Bailey, 2017). If the interactivity is not related to the learning material, however, the interactions may increase extraneous processing, because the learner must focus on the interactions themselves in addition to the learning material. Furthermore, interactivity could help generative processing by focusing the learner’s attention on relevant pieces of information and providing physical links to each part of the task, which can help in the understanding of new learning material. The present study sought to parse out the effects of interactivity and immersion, by comparing two VR conditions with two different means of interacting with the system, a less interactive condition that uses voice commands and a more interactive condition that uses arm and hand gestures. To interact with the task as closely as possible to a real-world maintenance procedure, we created a gesture-based interaction condition that is mapped to the physical actions of performing the procedure. We considered voice-based interactions to be less closely mapped to the real-world task, because one would not perform this maintenance procedure through voice commands.

Spatial Ability

In the current experiment, we investigated recall of a procedural maintenance task, and we posited that the learner’s SA may affect the extent of learning from VR, because the procedure involved spatial understanding and the ability to mentally animate the components of the task. The learners were trained on a step-by-step maintenance procedure for removing and replacing an alternator for a large machine, which involved learning where machine parts were located and recalling the next step in the procedure. Recalling this procedure involves mechanical reasoning as described by Hegarty (2004), because discrete events of the procedure are mentally animated into a causal chain of events. Hegarty and Sims (1994) asserted that mental animation in mechanical reasoning tasks is a piecemeal process in which a person infers motion in a chain of events by breaking down the chain into steps that are causally linked and then animating those links by inferring motion between steps. They further argued that SA was key for mentally animating such causal processes, because better SA helps in the inference of motion between components as well as the ability to maintain spatial information in working memory. Those with higher SA are able to recall more steps later in a causal chain than those with lower SA because later steps require more spatial working memory to process while retaining earlier steps in a procedure.

Based on these previous findings, we anticipated that learners with lower SA would struggle with the current maintenance procedure if they had difficulty in recalling either the location of the part or the next step in the procedure, because of the need to understand spatial information and causal links between steps in the procedure (e.g., a bolt must be removed before a part can be removed). Learners with higher SA, however, would be better able to infer the causal links connecting steps of the procedure and would be better able to recall the procedure via mental animation.

Training in VR may help close this gap in procedural recall between learners with higher and lower SA. Using gesture interactions to enact the procedure steps physically in VR may benefit learners with lower SA. Gestures should mitigate the limitations of lower spatial working memory by physically encoding the causal links of the procedure, providing an additional cue for recall of each step (Hegarty et al., 2005). Lending support to our hypothesis, there is evidence that lower spatial individuals perform to the level of higher spatial individuals after manipulating objects in a 3-D VE compared to those who were passively immersed in the VR (Jang et al., 2017).

Measuring Different Aspects of Spatial Ability

For the present study, we considered that SA is a multidimensional construct (Carroll, 1993), and that different aspects of SA may be relevant for our procedural maintenance task. Similarly, different aspects of SA may influence how a learner interacts with, perceives, and encodes while learning a procedure in VEs. Lohman et al., (1987) suggested that two of the most important factors of SA were spatial visualization and spatial orientation. French (1951) description of spatial visualization can be summarized as the imagined manipulations of spatial stimuli in two- and three-dimensional space. For spatial orientation, Carroll (1993) as well as Kozhevnikov and Hegarty (2001) offer similar definitions: imagining how a stimulus will appear from a different perspective. We believe that both of these factors are relevant for learning in VR.

We also acknowledge that SA measures can involve testing from an allocentric (stationary viewpoint on stimulus) or egocentric (moving viewpoint around stimulus) perspective (e.g., Klatzky, 1998). The present study’s VR testbed incorporates both of these perspectives, as the user moves around in the VE (egocentric) while also manipulating the simulated objects (allocentric) to perform the maintenance task. We describe our rationale and selection of SA measures in the following paragraphs.

Hegarty and Sims (1994) asserted that high spatial visualization ability was influential for mental animation, which we consider relevant for our procedural maintenance task. To assess spatial visualization ability and understand its influence on learning a maintenance task in VR, we selected the paper folding test (PFT; Ekstrom et al., 1976; described in this article’s Materials section). Ekstrom and colleagues described the PFT within the construct of spatial visualization, and researchers have used the PFT to assess this construct (e.g., Glass et al., 2013; Kozhevnikov et al., 2010; Salthouse et al., 1990). This test can be considered an allocentric test of SA, because the stimuli are imagined from a single point of reference (Kozhevnikov et al., 2013).

For spatial orientation, some researchers have supported a link with learning in VR. Arthur et al. (1993) suggested that individuals who view objects in VR form similar spatial representations to those who view the same objects in the real world, indicating a spatial orientation-enhancing effect for VR. Later research supported this claim, suggesting that training in VR can improve performance on perspective-taking tests of spatial orientation (Chang et al., 2017). Although claims that VR can improve SA are bold, it is clear that spatial orientation is relevant in VR. Guay’s visualization of viewpoints test (VV; Guay & McDaniels, 1976; described in this article’s Materials section) has been used to assess the perspective-taking aspect of spatial orientation in a variety of domains (e.g., Berkowitz et al., 2021; Chang et al., 2017; Cohen & Hegarty, 2012; Hegarty et al., 2009). We selected the VV to assess this subconstruct of SA and its relationship with learning in VR. This test can be considered an egocentric test of SA, because it involves imagining moving one’s viewpoint in space (Kozhevnikov et al., 2013).


In this experiment, students learned a maintenance procedure in a VE presented either in VR or on a desktop computer with different methods of interaction. There were three different training groups. In the Desktop group, students learned the procedure using a desktop-based learning environment and interacted with it using a mouse. In the VR-voice condition, students learned the procedure in a VR-based environment and interacted with it using voice-based commands. In the VR-gesture condition, students learned the procedure in a VR-based environment and interacted with it using gesture-based commands.

There were three goals in this experiment. We sought to determine whether VR was more effective than traditional desktop-based training methods (immersion hypothesis). We examined whether different methods of interaction within VR would impact learning (interactivity hypothesis). Finally, we considered that SA may influence performance on this task (spatial ability hypothesis). Our hypotheses are described in Table 1.

Table 1

Experimental Hypotheses




Immersion hypothesis

Higher immersion will increase feelings of presence during training and will reduce extraneous processing, allowing the learner to focus on productive processing.

VR conditions will have greater learning scores than the Desktop condition.

Interactivity hypothesis

Gesturing as an interaction method should be beneficial in VR, because gestures relate to learning material and assist with selection, organization, and integration of new information.

VR-gesture will have greater learning scores than the VR-voice condition.

Spatial ability hypothesis

Due to the inherent spatial nature of this maintenance task, those with high spatial ability should perform well on the task regardless of condition.
Additionally, those with lower spatial ability should benefit most from the VR-gesture condition, because gesture-based interactions should foster productive cognitive processing and assist these learners with selecting, organizing, and integrating information.

Spatial ability will be correlated with learning scores.
Those with lower spatial ability will have the greatest learning scores in the VR-gesture condition.

Note. VR = virtual reality.


Participants and Design

Participants were trained on a maintenance procedure task in one of three randomly assigned between-subjects training conditions: VR training with gesture (VR-gesture), VR training without gesture (VR-voice), or desktop training. A total of 101 participants were recruited from a university in the United States. Fifty-one participants self-identified as male, 46 as female, and four participants chose not to identify. The average age was 21.52 years (SD = 3.50 years). Participants received $15 an hour for up to 3 hr of participation. Because the study involved the use of VR simulations, we did not permit individuals with a history of seizures or severe motion sickness to enroll in the study. Additionally, because the testbed was designed to recognize gestures originating from a person’s right hand, we excluded left-handed people. People who were colorblind were also excluded from participating.


The testbed for this experiment was a procedural lesson presented on either a desktop computer or VR system that was informationally equivalent between groups. The lesson taught a maintenance procedure for an E-28 arresting gear, which is a large piece of machinery used to stop tailhook-equipped aircraft on a shore-based runway. The maintenance task consisted of 22 steps and was developed using the Unity 3-D game engine. Figure 1 shows a screenshot from the testbed.

Figure 1

VR and Desktop Versions of Testbed

Note. In this step, participants verified the engine was off. The VR conditions (Left) included toolboxes for part and tool selection, and participants could walk and move around the environment to change their view. The Desktop condition (Right) used buttons for tool and part selection and changing the camera perspective. VR = virtual reality.

The training environment was the same for each condition, but the conditions differed in whether participants used a 2-D desktop monitor (Desktop condition) or a 3-D stereoscopic HMD (VR groups used the Oculus Rift V2) and how participants interacted within the VR (i.e., gesture or voice commands). For each step, participants had to select the correct component of the machine on which to perform the action, and several steps required that they also select the proper tool to perform the step. For example, to remove the lead from the negative battery terminal, participants would equip the terminal puller tool and select the negative battery terminal. In the Desktop condition, participants used a mouse to make these selections and a keyboard/mouse to navigate around the space. For the VR conditions, participants selected the appropriate action to perform the step using either voice or gesture commands; the five actions (e.g., select, open, close, remove, and replace object) are listed in Table 2. Participants in the VR-voice condition spoke the words aloud to complete the procedure. For example, in the VR-voice condition, participants used their hand to hover over an object and would say aloud the action to perform, such as “Remove.” In the VR-Gesture condition, participants used gestures designed to represent the action being performed. First, participants would use their hand to hover over an object, close their fist to select it, and then make the appropriate “Remove” gesture to perform the action (see Table 2). The gesture-based commands were recognized by the Microsoft Kinect V1 infrared motion-tracker; however, the experimenter was able to advance the participant to the next step manually if the Kinect did not recognize the correct gesture being performed. In both VR conditions, participants could walk around the room to move around in the VE, but each step was located in the same general area of the machine so not much walking around was required to complete the task. The headset was calibrated to each individual’s height prior to the training.

Table 2

Commands for Interacting With the VR System

Intended action

Voice command

Gesture command description

Select (grab) a tool or part


Place right hand over object and close fist

Open a box


Move right fist out to right side, making a 90-degree angle at the elbow

Close a box


Move right fist from chest height to below right hip

Remove a part


Move right fist out to right side, keeping arm parallel to the ground

Equip a part


Move right fist from right hip to left shoulder

Note. Both groups used five interaction commands that represented the same five actions. The VR-voice group spoke the voice commands aloud, and the VR-gesture group performed the described gesture commands. VR = virtual reality.


Demographic Questionnaire

The demographic questionnaire asked 25 questions such as age, gender, experience as a mechanic and working on engines, as well as experience using computers, VR, and video games.

Paper Folding Test

The paper folding test (PFT; Ekstrom et al., 1976) consisted of 10 items measuring spatial visualization to be completed in 3 min. Each item contains a series of images depicting a piece of paper being folded with a hole punched through it. Participants selected one out of five possible images depicting how the piece of paper would look after it is unfolded. The PFT is scored by taking the number of correct responses minus one fifth of the number of incorrect responses.

Visualization of Viewpoints

The visualization of viewpoints task (VV; Guay & McDaniels, 1976) consisted of 24 questions in which participants must identify the position from which a three-dimensional object was viewed. In this task, the stimuli were depictions of three-dimensional shapes suspended inside a box. Participants were instructed to select the corner of the box that corresponded to the viewpoint from which the rotated object was being viewed. Before beginning the task, participants received two example problems to familiarize them with the task. Participants had 8 min to complete as many items as possible. The VV task provides an estimate of a participants’ SA and is scored by taking the number of correct responses minus one sixth of the number of incorrect responses.

Nomenclature Quiz

After viewing an instructional tutorial that described the task and named parts of the machinery, a short nomenclature quiz was given to participants to confirm participants comprehended the tutorial and could name key parts of the machinery. Participants were asked to label machine parts and answer questions regarding the purpose of the machinery.

Mental Effort (ME) Rating Scale

Participants were asked to rate their level of mental effort after each training phase on a scale of 1–9, with higher ratings indicating more mental effort (Paas 1992).

Motion Sickness Susceptibility Questionnaire

The Motion Sickness Susceptibility Questionnaire (MSSQ; Golding, 1998) consists of two subsections of self-reported motion sickness symptoms in childhood and as an adult, which were used to index participants’ motion sickness susceptibility.

Simulator Sickness Questionnaire

The Simulator Sickness Questionnaire (SSQ; Kennedy et al., 1993) consists of 16 items that assess three different dimensions of simulator sickness symptoms: nausea, oculomotor, and disorientation. Participants responded to each item by indicating the extent to which they were currently experiencing a symptom (e.g., fatigue, eyestrain, and dizziness) on a scale of 0 (not at All) to 3 (extremely).

Presence Questionnaire

The Presence Questionnaire (PQ; Witmer & Singer, 1998) asked participants to respond to 19 items on the extent to which they felt they were in the VE. These items were divided into four subscales measuring different aspects of presence: involvement, sensory fidelity, adaptation/immersion, and interface quality. For each item, participants were asked to rate their agreement on a scale of 1 (strongly disagree) to 5 (strongly agree).

System Usability Scale

The System Usability Scale (SUS; Brooke, 1996) included 10 items that assessed how usable participants felt the system was (either desktop or VR), rating each item on a scale from 1 to 5, with higher numbers indicating better usability.

Recall Measure

Participants were asked to write down the steps in order for the maintenance procedure of removing and replacing the alternator and list all tools and parts needed for each step. Participants were given 5 min to list as many of the 22 steps to the procedure as they could recall. To score this measure, one point was given for each correct step, and a bonus point was given if the response also listed the correct tool to be used for that step.


Participants completed the experiment individually in a lab setting. The experimenter provided an informed consent document and a brief description of the experiment. After consenting to participate, participants completed the demographics, MSSQ, PFT, and VV. Participants were then directed to read a slideshow tutorial on a computer. The tutorial explained the maintenance task they would be learning, the nomenclature for the machine parts, and showed them how to interact with the VR system or desktop computer system, depending on their condition. The tutorials were identical among conditions, except for the description for method of interaction. The VR-gesture and VR-voice conditions read the description of the five gestures or voice commands they would be using to interact with the VR system (Table 2). The interaction instructions for the Desktop condition were to use the mouse and arrow keys to move around the desktop trainer and to click the mouse to interact. Following the tutorial, participants completed a short nomenclature quiz to ensure they understood the tutorial and could recall the names of the parts they would be interacting with in the maintenance trainer. Participants then completed the SSQ for the first time to establish a baseline prior to performing in the VE.

All conditions were then given a short practice phase in their respective testbeds to become comfortable with the interactions. For the VR testbed, the VR-gesture and VR-voice conditions practiced the gesture and voice commands with the experimenter. Then, they were fitted with the HMD and calibrated in the VR system. For the Desktop condition, the participants were instructed to use the mouse and keyboard to interact with the system. All conditions completed a practice phase of removing and replacing the engine cage of the arresting gear. During the practice phase, participants learned to interact with the interface and could ask questions to ensure they understood the interaction methods.

After the practice phase, participants completed the first training phrase to learn the task of removing and replacing the alternator. During this first training phase, instructional scaffolding was provided: the instructions for each step were narrated, short text instructions were provided onscreen for each step, and the next part was highlighted. Immediately following the first training phase, participants were asked to rate their level of mental effort on the task in the first ME scale. The second training phase was identical to the first, except less instructional scaffolding was provided: the instructions were narrated and step text was provided, but the relevant parts were no longer highlighted. This necessitated that the participants recall where each part was located from the first training phase. Participants then rated their mental effort again in the second ME scale. Finally, participants completed the last training phase in which they performed the task of removing and replacing the alternator without any instructional scaffolding. Participants completed the task from memory without the help of narrated instructions, step text, or highlighting. Then, participants responded to a final ME scale.

Following the training phases, participants completed a second measure of the SSQ, the PQ, and the SUS. Finally, participants completed the procedural recall measure and upon completion, were thanked and debriefed.

In accordance with 32 CFR Part 219 and DoD Instruction 3216.02 data availability is limited. Additional use of the data for this study may be granted to nongovernment agencies or individuals by the Navy Surgeon General following the provisions of the Freedom of Information Act or contracts and agreements. All measures and statistical tests pertinent to the three stated research goals are reported herein.


Data Prescreening

We excluded a total of 18 participants from our analyses for the following reasons: eight participants experienced technical issues, five had previous VR experience, five reported relevant experience with car or mechanical repairs (e.g., rebuilding an engine), three ended their participation early, and one experienced moderate simulator sickness. Our rationale for these exclusion criteria were to ensure previous knowledge was not responsible for a participant’s learning (i.e., previous VR experience or maintenance experience), or that extraneous factors that might impact the learning experience were excluded (such as technical issues or motion sickness). Some of the participants who were excluded met multiple exclusion criteria. Therefore, the analyses below include data from 83 participants.

Do the Groups Differ on Learning Outcomes?

To examine differences among the three learning conditions, we conducted a one-way analysis of variance (ANOVA) on total procedural steps recalled on the written recall test. The analysis was not statistically significant, F(2, 80) = 0.60, p = .551, (descriptives provided in Table 3). This result suggests that, in general, all three learning conditions were not different in terms of procedural steps recalled. These results do not support the immersion or interactivity hypotheses that VR technology would lead to better learning outcomes than desktop training.

Table 3

Percentage of Procedural Steps Recalled by Condition



M (%)

SD (%)













Note. VR= virtual reality.

What Other Factors May Explain the Lack of Differences in Learning?

To examine why these hypotheses were not supported, we considered four self-report measures: the PQ (for presence), the SUS (for usability), the SSQ, and ME scale. We also examined the time participants spent during training. If there were differences among our conditions on these variables, they could elucidate why these hypotheses were not supported or provide justification for further analysis. Results from these analyses are reported in Table 4. Overall, none of these analyses suggest these variables confounded our results, but we do consider time in training in subsequent analyses.

Table 4

Follow-Up Analyses of Potential Explanatory Variables





PQ—Immersion subscale

ANOVA on posttraining self-reports

F (2, 80) = 0.02, p = .98

No condition was perceived as inducing more presence than any other.


ANOVA on posttraining self-reports

F (2, 80) = 0.78, p = .46

No condition was perceived more usable than any other.

SSQ: Nausea, oculomotor, and disorientation subscales

ANOVA on posttraining self-reports of sickness symptoms
Symptoms are rated from 0 (none) to 3(severe)

F (2, 80) = 3.97, p = .023 for oculomotor (all others not significant)
VR-voice (M = 0.30, SD = 0.35) had higher reported symptoms than
Desktop (M = 0.13, SD = 0.18; HSDp = .038)
Correlation of oculomotor symptoms and recall performance r(81) = −.04, p = .74

Differences in oculomotor symptoms exist between groups, but they are low considering the response scale and are unrelated to performance.

Mental effort

ANOVA for each of three measurement points

Time 1: F (2, 80) = 0.99, p = .38
Time 2: F (2, 80) = 0.15, p = .86
Time 3: F (2, 80) = 0.31, p = .74

No condition was perceived to require more mental effort than any other.

Time in training

ANOVA for total time (in minutes) spent in training

F (2, 80) = 20.07, p < .001
Desktop (M = 24.42, SD = 7.40) was faster than VR-voice (M = 32.55, SD = 8.26; HSD p = .001) which was faster than VR-gesture (M = 38.18, SD = 9.41; HSD p = .047)

Desktop condition was faster than both VR conditions, and VR-voice was faster than VR-gesture. Potential explanation for learning. Follow-up ANCOVA.

Learning recall adjusted for time in training

ANCOVA with time in training as covariate

Overall: F (5, 77) = 5.53, p < .001
Condition: F (2, 77) = 0.26, p = .78
Time: F (1, 77) = 23.66, p < .001
Condition × Time: F (2, 77) = 1.95, p = .15

Unlikely explanation for learning. Less time in training was a general predictor of performance, which did not differ by condition.

Note. PQ= Presence Questionnaire; ANOVA = analysis of variance; SUS = System Usability Scale; SSQ = Simulator Sickness Questionnaire; VR = virtual reality; ANCOVA = analysis of covariance; HSD = honest significant difference.

Does Spatial Ability Predict Learning Outcomes?

To examine whether SA was related to performance, we correlated both of our SA measures with recall performance. We identified significant zero-order correlations between SA (as measured by the VV and PFT) and procedural recall, for VV, r (81) = .45, p < .001; for PFT, r (81) = .49, p < .001. These indicate moderately strong relationships, such that those with higher SA should recall more procedural steps of the maintenance procedure. These two measures of SA were also moderately correlated with one another, r(81) = .60, p < .001, suggesting that each measure may explain unique variance in SA. Next, we examined whether this relationship differed depending on the training condition participants received.

How Does Spatial Ability Affect Learning by Condition?

In examining our SA hypotheses, we considered that both of our measures of SA (PFT and VV) could influence written recall performance, and that one or both measures could interact with the learning condition as predicted by the SA hypothesis. We also considered that time spent in training could explain additional variance in performance. We regressed these variables onto written recall performance using Hayes (2017) PROCESS macro Version 3.5.3 for SPSS. The analysis revealed a significant condition by VV interaction, but not one for condition by PFT (therefore, only the product terms for condition and VV are retained in the model, and PFT is retained as a covariate).

The first model regressed condition, both SA measures, and time spent in training onto recall performance, and was significant, F (5, 77) = 7.43, p < .001, R2 = .33, Radj2 = .28; accounting for 33% of the variance explained in recall performance. In the next model, we added the product terms for condition and VV scores. This overall model was significant, F (7, 75) = 6.94, p < .001, R2 = .39, Radj2 = .34, accounting for 39% of the variance explained in recall performance. The addition of the product terms significantly improved variance explained, indicating a moderating effect of VV scores and condition, F (2, 75) = 4.19, p = .019, ΔR2 = .07, (see Table 5, for regression coefficients for both models). After inclusion of these product terms, time spent in training and PFT scores still explained unique variance in performance.

Table 5

Regression Coefficients Predicting Percentage Procedural Steps Recalled


B (%)

SEB (%)


95% CIB

Model 1





































Model 2





































 VR-gesture × VV






 VR-voice × VV






Note. Condition was dummy coded for regression entry, with the Desktop condition set as the reference. Time was cumulative time spent in training in minutes. Unstandardized coefficients are provided due to categorical data present in model. CIB = confidence interval; SEB = standard error; VR = virtual reality; PFT = paper folding test; VV = visualization of viewpoints score.

* p < .05. ** p < .01.

To understand the nature of this moderating effect of VV scores, the PROCESS macro probes the moderator at the mean and ±1 SD (as recommended by Aiken & West, 1991; Holmbeck, 2002). These probing values are used to compare recall performance at low (−1 SD), mean, and high (+1 SD) levels of the moderator, VV score (probing results provided in Table 6, and plotted in Figure 2). The results of the probing analysis revealed that VR-gesture had greater recall than VR-voice and desktop at lower levels of SA.

Table 6

Moderator Probing Results for Percentage Procedural Steps Recalled

B (%)

SEB (%)


95% CIB

1 SDa




























+1 SD














Note. Probing results were compared against the Desktop condition, which was set as the reference. Unstandardized coefficients are provided due to categorical data present in model. CIB = confidence interval; VV = visualization of viewpoints score; SEB = standard error; VR = virtual reality. a Test of equality of conditional means was significant at this level of moderator; F (2, 75) = 4.65, p = .013. VR-gesture was greater than VR-voice; B = 12.19%, SEB = 5.78%, t(75) = 2.32, p = .038, 95% CI (23.71%, 0.67%).

** p < .01.

Figure 2

Moderating Effect of Spatial Ability on Condition Predicting Procedural Recall Performance
Note. Recall performance is depicted in percent correct, calculated from the number of correctly listed steps divided by the total number of steps (22). ±1 SD bands are displayed around the simple slopes for each condition. Participants with lower visualization of viewpoints (VV) scores (x-axis) performed better when receiving VR-gesture training, as compared to VR-voice and desktop training. Participants with higher VV scores performed better than participants with lower spatial ability in the Desktop group. Paper folding test (PFT) scores are included as a covariate in this model, which was a general predictor of performance in all conditions. VV = visualization of viewpoints; VR = virtual reality; PFT = paper folding test.

Our analysis also identified a significant positive effect for VV scores within the Desktop group, B = 1.5%, t(75) = 3.59, p = .001, such that those with higher VV scores performed better in the Desktop condition than those with lower VV scores. There were no significant differences in recall performance at mean or high levels of VV. To aid in understanding and visualization of this effect, we have presented the absolute scores of our instructional conditions, grouped by the probing values used to test our regression analysis (mean and ±1 SD VV scores; see Figure 3).

Figure 3

Performance by Probed Spatial Ability Category
Note. The data in this figure represent the absolute scores achieved for participants in each instructional condition, categorized by their scores on the visualization of viewpoints (VV) test. VV Low represents those who had lower than 1 SD below the mean score on VV, VV Med represents those who were between 1 SD above and below the mean score on VV, and VV High represents those who were greater than 1 SD above the mean VV score. Error bars are ±1 SE of the mean. These data are presented to aid in visualization of recall performance; for inferential statistics examining group differences at each level of spatial ability, refer to Table 6 and its associated regression analysis. VV = visualization of viewpoints.


The purpose of this experiment was threefold. First, we tested the immersion hypothesis by comparing VR- and desktop-based training on students’ learning outcomes. Second, we examined the interactivity hypothesis by comparing two methods of interacting within a VR training environment, gesturing or voice commands. Third, we explored the impact of SA on learning in VEs. Overall, the results did not support the immersion hypothesis, such that there were no significant differences in learning between the VR and Desktop conditions. Furthermore, we did not find straightforward evidence for the interactivity hypothesis, because the VR-gesture and VR-voice conditions did not differ significantly on the recall test. However, the results were qualified by SA, such that individuals with low SA benefited from the VR-Gesture condition but did not experience similar benefits in the other two conditions.

Considering the immersion hypothesis, we predicted that procedural recall would be greater in the two immersive VR conditions relative to a nonimmersive Desktop condition. Per CTML, immersion may reduce extraneous processing by inducing feelings of presence and occluding distractions outside of the VR headset, allowing the learner to focus on processing the learning material (Parong et al., 2020). However, the data did not support this hypothesis. Considering that the Desktop condition was viewed through a computer monitor, and the VR conditions were viewed with a HMD, it seems less tenable that the Desktop condition was similarly immersive to the VR conditions. One possibility is that the immersive cues in the VR conditions had no influence on performance for this particular task (Barrett & Hegarty, 2016). Another possibility is that the immersive properties of the VR were a source of extraneous processing. For instance, Makransky et al. (2019) found that learning outcomes were lower in VR than desktop when learning a laboratory task, and the authors posited that the perceptual realism in highly immersive environments can distract learners and interfere with generative processing (see also Moreno & Mayer, 2002). According to theory, immersion can induce a feeling of presence that should lead to a more meaningful and motivating learning experience (e.g., Lombard & Ditton, 1997), but in this experiment, we found no differences in measures of presence across the conditions. In spite of being more immersive, it is possible that the VR training conditions did not induce greater levels of presence than the Desktop condition, calling into question the assumption that higher immersion necessarily leads to higher feelings of presence. Alternatively it could be that the PQ was not sensitive enough to discriminate among our conditions (see Slater, 1999, for a criticism of the PQ).

To address the interactivity hypothesis, we separated interactivity from immersion by comparing two VR conditions with different interaction methods: VR-gesture and VR-voice. The VR-voice condition was designed to have the same immersive qualities as the VR-gesture condition, while minimizing the amount of interaction by replacing gestures with pointing and voice commands. Although we predicted based on CTML that VR gestures would support generative processing by helping participants select, organize, and integrate learning information, the data did not directly support the interactivity hypothesis because the VR-gesture group did not recall more steps in the procedure than the VR-voice group. One explanation could be that pointing in the VR-voice condition was perceived as a meaningful gesture, because the pointing action was paired with a voice command that provided context to the motion, helping in the selection of relevant information. Participants may have related it to the learning material so that the advantage of the VR-gesture condition was reduced (Bailey, 2017; Novack et al., 2016). Similarly, participants in both VR conditions were able to walk within the VE, which may have increased the interactivity present in the VR-voice condition. However, the task was concentrated on the same general area of the machine, so not much walking was required to complete the task. Nonetheless, it is important to consider how SA plays a role in moderating learning outcomes. In particular, our data support the interactivity hypothesis among participants with lower SA.

Spatial Ability Affects Learning in VR

This experiment investigated how SA plays a role in affecting learning outcomes within different learning environments. Specifically, the SA hypothesis stated that individuals with higher SA would perform better on the procedural recall test than individuals with lower SA. We also predicted that SA would moderate performance where individuals with low SA would perform the best in the VR-gesture condition due to the additional encoding cues and enactment that support learning. Based on the results, we found support for the SA hypothesis. We found a strong, positive correlation with procedural recall performance across all conditions and both measures of SA. This is consistent with the argument that individuals with higher SA devote fewer cognitive resources to process spatial relationships associated with causal links in the procedure, freeing working memory for encoding the steps to memory (Mayer & Sims, 1994).

Consistent with our hypothesis, individuals with lower VV scores showed greater recall in VR-gesture relative to the VR-voice and Desktop conditions, and their recall performance in VR-gesture was similar to individuals with higher VV scores. Individuals with higher VV scores fared similarly in all conditions, suggesting that SA compensated for less immersive and less interactive instruction (i.e., desktop training). These outcomes suggest that the interactive elements in the VR-gesture condition helped support individuals with lower SA. Interactivity, in the form of enactment, may have mitigated the disadvantages of low SA by offering a way to encode information that did not require mentally animating the causal process of the procedure. From a CTML perspective, enactment in the VR-gesture condition may have managed essential processing for individuals with low SA by offloading working memory demands to the motor system, freeing up resources to encode the procedure to memory (Engelkamp & Jahn, 2003; Hegarty et al., 2005). Additionally, gestures may have helped individuals with low SA encode spatial information more efficiently by offering a mechanism individuals can use to help infer motion from one part to the next, reducing the cognitive resources required for mental animation (Hegarty, 2004). This finding that gestures help later recall of a procedure is not specific to the VR technology. Gestures could be advantageous for learning from other technologies that employ gestures without an immersive display (i.e., natural user interfaces); however, this result does support which feature of VR is most beneficial during VR training. Overall, the study provides evidence that interactivity within VR can produce a high level of performance across individuals with different levels of SA.

Alternatively, these results may be explained by participants’ overall working memory capacity as opposed to their SA. The measures of SA that we used during this task were chosen because they measure a participant’s spatial orientation (VV) and spatial visualization (PFT), which are dimensions of SA that are likely to be used by learners during a VR procedural training task. These include an individual’s ability to imagine stimuli from different perspectives (VV) and to imagine manipulations of spatial stimuli (PFT), which both necessitate holding spatial information in working memory. Working memory capacity was not measured during this experiment, but future research on VR training should include participants’ working memory to understand the role of working memory during VR training in conjunction with SA.

Practical Implications

When looking at the data only from a group level perspective, there were no differences in learning outcomes between the VR- and desktop-based training conditions on the procedural recall test. However, the results of this experiment demonstrate that individual differences, specifically SA, are important considerations when determining the effectiveness of immersive learning environments. The results suggest that instructional designers should consider including gestures as an interactive element within VR-based learning environments for procedural tasks, because gesture-based interactions in VR-benefited individuals with lower SA without causing a negative impact to those with higher SA. Gestures offer an additional encoding mechanism that supports selecting, organizing, and integrating the information presented, leading to higher recall of the procedure for those with lower SA. Certainly, careful consideration is needed when designing these gestures to ensure they are as intuitive as possible so as not to induce extraneous processing (Bailey, 2017).

Limitations and Future Directions

One limitation of this experiment was that the gestures students used to interact with the system had to be learned since they symbolically represented the intended action. Although the gestures were designed to be intuitive, the gestures required students to make large movements with their arms in order to be recognized by the tracking system. As a result, students had to recall which gesture to perform from memory for a given action, which may have required additional cognitive resources than more natural gestures. In spite of evidence suggesting that symbolic gestures that reflect the action being performed are more usable and induce higher levels of presence than arbitrary gestures, utilizing more realistic gestures may produce stronger effects in support of the interactivity hypothesis (Bailey et al., 2018). As gesture recognition technology continuously improves, more research is needed to explore the impact of using more natural, intuitive gestures that may not require upfront training on learning outcomes. Another limitation concerns how memory of the procedure was assessed. An ideal measure of learning would have been performing the maintenance procedure on a real-world arresting gear, which unfortunately was not feasible in this case. However, we used a written measure of recall, which may be a less sensitive measure of participants’ procedural learning. Certainly, more research is needed to determine how well learning in immersive environments, such as VR, impacts real-world performance (Abich et al., 2021; Kaplan et al., 2021; Whitmer et al., 2019).

The present study assessed SA using an egocentric measure, which involves the ability to imagine spatial stimuli from different perspectives (see Klatzky, 1998, for a discussion of egocentric and allocentric SA). We considered this particularly relevant for the egocentric viewpoints presented in typical VR environments. Our results suggest that egocentric tests of SA may be a discriminating factor for identifying who learns best in VR. Individuals with poor egocentric spatial abilities may benefit from egocentric spatial cues or interactivity in VR. These results are consistent with other research examining egocentric spatial abilities and learning in VEs (Kozhevnikov et al., 2013). Future research is needed to determine how egocentric and allocentric aspects of SA impact learning in VEs.

In the present study, we assessed SA using two measures and analyzed it as a continuous variable. This permits analyzing SA without loss of statistical power due to categorization (see Pedhazur, 1997). One limitation is that any claims associated with continuous variables using this approach may be biased based on the range of performance observed. Therefore, “lower,” “medium,” and “higher” SA is not absolute, but relative to the SA range of the sample we observed. However, in the present study, both the PFT and VV measures spanned the full range of possible scores. Therefore, it is unlikely that our analyses involve a restricted range of performance in SA.

In addition, if these variables are normally distributed, there necessarily will be more scores around the mean and fewer scores in the lower and higher range. With the analysis we employed, it is conventional to probe using 1 SD above and below the mean (e.g., Aiken & West, 1991; Holmbeck, 2002) when examining the nature of the difference in slopes revealed by the regression analysis, but due to the nature of normal distributions, fewer scores will be estimated on the lower and higher end of a continuous measure than at the mean. In the present study, both SA measures were normally distributed (Kolmogorov–Smirnov normality statistics were .08 and .09 for PFT and VV, respectively, ps > .19), so there is the possibility that the analysis we employed could have over- or under-estimated the true effect of lower or higher SA. It is worth noting that this is a bigger problem for ANOVA-type analyses which require categorized groups (violating the assumption for equal sample sizes), since regression is natively suited to analyze continuous variables.

With these limitations in mind, an alternative approach is to treat SA as a quasi-experimental variable and assigns participants into SA groups (e.g., “low” or “high” SA) prior to an experiment. This is a conventional approach when testing research questions about SA (e.g., Kozhevnikov et al., 2002; Mayer & Sims, 1994). For analyses involving multiple group comparisons, this approach would yield more statistical power, and an experimenter would have greater ability to represent those with lower and higher levels of SA. A limitation of this approach is that it does not avoid the statistical issues of categorization. However, as an alternative approach to regression, it could provide converging evidence for research questions about SA. Future research should consider employing this approach with SA in the context of learning in VR.


We investigated two features afforded by VR systematically (immersion and interactivity) that were predicted to benefit learning a procedural task. Based on the results of the study, we did not find evidence to suggest that immersion uniquely contributes to increasing learning outcomes of a procedural task, nor did we identify differences in perceived presence between the more immersive VR conditions and the Desktop condition. However, the interactivity afforded by VR, through the use of gestures, benefited individuals with lower SA on learning a procedural task without causing a negative impact to individuals with higher SA. As VR technologies continue to advance and become more accessible, continued research is needed to understand how, why, and for whom VR is effective for learning.

No comments here
Why not start the discussion?