Skip to main content
SearchLoginLogin or Signup

Conversational User Interfaces: Explanations and Interactivity Positively Influence Advice Taking From Generative Artificial Intelligence

Volume 5, Issue 4, https://doi.org/10.1037/tmb0000136

Published onDec 09, 2024
Conversational User Interfaces: Explanations and Interactivity Positively Influence Advice Taking From Generative Artificial Intelligence
·

Abstract

Generative artificial intelligence (GenAI) has surged in popularity with the implementation of conversational user interfaces. The possibility to engage in natural conversations with GenAI thus constitutes a promising countermeasure to algorithm aversion, or the general preference for humans over algorithms, which hinders the effective utilization of often superior algorithmic output. In this study, we experimentally test the influence of explanations and interactivity on advice taking from GenAIs. In a judge–advisor system, 472 participants (313 females, 154 males, five diverse; median age = 23 years) solved a series of 10 estimation tasks with access to pregenerated output from ChatGPT. In the control condition, only the numerical output was provided as advice. Participants in the treatment conditions were additionally provided with, or could request, a detailed explanation of the rationale underlying ChatGPT’s judgments. The weight of advice was positively influenced by both the opportunity to interact with the GenAI and the receipt of an explanation. Moreover, actively requesting an explanation significantly enhanced the positive effect of interactivity compared to trials in which this opportunity was forgone. However, there was no evidence that the weighting of algorithmic advice was influenced by whether an explanation was provided or requested. The inherent explanatory capabilities of GenAI and the opportunity to interactively engage with it independently increase users’ advice taking. This finding underscores the potential of conversational user interfaces to large language models such as ChatGPT to improve individuals’ augmented judgment and decision making, but it also poses a threat to human autonomy in interactions with conversational GenAI systems.

Keywords: generative artificial intelligence, large language model, conversational user interface, advice taking, algorithm aversion

Funding: This research was funded by the Deutsche Forschungsge- meinschaft, Grant 2277, Research Training Group “Statistical Modeling in Psychology.”

Disclosures: The authors have no known conflicts of interest to disclose.

Author Contributions: Tobias R. Rebholz played a lead role in conceptualization, data curation, formal analysis, investigation, meth- odology, software, supervision, validation, visualization, writing– original draft, writing–review and editing and a supporting role in project administration and resources. Alena Koop played a lead role in resources and a supporting role in conceptualization, formal analysis, investigation, methodology, software, and writing–review and editing. Mandy Hütter played a supporting role in conceptualization, funding acquisition, and writing–review and editing.

Data Availability: Preregistration documents, materials, surveys, data, and analysis scripts are publicly available at the Open Science Framework repository (https://osf.io/v9je8).

Open Science Disclosures: The data are available at https://osf.io/4532v. The experimental materials are available at https://osf.io/v9je8.
The preregistration plus analysis design is available at https://aspredicted.org/4XF_CVT.

Open Access License: This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0; http://creativecommons .org/licenses/by/4.0). This license permits copying and redistributing the work in any medium or format, as well as adapting the material for any purpose, even commercially.

Correspondence concerning this article should be addressed to Tobias R. Rebholz, Department of Psychology, University of Tübingen, Schleichstraße 4, 72076 Tübingen, Germany. Email: [email protected]


Video Abstract


In 2016, the European Union enacted the General Data Protection Regulation. A key aspect of this legislation is the implicit establishment of a “right to explanation,” which mandates that people affected by augmented (i.e., algorithmically assisted) or fully automated judgment and decision-making processes have a right to be informed about the rationale behind them (Selbst & Powles, 2017). This legislative move aimed to increase transparency and accountability in the deployment of artificial intelligence (AI) technologies, addressing growing concerns about their opacity and impact on individual rights and freedoms. Crucially, the advent of chatbots represents a notable evolution in this context. OpenAI’s ChatGPT, Google’s Gemini, Meta’s LLaMA, or Anthropic’s Claude by design live out the right to explanation. Their conversational user interfaces allow users to directly and interactively inquire about the reasoning behind the algorithmic output generated (Bubeck et al., 2023), underscoring the potential of generative AI (GenAI) systems to comply with regulatory requirements.

It is important to note that although these systems can generate responses that may seem like explanations, they are not explanations in the traditional sense. The generated responses often seem to align well with the underlying decision-making process, but they can also be completely unrelated and thus misleading. In fact, GenAI systems consistently make certain types of errors in certain domains, such as complex arithmetic operations (Tuncer et al., 2023). Therefore, the ability of GenAI to provide explanations for its output is also often criticized for the risk of communicating falsities with high confidence (e.g., in education; Johnson, 2023). While the inherent capabilities of chatbots to generate explanations for their behavior underscore the potential of GenAI systems to engage users in a dialogue about their outputs, they should not be mistaken for reliable explanations. Nevertheless, these systems can serve as a starting point for promoting a culture of transparency and informed user engagement in the digital ecosystem, provided their limitations are clearly communicated and understood.

Advice Taking From Algorithms in the Age of Conversational GenAI

The limitations of traditional AI systems, such as their inability to adapt to changes in the decision-making environment (Dawes, 1979) or to account for individual differences (Grove & Meehl, 1996), were originally identified as the main cause of “algorithm aversion” (Dietvorst et al., 2015). This aversion is characterized by the tendency of individuals to prefer interacting with humans rather than algorithms, which reduces the utilization of often superior algorithmic output (Mahmud et al., 2022; Prahl & van Swol, 2017). Essentially, the conversational style of contemporary large language models (LLMs) is almost indistinguishable from a natural human conversation (e.g., Lee et al., 2020). Therefore, there should be no reason to expect aversion against advanced chatbots—unless they can be easily debunked as such, despite their advanced communication capabilities. Mahmud et al. (2022) defined it as “general aversion” when, for instance, people have an innate distrust of algorithms regardless of how well they perform (Önkal et al., 2009; Prahl & van Swol, 2017). Accordingly, the mere knowledge that one is interacting with an algorithm rather than another human can trigger aversive behavior, such as a reduced weighting of algorithmic advice (i.e., judgments made by an algorithmic agent).

For traditional AI systems, Logg et al. (2019) found that users were more willing to rely on algorithmic advice than human judgments, including their own, when the algorithm’s prior performance was not as transparent as in studies showing algorithm aversion. This “algorithm appreciation” was found across many judgment domains—subjective (e.g., music and dating) and objective (e.g., weight estimation)—for which advanced chatbots can be used as advisors. Moreover, conversational user interfaces to GenAI have considerably increased the popularity of LLMs (Xi et al., 2023). Indeed, familiarity with a decision support system has been identified as an important reason for people’s increased willingness to rely on it (Komiak & Benbasat, 2006; Mahmud et al., 2024). Therefore, research on the appreciation of LLM-generated advice is rapidly gaining importance.

In summary, we argue that conversational user interfaces constitute a promising tool for counteracting algorithm aversion and promoting algorithm appreciation, thereby enabling the utilization of often superior algorithmic output. Thus, our aim here was to investigate the tension between algorithm aversion and algorithm appreciation in interactions with GenAI that uses a natural style of conversation. To this end, by experimentally manipulating participants’ interactions with ChatGPT as the most popular LLM, we test the influence of two critical design features of conversational user interfaces on advice taking from algorithms. Specifically, we systematically vary the levels of explanation and interactivity with a GenAI capable of providing advice, along with explanations of the underlying rationale, to examine their effects on the utilization of its output.

The Interplay of Explanations and Interactivity

We argue that it is the combination of explanations and interactivity that underpins the widespread popularity and impact of contemporary GenAI systems. Providing explanations increases the transparency of the algorithmic judgment and decision-making process (Papamichail, 2003). Furthermore, the disclosure of intermediate steps makes algorithmic reasoning more understandable and cognitively accessible (van Dongen & van Maanen, 2013). As a consequence, explanatory algorithms are perceived as more trustworthy, leading to higher levels of stated trust in Goodwin et al. (2013), and their advice is weighted more strongly, especially when the accompanying explanations convey higher informational value (Gönül et al., 2006). Therefore, we expect that advice from a GenAI provided along with an explanation is weighted more strongly than the same advice presented without additional details about the underlying rationale (Hypothesis 1).

Recommender systems that are designed to appear more humanlike (e.g., voice, visual appearance) have been found to increase user trust and usage intention (Qiu & Benbasat, 2009). The opportunity to interactively engage with an algorithm is an anthropomorphic feature of a system that enhances users’ trust calibration and satisfies their desire for control (van Dongen & van Maanen, 2013; see also Westphal et al., 2023, on control). Essentially, the responsibility associated with higher levels of controllability of an algorithm’s behavior has been found to increase users’ willingness to rely on its output (Dietvorst et al., 2018). Therefore, we expect that algorithmic advice is weighted more strongly when users have the opportunity to actively interact with the GenAI providing the advice, compared to situations where interactivity is not possible (Hypothesis 2).

In the case of an opportunity to interact that allows users to request further information, interactivity can also be understood as a cue to reasoning capabilities. In this sense, the ability to provide explanations can be considered equivalent to the ability to provide arguments for one’s position, which is a common feature of conversation across cultures (Mercier & Sperber, 2017). Therefore, we expect that actively requested optional explanations for algorithmic advice are most effective in enhancing its informational influence on users and thus are associated with the highest weight of algorithmic advice relative to explanations that are provided by default, optional explanations that are not requested, or neither mandatory nor optional explanations (Hypothesis 3). There are two reasons for our expectation, depending on the explanation versus interactivity perspective, as discussed in the following.

Information asymmetry occurs when one party has more or better information than the other party (Hütter & Ache, 2016; van Dongen & van Maanen, 2013; Yaniv, 2004; Yaniv & Kleinberger, 2000). By providing information upon request, parties can reduce information asymmetry, signaling their willingness to be open and transparent. Therefore, the algorithmic judgment and decision-making process should be perceived as more transparent if the algorithm is willing—in the sense of the deliberately programmed capability—to reveal additional details upon request, rather than providing them by default. Consequently, explanations that are provided upon the user’s individual request should have a stronger effect on advice weighting than explanations that are provided by default (Hypothesis 3a).

Although the mere opportunity to request an explanation might satisfy users’ core desire for control, the salience of influencing an algorithm’s behavior is greater when an explanation is actively requested than when it is provided by default. In Dietvorst et al. (2018), marginal amounts of control over an algorithm’s behavior were shown to reduce algorithm aversion. Furthermore, the opportunity to interact is of little consequence in terms of building trust through explanation if it is not used to request an explanation from the algorithm (Goodwin et al., 2013). Consequently, the opportunity to request an explanation should have a stronger effect on advice weighting when it is used than when it is forgone (Hypothesis 3b).

Method

In a judge–advisor system (JAS; Sniezek & Buckley, 1995), 472 participants solved a series of 10 estimation tasks with access to pregenerated output from ChatGPT as advice. We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures. The experiment was preregistered and unless stated otherwise, the sample size, manipulations, measures, data exclusions, and analyses adhere to the preregistration. A significance level of 5% was used for statistical testing. Preregistration documents, materials, surveys, data, and analysis scripts are publicly available at the Open Science Framework repository (https://osf.io/v9je8; Koop et al., 2024).

Design

Our experiment implemented three between-subject conditions with repeated measures. Participants in the control condition (n = 119) did not have the opportunity to interactively engage with ChatGPT and received no explanation of the assumptions and calculations it made. They only received the numerical output generated by the GenAI as advice for a given stimulus item. In the two treatment conditions, participants also received this numerical output as advice. However, in the mandatory explanation condition (n = 118), participants additionally received a detailed explanation of how ChatGPT formed its judgments by default. That is, participants in this condition also did not have the opportunity to interactively engage with ChatGPT, but they were provided with an explanation of the assumptions and calculations it made. In the optional explanation condition (n = 235), by contrast, participants could voluntarily request this explanation on a trial-by-trial basis. As preregistered, we recruited about twice as many participants in this condition to achieve a roughly balanced number of participants in each of the four resulting groups. In fact, participants in the optional explanation condition self-selected into the optional explanation requested group on 1,730 out of 2,350 (73.62%) trials by making use of their opportunity to interactively engage with ChatGPT to request additional explanations on a given trial, whereas the remaining trials (i.e., 26.38%) formed the optional explanation forgone group. Thus, the amount of interactive engagement with ChatGPT was relatively high compared to requests for additional explanations from rule-based or expert systems (see Gregor & Benbasat, 1999, for a review).

Participants

In another data set from our lab, detailed explanations of the specific workings of an algorithm in a numerical judgment and advice-taking task significantly increased participants’ weighting of algorithmic advice relative to more abstract descriptions of the same algorithm in the other three conditions (Rebholz, 2024). Treating this as a proxy for advice taken from a GenAI that provides explanations of its underlying rationale, we used the effect size from this study (Cohen’s d = 0.29; Judd et al., 2017) for our power simulation using the R package simr (Green & MacLeod, 2016). Based on 1,000 iterations, sufficient power (95% confidence that 1 − β ≥ 0.80) required the collection of data from at least 452 participants. Recruitment via the general mailing list of the University of Tübingen over seven full days in July 2023, as preregistered, resulted in a final sample of size N = 472 (313 females, 154 males, five diverse). The median age of our participants was 23 years (interquartile range [IQR] = 4.25).

Materials

Participants solved a series of 10 Fermi problems, which are simple numerical judgment or order-of-magnitude estimation tasks. For instance, they were asked to estimate the total number of hours of schooling completed by German high school graduates or how many days it would take to walk around the equator. The key to solving Fermi problems is to break the question down into simpler, more tangible parts, make reasonable assumptions, and then often perform simple calculations to arrive at plausible estimates (Ärlebäck, 2009; Edge & Dirks, 1983). In the second example from above, one would need to know or guess the length of the equator, make an assumption about the average walking speed, possibly reduced by reasonable maximum daily walking times and full days of rest in regular sequences, to arrive at a plausible estimate of the total walking time along the equator.

Advice and explanations for each estimation task were pregenerated using OpenAI’s GPT-3.5-Turbo model. We used their development platform to be able to control parameters of the model that were critical to our application. In particular, the temperature parameter controls the degree of randomness in the LLM’s output. A higher temperature results in a more varied output, in the sense that the model generates a different response each time it is prompted in the same way, whereas a lower value results in a more deterministic output. By setting the temperature to zero, our goal was to minimize randomness and thereby ensure more realistic estimates for the Fermi problems presented to participants as advice. In addition, we used a standardized script for prompting the LLM to solve each Fermi problem. The model was asked to solve a particular problem step by step and to produce a concrete value as an estimate at the end. This is because chain-of-thought prompts have been shown to improve the reasoning capabilities of LLMs (Wei et al., 2022; P. Zhang, 2023). More details on how the stimulus items were created can be found in the Materials folder of the online repository.

Procedure

The experiment, with a median duration of 10.27 min (IQR = 6.70), was conducted online using SoSci Survey (Leiner, 2021). After entering the study and giving their informed consent to participate, participants were first asked to rate their experience with numerical estimation tasks (M = 1.77, SD = 0.91) and with using ChatGPT (M = 1.61, SD = 1.24) on 5-point Likert scales ranging from 0 (no experience) to 4 (much experience). They were then informed of their task, which was to solve a series of 10 estimation tasks following a typical JAS procedure. That is, participants were instructed to make an independent initial judgment in the first estimation phase of a trial, and that they would have access to advice from ChatGPT in the second estimation phase. We tried to closely simulate real interactions with ChatGPT by presenting user prompts, advice, and explanations as screenshots in the original layout (see the Survey folder of the online repository for an example). Participants in the optional explanation condition were also informed that they could voluntarily request additional explanations from ChatGPT for its advice. On trials in which they pressed a button to “request an explanation for this estimate,” they also received ChatGPT’s rationale for its advice. After providing two integer estimates for all 10 Fermi problems, participants were asked for their demographic information and could provide their contact information to voluntarily enter a raffle for 20 vouchers from a German bookstore chain, worth 10 € each. At the end of the experiment, participants were fully debriefed and thanked.

Measures

Our main dependent variable was Harvey and Fischer’s (1997) weight of advice (WOA) index, which is calculated as the ratio of the distance between participants’ final and initial judgments to the distance between ChatGPT’s advice and participants’ initial judgment. Accordingly, a WOA of 0 indicates complete disregard of ChatGPT’s advice, a value of 1 indicates complete adoption of the advice, and everything in between and outside this interval represents a corresponding weighted linear combination of ChatGPT’s advice and the participants’ own initial estimates. Outliers of WOA were excluded based on Tukey’s (1977) fences. That is, we removed trials on which the WOA fell below the 25th percentile or exceeded the 75th percentile by more than 1.5 times the IQR of the entire WOA distribution.

Data Analysis

As preregistered, we conducted a multilevel regression analysis using R Version 4.3.1 (R Core Team, 2023) and the packages lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017). WOA was used as the dependent variable, and random intercepts for participants and stimulus items were included to account for repeated measures, specifically variation in advice taking across participants and differences in item difficulty, respectively. To experimentally investigate the effects of explanation and interactivity on participants’ utilization of ChatGPT advice, the model included fixed effects of explanation (contrast coded as −0.5 for not provided and 0.5 for provided) and interactivity (contrast coded as −0.5 for not possible and 0.5 for possible), as well as the interaction term of the two treatment factors. Thus, the intercept of the regression model provided an estimate of the unweighted grand mean across conditions, the first fixed effect measured the main effect of explanations, the second fixed effect measured the main effect of the opportunity to interactively engage with ChatGPT, and the interaction term measured the additional contribution of interactively requested explanations over and above the sum of the two main effects.

Results

As shown in the bar chart of mean WOA (93 out of 4,720, or 1.97%, outliers excluded) in Figure 1, the mean weighting of algorithmic advice was descriptively highest on trials where optional explanations were requested (M = 0.68, SD = 0.41). Advice taking was reduced on trials where optional explanations were forgone (M = 0.66, SD = 0.42), as well as in the mandatory explanations group (M = 0.66, SD = 0.43). However, the lowest WOA was observed in the control group (M = 0.58, SD = 0.40). Thus, there was descriptive evidence for all three hypotheses.

Figure 1

Mean WOA per Explanation and Interactivity Group
Note. Error bars show the 95% confidence interval. Outliers of WOA are excluded based on Tukey’s (1977) fences. WOA = weight of advice.

Figure 2 shows the characteristic distributions of WOA, which, reminiscent of a W-shape (Soll & Larrick, 2009), have three modes at complete disregard of ChatGPT’s advice (WOA = 0), unweighted averaging of the advice and the participants’ own initial judgment (WOA = 0.5), and complete adoption of the advice (WOA = 1). However, the distributions were slightly more left-skewed than in traditional JAS studies (overall pooled meta-analytic WOA = 0.39; Bailey et al., 2023), indicating relatively high advice taking from ChatGPT. Notably, the proportions of complete disregard and unweighted averaging were substantially higher in the control group than in all other conditions. Whereas there were no striking differences between the remaining three conditions for the two modes with lower levels of advice taking, the proportions of complete adoption appear to be the main driver of the quantitative differences at the aggregate level reported above. In particular, advice provided with mandatory explanations and without optional explanations was completely adopted more frequently than advice in the control condition, but the proportions of complete adoption were also substantially lower than for the interaction opportunity used to request a detailed explanation of the underlying rationale.

Figure 2

Distributions of WOA per Explanation and Interactivity Group
Note. Gaussian kernel density plots with the bandwidth chosen according to Silverman’s (1986) rule of thumb. Outliers of WOA are excluded based on Tukey’s (1977) fences. WOA = weight of advice.

Confirmatory Multilevel Modeling

The full multilevel regression model with WOA as the dependent variable and fixed effects of explanation, interactivity, and their interaction is summarized in Table 1. As indicated by the estimate of the intercept, the weighting of ChatGPT’s advice was relatively high on average across all conditions in our study (d = 1.55; Judd et al., 2017). More importantly, consistent with Hypothesis 1, we found that participants weighted algorithmic advice significantly more strongly when it was accompanied by an explanation (d = 0.15), regardless of whether these explanations were provided by default or actively requested. Consistent with Hypothesis 2, interactivity significantly increased participants’ WOA (d = 0.11), regardless of whether these opportunities to interactively engage with ChatGPT were actually used or not. For Hypothesis 3, there was no evidence for an interaction effect of explanation and interactivity (d = −0.07). That is, actual interactive engagement to request an explanation did not additionally increase WOA beyond the additive effects of explanation and interactivity individually.

Table 1

Full Multilevel Regression of Weight of Advice on Explanation, Interactivity, and Their Interaction

Variable

Estimate

95% CI

SE

t

df

p

b0

0.6421***

[0.5780, 0.7061]

0.0327

19.66

10.61

<.001

bexplanation

0.0643***

[0.0284, 0.1001]

0.0183

3.52

870.61

<.001

binteractivity

0.0464*

[0.0049, 0.0879]

0.0211

2.19

513.54

.028

bexplanation × interactivity

−0.0303

[−0.1020, 0.0414]

0.0366

−0.83

870.72

.407

τ0, S

0.1912

[0.1731, 0.2071]

τ0, T

0.0977

[0.0488, 0.1444]

σ

0.3554

[0.3480, 0.3633]

ICC

0.27

Rmarginal2

0.01

Rconditional2

0.28

Note. Wald 95% CI for fixed effects and bootstrap 95% CI (with 1,000 iterations) for random effects are shown. Sample sizes of N = 472 participants S and M = 10 stimulus items T resulted in a total number of 4,627 observations after excluding outliers based on Tukey’s (1977) fences. CI = confidence interval; SE = standard error; df = degrees of freedom; ICC = intraclass correlation coefficient.

* p < .05, *** p < .001, two-sided.

Exploratory Analyses

As the negative interaction term is descriptively smaller in absolute terms than either of the two main effects, it suggests a small increase in the weighting of advice provided along with actively requested optional explanations. In other words, the net effect of the combination of the two treatment factors was indeed positive as expected (see also Figure 1). According to post hoc linear hypothesis testing using the R package car (Fox & Weisberg, 2019), there was no evidence for Hypothesis 3a, or an increase in WOA for actively requested explanations over explanations provided by default, binteractivity + 0.5 × bexplanation × interactivity = 0.03, χ2(1) = 1.47, p = .226. That is, there was no evidence for a differential weighting of advice provided with optional versus mandatory explanations. In contrast, there was evidence for Hypothesis 3b, that is, for a significant increase in WOA for actually making use of the interaction opportunity compared to not making use of it, bexplanation + 0.5 × bexplanation × interactivity = 0.05, χ2(1) = 4.79, p = .029. Apparently, the mere possibility of interactive engagement was not as valuable in terms of higher WOA as actually making use of it to receive additional explanations.

The random intercept for the stimulus items indicated considerable variation in participants’ advice taking for different Fermi problems. Therefore, we also included random slopes of explanation to examine whether the observed item-wise variance in WOA could be attributed to the corresponding explanations provided (Appendix). According to likelihood ratio testing, our data were significantly better explained by this extended model than by the preregistered one without random slopes, χ2(4) = 37.44, p < .001. Moreover, the conditional modes of the random item slopes revealed systematic variations in the item-specific effects of explanation on WOA. Post hoc examination of the stimulus items revealed that the largest and, according to the 95% CI of the dot plots shown in Figure 3, only significantly negative effects on WOA were observed for the two items for which ChatGPT’s explanations indicated a deficient quality of the generated advice (i.e., unrealistic assumptions, such as for the space occupied by a person in Item Number 10, or incorrect calculations, such as a missing trailing zero in Item Number 5; see additional online Table S1 of the online repository at https://osf.io/qjbmt). This finding suggests that participants were also appropriately sensitive to inaccuracies and falsities in the generated explanations. Nevertheless, the mean WOA for these two items was still relatively high and significantly positive, M10 = 0.67, SD10 = 0.39, t(465) = 37.42, p < .001, and M5 = 0.52, SD5 = 0.48, t(456) = 23.32, p < .001.

Figure 3

Conditional Modes of Item Random Slopes of Explanation in the Extended Model of Appendix
Note. Error bars show the 95% confidence interval.

We also conducted post hoc multilevel modeling with random intercepts for participants and items as well as fixed effects of the demographic variables. The results indicate that male participants (M = 0.60, SD = 0.44) take significantly less advice from ChatGPT than female participants (M = 0.67, SD = 0.40), t(466.51) = −3.28, p = .001. In addition, the weighting of algorithmic advice decreases significantly with participant age, b = −0.01, t(471.99) = −4.33, p < .001. Finally, controlling for (centered) experience in the preregistered multilevel model for the main analysis did not affect the main conclusions with respect to the manipulated variables and revealed no significant effects of any type of prior experience on advice weighting, b = −0.01, t(468.11) = −0.94, p = .350, for numerical estimation experience, and b = 0.00, t(468.73) = 0.02, p = .986, for ChatGPT experience.

Discussion

For more objective tasks, the “perfection schema” (Dzindolet et al., 2001; Prahl & van Swol, 2017) and the “machine heuristic” (Sundar, 2008) posit that users attribute superior performance to algorithms over humans. This may explain why the weighting of ChatGPT’s advice was relatively high on average across all conditions in our study using Fermi problems as estimation tasks. For instance, estimating the total number of hours of schooling completed by German high school graduates or the number of days it would take to walk around the equator can be considered fairly objective tasks that the GenAI should thus be able to solve well. More importantly, however, both the explanation provided and the opportunity to interact with the chatbot exert positive effects on participants’ advice taking from ChatGPT (see also Figure 1), with corresponding theoretical implications as discussed in the following.

Making use of the opportunity to request an optional explanation significantly enhances the positive effect of interactivity. That is, simply providing the infrastructure to interactively engage with the GenAI is beneficial, but does not lead users to weight the algorithmic advice as much as in situations where they actively use this opportunity to request more information. Indeed, it is less surprising that people would prefer advice that is accompanied by some form of reasoning, reflecting their inherent desire to understand and make sense of the information they receive (Mercier & Sperber, 2017). When no additional explanation is requested, participants are left to rely solely on the advice given, without any context or rationale, leading to a lower weighting of the advice due to the lack of transparency and understanding. In contrast, when a rationale is provided, the informational asymmetry is reduced, making the advice more understandable and thus more likely to be taken into consideration.

There is no evidence for a differential weighting of advice provided with mandatory versus optional explanations of the underlying rationale generated by ChatGPT. However, our findings suggest that advice taking from GenAIs could potentially be increased by the provision of an explanation of the underlying rationale, even when not explicitly requested. Note that this benefit of explanation does not contradict the so-called “verbosity bias” of LLMs, which have been criticized for their general tendency to favor long and wordy texts over short and concise texts of similar or even better quality, leading to the generation of unnecessarily long responses (Saito et al., 2023). According to the systematic item-wise random effects of explanation (see Figure 3), participants in our study weighted lower quality advice less than higher quality advice. In other words, our data highlight the value of explanations in assessing the quality of GenAI-generated advice.

Practical Implications

The distributional patterns of WOA (Figure 2) provide a more detailed picture of qualitative differences between the explanation and interactivity conditions. In particular, the provision of explanations and the opportunity to interact reduce the propensity of users to completely neglect algorithmic advice. Furthermore, the hypothesized interaction between these two factors is clearly evident at higher levels of advice weighting, as the density of informational influence is highest for actively requested explanations. Note that the definition of aversion against algorithms relative to humans includes users’ overweighting of their own (e.g., initial) judgments (Mahmud et al., 2022). Accordingly, the distributional observations suggest that conversational user interfaces primarily empower ChatGPT to counteract algorithm aversion and promote algorithm appreciation relative to one’s own estimate, respectively. In contrast to the findings of Bansal et al. (2021), who studied joint decision making in human–AI teams, GenAI-generated explanations did not lead users to blindly trust the algorithmic advice regardless of its quality but instead enabled them to calibrate their trust appropriately (Y. Zhang et al., 2020).

As we did not manipulate the output of ChatGPT, some of the explanations indicated advice of deficient quality. In addition to increasing the ecological validity of our study, the inclusion of these items allowed us to assess the effects of falsities in algorithmic explanations on advice taking in our data. Indeed, the weighting of advice of deficient quality was significantly below average for Item Numbers 5 and 10 (Figure 3). Although this suggests that participants were sensitive to falsities in ChatGPT’s reasoning, the mean WOA for these two items was still relatively high and significantly positive. This may be due in part to the unavailability of cues to the quality of the advice provided in the control group and on trials where participants did not make use of the opportunity to interact (Papamichail, 2003), again supporting the relevance of explanations. A recent study provides evidence that ultimately “hallucination is inevitable” for LLMs (Xu et al., 2024), with negative consequences for user engagement and advice taking, as demonstrated in our study. The unintentional generation of falsities and hallucinations, but also the intentional misuse of GenAI to efficiently produce convincing-sounding misinformation, are relevant and nonnegligible societal risks of further innovations in this technology (Bubeck et al., 2023). By varying the quality of explanations, negative consequences should thus be addressed more systematically in future research.

Our results suggest that the mere presence of features that enhance interactivity increases users’ advice taking from conversational GenAI, potentially leading to an overreliance on low-quality and misleading advice. In other words, while the anthropomorphic features of interactivity may enhance users’ trust calibration (van Dongen & van Maanen, 2013), useless or inappropriately designed conversational user interfaces also run the risk of undermining the calibration of trust. For instance, organizations could increase the utilization of their models’ output by implementing mock-up interfaces for interactive engagement. We also found that actual interactive engagement to request explanations further increases the positive effect of interactivity. In most languages, the same message can be phrased in multiple ways. Therefore, it may be even more problematic to provide only a seemingly modified output that actually conveys the same message as the original advice. Deliberately programming chatbots to simply rephrase the virtually identical information using a different tone of voice and/or wording (e.g., using synonyms) to respond to user requests for truly updated information would, according to our results, run the risk of overreliance on such algorithmic advice. This insight has the potential to contribute to the debate about the possible dangers of increasing implementations of GenAI in digitalized societies. Seemingly updated algorithmic advice in response to user requests for genuine updates poses threats beyond hallucinations and misinformation to human autonomy in augmented judgment and decision making.

Limitations and Future Directions

Our study focused primarily on the behavioral consequences of explanations and interactivity, which we deem most pertinent to our research question about the factors influencing the utilization of GenAI-generated advice. Therefore, we did not directly measure other theoretical constructs such as controllability, perceived transparency, and salience of interactivity. These constructs were inferred from existing literature (e.g., controllability: Dietvorst et al., 2018; perceived transparency: Papamichail, 2003) and served as the theoretical foundation for our hypotheses regarding behavioral consequences. We assume that these constructs do not differ significantly in interactions with GenAI as compared to traditional decision support systems, so that their effects would be implicitly reflected in the measured WOA. However, we recognize that direct measurement of these constructs in future research could enrich our understanding of the phenomena under investigation.

Our convenience sample of Western, educated, industrialized, rich, and democratic university students is not representative of the general population. In addition, other unidentified factors may influence individuals’ engagement with LLMs. For instance, user characteristics such as personality (e.g., extroversion, self-esteem) or their attitudes toward technology have been found to influence perception, trust, and utilization of algorithms (see Mahmud et al., 2022, for an overview). By including random intercepts for participants in our models, we at least statistically accounted for potential individual differences in the propensity to rely on algorithmic advice. Moreover, post hoc multilevel modeling suggests that advice taking from ChatGPT depends on demographic variables such as participants’ age and gender. Accordingly, a more systematic investigation of relevant user characteristics and collection of data from non-Western, educated, industrialized, rich, and democratic populations is an important avenue for future research on the effect of explanation and interactivity on advice taking from algorithms.

In addition to user characteristics, human–GenAI interactions are also shaped by specific design features and functionalities of LLMs. In our experiment, interactivity was limited to pressing a button to request more information. We tried to closely simulate real interactions with ChatGPT by presenting user prompts, advice, and explanations in the original layout. However, real conversational user interfaces allow for much more flexible and dynamic engagement with the GenAI, affecting user engagement by providing a more conversational experience. This includes elements of anthropomorphism, which could lead users to perceive the system as more humanlike and thus be more inclined to trust it (Prahl & van Swol, 2017; Qiu & Benbasat, 2009). By tightly controlling interactivity and using pregenerated algorithmic advice and explanations, our intention was to avoid overly individualized advice interactions. The reason for this was that, technically, LLMs generate text based on probabilistic token forecasts given the current context (Vaswani et al., 2017). This functionality implies that the advice and explanations generated are highly attuned to the intentions and beliefs of the specific users providing the relevant context (Bubeck et al., 2023). We hypothesize that confirmation bias—the tendency to favor information that is consistent with one’s prior beliefs (Fiedler, 2000; Wason, 1960)—would increase users’ willingness to accept highly individualized algorithmic advice resulting from open conversations. Therefore, in future research, we plan to investigate the impact of free and unconstrained (multishot) prompting on users’ willingness to take advice from self-explanatory GenAI.

One of the major advantages of using pregenerated advice would have been to systematically vary certain properties of the GenAI-generated explanations. Important variables include explanatory depth, or how detailed and comprehensive certain explanations are (Kizilcec, 2016; Sovrano & Vitali, 2024); explanatory focus, or how relevant the information contained in an explanation is to its target stimulus (Shimony, 1993); and the degree of personalization, or the extent to which explanations are tailored to individual problem-solving strategies (Ribera & Lapedriza, 2019; Westphal et al., 2023). More advanced LLMs, such as Google’s Gemini or OpenAI’s GPT-4, are capable of multimodal reasoning, such as interpreting and/or generating images. Research on explainable AI shows that visual (Cheng et al., 2019) or—depending on user expertise—hybrid (i.e., textual and visual; Szymanski et al., 2021) explanations improve users’ objective understanding of complex algorithms, such as classification algorithms used in classical profiling tasks like university admissions. Thus, similar to other properties of explanations, the utilization of algorithmic advice is likely to be influenced by the modality of the explanations provided along with it. The procedure introduced in the present research can be extended for investigating these issues in future research.

In our study, there is no evidence for an effect of prior experience with numerical estimation tasks or using ChatGPT on the weighting of its advice. In general, however, contextual and environmental factors, such as users’ familiarity with the algorithm and task (Mahmud et al., 2024) or their experience and expertise (Ribera & Lapedriza, 2019), also affect the processing of explanations and the conditional utilization of algorithmic advice. For instance, the effectiveness of explanations of an AI’s behavior was found to be negatively affected by users’ cognitive load (Westphal et al., 2023). Put differently, there is a natural trade-off between increased comprehensibility of the underlying rationale and increased cognitive load due to the complexity added by explanations of the GenAI’s numerical judgment. In contrast to Westphal et al. (2023), we found a positive effect of general, non-user-centered explanations on advice taking from algorithms. However, in our online study, we could not directly control for cognitive load during participation from any location at any time. In general, understanding how the decision-making ecology affects the processing of advice provided along with explanations from interactive GenAI is essential for optimizing the design of conversational user interfaces.

Conclusion

The inherent explanatory capabilities of GenAI and the opportunity to interactively engage with it independently increase users’ advice taking. In fact, the mere existence of conversational user interfaces promotes the influence of GenAI on humans without the need for users to engage extensively with these interfaces. On the one hand, this feature can facilitate the efficient use of information. On the other hand, it may be considered a threat to human autonomy in interactions with conversational GenAI systems. However, our data also suggest that GenAI-generated explanations did not lead users to blindly trust the algorithmic advice regardless of its quality but instead enabled them to calibrate their trust appropriately. This finding underscores the potential of conversational user interfaces to LLMs, as implemented in ChatGPT and the like, to improve individuals’ augmented judgment and decision making. Our study extends our knowledge of advice taking by highlighting both the benefits and potential risks of integrating GenAI into these processes, suggesting a nuanced impact on user autonomy and trust calibration.

Supplemental Materials

https://doi.org/10.1037/tmb0000136.supp

Appendix

Full Multilevel Regression of Weight of Advice on Explanation, Interactivity, and Their Interaction

Variable

Estimate

95% CI

SE

t

df

p

b0

0.6418***

[0.5793, 0.7043]

0.0319

20.13

10.71

<.001

bexplanation

0.0641*

[0.0008, 0.1274]

0.0323

1.98

15.01

.047

binteractivity

0.0460*

[0.0042, 0.0877]

0.0213

2.16

464.84

.031

bexplanation × interactivity

−0.0307

[−0.1043, 0.0429]

0.0375

−0.82

421.39

.414

τ0, S

0.1877

[0.1684, 0.2068]

τ0, T

0.0950

[0.0513, 0.1409]

τexplanation, S

0.0967

[0.0038, 0.1705]

τexplanation, T

0.0831

[0.0374, 0.1288]

ρ0explanation, T

−0.1548

[−1.0000, 1.0000]

ρ0explanation, S

0.2638

[−0.4824, 0.8725]

σ

0.3526

[0.3453, 0.3602]

ICC

0.28

Rmarginal2

0.01

Rconditional2

0.29

Note. Wald 95% CI for fixed effects and bootstrap 95% CI (with 1,000 iterations) for random effects are shown. Sample sizes of N = 472 participants S and M = 10 stimulus items T resulted in a total number of 4,627 observations after excluding outliers. CI = confidence interval; SE = standard error; df = degrees of freedom; ICC = intraclass correlation coefficient.

* p < .05, *** p < .001, two-sided.


Received March 28, 2024
Revision received June 25, 2024
Accepted June 28, 2024
Comments
1
?
zoey jackson:

Conversational User Interfaces (CUIs) enhance user experience by providing intuitive, interactive communication with AI systems. Best Internet Packages They enable users to engage in natural dialogues, improving advice-taking. The interactivity fosters trust, ensuring users feel understood. Generative AI, combined with CUIs, offers personalized responses, positively influencing decision-making and user satisfaction in various contexts.