Volume 3, issue 2 : Summer 2022. DOI: 10.1037/tmb0000054
Our first impressions of the people we meet are the subject of considerable interest, academic, and nonacademic. Such initial estimates of another’s personality (e.g., their sociality or agreeableness) are vital, since they enable us to predict the outcomes of interactions (e.g., can we trust them?). Nonverbal behaviors are a key medium through which personality is expressed and detected. The character and reliability of these expression and detection processes have been investigated within two major fields: psychological research on personality judgments accuracy and Artificial Intelligence research on personality computing. Communication between these fields has, however, been infrequent. In the present perspective, we summarize the contributions and open questions of both fields and propose an integrative approach to combine their strengths and overcome their limitations. The integrated framework will enable novel research programs, such as (a), identifying which detection tasks better suit humans or computers, (b), harmonizing the nonverbal features extracted by humans and computers, and (c), integrating human and artificial agents in hybrid systems.
Keywords: personality judgments, personality computing, nonverbal behavior, first impressions, social computing
Action Editor: Danielle S. McNamara was the action editor for this article.
Funding: This research is funded by an Irish Research Council scholarship, within the framework of the Employment-Based Postgraduate Programme, to the first author, in collaboration with AON Assessment Solutions (EBPPG/2018/99).
Disclosures: Neither the authors nor the funding institutions have any conflict of interest related to the content of this article. All authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, writing, and revision of the article. Furthermore, each author certifies that this material or similar material has not been and will not be submitted to or published in any other publication before its appearance in Technology, Mind, and Behavior.
Correspondence concerning this article should be addressed to Davide Cannata, School of Psychology, National University of Ireland Galway, College Road, Galway H91 TK33, Ireland email@example.com
When we meet a new person, whether in a private or professional context, we quickly develop a first impression of whom that person might be. This makes sense, since having a good estimate of another person’s characteristics (the “target”) will be useful, especially if you (the “judge”) expect to deal with them repeatedly (Palese & Schmid Mast, 2020). In some cases (e.g., a job interview), this impression can have important and far-reaching consequences (Harris & Garris, 2008). Nonverbal behaviors (NVBs; e.g., a confident posture during a job interview) are a key medium through which individual differences are expressed and detected (Hall et al., 2019). In fact, NVBs, are used to form first impressions before any relevant verbal information is exchanged (Hall et al., 2019). Also, compared to verbal behaviors, NVBs are harder to deliberately regulate (DePaulo, 1992) and more difficult to suppress (Bonaccio et al., 2016; Roche & Arnold, 2018). They can, therefore, provide a means to detect personality in situations in which verbal behaviors are unavailable or not informative.
Detecting individual differences based on NVBs has been a major topic of scientific investigation within two main fields of research: psychological research on the accuracy of personality judgments (PJ; Back & Nestler, 2016) and artificial intelligence research on personality computing (PC; Vinciarelli & Mohammadi, 2014). The two disciplines differ in the main goal they pursue. The main goals of PJ are to understand the causes and implications of human PJ as they happen in nature, whereas PC scholars prioritize the creation of technological solutions to replicate or enhance human judgment. However, both fields share key research questions. These relate to (a), the conditions that facilitate accurate nonverbal personality detection, (b), the strength and character of the relationships between personality and NVBs and between NVBs and PJ, (c), how these relationships differ across groups (e.g., ethnic or cultural groups) and situations (e.g., professional, casual), and (d), the most appropriate measures of NVBs, target personality and PJ.
Despite this convergence of research interests, there has, to date, been little communication between these disciplines. In this article, we provide an overview of both fields, their contributions and open questions. We then argue for an integrative approach to the science of nonverbal personality detection. Such an integrated approach will help to answer old research questions in both fields and to formulate novel questions.
In this section, we provide an overview of the state of art in psychology in the study of nonverbal personality detection. The discipline, which focuses on how human unacquainted judges can make accurate inferences of others’ personality, has largely influenced later studies in PC. We will summarize how the field has operationalized the concepts of “actual” personality and accuracy, and how it faced the challenge to measure and classify NVBs. Furthermore, we will explain the lens model, the major framework for personality detection research, and we will report the major moderators that explain fluctuations in the accuracy of human PJ.
Within psychology, research on the accuracy of PJ has investigated how independent judges rate multiple targets on their personality based on only a little information (e.g., pictures, videos, short interactions; Funder, 1995, 2012). PJ are more accurate than chance, even after short exposure to targets’ NVB (Ambady & Rosenthal, 1992; Ambady et al., 1995; Borkenau et al., 2004; Borkenau & Liebler, 1992; Hirschmüller et al., 2015), with correlations ranging between .20 and .40 for aggregate ratings. Accuracy is defined as the level of correspondence between ratings of judges on personality traits and the actual personality of targets (Back & Nestler, 2016). Information about personality is usually collected through three types of measures: self-reports (e.g., personality scales), informant reports (e.g., scales completed by one’s friends), and behavioral assessments (e.g., observation of the target in a situation specifically created to elicit the activation of a personality trait). Each of these methods captures different aspects of personality (self-concept, reputation, behavioral tendencies) and has its own limitations. Therefore, the combination of the three types of measures is regarded as the gold standard (Back & Nestler, 2016; Baumeister et al., 2007; Human et al., 2014). That is, actual personality (e.g., how extraverted I am) is a construct best estimated by combining information from self-reports, informants’ reports, and behavioral assessment. The correlation of judgments with outcomes known to be associated with personality (e.g., Conscientiousness and job performance) can provide further insights into the accuracy of PJ. Typically, research in this field consists of three different steps (Ambady et al., 1995; Back & Nestler, 2016; Connelly & Ones, 2010). First, target participants have their personality measured and they are photographed, or video recorded while briefly acting in front of a camera or a real audience. The acts that targets perform often consist of self-presentation, but other tasks such as reading texts, solving puzzles, or performing scripted actions are not uncommon (e.g., Borkenau et al., 2004). Second, a sample of judges rates the targets’ personality, after being exposed directly to the participant act or their video recordings. Third, two or more trained raters carefully examine the videos and annotate NVB.
The assessment of NVBs has been the focus of wide-ranging research within the psychological tradition. Traditionally, NVBs are divided into four domains: changes of (a) the face, (b) the body, (c) the voice, and (d) static cues (Elfenbein & Eisenkraft, 2010; Hall & Andrzejewski, 2008; Hall et al., 2016). Facial NVBs refer to movements executed by the facial muscles (Ekman & Rosenberg, 2005), which include facial expressions (e.g., smiling, raising one’s eyebrows) and gaze (e.g., eye contact). The body domain includes static body posture, interpersonal distance, gait, head movement, and gestures (Breil et al., 2021). The domain of voice, also called paralanguage, concerns all the nonverbal elements of speech. These include speech rate, pauses, utterances, the dominant pitch, volume, variations of pitch and volume, elements related to the quality of the voice such as timbre or pleasantness, and markers of socialization, such as accent (Degroot & Gooty, 2009; Hall et al., 2019). Finally, static cues refer to nonverbal expressions that are relatively stable features of the target, such as her/his dressing style, choice of jewelry, grooming, and personal hygiene (Degroot & Gooty, 2009; Gifford et al., 1985). Furthermore, measures of NVBs can focus on different levels of abstraction: a microlevel (specific behaviors like smiling), a mesolevel (circumscribed expressions such as showing surprise), or a macrolevel (more global characteristics such as seeming approachable, loud, or shy).
Numerous behavioral codes have been designed for the recording of NVBs based on the domain and the level of abstraction. The Facial Action Coding System (Ekman & Friesen, 1978; Sayette et al., 2001), for example, focuses on the classification of facial expressions at the microlevel, by identifying the smallest Facial Action Units. The Riverside Behavioral Q-Sort (Funder et al., 2000) records behaviors at a mesolevel (e.g., shows enthusiasm) and macrolevel (e.g., a high energy level). The Münster Behavior Coding System (Grünberg et al., 2019) allows for assessment of various NVBs at the micro-, meso-, and macrolevels while focusing on their interpersonal meaning. The assessment of NVBs at different levels helps to balance between specific behaviors that are easy to measure, but might not be very informative (e.g., blinking), and higher order behavioral constructs (e.g., showing anxiety) that are more difficult to measure objectively, but that are more strongly related to personality characteristics.
Brunswik’s lens model (Brunswik, 1956) has been used as a conceptual framework to explain how judges form accurate impressions of a target’s personality characteristics that are not directly observable. In this model, represented in Figure 1, observable cues, such as NVBs, serve as links between actual personality on one side and PJ on the other side. Accurate detection is achieved when those cues are valid (related to a target’s actual personality; cue validity) and utilized by the judges (related to judges’ impressions; cue utilization). For example, a judge might correctly identify a target’s Extraversion by utilizing cues such as broad gestures, loud voice, stylish clothes, that are related to the target’s actual (e.g., self- and informant-reported) Extraversion.
On the expression side of the lens model, significant associations between traits and nonverbal expression (cue validity) have been reliably observed (Back et al., 2011; Gawda, 2007; Gifford, 1994; Hirschmüller et al., 2013; Iizuka, 1992; R. E. Riggio et al., 1990; R. E. Riggio & Friedman, 1986; Simpson et al., 1993; Wiens et al., 1980). For example, extraverts smile more, speak louder and faster, and have more expressive tones of voice, gestures, and facial expressions (Breil et al., 2021). On the reception side of the lens model, judges utilize a variety of NVBs to make PJ (Back et al., 2011; Degroot & Gooty, 2009; Gifford, 1994; Hall et al., 2011; Hirschmüller et al., 2013; Kenny et al., 1992; McAleer & Belin, 2018; Page & Balloun, 1978; Tucker & Friedman, 1993), even after very short exposures to the targets (Olivola et al., 2014). Estimating someone’s personality is a complex task that requires judges to hold (possibly subconscious) beliefs about associations between NVBs and personality traits. Cues other than NVBs, such as gender or ethnic background, are also utilized by judges, and excessive reliance on those can induce bias due to stereotyping. For example, right-wing conservatives are more influenced by religious identity as a cue for a range of personality traits when evaluating middle eastern refugees (Stecker et al., 2021).
One of the main challenges in PJ is to understand why there is so much variation in accuracy depending on the context. The Realistic Accuracy Model (RAM; Funder, 1995, 2012) expands the Brunswik lens model to include four moderators that explain fluctuations in achieved accuracy. The moderators are “good trait” (the property of some traits to be more easily detectable by judges), “good target” (differences between targets in expression of personality), “good judge” (differences between judges in evoking, detecting, and utilizing cues), and “good information” (quantity and quality of available information). The interactions between the domains (complex moderators) can also be relevant. Here, we provide examples of some highly impactful moderators in these categories. 1 Examples of “good target” Moderators. Consistent with the lens model, people who produce more NVBs, and therefore provide more cues, are easier targets to judge. Extraverts, for example, are generally easier to evaluate, even on personality traits other than Extraversion (Ambady et al., 1995; Colvin, 1993; Funder et al., 2000). They are judged more accurately since they are more expressive in their NVBs (Hirschmüller et al., 2015; Human & Biesanz, 2013). Also, extraverts have been demonstrated to be better at encoding their emotional states through recognizable NVBs (Gross & John, 2003; H. R. Riggio & Riggio, 2002), fitting their NVBs more closely to social expectations. Similarly, women and more feminine individuals are better nonverbal encoders of emotions (Zuckerman et al., 1982), leading to higher accuracy when their traits are judged. A good target can also influence the detection of nonverbal cues from judges. For example, attractive targets elicit more attention from others (Langlois et al., 2000) and therefore are rated more accurately (Human & Biesanz, 2013). 2 Examples of “good judge” Moderators. Some judges of personality are more accurate than others (Letzring, 2008; Rogers & Biesanz, 2019). A good judge is likely to utilize more personality-related cues, to apply cue information more consistently across judgments of different targets, and to make stronger use of valid than invalid cue information when forming personality impressions of others (Hall et al., 2016). Furthermore, in interactive settings, good judges might be better at evoking valid cues from their interlocutors (Rogers & Biesanz, 2019). A multitude of characteristics, such as cognitive ability, attentiveness, social skills, motivation (Christiansen et al., 2005; Mero et al., 2003) have been proposed as influencing judgment accuracy. However, there remains little consensus on the most reliable and stable predictors of the good judge (Letzring, 2008) due to the lack of standardized tests for personality judgment accuracy. Furthermore, the effect of “good judges” is more visible when the target offers many valid cues (Rogers & Biesanz, 2019). 3 Examples of “good trait” Moderators. Some traits are easier to evaluate across situations, judges, and targets. The accuracy of first impressions on Extraversion is the highest among Big Five dimensions, while judgments related to Neuroticism are the least accurate (Connelly & Ones, 2010). One characteristic that affects accurate judgments of traits is visibility (John & Robins, 1993). Visibility refers to how much a trait is expressed through observable behavioral cues. A comparatively higher number of nonverbal cues are valid and utilized in relation to Extraversion (Breil et al., 2021). Evaluativeness, which refers to the desirability of a trait, also affects judgment accuracy. Highly desirable or undesirable traits, such as Neuroticism, are judged less accurately, due to distortion of responses from both targets and judges (Human & Biesanz, 2013; Nederström & Salmela-Aro, 2014). 4 Examples of “good information” Moderators. The setting in which the target is observed will influence the accuracy of PJ, since, in some situations, targets provide judges with a greater number, or more distinctive patterns, of cues (good information). Conversely, targets are less likely to produce good information when situations define strong behavioral guidance on the best course of action (Letzring et al., 2006). Situational “strength” has been defined as the degree to which a situation imposes constraints on the expression of behaviors (Caspi & Moffitt, 1993). Unstructured situations, with lower situational strength, allow targets to express their personality yielding better information (Funder, 2012) and more accurate judgments. A contextual element that is particularly important for interactions in virtual environments is the richness of the amount of information available, often dependent on the media channel used for the communication (Wall et al., 2013). Traditionally, “richer” media (which pass more information) have been considered advantageous for PJ (Letzring et al., 2006). However, later research demonstrated the relation between contextual richness and accuracy is more complex. It depends on the measured trait and it is enhanced when valid cues are available and invalid cues are suppressed (Wall et al., 2013). Furthermore, accuracy can be higher when parties are familiar with the media (Ishii et al., 2019).
The four moderators do not necessarily operate independently, and their interaction can enhance or suppress accuracy. For example, certain traits can be better judged in certain situations (e.g., Neuroticism in socially stressful situations; Hirschmüller et al., 2015). Also, judges are more accurate when rating the personality of targets similar to them in terms of gender and ethnicity (Letzring, 2010).
PC is a subfield of Artificial Intelligence that aims to study personality by mining data from a variety of sources such as texts, social networks, mobile phones, location devices, cameras, and microphones.
The field consists of two subdisciplines that are closely aligned and intertwined (Vinciarelli & Mohammadi, 2014): personality recognition and personality perception. Personality Recognition aims to predict self-reported personality traits by detecting valid cues, addressing issues on the expression side of Brunswik’s model, whereas Personality Perception aims to predict judgments of personality in zero acquaintance situations by detecting utilized cues, the reception side of Brunswik’s model.
The first work in PC that focused on NVB, in particular, was published in 2008 by Pianesi, Mana, Cappelletti, Lepri, and Zancanaro. The authors measured vocal characteristics and the movements of different regions of the body to detect Locus of Control and Extraversion of targets interacting in small groups. In this section, we introduce PC research on nonverbal personality detection. Since a comprehensive review is beyond the scope of the current article, we focus on three areas of particular relevance to our project of integration: feature extraction, prediction algorithms, and deep learning. Furthermore, we provide definitions for discipline-specific words in Table 1. In the end, we summarize some achievements in PC in accurately recognizing personality and predicting human judgments.
Glossary of Personality Computing (PC) Terminology
Weighted motion energy
The average amount of movement of the pixels in a specific region of a video (e.g., a face), corrected for the overall movement of that region in the space.
Facial action units
Classified actions of facial muscles corresponding to the unique smallest independent movements.
Pitch is the measure of the frequency of a soundwave. It indicates if a voice is deep or acute.
Energy is a measure of the volume of speech.
The speed of speech production, usually calculated as words per minute.
A formant is a concentration of acoustic energy around a frequency. The first formant is concentrated around the lowest frequency, and the next ones are concentrated around higher frequencies. They are related to acoustic characteristics of speech such as, for example, the openness of vowels.
Is a measure of the rate of change of the soundwave. It summarizes information about the changes in pitch and formant.
Jitter and shimmer
Jitter and shimmer are measures of disturbance of the speech soundwave. Higher jitter corresponds to a rougher voice, while higher shimmer refers to breathiness and noise emission.
Support vector machine
Classification technique which aims at individuating in multidimensional spaces of variables values areas delimited by multidimensional hyperplanes in which the target variable is likely to have a certain value.
In machine learning, the ensemble of techniques aims to use a linear combination of predictors to build a sigmoidal probability function of the value of a categorical criterion.
Family of probabilistic machine learning which calculates the conditional probability of each possible value of the criteria, given the value of the predictors. Compared to simpler Bayesian approaches, Bayesian Networks include a representation of the relations between predictors.
A family of machine learnings technique inspired by the human brain. A neural network is made of layers of neurons and connections between the predictors, the neuron layers, and the outcome variables. The algorithm is tuned through backpropagation during the training phase to optimize the values of the strength of the connections.
Extreme learning machine
A method similar to Neural Network, in which the connections between neurons are not learned through backpropagation, but are calculated directly like in linear regression.
A nonparametric regression technique. It aims at calculating the parameter of the probability function of the value of the criteria given the predictors.
Random decision forest
Technique consisting in the construction of multiple decision trees and the aggregation of their outputs (e.g., mean or median value) as a final output.
In PC, nonverbal features refer to numerical representations of information contained in audiovisual media and other sources (e.g., wearable devices, smartphones). Traditionally, features have mostly been engineered by human researchers inspired by the psychological literature on NVBs (alternative approaches are discussed below in the subsection dedicated to deep learning methods). It is to these features that we refer when further in the article we use the term “handcrafted automatically extracted features.” More specifically, features engineering begins with the extraction of low-level properties of the material (e.g., contour pixel position per photogram) that are combined to calculate higher level properties of the video (e.g., the movement of the upper lip). When analyzing a single target, PC researchers can extract visual, vocal, and linguistic features that can be identified in individuals. When analyzing interactions among multiple targets, the foregoing can be isolated for each target and social features (related to the interaction between different targets) can also be isolated. Linguistic features are outside the scope of the current article, but the interested reader is directed to Boyd and Pennebaker (2017) and Mairesse et al. (2007). Furthermore, NVB is sometimes measured in less traditional ways. For example, features have been extracted out of data coming from smartphone usage (Masud et al., 2020; Stachl et al., 2020) or wristband accelerometers records (Hung et al., 2013). An example of a typical visual feature is fidgeting, defined as the amount of movement within a certain region of the space (i.e., face, hands, body; Pianesi et al., 2008). Weighted Motion Energy Images (wMEI) is another popular measure of the energy of movements (J. I. Biel et al., 2012). Computer vision algorithms can also be trained to detect specific movements such as head tilting (Nguyen & Gatica-Perez, 2016). A recent tool is OpenFace (Baltrušaitis et al., 2018), a multipurpose face and expressions recognition software package, which can recognize different aspects such as general head movements, eye gaze direction, and face action units (Baltrušaitis et al., 2015).
Among the most popular vocal features, we find pitch, energy, speech rate, first and second formant, cepstral, jitter, and shimmer (An et al., 2016; J. Biel et al., 2011; Kwon et al., 2013; Zhao et al., 2015). Some earlier works used a mixture of handcrafted automatically extracted audio features and manually annotated visual features (Nguyen et al., 2013).
Finally, social features include measures of interpersonal events, such as social gaze (Lepri, Subramanian, et al., 2012), speaking time (Lepri, Staiano, et al., 2012), or spatial proximity (Zen et al., 2010). In some cases, wearable devices have been employed to measure peoples’ movements and proximities in space in group interactions (Lepri, Staiano, et al., 2012).
Once features are extracted, their relationships with self-reported personality and PJ are modeled through different machine learning techniques. Early research mostly used classification algorithms (e.g., Support Vector Machines, Logistic Regression, Bayesian Networks, etc.,) to discriminate traits scores higher or lower than average (Audhkhasi et al., 2012; Batrinca et al., 2011; Kwon et al., 2013; Mohammadi et al., 2010; Pianesi et al., 2008). However, binary classification tasks have started to decline in popularity following criticisms that average scores (which are the most common) were forcibly classified as high or low (Mariooryad & Busso, 2017; Phan & Rauthmann, 2021). The most popular approaches now use a continuous personality score for regression tasks, using techniques such as Neural Networks, Extreme Learning Machine, Kernel Regression, Random Decision Forest, etc., (Aydin et al., 2016; Celiktutan & Gunes, 2017; Escalante, Kaya, et al., 2018; Kaya et al., 2016).
The foregoing research on handcrafted feature extraction and prediction algorithms typically retain explicit theoretical models of personality and the relationships between features and such constructs. Deep learning approaches (An & Levitan, 2018; Güçlütürk et al., 2016), in contrast, do not use theoretical models for feature extractions. Instead, features are automatically engineered through the operation of complex unsupervised algorithms on video data analyzed by the machine at the pixel level to maximize accuracy (for an overview see Mehta et al., 2020). Deep learning is helping to build highly accurate personality perception models (Escalante, Kaya, et al., 2018). However, in a field like PC, in which transparency and explainability are important, deep learning creates challenges in obtaining a human interpretable understanding of results and in replicating theoretical expectations (Escalante, Escalera, et al., 2018; Murdoch et al., 2019).
Overall, PC research has demonstrated the possibility of using handcrafted automatically extracted features to simulate human judgments and to detect personality. Some automatic features have been used successfully multiple times to simulate human ratings (e.g., wMei and smiles; Junior et al., 2019), although no quantitative evaluation of the stability of these results across studies and settings has been conducted.
An important role in the development of the discipline has been played by competitive challenges (Celiktutan & Eyben, 2014; Celli et al., 2014; Escalera et al., 2017; Ponce-Lopez et al., 2016; Schuller et al., 2015), in which different teams worked on the same public databases and compared their methodologies. This encouraged rapid progress in terms of technology and accuracy, with the participants of the most recent challenge (Escalante, Kaya, et al., 2018) all having accuracy close to or above 90% when predicting which of two targets would have been judged higher on Big Five traits. Challenges have used data from the laboratory (Celli et al., 2014), as well as data about radio speakers (Schuller et al., 2015), youtube videos (Ponce-Lopez et al., 2016), and job applicant video curricula (Escalera et al., 2017), aiming at using naturalistic and diverse settings of personality expressions.
PC technologies have achieved agreements above 90% with average ratings from human judges (Junior et al., 2019). This demonstrates that algorithms are able to exploit similar cues to those used by humans. Agreement with self-reports, however, is lower, but nontrivial, with accuracy values in the range of 60% (Junior et al., 2019). Just like in studies with human judges, the ability of PC approaches to accurately detect personality depends on the context in which the target is observed (Celiktutan & Gunes, 2017); that is, it is moderated by the quality of the information provided (“good information”). Deep learning approaches have recently obtained better results in personality recognition tasks than traditional handcrafted feature-based approaches (Escalante, Kaya, et al., 2018) suggesting avenues for further improvements in accuracy.
We believe that research on both PJ and PC would benefit from collaboration with the other discipline in order to answer key open questions and overcome current limitations. In the following section, we outline how an integrative approach could help address the important challenges for each of the disciplines (Figure 2).
We identified three major instances in which challenges in PJ can be addressed by integration with PC. First, existing NVB coding systems in psychology require manual observation and annotation of segments of behaviors, a process that requires considerable input in terms of money, time, and effort (Furr & Funder, 2009) that increases with the size of the data set. Automatic extraction of handcrafted features enables cheaper and more scalable methods for measuring cues, making large-scale research more practical.
Second, NVB coding in psychology cannot objectively record certain visual signals produced by targets alone or in interactions. PC offers a range of new objective measures of intensity, velocity, and amplitude of NVB. An example is fidgeting, which represents the average amount of movement in an area of the body (Batrinca et al., 2011). Another example of novel measure is obtained through wearable devices, which can offer an accurate and sensitive measure of interpersonal distances.
Third, PJ research typically employs linear regression models for data analysis. This method is not sensitive to more complex relations between personality and NVBs and between NVBs and judgments which are likely to exist (Brooks et al., 2018; Stolier et al., 2020). At least four elements are usually excluded: the combination of multiple NVB cues (e.g., a target who smiles only express Extraversion, a target who smiles and shake express Neuroticism); their temporal sequence (e.g., smiling after an expression of anger could indicate more Agreeableness than expressing anger after smiling); nonlinear relationships (e.g., curvilinear or exponential) between NVB and true or judged personality (e.g., there might be a “sweet spot” in the rate of smiling that denotes the highest levels of Extraversion); and correlations among personality traits (e.g., extraverts might be considered as more agreeable). PC has a stronger tradition in using machine learning algorithms to capture more complex relations between cues and judged or actual personality scores. Methods such as random regression forest or support vector machines do not assume linear or independent effects among predictors. For further reflections on the limitations of the lens model in dealing with nonlinear or combinatory effects and on the role of computer science to address the limitations, we direct the interested reader to the perspective by Hinds and Joinson (2019).
Research in computer science on nonverbal personality detection also presents challenges that can be addressed by integration with PJ. PC studies mostly rely either on self-descriptions or judges’ ratings, although there are few noticeable exceptions (Celiktutan et al., 2019; Finnerty et al., 2016). A majority of the studies in PC collect only judges’ ratings (Junior et al., 2019), and the second most commonly used paradigm includes self-reports only. Much fewer studies have included informant reports or behavioral measures of personality. Psychological studies demonstrate that employing multiple criteria provides a more accurate ground truth of “actual personality,” with the inclusion of self-reports, informant reports, and behavioral measures (Funder, 1999). Artificial judgments informed by personality derived from multiple measures would be more valid.
Another challenge for PC concerns the generalization and explanation of results across different situations, judges, and targets. A PC model that operates very effectively in one situation might not do so when exposed to data from another one, with the risk of generating meaningless comparisons (Demasi et al., 2017). The potential for such overfitting to occur is greater when employing relatively opaque machine learning models. Such generalization failures may not only lead to lower accuracy in another field, but they can inadvertently result in biased evaluations of underrepresented and vulnerable groups. Psychological models, such as the RAM model that was mentioned previously, provide a theoretical framework in which the effects of moderators, such as targets’ gender, situational setting, or judge motivation, can be interpreted. In doing so, it becomes possible to develop algorithms that are sensitive to these moderators and incorporate that information in predictions.
The final PC challenge we will mention relates to the nature of the handcrafted automatically extracted NVB features. Currently, the features automatically extracted from video by PC algorithms mostly measure low-level, specific motor behaviors, such as head tilting, blinking, or smiling. A dependency on such microlevel cues can lead to an overweighting of such cues. In addition, microlevel cues sometimes vary in meaning depending on context. For example, high levels of blinking during a high-anxiety-inducing task (such as an exam) have a different meaning than in a low anxiety-inducing task (such as playing a noncompetitive game). Thus, an over-reliance on microlevel cues might obscure the broader context in which those NVBs are produced and misconstrue their communicative intents (Huang & Kovashka, 2016). Psychology offers comprehensive models to record NVB within the micro-, meso- and macrolevels. The occurrence of microlevel features can be situated within macrolevel behaviors to enable the identification of communicative intents more accurately.
An integrative approach to nonverbal personality detection requires more than the simple transfer of knowledge or technologies among the component disciplines. In proposing an integrated science, we aim to understand, reproduce, and improve the ability of human and artificial agents to detect personality. Human and artificial judgments of personality will be studied within a single framework (Figure 3) and novel research questions will become possible as new theories emerge.
Personality and its nonverbal expression are the starting point of the framework. As mentioned earlier, the measure of actual personality should be carefully selected, and the limitations of candidate measurement methods and the possibility for their combination should be considered when applying an integrative approach. A fundamental assumption of nonverbal personality detection is that certain traits (actual personality) are expressed via NVBs (Back & Nestler, 2016). Within the integrated framework, we generally assume that the expression of personality occurs independently of the observer’s judgment (although the presence of a direct observer might influence personality expression). Thus, as long as contextual elements are the same, nonverbal expression will not be affected by the nature (i.e., human or artificial) of the judge.
Both human and artificial judges make their inferences based on the behaviors expressed by the targets. In the case of artificial judgments, NVBs are first operationalized (when using a handcrafted approach) as automatic features, whose values can be interpreted by the machine and used for calculation, while, in human judgments, NVBs are coded as observed NVBs. Nonverbal expression can, therefore, be measured both as the correlation between personality and automatic features or between personality and observed NVBs. The vertical line in the center of the model represents the influence of NVB on both human and artificial judgments. Human cues utilization refers to the relation between NVBs and human judgments, while artificial cues utilization refers to the relation between automatic features and artificial judgments.
Automatic features, however, can also be used in PJ, to scale up the assessment of NVBs. From this perspective, we advocate for calculating and reporting feature accuracy, a measure of the association between engineered features and human ratings on associated observed NVBs (e.g., number of smiles recorded by a trained human judge and the number of smiles recorded through a computer vision software). This procedure can also help uncover the meaning of self-extracted features in deep learning designs. Research on feature accuracy can also help to achieve reliable automatic measurements of macro NVBs. Human annotations of macro NVBs can be used as ground truth to train machine learning algorithms to identify more complex nonverbal characteristics of targets, such as their displayed nervousness or friendliness. This approach is already well established for recognizing emotions (Valstar et al., 2012).
On the left side of Figure 3, human detection accuracy and artificial detection accuracy represent the accuracy of humans and algorithms when judging personality by using NVBs. Consequently, the accuracy of both natural and artificial agents can be compared, across different situations and scopes (see e.g., the work of Kosinski on personality prediction from social media data, (Youyou et al., 2015). The comparison of human and artificial judgments is important for understanding the biases and fallacies of both. It can also guide practitioners on the selection of ideal personality detection methods.
On the right side of Figure 3, we outline some of the opportunities available when human and artificial judgments are obtained from the same targets. Agreement between human and artificial judges can be measured as human-artificial consensus. Artificial judgments can be optimized either for detection accuracy or for consensus with humans, and the comparison of the two approaches can offer new tools to understand human processes. Finally, the judgments produced by human raters and algorithms can be blended, opening to new studies aimed at the evaluation of collaborations between human and artificial judges. There is virtually no research up to date on the utility of combining human and artificial judgments, despite the potential relevance of the topic. We individuated at least five different approaches for the blending: (a) humans can supervise and override artificial judgments; (b) software can point humans toward the most valid cues; (c) algorithms can suppress nonvalid cues (e.g., racial features); (d) algorithms can flag potentially inaccurate judgments; (e) artificial and human judgments can be aggregated. PC scholars have developed many techniques for aggregating data from multiple sources (e.g., audio and video) which could be applied as well to the aggregation of human and artificial judgments beyond a simple regression (e.g., Junior et al., 2019). The choice of the best aggregation techniques is open to future research.
The integrated framework can be used to compare the impact of moderators on automatic and human judgments. The use of moderators in integrated approaches opens up important research questions and can contribute to the development of more efficient and reliable machines. Furthermore, it helps to understand under which conditions human or artificial detection is more effective. In this section, we provide relevant examples of the impact of moderators in integrated research.
Demographic characteristics of the targets, such as gender, ethnicity, or age, that are easily accessible to human judges have been shown to influence overall PJ in three ways. First, such characteristics are used directly as cues. The utilization of nonverbal cues (e.g., visible demographic characteristics of targets) can lead to greater accuracy (e.g., age is a valid and utilized cue for Openness to Experience, Chan et al., 2012), but also to unfair bias and discrimination (e.g., Arabic faces receive less favorable ratings from German right-wingers, Stecker et al., 2021). Second, demographic characteristics moderate the detection of nonverbal cues, as cues that fit with stereotypes about ethnicity or gender are detected more easily (see Stolier et al., 2020). Finally, demographic characteristics moderate the utilization of nonverbal cues (e.g., women are in general expected to smile more than men, Briton & Hall, 1995).
Although providing human judges with standardized scales and extensive training can help to reduce unfairness, a significant proportion of human bias is due to unconscious processes that are hard to control and that can be influenced by external factors such as mood and fatigue of the judge (Tversky & Kahneman, 1974).
In PC, the risks of inappropriate judgments, due to biases in the data or to inaccuracies of the learning algorithms, are considerable since the impact of discriminatory artificial judges is magnified by their speed, potentially exposing a high volume of targets to unfair practices and companies to considerable legal sanctions (Buolamwini & Gebru, 2018). On the other hand, machine biases in judgments are relatively easier to detect than human ones and several fixes have been proposed by researchers working on fairness in machine learning, a recent research area that investigates and evaluates methods to ensure that biases and inaccuracies do not lead to outcomes discriminating some individuals on the basis of gender, racial, and other characteristics (Barocas & Selbst, 2016; Barocas et al., 2018; Dwork et al., 2012; Kleinberg et al., 2018; Lepri et al., 2021). In this regard, the integrated framework can be used to individuate and solve two problems relating to the effects of demographics on judgments.
First, demographics can influence the capability of computer vision software to recognize cues due to differences in face morphology and skin color. For example, a picture enhancement software designed to recognize unintentional blinks produced many false positives with Asian study participants (Zou & Schiebinger, 2018). Thus, we suggest measuring feature accuracy within groups that differ in their demographics before using the features in high stake contexts.
Second, controlling for bias on judgments due to demographics raises considerable technical and conceptual issues. An apparently simple solution is to program artificial judges to suppress, include as cues, or control for demographic information. If programmed to suppress demographic features, all the cues that could lead to recognition of demographic features are removed, to restrict the artificial agent to processing NVBs (e.g., Cai et al., 2019). However, this solution has strong technical limitations, as demographic characteristics can creep into the algorithm indirectly (Lepri et al., 2021, Tomasev et al., 2021). For example, people from a certain country might combine two specific NVBs much more often than expected, allowing the algorithm to “indirectly” recognize ethnicity. If programmed to include demographic features explicitly, demographic information can be treated as a cue and/or as moderators of other cues and the role that demographic features play can be identified and managed appropriately. However, this could introduce explicit discrimination. Finally, controlling for the demographics consists in using mathematical rules to ensure that on average protected groups receive the same judgments as the majority (Dwork et al., 2012; Lepri et al., 2021). This approach risks to provoke a loss in accuracy when comparing diverse groups of people (Dwork et al., 2012).
The foregoing problems in the treatment of demographics in nonverbal personality detection will benefit from the potential to engage in parallel (human vs. machine) and collaborative (human plus machine) research. Such a research program might begin with “simple” questions such as whether human or machine judgments are more appropriate, and the training and limits required to optimize these approaches. However, once we consider the potential of blended judgments to correct bias, we open novel deeper conceptual questions relating to the processing of nonverbal information by humans and machines.
In PC human judgments are generally used as ground truth. However, we know from PJ research that human judges’ demographics act as moderators of judgments and have an impact on the fairness and bias of the outcome. For example, an algorithm trained in replicating human PJ of people from different ethnicities could replicate human bias when trained only on members of the majority. The question of judges’ diversity is important both for classical psychology and for PC, making it an ideal research question to be answered in synergy. For example, algorithms trained on diverse judges could be used to flag biased human judgments.
We mentioned earlier that, in PJ, certain traits are more accurately judged than others, but this pattern of judgment accuracy due to traits is not replicated in PC. While human judges recognize more easily Extraversion (Connelly & Ones, 2010) PC algorithms have performed better with Openness and Conscientiousness (Junior et al., 2019). This might be due, for example, to different impacts of trait visibility. Visibility might impact human and artificial judgments in different ways. Artificial judges, indeed, might be able to capture subliminal NVBs that are not easily accessible to human perceivers. Humans, on the other hand, might be better able to include contextual or holistic elements in their evaluations. Further investigation on these topics might help therefore identifying new cues invisible to humans and improving algorithms for the identification of more holistic cues.
Algorithms have been employed to detect personality in highly standardized settings (e.g., an automated structured clinical interview; Stratou et al., 2013), in constrained, but unstandardized environments (e.g., a customer assistant agent; Gilpin et al., 2018) or in unconstrained environments (e.g., visual surveillance; Zen et al., 2010). Unconstrained and unstandardized environments represent a current challenge for PC (Junior et al., 2019), since contextual elements can strongly influence targets’ observable behaviors. Integrative studies that include the manipulation of settings can help to better understand how human judges manage contextual variability and how machines can replicate such understanding. As individuals frequently form and develop relationships that alternate between online and offline environments, another research question for future studies concerns if and how behavioral expressions remain stable across digital versus nondigital environments.
The examples discussed are representative of the many new research questions opened by an integrative approach that can help understand the strengths and weaknesses of artificial and human evaluation of personality in various settings. In the next section, we expand on how the framework could be applied to a real-world problem. We have chosen the context of candidates’ screening, as it is a situation where automatic PJ are already used.
One situation in which effective detection of personality has important consequences is the employment interview (Nederström & Salmela-Aro, 2014; Swider et al., 2016). There are therefore practical reasons to compare human and artificial agents’ accuracy. In this example (Figure 4), the study will focus on how Extraversion, as expressed through NVBs (nonverbal expression) in a video interview, is judged by human and artificial judges. First, a large sample of candidates completes a personality questionnaire and a video interview. Out of the large sample (approximately 1,000), 90% of the videos are included in a training set and 10% as a testing set. Using the training set, an algorithm is trained to detect Extraversion using a set of features from previous research. The ground truth for the algorithm is the candidates’ scores on their personality questionnaires, a typical training method in PC. The algorithm trained on the training set will constitute the artificial judge.
Following training, the candidates in the testing set are evaluated by the human and artificial judges. In the human upper branch of Figure 4, the NVBs of the testing set candidates observed in the video interview are recorded manually by trained assessors and their Extraversion is judged by multiple human raters. In the artificial (lower) branch of Figure 4, features are automatically extracted from the video interview and Extraversion judgments are generated by the algorithm. The candidates in the testing set are eventually invited to an assessment center, which consists of three group interactions (each with different people). The candidate behaviors throughout assessment center exercises are recorded to create a composite behavioral measure of Extraversion.
This design allows the researcher to explore most of the relations exposed in Figure 3. Feature accuracy can be explored as manual and automatic recording of NVBs occurs. Human and artificial judgments can be compared (Human artificial consensus) and differences in cue utilization can be explored. Finally, it is possible to compare the accuracy of the Extraversion judgments produced by single humans (human detection accuracy), aggregated human scores, algorithm (artificial detection accuracy), and a blend of human and artificial judgments. The effect of moderators can also be easily included in the design. For example, the researcher could compare human and artificial accuracy, feature accuracy, and human-artificial consensus achieved for male and female candidates.
PC and PJ have different traditions, methods and goals, and this represents the major barrier that keeps the two disciplines separated (Mahmoodi et al., 2017). We can foresee four major challenges. First, while psychology is mostly concerned with explanation, computer science is mostly focused on prediction. There are valid arguments, however, to include both approaches in human behavior research, which have been discussed elsewhere (Mõttus et al., 2020; Yarkoni & Westfall, 2017). Second, researchers in psychology are used to traditional data sets, rather than “big data,” which are characterized by lack of structure and high volume (Chen & Wojcik, 2016; e.g., the video recording of a target). In this regard, PC can provide tools for researchers in psychology to transform big data into more manageable forms. Third, the two disciplines have different traditions in reporting results, with PJ focusing on the individual correlations between NVBs and personality traits, and PC on aggregate accuracy. Both the aspects should be reported in an integrative approach. Finally, fourth, articles on PC and PJ are published on different outlets, which might hamper cross-dissemination. Cross-disciplinary conferences or journal special issues could help to address this problem. Academic institutions have an important role in overcoming the challenges, as it will require fostering cross-disciplinary learning and the set-up of cross-disciplinary teams. Also, researchers from both fields will need to come up with a common terminology or at least to learn the major terminology of each other field (Blandford et al., 2018). Despite the difficulties, however, cross-disciplinary research can lead to unexpected results and even paradigm shifts, as it was the case, in the 50s, with the cognitive revolution, in which the collaboration between psychology, linguistic, philosophy, neuroscience, anthropology, and computer science enormously benefitted each of the discipline (Baddeley, 2000).
Artificial personality assessments can integrate or substitute traditional personality assessments in industrial, judicial, clinical, or research settings. The most important commercial domain is represented by employment testing, especially in combination with remote video interviews (see e.g., Chiang & Berkoff, 2017). Main drivers for using artificial agents for personality assessment rather than traditional methods include cost reduction, bias attenuation, and control of impression management (Ihsan & Furnham, 2018). The main disadvantage is that psychometric properties of artificial personality are not as well researched and established as for traditional methods. The integrative framework and the accuracy moderators can be used for the evaluation of commercial applications in different settings, helping producers to demonstrate the validity of the algorithm they build and users to make informed decisions on if and how to use them. Artificial judgment accuracy, feature accuracy, human-machine consensus, and, where relevant, blended judgments accuracy, are all important constructs to evaluate when deciding to use artificial PJ over other methods. Target, judge, trait, and situation moderators should also be taken into account to understand under which circumstances one method is more suited.
Artificial personality detection is also used in more novel domains where traditional assessments are not a viable option. Detection of personality can be one of the features of artificial agents such as customer assistant chatbots (Siddique et al., 2017), conversational Agents such as amazon’s Alexa (Ram et al., 2018), trust-building virtual agents (Zhou et al., 2019), or interactive social robots (Celiktutan et al., 2019). Also, in this case, the integrated framework can help to improve and understand the performance of the models and to increase generalizability.
This research intends to help connecting the two major disciplines that study personality detection. Researchers from both fields have been calling for more collaboration. For example, Schmid Mast and colleagues argue that “The development of algorithms to extract relevant behavioral cues […] provides an opportunity for the emergence of a new, interdisciplinary field in which behavioral scientists collaborate with computer scientists” (Mast et al., 2015). In their PC survey instead, Junior and colleagues state:“ “Recent and promising results on personality computing may encourage psychologists to get interested in machine learning approaches and to contribute to the field. […] A very promising venue for research has to do with the incorporation of prior knowledge in personality analysis models” (Junior et al., 2019). ”
We highlighted how some of the gaps in both fields can be addressed by the other, with PC providing a precious tool for the analysis of big data and psychology theories and models. The resulting integrative approach provides a working framework and a common vocabulary for researchers in both fields. It offers the opportunity for researchers in psychology and computer science interested in nonverbal personality detection to confront their results, to apply new technologies and models, and to explore novel hypotheses regarding the comparison and combination of human and artificial judgments.
As the integrated field matures, how researchers measure and conceptualize the construct of personality could transform. For example, humans of the 21st century continuously leave a digital footprint of “nonverbal” behaviors (e.g., likes in social media, step counts from mobile phone, or heart rate measured by smartwatches; Montag & Elhai, 2019). Access to such granular behavioral data as well as machine learning methods to analyze such data could lead to a more behaviorally focused conceptualization of personality that is less based upon Likert scales and self-concepts (Phan & Rauthmann, 2021). With this in mind, an important avenue for the integrated discipline will be to promote inductive exploration, in which the data coming from nonverbal sensors are measured and extensively described across subjects (Kubovy, 2020).
We have focused on personality as the variable to detect and NVBs as cues, but the integrated approach is potentially illustrative of any integration between interpersonal perception psychology and social computing (For an overview see Kedar & Bormane, 2015 or Parameswaran & Whinston, 2007). Other target variables for which the presented framework could be adapted include emotions (Dzedzickis et al., 2020), social relations (Liu et al., 2019; Sun et al., 2017), clinical conditions (e.g., Yang et al., 2019), or deception (An et al., 2018; Cohen et al., 2010). Other cues that can be perceived by humans and measured by machines include verbal (An et al., 2016; O’Sullivan et al., 1985) and written content (Tucaković et al., 2020; Xue et al., 2018) or social media activity (Youyou et al., 2015). It is our hope that the proposed framework will provide a fruitful guideline for such research.