Attention to Speech-Accompanying Gestures: Eye Movements and Information Uptake

Gullberg, Marianne; Kita, Sotaro

doi:10.1007/s10919-009-0073-2

Attention to Speech-Accompanying Gestures: Eye Movements and Information Uptake

Original Paper
Open access
Published: 19 July 2009

Volume 33, pages 251–277, (2009)
Cite this article

Download PDF

You have full access to this open access article

Journal of Nonverbal Behavior Aims and scope Submit manuscript

Attention to Speech-Accompanying Gestures: Eye Movements and Information Uptake

Download PDF

Marianne Gullberg¹ &
Sotaro Kita²

4957 Accesses
69 Citations
1 Altmetric
Explore all metrics

Abstract

There is growing evidence that addressees in interaction integrate the semantic information conveyed by speakers’ gestures. Little is known, however, about whether and how addressees’ attention to gestures and the integration of gestural information can be modulated. This study examines the influence of a social factor (speakers’ gaze to their own gestures), and two physical factors (the gesture’s location in gesture space and gestural holds) on addressees’ overt visual attention to gestures (direct fixations of gestures) and their uptake of gestural information. It also examines the relationship between gaze and uptake. The results indicate that addressees’ overt visual attention to gestures is affected both by speakers’ gaze and holds but for different reasons, whereas location in space plays no role. Addressees’ uptake of gesture information is only influenced by speakers’ gaze. There is little evidence of a direct relationship between addressees’ direct fixations of gestures and their uptake.

Focusing on the face or getting distracted by social signals? The effect of distracting gestures on attentional focus in natural interaction

Article 23 July 2020

Gaze facilitates responsivity during hand coordinated joint attention

Article Open access 26 October 2021

Gestures and pauses to help thought: hands, voice, and silence in the tourist guide’s speech

Article 10 December 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Typically, when we talk, we also gesture. That is, we perform manual movements as part of the expressive effort (Kendon 2004; McNeill 1992). Such speech-accompanying gestures typically convey meaning (e.g., size, shape, direction of movement), which is related to the ongoing talk. The communicative role of these gestures is somewhat controversial. It is debated both whether speakers actually intend gestural information for their addressees (e.g., Holler and Beattie 2003; Melinger and Levelt 2004), and whether addressees attend to and integrate the gestural information. This paper focuses on the latter issue.

There is growing evidence that speech and speech-accompanying gestures are processed and comprehended together, forming an ‘integrated’ system or a ‘composite signal’ (e.g., Clark 1996; Kendon 2004; McNeill 1992). Gestural information is integrated with speech in comprehension and influences the interpretation and memory of speech (e.g., Beattie and Shovelton 1999a, 2005; Kelly et al. 1999; Langton and Bruce 2000; Langton et al. 1996). For instance, information expressed only in gestures re-surfaces in retellings, either as speech, as gesture, or both (Cassell et al. 1999; McNeill et al. 1994). Further, neurocognitive studies show that incongruencies between information in speech and gesture yield electrophysiological markers of integration difficulties such as the N400 (e.g., Özyürek et al. 2007; Wu and Coulson 2005). However, surprisingly few studies have attempted to examine directly whether attention to gestures and uptake of gestural information is deterministic and unavoidable or whether such attention is modulated in human interaction, and if so by what factors. Furthermore, surprisingly little is known about the role of gaze in this context. This study therefore aims to examine what factors influence overt, direct visual attention to gestures and uptake of gestural information, focusing on one social factor, namely speakers’ gaze at their own gestures, and two physical properties of gestures, namely their place in gesture space and the effect of gestural holds. The study also examines the relationship between addressees’ gaze and uptake.

Visual Attention to Gestures

Gestures are visuo-spatial phenomena, and so the role of vision and gaze for attention is important. However, addressees seem to gaze directly at speakers’ gestures relatively rarely. Addressees mainly look at the speaker’s face during interaction (Argyle and Cook 1976; Argyle and Graham 1976; Bavelas et al. 2002; Fehr and Exline 1987; Kendon 1990; Kleinke 1986). Studies using eye-tracking techniques in face-to-face interaction have further demonstrated that addressees spend as much as 90–95% of the total viewing time fixating the speaker’s face and thus fixate only a minority of gestures (Gullberg and Holmqvist 1999, 2006).

However, the likelihood of an addressee directly fixating a gesture increases under the following three circumstances (Gullberg and Holmqvist 1999, 2006; Nobe et al. 1998, 2000). The first is when speakers first look at their own gestures (speaker-fixation) (Gullberg and Holmqvist 1999, 2006). This tendency is stronger in live face-to-face interaction than when observing speakers on video (Gullberg and Holmqvist 2006). This suggests that the overt shift of visual attention to the target of a speaker’s gaze is essentially social in nature rather than an automatic response. The second circumstance is when a gesture is produced in the periphery of gesture space in front of the speaker’s body (cf. McNeill 1992). The third is when a gestural movement is suspended momentarily in mid-air and goes into a hold before moving on (cf. Kendon 1980; Kita et al. 1998; Seyfeddinipur 2006). Holds are often found between the dynamic movement phase of a gesture, the stroke, and the so-called retraction phase, which marks the end of a gesture. It is currently not clear whether these three factors—speaker-fixation, peripheral articulation, and holds—all contribute independently to the increased likelihood of the addressee’s fixation on gesture. The evidence for the influence of these three factors mostly comes from observational studies of naturalistic conversations, in which the three factors often co-occur (Gullberg and Holmqvist 1999, 2006; see also Nobe et al. 1998, 2000). Therefore, one of the goals of this study is to experimentally manipulate these factors and assess their relative contributions to the likelihood of addressees’ fixations of gesture.

The three factors may draw the addressee’s attention either for bottom-up, stimulus-related reasons or for top-down, social-cognitive reasons. Gestures in peripheral gesture space or with a hold may elicit the addressee’s fixation for bottom-up reasons, namely, because these gestures challenge peripheral vision. Firstly, the acuity of peripheral vision decreases the further away from the fovea the image is projected, and secondly, peripheral vision, which is good at motion detection, cannot process information about a static hand in a hold efficiently. In contrast, gestures with speaker-fixations may elicit the addressee’s fixation for top-down social reasons, namely to manifest social alignment or joint attention. The difference between bottom-up and top-down processes should be reflected in different onset-latencies of fixations to gestures (cf. Gullberg and Holmqvist 2006). Fixation onsets that are bottom-up driven should be short, whereas fixations driven by top-down concerns should have longer onsets (e.g., Yantis 1998, 2000). Thus, another goal of the study is to compare the onset-latency for fixations on gestures triggered by the three factors to further elucidate the reasons for fixation.

Uptake of Gestural Information

Only a few studies have attempted to directly examine whether attention to and uptake of information from gestures is unavoidable or whether it is ever modulated and if so by what factors. Rogers (1978) manipulated noise levels showing that addressees pick up more information from gestures the less comprehensible the speech signal. Beattie and Shovelton (1999a, b) demonstrated that addressees decode information about relative position and size better when presented with speech and gesture combined than with either gesture or speech alone. Interestingly, this study also indicated that not all gestural information was equally decodable. Addressees reliably picked up location and size information pertaining to objects, but did worse with information such as direction. These studies indicate that the comprehensibility of speech affects addressees’ attention to gestures and also that the type of gestural information matters.

Other factors may also modulate addressees’ attention to gestures. Speakers’ gaze to their own gestures, a factor of a social nature, is a likely candidate. It is well-known that humans are extremely sensitive to the gaze direction of others (e.g., Gibson and Pick 1963), and that gaze plays a role in the establishment of joint attention (e.g., Langton et al. 2000; Moore et al. 1995; Tomasello 1999; Tomasello and Todd 1983). It has been suggested that speakers look at their own gestures as a means to draw addressees’ attention to them in face-to-face interaction (e.g., Goodwin 1981; Streeck 1993, 1994). Such behavior could increase the likelihood of addressees’ uptake of gestural information, although this has not been tested with naturalistic, dynamic gestures that are not pointing gestures.

Physical properties of gestures may also affect addressees’ uptake of gestural information. First, the location of the gesture in gesture space may matter (cf. McNeill 1992). Speakers often bring gestures up into central gesture space, that is, to chest height and closer to the face, when they want to highlight the relevance of gestures in interaction (e.g., Goodwin 1981; Gullberg 1998; Streeck 1993, 1994). The information expressed by such a gesture seems more likely to be integrated than that of a gesture articulated for instance on the speaker’s lap in lower, peripheral gesture space.

A second potentially important physical property is the gestural hold. The functional role of holds is somewhat debated, but holds have been implicated in turn taking and floor holding in interaction. Transitions between speaker turns in interaction are more likely once a gesture is terminated or when a tensed hand position is relaxed (e.g., Duncan 1973; Fornel 1992; Goodwin 1981; Heath 1986). If holds are a first indication that speakers are about to give up their turn, it would be communicatively useful for addressees to attend to them. This in turn may increase the likelihood of information uptake from a gesture with a hold. A further goal of this study, then, is to examine the impact of these three factors on addressees’ uptake of gesture information.

The Relationship Between Fixations and Information Uptake

As indicated above, most gestures are perceived through peripheral vision. Although peripheral vision is powerful, optimal image quality with detailed texture and color information is achieved only in direct fixations, that is, if the image falls directly on the small central fovea. Outside of the fovea, parafoveal or peripheral vision gives much less detailed information (Bruce and Green 1985; Latham and Whitaker 1996). Consequently, it is generally assumed that an overt fixation indicates attention in the sense of information uptake. If addressees shift their gaze from the speaker’s face to a gesture in interaction, this might indicate that they are attempting to integrate the gestural information (e.g., Goodwin 1981; Streeck 1993, 1994).

However, addressees’ tendency to gaze directly at an information source is modulated in face-to-face interaction by culture-specific norms for maintained or mutual gaze to indicate continued attention (e.g., Rossano et al. 2009; Watson 1970). In cultures where mutual gaze is socially important, face-to-face interaction may emphasize the reliance on peripheral vision for gesture processing and dissociation between overt and covert attention. Addressees can fixate a visual target without attending to it (“looking without seeing”), and conversely, attend to something without directly fixating it (“seeing without looking”). If the speaker’s face is the default location of visual attention in interaction, then most gestures must be attended to covertly. It is therefore not entirely clear what the relationship between overt fixation and information uptake might be in interaction from information sources like gestures. A final goal of this study is therefore to examine the relationship between overt fixation of and uptake of information from gestures.

The Current Research

This study aims to examine what factors modulate addressees’ visual attention to and information uptake from gestures in interaction by asking the following questions:

1.
Do social and physical factors influence addressees’ fixations on speakers’ gestures? Furthermore, do different factors trigger qualitatively different fixations, reflecting the difference between top-down vs. bottom-up processes? We expect top-down driven fixations to have longer onset latencies than bottom-up driven fixations.
2.
Do social and physical factors influence addressees’ uptake of gesture information?
3.
Are addressees’ fixations a good index of information uptake from gestures?

To examine these questions we present participants (‘addressees’) with video recordings of naturally occurring gestures embedded in narratives. We examine the effect of a social factor, namely the presence/absence of speakers’ fixations of their own gestures (Study 1), and the effect of two physical properties of gestures, namely gestures’ location in gesture space (central/peripheral) and the presence/absence of holds (Study 2). In Studies 1 and 2, we manipulate the independent variables by selecting gestures with the relevant properties from a corpus of video recorded gestures. In a second set of control experiments, we present participants with digitally manipulated versions of the gesture stimuli used in Studies 1 and 2, examining the effect of presence/absence of speakers’ artificial fixations of their own gestures (Study 3) and the presence/absence of artificial holds (Study 4). These studies are undertaken to control for any other unknown variables that may have differed between the stimulus gestures used in the conditions in Studies 1 and 2.

In all studies, participants were presented with brief narratives that included a range of gestures, but our analyses focus on one “target gesture” in each narrative. Each target gesture conveyed information about the direction of a movement. This information was only encoded in the target gesture, and not in other gestures or in speech. Overt visual attention to gestures was operationalized as direct fixations of gestures. Participants’ eye movements were recorded during the presentation of the narratives using a head-mounted eye-tracker. Further, information uptake was operationalized as the extent to which participants could reproduce the information conveyed in the target gesture in a drawing task following stimulus presentation. Participants were asked to draw an event in the story that crucially involved the movement depicted by the target gesture. The match between the directionality of the movement in the drawing and in the target gesture was taken as indicative of information uptake.

Study 1: Speaker-fixations

The first study examines the effect of a social factor on addressees’ overt visual attention to and uptake of information from gestures, namely the presence/absence of speakers’ fixations of their own gestures.

Methods

Participants

Thirty Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 22, SD = 3), 23 women and 7 men. They were paid 5 euros for their participation.

Materials

The stimuli were taken from a corpus of videotaped face-to-face story retellings in Dutch (Kita 1996). The video clips showed speakers facing an addressee or viewer retelling short stories. The video clips did not show the original live addressee, but only the speaker seated en face. Each video clip contained a whole, unedited story retelling. Each clip therefore contained multiple gestures, only one of which was treated as a target gesture. Consequently, the target gesture appeared within sequences of other gestures so as not to draw attention as a singleton. The stimulus videos were selected from the corpus because they contained one target gesture displaying the appropriate properties. For Study 1, each target gesture displayed either presence or absence of speaker-fixation, that is, the speakers either looked at their own gestures or not. The target gestures were otherwise similar, and performed in central gesture space without holds. All target gestures were representational gestures encoding the movement of a protagonist in the story from an observer viewpoint (McNeill 1992), meaning that the speaker’s hand represented a protagonist in the story as seen from outside. The target gestures, typically expressing a key event in the story lines, encoded the direction of the protagonist’s motion left or right. Although the movement itself was an important part of the storyline, the direction of the movement was not. The directional information was only present in the target gesture and not in co-occurring speech. Further, the directional information could not be inferred from other surrounding gestures. Care was taken to ensure that the gestural information was not highlighted in any other way. Co-occurring speech did not contain any deictic expressions referring to and therefore drawing attention to the gesture (e.g., ‘that way’). Moreover, the target gesture did not co-occur with hesitations in speech, with the story punch line or with first mention of a protagonist, as all of these features might have lent extra prominence to a co-occurring gesture. Descriptions of the animated cartoons used to elicit the narratives and the target scenes therein are provided in Appendix 1. Outlines of the spatio-temporal properties of the target gestures across conditions (and all studies) are provided in Appendix 2, and speech co-occurring with target gestures is listed in Appendix 3.

In Study 1, the target gestures consisted of gestures that were either fixated or not by the speaker in the video (speaker-fixation vs. no-speaker-fixation). Location in gesture space and presence/absence of hold were held constant (central space, no hold). There were 4 items in each condition. The mean durations of the target gestures in each condition in Study 1 are summarized in Table 1.

Table 1 Mean duration (ms) of target gestures with and without speaker-fixation

Full size table

Apparatus

We used a head-mounted SMI iView© eye-tracker, which is a monocular 50 Hz pupil and corneal reflex video imaging system. The eye-tracker records the participant’s eye movements with the corneal reflex camera. The eye-tracker also has a scene-camera on the headband, which records the field of vision. The output data from the eye-tracker consist of a merged video recording showing the addressee’s field of vision (i.e., the speaker on the video), and an overlaid video recording of the addressee’s fixations as a circle overlay. Since the scene-camera moves with the head, the eye-in-head signal indicates the gaze point with respect to the world. Head movements therefore appear on the video as full-field image motion. The fixation marker represents the foveal fixation and covers a visual angle of 2°. The output video data allow us to analyze both gesture and eye movements with a temporal accuracy of 40 ms.

Procedure

Participants were randomly assigned to one of the two conditions: Speaker-fixation (central space, no hold, speaker-fixation) and No-speaker-fixation (central space, no hold, no speaker-fixation). The participants were seated 250 cm from the wall and fitted with the SMI iView© headset. A projector placed immediately behind the subject projected a nine-point matrix calibration screen on the wall of the same size as the subsequent stimulus videos. After calibration, four stimulus video clips were projected against the wall. The speakers appearing in the videos were thus life-sized, and their heads were level with the participants’ heads. Life-sized projections have been shown to yield fixation behavior towards gestures that is similar to behavior in live interaction (Gullberg and Holmqvist 2006). A black screen appeared between each video clip for a duration of 10 s. Participants were instructed to watch the videos carefully to be able to answer questions about them subsequently. The instructions did not mention gestures or the direction of the movements in the story. Participants’ eye movements were recorded as they watched the video clips. After watching all four videos, participants answered questions about the target events of each video by drawing pictures of the protagonists in the story. An example question is “De muis heeft moeite met roeien. Hoe komt hij toch vooruit?” (“The mouse has trouble rowing. How does it make progress?”) (see Appendix 4 for the complete set of questions).

The participants did not know the contents of the questions until they had finished watching all four videos. A drawing task was chosen because it allows directionality to be probed implicitly: The participant must apply a perspective on the event and the protagonist in order to draw them, a perspective which in turn will reveal the direction of the protagonist (see Fig. 1). The drawing task thus avoids the well-known difficulties involved in overt labeling of left-right directionality (e.g., Maki et al. 1979). A post-test-questionnaire ensured that gesture was not identified as the target of study.

Coding

The eye movement data were retrieved from the digitized video output from the eye-tracker. The merged video data of the participants’ gaze positions on the scene image were analyzed frame-by-frame and coded for fixation of target gesture (Yes or No) and for matched reply (Yes or No). A target gesture was coded as fixated if the fixation marker was immobile on the gesture, i.e., moved no more than 1 degree, for a minimum of 120 ms (equal to 3 video frames) (cf. Melcher and Kowler 2001). Note that fixations on gestures were spatially unambiguous. Either a gesture was clearly fixated, or the fixation marker stayed on the speaker’s face (cf. Gullberg and Holmqvist 1999, 2006). A drawing was coded as a matched reply if the direction of the motion in the drawing matched the direction of the target gesture on the video as seen from the addressee’s perspective (see Fig. 1).^{Footnote 1} Only responses that could be coded as matched or non-matched were included in the analysis. When drawings did not depict a lateral direction of any kind, the data point was discarded. Chance performance therefore equals 50%.

Analysis

The dependent variables were (a) the proportion of trials with fixations on target gestures, and (b) the proportion of matched responses as defined above. We employed non-parametric Mann–Whitney tests to analyze the fixation data because the dependent variable, proportions of trials with fixation on gesture, had a skewed distribution with clustering of data at zero. We analyzed the information uptake data using parametric, independent samples analyses of variance and single sample t-tests. Throughout, the alpha level for statistical significance is p = .05.

Results and Discussion

The proportion of trials in which the addressee fixated gestures were significantly higher in the speaker-fixation condition (M = .08, SD = .12) than in the no-speaker-fixation condition (M = 0, SD = 0), Mann–Whitney, Z = −2.41, p = .016 (see Fig. 2a). The proportion of trials in which the addressees’ drawn direction and the gesture direction matched (an index of information uptake) was higher in the speaker-fixation condition (M = .86, SD = .19) than in the no-speaker-fixation condition (M = .63, SD = .32), F(1, 28) = 5.59, p = .025, η² = .17 (see Fig. 2b). Furthermore, the proportion of trials in which addressees’ drawing and gestures matched was above chance level (.50) in the speaker-fixation condition, one-sample t-test, t(14) = 7.33, p < .001, but not in the no-speaker-fixation condition, t(14) = 1.61, p = .13.

The results show that speakers’ fixation of their own gestures increase the likelihood of addressees fixating the same gestures. Furthermore, speaker-fixations also increase the likelihood of addressees’ uptake of gestural information, even when it is of little narrative significance and embedded in other directional information. Overall, the combined fixation and uptake findings suggest that speakers’ gaze at their own gestures constitute a very powerful attention directing device for addressees influencing both their overt visual attention and their uptake.

Study 2: Location in Space and Holds

The second study examines the effect of two physical gestural properties on addressees’ overt visual attention to and uptake of information from gestures, namely gestures’ location in space (central vs. peripheral) and the presence vs. absence of holds.