What is speech perception influenced by?

  • Journal List
  • HHS Author Manuscripts
  • PMC2773261

Lang Speech. Author manuscript; available in PMC 2009 Nov 4.

Published in final edited form as:

PMCID: PMC2773261

NIHMSID: NIHMS154562

Abstract

We report four experiments designed to determine whether visual information affects judgments of acoustically-specified nonspeech events as well as speech events [the “McGurk effect”]. Previous findings have shown only weak McGurk effects for nonspeech stimuli, whereas strong effects are found for consonants. We used click sounds that serve as consonants in some African languages, but that are perceived as nonspeech by American English listeners. We found a significant McGurk effect for clicks presented in isolation that was much smaller than that found for stop-consonant-vowel syllables. In subsequent experiments, we found strong McGurk effects, comparable to those found for English syllables, for click-vowel syllables, and weak effects, comparable to those found for isolated clicks, for excised release bursts of stop consonants presented in isolation. We interpret these findings as evidence that the potential contributions of speech-specific processes on the McGurk effect are limited, and discuss the results in relation to current explanations for the McGurk effect.

Keywords: audiovisual speech perception, clicks, McGurk effect, nonspeech

1 Introduction

In 1976, McGurk and MacDonald reported an effect on speech perception of dubbing an acoustic signal corresponding to one consonant-vowel sequence [e.g., /ba/-/ba/] onto a videotaped event in which a speaker mouthed a different sequence [e.g., /ga/-/ga/]. Listeners reported hearing a consonant reflecting the integration of phonetic information acquired optically and acoustically [e.g., /da/-/da/]. Subsequent research [e.g., MacDonald & McGurk, 1978] demonstrated that certain dubbings will result in perception of the optically-specified consonant [e.g., /da/ when auditory /ba/ is dubbed with visual /da/] or a consonant that incorporates features from each modality [e.g., /na/, an alveolar nasal, when an auditory bilabial nasal /ma/ is dubbed with a visual alveolar plosive /da/]. This phenomenon, which we will refer to as the “McGurk effect,” following popular practice [e.g., Green, Kuhl, Meltzoff, & Stevens, 1991; Rosenblum & Saldaña, 1992], has excited considerable interest and has stimulated much research. A reason for this interest in the McGurk effect, aside from its phenomenal vividness, is its demonstration of cross-modal contributions to a unified perceptual experience. In particular, it provides compelling evidence that speech perception is not only an auditory process, but involves the extraction of a phonetic message across modalities.

The McGurk effect has been reported for a variety of audio-visually incongruent speech stimuli and is often phenomenally quite compelling [Manuel, Repp, Liberman, & Studdert-Kennedy, 1983]. However, it varies in magnitude and even fails to occur in some dubbings. For example, an auditory syllable /da/ presented with a visual /ga/ is perceived as /da/ despite the audiovisual incongruency [MacDonald & McGurk, 1978], presumably because the visual distinction between /da/ and /ga/ is subtle. McGurk effects for vowels, although reliable, are small [Massaro & Cohen, 1993; Summerfield & McGrath, 1984]. Such findings indicate that there are constraints on the range of audiovisual dubbings for which there is a visual influence on perception.

To date, the limiting conditions for the McGurk effect have not been clearly established. One question regarding the scope of the McGurk effect concerns whether it is restricted to phonologically relevant stimuli [e.g., speech sounds], or whether it might occur for appropriately constructed nonspeech stimuli. This question highlights a long-running debate among speech researchers, namely whether speech perception is accomplished by specialized brain mechanisms that are solely devoted to the task of perceiving speech, or whether it is accomplished by means of general perceptual systems.

The most prominent theory associated with the specialized speech mechanism view is the motor theory [Liberman & Mattingly, 1985]. According to motor theorists, speech perception is achieved by a specialized phonetic “module” [see Fodor, 1983] that retrieves gestures of the vocal tract, whereas nonspeech acoustic inputs undergo general auditory system analysis. By this view, the McGurk effect is a consequence of processing gestural information in both the acoustic and optical signals by the phonetic module. Thus, McGurk effects should be limited to stimuli that engage the phonetic module or any other brain module that makes use both of optical and of acoustic information of a specific type [such as, perhaps, one responsible for sound localization, a domain in which visual effects are well-known; Driver, 1996; Radeau, 1994]. Possibly, McGurk-like effects would occur for other stimuli as well by means of some higher-level cognitive process, but these effects would “not portray the striking power of the McGurk effect” [Saldaña & Rosenblum, 1993, p. 407].

In contrast, other theories do not maintain a distinction between the perception of speech and nonspeech stimuli. For example, proponents of the direct-realist theory of speech perception [Best, 1984, 1995; Fowler, 1986, 1994; Rosenblum, 1987], while agreeing with motor theorists that the objects of speech perception are gestures, propose that perception of speech is fundamentally like perception of other events: both nonspeech and speech events create structure in acoustic and optical [and haptic; Fowler & Dekle, 1991] signals that specify their source, and perceptual systems detect this structure as information about the event [Gibson, 1966; 1979]. According to this view, the McGurk effect occurs because information in the dubbed audio-visual signal corresponds to an event that contrasts with the event specified by the acoustic signal alone, regardless of whether the stimuli are speech. To date, proponents of this view have not explored what the critical factors for a strong McGurk effect might be, but from a direct-realist perspective, the critical factors should involve aspects of the underlying kinematic properties of the dubbed events and how they are specified by the acoustic and optical signals [see Rosenblum & Saldaña, 1996, for a discussion of the role of kinematic primitives in audio-visual integration].

Another theory that rejects the view that the McGurk effect is specific to speech is the Fuzzy Logical Model of Perception [or FLMP, Massaro, 1987, 1998], According to the FLMP, categorization in both speech and nonspeech involves extracting features from input signals and using them to evaluate stored prototypes. The McGurk effect arises for audio-visually incongruent stimuli because of the contributions of visual features in the selection of a prototype. By this view, the McGurk effect can occur for both speech and nonspeech stimuli, provided that the relevant prototypes include both auditory and visual features. However, because prototype descriptions develop from experience, the effect will occur only if there are sufficient built-up associations of auditory and visual features [see also Diehl & Kluender, 1989; but see Fowler & Dekle, 1991, for findings challenging this assumption]. Additionally, according to the theory, the magnitude of the McGurk effect will be related to the relative ambiguity of the auditory and visual cues.

To date, few studies have offered evidence for a strong nonspeech McGurk effect involving visual influences on auditory event identification [excluding visual influences on auditory location; cf. Rosenblum, 1994]. Rosenblum and Fowler [1991] presented a model clapping his hands at different levels of visible effort and auditory loudness, with the auditory and visual levels of effort cross-dubbed. When listeners were instructed to rate the loudness of the claps, based only on what they heard, the effects of video effort on loudness judgments were small and reached significance only at one of the four levels of auditory effort. Saldaña and Rosenblum [1993] tested for a McGurk effect using dubbed cello bow and pluck sounds: they created a continuum of cello sounds ranging from a bow sound to a pluck, which were dubbed onto video presentations of a person bowing or plucking a cello string. There were significant effects of the video display on ratings of the sounds along a pluck-bow continuum, but the effect lacked the phenomenal vividness of the McGurk effect for speech; notably, the dubbed videos failed to turn a clear pluck sound into a bow sound or vice versa, unlike the effect with dubbed consonants. More recently, de Gelder and Vroomen [2000; see also Massaro & Egan, 1996] found a visual influence of static faces with happy and fearful expressions on judgment of emotion expressed vocally, but as with the findings for hand-claps and plucks and bows, this effect was weaker and qualitatively different than the effect found for dubbed consonants. Thus, the visual influences on auditory judgments of nonspeech events observed to date have been considerably weaker than the effect found for consonants.

These weak McGurk effects for nonspeech events appear to be consistent with the view that phonological significance is required for a strong McGurk effect. However, this is a premature conclusion, because these stimuli differed from consonants in other ways than the speech/nonspeech distinction. For example, as pointed out by Saldaña and Rosenblum [1993], the McGurk effect might be stronger for categorically perceived stimuli [such as stop consonants] than for continuously perceived stimuli [such as vowels and plucks and bows]. The effect might also be stronger for stimuli with less robust auditory cues, as predicted by the FLMP [Massaro, 1987]; auditory cues for stop consonants are arguably more transient than the ones that distinguish, for example, plucks from bows. Stop consonants differ along very different acoustic dimensions than the nonspeech stimuli that have thus far been examined: acoustically, stop consonants differ primarily in frequency at their onsets [e.g., Fant, 1973]; in contrast, Saldaña and Rosenblum’s plucks and bows and Rosenblum and Fowler’s loud and soft hand-claps differed from one another only in variations in amplitude, and sentences varying in emotion differ primarily in F0. It is possible that these distinctions in physical [or psychophysical] properties, rather than the presence versus absence of phonological significance, underlie the variation in the magnitude of the McGurk effect between stop consonants and the nonspeech stimuli tested to date. In other words, perhaps a strong McGurk effect occurs for stimuli that have certain physical properties, and stop consonants happen to possess those properties.

The purpose of the present research was to investigate the importance of phonological status in relation to nonlinguistic stimulus properties in fostering a strong McGurk effect. In doing so, we attempted to address the question of whether speech-specific mechanisms are responsible for the McGurk effect.

Because there is a broad range of potentially critical physical or event properties, we opted to take an incremental approach to the problem: We tested the McGurk effect for stimuli that share many properties of stop consonants, but that would not be identified by our listeners as speech. Specifically, we used variants of the consonantal clicks that serve as phones in some languages of Africa. These have kinematic properties that are similar to those of consonants of spoken English: they are produced by making complete constrictions somewhere in the oral cavity and then releasing the constrictions, resulting in a distinct pattern of change in the acoustic frequencies at release, as in English stop production. The clicks we selected for use are also visibly distinct from one another, thus providing an appropriate environment for the McGurk effect to emerge. Moreover, previous work has found that for native speakers of English, clicks are perceived as nonspeech [Best, McRoberts, & Sithole, 1988]; notably, Best and Avery [1999] found that native speakers of English do not exhibit a right-ear advantage when discriminating clicks although they do for native English consonant contrasts, whereas native speakers of a click language [Zulu] do show a right-ear advantage for the same click stimuli.

In using these clicks, we narrowed our scope from the general question of whether certain nonlinguistic properties are required for a strong McGurk effect, to the more specific question of whether the McGurk effect is always strong for a particular type of event, namely those which involve a rapid release of a vocal tract constriction. In doing so, we followed the recommendation of Saldaña and Rosenblum [1993] that “future research should be designed to implement nonspeech sounds that have characteristics of consonants … in order to demonstrate a more phenomenally striking nonspeech McGurk effect” [p. 415].

In the following experiments, we examined the magnitude of the McGurk effect for voiceless stop consonants in consonant-vowel syllables [in Experiment 1] and for isolated clicks [in Experiment 2]. A finding of McGurk effects of comparable magnitudes in the two experiments would indicate that the McGurk effect is strong for stimuli that involve a rapid release of vocal tract constrictions, regardless of whether or not they are perceived as speech. In contrast, a finding of a weaker McGurk effect for clicks could be attributed either to their lack of phonological significance or to the physical differences that exist between the stop consonant syllables and the isolated clicks. Experiments 3 and 4 provided a systematic examination of the relative contributions of these physical differences; they tested the McGurk effect for clicks coarticulated with a following vowel and for stop bursts presented without a following vowel, respectively. Thus, these experiments were designed to examine the relative contributions of both physical factors and phonological significance, the latter of which we expected to differ across the stimuli of the four experiments.

2 Experiment 1

The purpose of the first experiment was to establish the McGurk effect using English voiceless stop consonant syllables, namely, /pa/, /ta/, and /ka/. We used voiceless stops, as opposed to the more commonly used voiced stops, because in a later experiment we only presented the release bursts of the stops, and the bursts of voiced consonants would not have been appropriate for this use. Typically, McGurk experiments are conducted by instructing participants to identify a critical phoneme in a syllable or word. However, when the stimuli are unfamiliar, such as the clicks we used in subsequent experiments, participants may not have consistent labels for them, and an identification task may provide unreliable results. Therefore, we used an AXB categorization task in which participants were required to compare two stimuli [A and B] to an anchor stimulus [X] and to choose the one that was the better match. Although an AXB task has been used before to test for a McGurk effect [Rosenblum & Saldaña, 1992], the format we adopted was unique.

In our test, X was presented only auditorily, whereas A and B [henceforth the ‘test tokens’] were presented audio-visually. In this way, participants could not base their matching decisions on the visual similarity of the A and B tokens to the X token. The auditory component of either A or B matched X in place of articulation, while the other had a different auditory place of articulation. We had three kinds of trials. First, in “Auditory-Alone” trials, the auditory tokens were presented without videos, in order to establish the overall categorizability of the auditory stimuli. In the “Match” trials, A and B were both audio-visually congruent, whereas in the “Mismatch” trials, the visual displays were switched so that both were audio-visually incongruent. Table I shows examples of the Match and Mismatch Trials. The effect of incongruent dubbings on Mismatch trials was determined by comparing performance on these trials to otherwise identical trials with congruent dubbings [Match trials]. If, on a Mismatch trial, the test token that acoustically differs from X in place of articulation [e.g., B in Table 1] sounds like a better category match to X than does the test token that shares auditory place of articulation with X [A in Table 1], then participants will choose B as the better match to X. Thus, the test requires a visually-induced change in the perceived category of at least one of the test tokens. If there is no visual effect, then the audio-visually incongruent token that is an auditory match to X should be selected consistently in both the Match and Mismatch conditions.

Table 1

Examples of the Match and Mismatch AXB trial types, using stimuli from Experiment 1

Stimulus ModalityMatchMismatchAXBAXB
Auditory /pa/ /pa/ /ta/ /pa/ /pa/ /ta/
Visual /pa/ /ta/ /ta/ /pa/

2.1 Method

2.1.1 Participants

The participants were 25 undergraduates at the University of Connecticut. All received course credit for their participation. All reported normal hearing and normal or corrected-to-normal eyesight, and all were native speakers of English. Data from one participant were dropped from the analyses for failure to meet our performance criteria [see Procedure below].

2.1.2 Materials

2.1.2.1 Visual stimuli

Visual stimuli were recorded in a room with bright lighting and a plain backdrop. An adult female [the second author] was videotaped producing the syllables /ba/, /da/, and /ga/ in randomized sequences. The tokens were digitized at 30 frames/second at 320-by-240 pixels using a Radius board mounted in a Macintosh computer, and edited using Adobe Premiere [San Jose, CA]. One token of each syllable was selected from these productions, based on the clarity of the articulatory movements and visual similarity [in duration, head position, etc.] across the tokens of different consonants. Movies of each syllable were truncated so that they were the same duration [1630ms], had at least two frames preceding the onset of the consonant gesture, and included the full duration of the vowel opening and closing gestures.

2.1.2.2 Auditory stimuli

To obtain noise-free recordings, the auditory tokens were recorded in a separate session, in a soundproof booth. The same talker produced multiple tokens of the syllables /pa/, /ta/, and /ka/ in randomized sequences, which were recorded onto a digital audio tape recorder [DAT] and then input to the Haskins VAX computer system in pulse code modulation [PCM] format, where acoustic analyses were performed. The duration and fundamental frequency [F0] of each syllable were measured in the acoustic analysis program HADES [Rubin, 1995], and three tokens of each syllable were selected that were roughly matched in duration and F0. The stimuli were then digitally amplified so that they were matched in peak RMS amplitude as well. Of the three tokens of each category, one was selected to be dubbed [for A and B tokens of the AXB test], while the other two were dubbed to a continuous black video screen [for X]. The acoustic A and B tokens were selected because they were the most similar to one another in duration and F0 across the three categories. In that way, we made the acoustic differences between the test [dubbed] tokens and the comparison [X] tokens as uniform as possible.

2.1.2.3 Audio-visual dubbing

The selected auditory tokens were converted into AIFF format, and then imported into Adobe Premiere, where those selected to be dubbed were paired with the digitized visual token. Dubbing was accomplished by aligning the acoustic release burst with the video frame in which the consonant release was first visible. Auditory /pa/ was paired with visual /ba/, /da/, and /ga/ [which, because they are visually indistinguishable from their voiceless counterparts, will henceforth be called /pa/, /ta/, and /ka/, respectively]; auditory /ta/ was paired with visual /pa/ and /ta/; and auditory /ka/ was paired with visual /pa/ and /ka/. [The /ta/-/ka/ and /ka/-/ta/ combinations were not used because the alveolar and velar places of articulation are not sufficiently distinct to give rise to a McGurk effect; e.g., MacDonald & McGurk, 1978]. Each of the dubbed audiovisual tokens was saved as an individual movie file. Additionally, movies were made of each auditory token paired with a black screen.

2.1.2.4 Test sequences

AXB trials were constructed by concatenating a dubbed movie, an X token movie [a black screen], and another dubbed movie, followed by a 3.5s long silent movie of a black screen. The interstimulus intervals [ISI], measured from the acoustic offset of one token to the acoustic onset of the next token within a trial, were approximately 1050ms. Sixteen AXB trials were constructed for each of the three trial types [Audio-Alone, Match and Mismatch], along with two other trial types that will not be reported here, for a total of 80 trials.

In the Auditory-Only, Match, and Mismatch trials, the consonant contrast between the A and B stimuli was either /p/-/t/ or /p/-/k/; the X token was either /pa/ or the alternative consonant in the contrast [/ta/ or /ka/]; and the assignment of/pa/ and /ta/ or /ka/ to either the A or B position of the triad was balanced. Additionally, two X tokens of each syllable were used. The three classes of trial differed in the visual components: in Auditory-Only trials, the visual component was a black screen for A, X, and B; in Match trials, A and B had congruent visual tokens [e.g., auditory /pa/-visual /pa/; auditory /ta/-visual /ta/]; in the Mismatch trials, the visual components of the A and B stimuli in the Match trials were switched.

We created a randomized sequence of the 16 Audio-Alone AXB trials, and a separate randomized sequence of the 64 Audio-Visual AXB trials [with a full randomization of all trial types]. Additionally, practice sets were created for the Audio-Alone and Audio-Visual sequences: each contained five AXB trials, and in the Audio-Visual practice sequence they were all trials from the Match condition. The sequences were output directly to videotape using Adobe Premiere. All participants received the same Audio-Alone sequence and the same Audio-Visual sequence.

2.1.3 Procedure

Participants were given answer sheets that offered ‘A’ and ‘B’ as response options for each trial. In the Audio-Alone portion, they were instructed that they would hear sets of three English syllables and that they were to write on their answer sheet which of the first or the third syllables sounded more like the second. In the Audio-Visual portion, they were told that the first and third syllables in the sequence, but not the second, would be accompanied by a visual presentation of a face saying a syllable. Participants were informed that the syllables were dubbed, and that the visual presentation would not necessarily match the acoustic signal. Participants were instructed to base their decisions only on how similar the syllables sounded. It was also stressed that participants should, nonetheless, watch the video screen at all times except when they were marking their answer sheet. The participants were shown the practice sequence for each portion of the experiment immediately preceding the respective test sequence, and were given feedback if requested.

In this and all of the other experiments, the order of the presentation of the Audio-Alone and Audio-Visual tests was counterbalanced so that half of the participants performed the Audio-Alone test first, and half the Audio-Visual test.

The sequences were presented visually to the participants on a color TV monitor [20 inch screen] with full-screen video, and played through a VCR. The sound was fed from the VCR output through an amplifier, and played through an eight-inch speaker mounted on top of the monitor. Participants were seated approximately eight feet from the monitor, and were run in groups of one to four.

2.2 Results

There were two screening measures. Participants had to make correct responses in at least 70% of trials in the Audio-Alone condition and in at least 70% of trials in the Audio-Visual Match condition. The data of one participant failed to meet either of these criteria and were excluded from all analyses. The remaining participants averaged 98.2% correct responses on the Audio-Alone trials.

On Match and Mismatch trials, we computed the percentage of trials on which participants selected the audio-visual syllable with the same auditory place of articulation as X as the better match to X. This percentage should be low on Mismatch trials to the extent that the incongruent dubbings alter the perception of the A and B tokens, such that the syllable with an auditory match X sounds less like X, and the other syllable, whose visual [but not auditory] component matches auditory X in place of articulation, sounds more like X. Match trials, on which the percentage should be high, provided a baseline for assessment of performance on Mismatch trials. Thus, our dependent measure was the Match minus Mismatch difference in the percentage selection of syllables with an auditory match to X. The consonant contrast on each trial was either /pa/-/ta/ or /pa/-/ka/. The X token was either /pa/ or the nonlabial alternative [/ta/ or /ka/].

The overall difference between the Match and Mismatch conditions was 27.9%, a value significantly greater than zero, t[95] = 8.88, p < .0001. Analogous one-sample t-tests performed on each cell mean revealed that each one differed from zero at p < .005 or less, smallest t[23] = 3.12. An ANOVA on the difference scores revealed significant main effects of Consonant Contrast, F[1, 23] = 6.59, p < .05, and of X token, F[1, 23] = 5.38, p < .05, as well as a significant interaction, F[1, 23] = 6.34, p < .05, shown in Figure 1. The Match-Mismatch difference score was nearly identical in three of the cells; it was much smaller when X was /ka/ in /pa/-/ka/ contrasts than in the other conditions.

Difference between Match and Mismatch trials in the percentage of selections of the audiovisual token with an auditory place-of-articulation match to X, for English stop consonant syllables in Experiment 1. Means are grouped according to the Consonant Contrast [/p/-/t/ or /p/-/k/] and the identity of the X token. Asterisks specify difference scores that are significantly greater than zero

2.3 Discussion

The results demonstrate a visual influence on speech perception, and further revealed differences in the strength of the effect depending on the phonological contrasts and the particular audiovisual pairings; these differences are broadly consistent with the typical phonetic classification for these audiovisual pairings. Although we will not do so for the later experiments, where our primary focus is the overall effect magnitude, here we will discuss the results in some depth in order to provide an understanding of how results in the AXB tasks correspond to typical phonetic classification results.

We found significant mismatch effects for all of our conditions, but the effect was much weaker in one condition, namely when the contrast was between /pa/ and /ka/, and X was /ka/, than in the others. The pattern of results can be interpreted in light of the typical responses for the incongruent stimuli. Table 2 presents, for the four test conditions in the Mismatch condition, the typical visually-influenced percepts for the incongruent tokens in the AXB trials. Across the four conditions, the typical McGurk percept for the incongruent token with an auditory match to X is either a different consonant than X [rows 1 and 3 of Table 2] or a combination of consonants that includes X [rows 2 and 4]. Likewise, the typical McGurk percept for the incongruent token with a different auditory place of articulation than X is usually either the same consonant as X [row 2] or a combination that includes X [rows 1 and 3]. The exception to this is the condition [row 4] in which X is /ka/ and the incongruent token is A/pa/-V/ka/, which is typically perceived as /ta/. Table 2 demonstrates that in the first three conditions the expected McGurk percept for the incongruent token with a different auditory place of articulation than X provides a better match to X. In contrast, in the fourth condition the McGurk percept for the incongruent token with the auditory match to X provides a better match, consistent with the smaller mismatch effect.

Table 2

Design of Mismatch trials in Experiment 1, along with typical visually-influenced responses to the incongruent A and B stimuli, and expected responses if there is no visual influence [‘Auditory Choice’] and if there is a McGurk effect [‘McGurk Choice’]

Consonant ContrastX TokenSample AXB TrialAuditory ChoiceExpected McGurk PerceptsMcGurk ChoiceAX BAXB
/pa/-/ta/ /pa/ A/pa/-V/ta/ A/pa/ A/ta/-V/pa/ A /ta/ /pa/ /pta/ B
/pa/-/ta/ /ta/ A/ta/-V/pa/ A/ta/ A/pa/-V/ta/ A /pta/ /ta/ /ta/ B
/pa/-/ka/ /pa/ A/pa/-V/ka/ A/pa/ A/ka/-V/pa/ A /ta/ /pa/ /pka/ B
/pa/-/ka/ /ka/ A/ka/-V/pa/ A/ka/ A/pa/-V/ka/ A /pka/ /ka/ /ta/ A

Overall, Experiment 1 showed that our AXB task captures the typically-reported perceptual effects of audiovisual incongruities in speech. We turn now to our test for a McGurk effect on stimuli that lack phonological significance for our participants, namely vocal-tract clicks.

3 Experiment 2

In our attempt to find a nonspeech McGurk effect, we used a bilabial click [which is rare among click languages, used only by speakers of southern Khosian languages such as !Xõò [Ladefoged & Maddieson, 1996; Ladefoged & Traill, 1994] along with a dental click and a lateral click, used in a number of other African languages [e.g., Zulu, Xhosa]. Clicks are produced with two locations of complete closure in the oral cavity permitting a suction to be formed between them when the primary articulator is drawn down for release of the anterior closure. The posterior closure [velar, except for bilabial clicks, in which case it is dental] is released only after the anterior constriction release. Thus the initial release produces a suction release noise. Informally, a bilabial click resembles a “kissing” sound; dentals can be described as a “tsk” sound, and laterals [which have a visual asymmetry of the jaw and tongue during the click production] produce a sharp “cluck” sound [similar to a “giddyap”]. The bilabial and dental clicks are relatively similar acoustically, both in frequency and duration, and are both fairly distinct from the lateral [Ladefoged & Traill, 1994]. Velar clicks do not exist in any language, and would in fact be articulatorily impossible given the requirement that clicks have two releases, normally with a velar secondary release. The stimuli for Experiment 2 were clicks produced in isolation; that is, they were produced without a following vowel.

These clicks are similar to English stop-vowel syllables of Experiment 1 in that they involve a full closure of the vocal tract followed by an abrupt release. Additionally, the places of articulation of two of the clicks we used, namely the bilabial and dental, are similar to those of /p/ [a bilabial] and /t/ [an alveolar], two of the stimuli in Experiment 1. However, the clicks differ from the stimuli of Experiment 1 in phonological status, as they are not native speech sounds for English-speaking listeners, who have been found to hear clicks as nonspeech sounds [Best et al., 1988]. The clicks also differ from the stimuli of Experiment 1 in physical properties; they lack a following vowel and vocalic transitions from the consonant release to the vowel steady-state portion, and the click releases differ from English stop consonant bursts both aerodynamically and in their acoustic makeup [Ladefoged & Traill, 1994]. We will return to these physical differences in the discussion.

We tested for effects of incongruent dubbings on perception of the clicks, using the procedures of Experiment 1. This enabled us to test the hypothesis that stimuli that are similar to stop consonants in that they involve a rapid release of a vocal tract constriction will exhibit a strong McGurk effect, regardless of their phonological status.

3.1 Method

3.1.1 Participants

Participants were 25 undergraduates at the University of Connecticut. All received course credit for their participation. They all reported normal hearing and normal or corrected-to-normal eyesight, and all were native speakers of English. An additional 10 undergraduates served as participants in a follow-up test of the perceived phonological significance of the stimuli.

3.1.2 Materials

3.1.2.1 Visual stimuli

The visual stimuli were recorded in the same session, by the same speaker, as in Experiment 1. The speaker is a native speaker of English, but has phonetic training and can produce clicks, both in isolation and coarticulated with a vowel. The talker produced six repetitions of bilabial, dental, and lateral clicks in isolation, intermixed with other productions in randomized sequences. Of these, one bilabial, one dental, and one lateral production were selected to provide videos for the experiment. All digitizing, selecting, and editing procedures were the same as in Experiment 1. The visual stimuli were all 1400ms in duration.

3.1.2.2 Auditory stimuli

The auditory tokens were recorded separately, in the same session [with the same recording conditions] as in Experiment 1. The speaker produced up to 20 tokens each of the bilabial, dental, and lateral clicks in isolation [in randomized sequences]; these were recorded to DAT and input into the Haskins VAX system in PCM format, where acoustic measurements were performed. We measured click durations and centroid frequency at the click midpoint. Three tokens from each click category were selected on the basis of similarity among these measurements. The stimuli were digitally amplified, where necessary, so that the clicks were matched in RMS amplitude. As in the previous experiments, one of the three tokens served as the dubbed tokens; the others served as X tokens.

3.1.2.3 Audiovisual dubbing

The procedure for dubbing was identical to that used in Experiment 1, except that here all possible dubbing pairs were used [i.e., bilabial-dental, bilabial-lateral, and dental-lateral].1

3.1.2.4 Test sequences

The procedure was identical to that of Experiment 1. However, because more contrasts were used in this experiment, the Audio-Alone test sequence had 24 trials [instead of 16], and the Audio-Visual test sequence had 96 trials [rather than 64]. The acoustic ISIs were approximately 1375ms. The difference in ISI from Experiment 1 is primarily due to the acoustic signals [isolated clicks] used here being much shorter than the syllables in Experiment 1 while the visual stimuli in the two experiments were more similar in duration.

3.1.3 Procedure

The instructions were similar to those used in Experiment 1, except that the clicks were described as “sounds.” Thus, in both the Audio-Alone and Audio-Visual portions of the test, participants were told that they would hear a sequence of three short sounds and that they were to indicate on their answer sheets whether the first or third sounded most like the middle one; in the Audio-Visual portion, they were told that the first and third sounds would be accompanied by a video display of a woman opening her mouth. No reference was made to the fact that the sounds themselves were produced by a vocal tract. In other regards, the experiment was conducted in exactly the same manner as Experiment 1.

3.1.3.1 Phonological significance test

We conducted a follow-up test to confirm that the click stimuli were not perceived as speech. Participants in this phonological significance test received the same instructions as the participants in the regular experiment. However, they were only presented with the Audio-Visual portion of the experiment, and only the first 24 trials were shown. They were then given a questionnaire in which they were asked to describe the stimuli. The first item of the questionnaire asked the participants, in an open-ended manner, to describe the qualities of the sounds that had been presented. Subsequent questions asked specifically whether the sounds resembled [1] any particular environmental events, [2] any nonspeech mouth sounds, and [3] any speech sounds. In each case that participants responded “yes,” they were asked to describe what the stimuli resembled.

3.2 Results

The data of one participant failed to meet our performance criteria [see Experiment 1] and were excluded from all analyses. In the Audio-Alone part of the experiment, the remaining participants overall selected the test token that matched X 87.5% of the time.

As in Experiment 1, we examined the difference between Match and Mismatch conditions in the percentage of selections of the token that matched X in auditory place of articulation. The click contrast was either bilabial-dental or bilabial-lateral, and the X token was either bilabial or the alternative [dental or lateral].

Overall, participants selected the X token’s auditory match 15.4% less often in the Mismatch than in the Match condition; this overall difference score is significantly greater than zero, t[95] = 4.39, p < .0001. One-sample t-tests performed on each cell mean of the Click Contrast by X-token crossing [see Fig. 2] found a significant difference from zero at the p

Bài Viết Liên Quan

Chủ Đề