Last week in discussing Daniel Everett’s latest book, I touched briefly on some controversial issues in speech perception. I claimed, a bit glibly, that approaches to the problem in which it is assumed that the goal of speech perception is to recover abstract symbolic units for processing at higher levels of representation had not been particularly fruitful. That was a bit rash, and a bit opaque for people outside the field, so I wanted to double back this week to fill in some gaps, and discuss some reasons why I’m sympathetic to non-symbolic approaches to studying language.
The dominant view of language has — for at least a century — assumed that it has an underlying structure that is rational and hierarchical, with sentences made out of words and words made out of phonemes and phonemes made out of features, etc. It’s easy for people who read alphabetic languages to imagine speech as a sequence of symbols because that’s what text is like for us: a few basic shapes (these would be visual features) organized into letters (which correspond roughly to “phonemes”) which are in turn strung together to make words, etc.
The problem of speech perception, from this view, has been characterized as one of recovering pristine symbolic units from a messy, biological substance. Consider the influential analogy of Hockett’s Easter eggs. Hockett compares the symbolic structure of speech to a series of finely painted and decorated — but uncooked — Easter eggs that, in the process of producing speech, are subjected to a process that acts like a wringer, mashing them up into an undifferentiated mess. The listener’s job is to inspect this output and identify the series of painted eggs that must have produced it.
Hockett is arguing here that speech is conceived in the mind of the speaker as a string of symbols, and that the mechanics of speech production makes a mess out of this systematic and hierarchically organized representation. This is because speech production is carried out by a body made of muscle, bone, and nervous tissue, rather than, say, a finely tuned Cartesian automaton. (Although that’s an odd example because one of the most famous automata was Vauconson’s duck, whose function was to simulate both eating and the other end of the digestive process.)
It’s interesting to consider why Hockett felt the need to specify that his eggs were uncooked. Presumably it would be just as hard to identify them if they were hard-boiled. But whereas cooked eggs, when similarly treated, would be at least food-like, if inedible due to all the shell fragments and paint, crushed raw eggs would be more reminiscent of some kind of bodily fluid that we would generally not like to encounter outside of a body. Not to get all Freudian here, but I think this detail reveals a kind of distaste for the frankly biological and unruly nature of the signal under consideration.
The idea that language is essentially abstract, symbolic, rational, and otherwise generally unlike anything else our bodies do has motivated a decades-long Easter egg hunt for just the acoustic and/or motor parameters that can reliably deliver up crisply defined, discrete phonetic units from the speech signal. If words are strings of phonemes, and we identify words by serially identifying their parts, then there must be some set of necessary and sufficient conditions that the perceptual system can use to identify those parts despite the lack of invariance problem.
Unfortunately for this view, categorization of speech sounds does not seem to be at all about necessary and sufficient conditions. There is no fixed criterion by which we can identify with certainty, say, the sound /p/ in naturally produced speech. There are rules of thumb that allow us to guess with varying degrees of confidence, but it could always turn out to have been a /t/, or a /b/ of even an /f/, on closer inspection. This becomes clear when you try to teach yourself how to read spectrograms — visual representations of acoustics. If you follow that last link, and read Rob Hagiwara’s segment-by-segment description of how he solves each month’s puzzle, you will be struck by the awesome computational power that would be necessary for a system to do this in real time, dealing with hundreds of these puzzles per minute. Or else you will begin to wonder whether that could possibly be the way speech perception works.
In fact, when computers are programmed to produce speech in a way that preserves the whole “eggs” — so that phonemes are produced the same way every time, with none of the problematic variance characteristic of natural speech — it is harder, not easier to understand, than when some of the messiness of natural speech is incorporated. Similarly,
engineers trying to build systems that do automatic speech recognition gave up on trying to identify phonemes one at a time pretty early in the game. Instead, advances in synthetic speech such as Siri have generally been achieved by using larger and larger data sets and algorithms that use the variability across speakers and contexts that seemed to be such a problem as part of their solution.
Importantly, these approaches are also generally ecumenical about what level of representation they try to reconstruct, generally favoring whole words and phrases over individual phonemes.
Another major problem for the notion that we perceive speech by doing anything like reconstructing a string of Easter eggs is that humans are very good at retrieving meaning from speech, even under conditions in which the usual features that would allow us to identify phonemes are destroyed. Two nice examples of how crazy you can get with this are sine wave speech and noise vocoded speech.
One way to understand the results of those studies is to assert that just the information preserved in those manipulations is necessary and sufficient for speech perception, and if we understood how to map from that information to meaning, we would have solved all of the important mysteries of speech perception. But this ignores the fact that sine wave speech and noise vocoded speech are far less intelligible than natural speech. If the phase transitions that are spared in those stimuli were really the essence of speech, wouldn’t they be easier to understand with all of those other distracting surface features stripped away?
Alternatively, these findings may stir us to consider whether the concept of “necessary and sufficient” is just as useless in adjudicating between theories of speech perception as it is in categorizing phonemes. Competing theories have generated a lot of contradictory data about what aspects of the speech signal people use under different conditions. Produce speech with just a few of the necessary cues to phonemic identity and people can, with some effort, understand it. Remove those cues and leave just phase transitions and people can, with some effort, understand it. Tweak around the signal in a way that targets a single phoneme and people adapt to that in remarkable ways, which suggests that they are keeping track of how that particular phoneme is produced in that particular context. But you can also impact how people identify phonemes by filtering a piece of music that they hear before the to-be-identified stimulus.
It seems like all of these things could only be true if the brain actually had multiple, somewhat redundant processes for dealing with speech. That is, there may be no single “way” in which speech is perceived, but instead there are a wide variety of mechanisms available to the brain that allow it to meet different goals depending on the context.
To understand why this does not seem to be a very popular view, you have to consider that much of the work on the topic of speech perception has been published in journals that could well be titled the “Journal of Pretending that there are Exactly Two Theories About Anything and that These Experiments will Tell Us Which One is Correct.” This encourages an approach in which evidence for one mechanism — say, that motor representations are engaged in some unexpected way by speech — is taken as evidence against the existence of another mechanism — say, a dependence on general-purpose auditory categorization abilities — as if understanding speech perception were a zero-sum game.
Quite possibly it is theories of speech perception that are like finely wrought, but uncooked Easter eggs, and once you run them through the wringer of What Actually Happens in the World, you get the spectacular mess that we observe in the literature. At any rate, it seems unlikely to me that discrete, symbolic representations are really the coin of the realm when we are talking about another gooey, frankly biological mess:the brain.
Image credit: Lubing industrial farming supplies.