In the New Yorker’s Newsdesk blog last week, Gary Marcus expresses his skepticism of “deep learning,” an approach to artificial intelligence pioneered by Geoffrey Hinton that received some unusually high-profile coverage in the Times. I honestly don’t know enough about deep learning models to evaluate whether some of his criticisms hit their mark. Some of them are familiar from his criticisms of earlier work on neural networks, and some of those are more robust than others. Overall it seems clear that Marcus is trying to be fair in his appraisal, and yet he apparently couldn’t help introducing a little bit of self-serving historiography about the role of neural network models in the study of language:
Even the new models had serious problems…They learned slowly and inefficiently, and as Steven Pinker and I showed, couldn’t master even some of the basic things that children do, like learning the past tense of regular verbs. By the late nineteen-nineties, neural networks had again begun to fall out of favor.
This gives the impression that the neural network models were a bit of a fad, and now that that’s all over we have all moved on, having quietly tucked our gradient descent learning algorithms and three-layer perceptrons away in a storage closet with our CDs of the Batman Forever soundtrack, swing dancing outfits and paintings of Elvis on black velvet. James, man, remember the time we put Brighten the Corners on repeat, and tried to model second language speech perception with a continuous recurrent backprop network? Those were the days…
I dunno. There may have been a time, late in the last century, when neural network modelers could imagine ourselves as a kind of Mongolian horde, laying siege to the vast territory commanded by Good Old Fashioned Artificial Intelligence and the Computational Theory of Mind, chanting “It is not enough that we succeed, all others must fail.” (A sentiment better attributed, as it turns out, to Somerset Maugham than Genghis Khan, but this epic time would definitely have been before you could look these things up on the internet.) The reasons behind the sputtering of these imperial aspirations are complicated, and a topic for another time, but I don’t believe it was because of a historic defeat on the battlefield of the past tense.
In fact I’m not sure that there was any such defeat.
Pinker may have written that book on the subject, but insights from the neural network models he and Marcus pooh-pooh still have a lot of currency among researchers trying to understand the relationship among language impairment, dyslexia, and difficulties with particular aspects of speech processing, for example. Further, like Genghis Khan’s DNA, the ideas behind neural network models more generally have been absorbed into the mainstream, and their descendants have been fruitful (see, for example, the burgeoning literature on “statistical learning” in developmental psychology). Also — and I realize this analogy is starting to get a little weird — as in modern Mongolia, where most of the land is still publicly owned so that nomadic families can tend to their herds according to traditional techniques that are plausibly more sustainable food systems than, say, the rest of the world’s, some of us are actually still using neural network models in our labs.
Here’s the thing. You can’t work with neural networks for very long without being forced to acknowledge their weaknesses. For those of us using the models to study language, the ethos was a lot less “I am the punishment of God” and a lot more “Hey, these dumb, general purpose learning algorithms actually go pretty far toward simulating a lot of phenomena that people think of as evidence for much more specialized processes — let’s see how far we can push this before it breaks!” Critics like Marcus contributed to this process by providing examples of things they thought the models could never do, and modelers often rose to the occasion by building models that could do just those things. Where such models fail (and in fact when they succeed) they open up interesting questions. Did they fail because they can’t solve the problem in principle or because of some more mundane detail of how we decided to frame the problem, or because there just wasn’t enough data or computing power to do what we wanted to do? Did they succeed because we somehow built assumptions into them that are part of our critics’ theories?
The intensity of these exchanges has waned substantially since the turn of the century, but for an example of how neural network models have continued to influence the language research, Marcus might have headed down the hall to chat with his collaborator Liina Pylkkänen about her work on brain responses to syntactic anomalies. (Two authors on this work Suzanne Dikker and Thomas Farmer are collaborators of mine.)
This work is based on electrophysiological studies of the ELAN, an effect observed for grammatical errors like “The boys heard Joe’s about stories Africa.” In this case, “about” is a grammatical error because it is from the wrong word category (it’s a preposition, whereas there ought to be a noun there). ELAN stands for Early Left Anterior Negativity: Early because it is observed within about 100 milliseconds after the error is presented, Left Anterior because it is visible in electrophysiological measures over the front left side of the scalp, and Negativity because electrical voltages are signed and the effect involves stronger responses to errors during a negative-going portion of the electrical wave elicited by the word. It was initially taken as evidence that there are incredibly fast, automatic, syntactic processes that operate autonomously. How else to explain such a rapid response to a purely syntactic error?
Pylkkänen’s group collected data on ELAN-producing* stimuli using a different technique with higher spatial resolution. They found that these rapid syntactic processing signals were coming not from from classical language areas, but from early visual cortex, and depended on visually salient category markings. It then became clear that a lot would turn on how “visually salient” should be defined in this instance. There could still be a dedicated word-category-recognizing process that operated on the basis of elements like affixes, for example the -ed ending that indicates the regular past tense. But an alternative possibility was emerging from work inspired by neural network models. Working with Morten Christiansen and Padraic Monaghan, Thomas Farmer was looking for probabilistic cues to syntactic category using simple Euclidean distance over surface representations of words — just the sort of thing one would expect a neural network to pick up on. They found that, overall, nouns tend to sound more like other nouns than like verbs and vice versa.
It would be very difficult to write a set of rules or if-then statements that capture all of these regularities. They lack the crisp clarity of closed-class morphemes. Further, they are only probabilistic. It is not hard to find items that sound very “verby,” but turn out to be nouns, like “insect.” This is because there is nothing about the surface form of a word that essentially defines it as being a noun or a verb. In short, these statistical regularities are not at all the kind of thing that the logical, symbol-processing models of cognition Marcus favors would want to include. They tell us nothing about causality, necessity, sufficiency, etc. Of course by now you have guessed that these regularities impact a range of behavioral tasks — for example, readers slow down slightly when they encounter a verby noun relative to nouny nouns in sentences.
In a collaborative effort, the two groups showed that these probabilistic cues also influenced the early brain response to syntactic anomalies. Even when there was no syntactic anomaly seeing a verby noun elicited a response much like what is observed for actual errors. This example shows that at least some of the processes involved in language are easily (and rapidly) influenced by overall surface similarity, weakly contrastive probabilistic patterns, in essence all of the things that are left out of models that start from a priori assumptions about what the parts of the system are, and what their essential properties should be.
Does this mean that there is no such thing as an autonomous syntactic process in the brain? Does it mean that the affixes are not special in some way, but are just emergent properties of quasiregularity in the relationship between form and meaning? Interestingly, the authors of this paper don’t agree among themselves on those larger issues. The NYU group’s other work is deeply enmeshed in formal theories of syntax. The Cornell group’s other work takes a decidedly functionalist approach to all of language processing, including syntax. This work is an example of what can happen when people acknowledge the strengths and weaknesses of different approaches and sit down to try and figure out how things actually work. It’s also an example of why you need a wide range of approaches to complex scientific problems like the ones we face in the cognitive neuroscience of language. It’s unlikely that functionalists would have discovered these early “syntactic” effects on their own, but the initial formalist explanation turned out to be wrong, or at least incomplete.
No one is trying to build a ladder to the moon, as Marcus puts it, certainly not those of us who are focused on the basic science issues at stake. Our position is more like that of Martians, who, having recently witnessed the landing of the Curiosity rover, are now faced with the task of figuring out what it is, what it does, and how it works. We’re not going to get there by dismissing productive theories because we disagree with them.
Image credit: screen grab of Germany’s 1979 Eurovision entry, Dschingis Khan, performing their hit number, “Genghis Khan.” Maybe not my best image pick to date, but the other option I came up with included Mario Balotelli’s nipple, and we’ve all seen enough of that.
*Actually there are two differences to note here: Pylkkänen’s group uses a magnetoencephalagraphic measure, so the terminology is different (it’s called an M100 effect, because it’s observed in the Magnetic field at around 100ms post-stimulus), and, second, they used printed, rather than spoken words.