The commentaries on my paper “Generative Linguistics and Neural Networks at 60: Foundation, Friction and Fusion” are all now posted on-line at the authors’ websites at the links below. The linked version of my paper and – I presume – of the commentaries are the non-copyedited but otherwise final versions that will appear in the March 2019 volume of Language in the Perspectives section.
Update March 2019: The final published versions can now be found at this link.
I decided not to write a reply to the commentaries, since they nicely illustrate a range of possible responses to the target article, and because most of what I would have written in a reply would have been to repeat or elaborate on points that are already in my paper. But there is of course lots more to talk about, so I thought I’d set up this blog post with open comments to allow further relatively well-archived discussion to continue.
Iris Berent and Gary Marcus. No integration without structured representations: reply to Pater.
Ewan Dunbar. Generative grammar, neural networks, and the implementational mapping problem.
Tal Linzen. What can linguistics and deep learning contribute to each other?
Lisa Pearl. Fusion is great, and interpretable fusion could be exciting for theory generation.
Chris Potts. A case for deep learning in semantics
Jonathan Rawski and Jeff Heinz. No Free Lunch in Linguistics or Machine Learning.
Re. Berent and Marcus: Is there public write-up of “How we reason about innateness”? It’s cited as evidence for a fairly bold claim (“resistance to innate ideas could well be grounded in core cognition itself”), but all I can find is a talk with no associated paper.
Hopefully, soon. It’s under review (revised and resubmit).
Thanks! Curious to see if you’re comfortable sharing privately. (No worries if not.)
(Porting over from Twitter)
Either I don’t get the spirit of the Rawski & Heinz response, or it misses an easy opportunity to draw parallels between symbolic and representation learning-based approaches to language. They claim that “any serious scientific application of neural architectures within linguistics must always strive to make the learner’s biases transparent,” with the strong implication that this isn’t being done.
Specifying the model architecture and learning algorithm used for an NN model specifies the model’s bias, no? Nearly every paper in this literature gives that key information, and many, many CL papers discuss the consequences of the specifications for what is learnable easily and what is learnable at all. Very little of this discussion is couched in the language of learning theory and the Chomsky Hierarchy, but the discussion is absolutely happening. I see a clear opportunity for bridge-building here, but I don’t see a clear failing in current work unless you take that body of theory as a particularly privileged approach to the science of language.
(Of course, the original article could have commented on this too, but I don’t think it was clearly called for.)
Reposted from Twitter:
The Berent & @Marcus response summarizes familiar arguments against 1980s-style connectionism, but it would have been useful to see a discussion of more recent work, e.g. recent neural architectures with structured inductive biases (syntactic, relational, compositional, etc), which I assume would not be taken to follow the “associationist hypothesis” (from Chris Dyer, Jacob Andreas, Richard Socher, Sam Bowman), Kirov and Cotterell experimental work showing that modern seq2seq networks (without explicit algebraic representations) can in fact learn the English past tense (https://arxiv.org/abs/1807.04783), etc, etc. The only recent papers that do get mentioned are ones that support the authors’ argument – @LakeBrenden & Baroni’s (very cool) experimental work that demonstrates lack of systematicity in standard seq2seq networks (https://arxiv.org/abs/1711.00350 and https://arxiv.org/abs/1807.07545).
In any case I agree with Berent & Marcus that (1) the goal is to create a model that generalizes like humans, and (2) to get there we need to run experiments on both models and humans, and if necessary add different/stronger inductive biases (nothing controversial here).
Thanks Tal! Just in case someone reading your post hasn’t read the paper on which Berent and Marcus were commenting, I should point out that I tried to have a balanced discussion there on the issue of whether explicit linguistic representations, including symbols, are needed in neural net models of language, and included Kirov and Cotterell as an example of an interesting recent result that suggests that current architectures can do more without variables than earlier ones could. It seems like from Berent and Marcus’ perspective, I was leaning too far in the direction of endorsing symbol-free models. The only major thing missing from the commentaries, I think, is someone from the other side, arguing that I was being too optimistic about the need for explicit linguistic representations.
To summarize my main points: (1) current practice in the neural network world has moved beyond what Gary has termed eliminative connectionism, and many “deep learning” systems have components that could qualify as symbols, variables and compositional representations; (2) discussion of the abilities and limitations of neural networks should make reference to specific experimental results obtained on specific neural network architectures.
Pater presented a narrative about language-learning over the past sixty years, focusing on neural networks and generative grammar. Our point is that any such narrative which excludes, as Pater’s does, computational learning theory and mathematical theories of string (and tree and graph) languages misses the forest for the trees. This is especially true for the fusion Pater dreams of. The only way that will occur, as Dunbar argues so cogently, is by bridging the mapping gap, which requires the ability to reason about NNs at the right level of abstraction. This is exactly what computational and mathematical theories of languages and language-learning provide.
Bowman asks “Specifying the model architecture and learning algorithm used for a NN model specifies the model’s bias, no?” The answer is No. A replicable program does not entail that its bias is analyzable or transparent. Such a specification is a necessary, but not a sufficient, condition. No one would use a sorting algorithm in a software library if there was no proof of its correctness. The proof of correctness may be derived from a specification of the program, but the program itself is not sufficient to make clear what problem it is solving. Computer science is about problems and the algorithms that solve them reliably, correctly, and with so much resources. Again, pace Dunbar, what problem is the specified NN solving, regardless of the task?
We agree there is an opportunity for bridge building, and we pointed to recent work that makes these connections. We encourage more work of this sort.
Finally, Pater’s article is about *generative* linguistics and neural networks. Our p.4 clearly showed this body of theory is central to both, so it is privileged in this context. One can reject generative grammar, but then you’re not talking about Pater’s paper nor ours. Also, the body of theory we discuss is a mathematical theory of language and language-learning. So it is generally privileged to the science of language and language-learning in the same way mathematical analysis is privileged in any other scientific endeavor. Many topics it encompasses, such as logic, automata theory, and the learning of grammars expressed with logic and automata, will be with us for centuries to come.
Jon & Jeff
One thought related to some of the above. I am very sympathetic to the desire to have our neural nets be interpretable, and also to have analytic results about representability and learnability that could hook up with other results in mathematical and computational theory. But it doesn’t follow that we should dismiss, or ignore, research that does not meet those desires.
Say, for example, that we want a model of how humans learn and represent semi-regular morphophonology. I think it would be a mistake to ignore Kirov and Cotterell’s research, just because it doesn’t meet the above desires.
Coming in here way late (nearly a year after the previous comments), but I just saw your paper.
I found the discussion of the hierarchical nature of phonology (pg e49 in the version I’m looking at) confusing. It seems to conflate two distinct points: the question of whether phonological representations are hierarchical, and the depth of derivation (in rule-based phonology). And further, both points seem to point away from any similarity with syntax.
Wrt the hierarchical nature of representations, the paper refers to Selkirk 1981, and Yu 2017. Both appear to refer to syllables and feet. But neither syllables nor feet are hierarchical in the sense that syntactic phrase structure is; on the assumption that syllables can’t embed other syllables, and feet can’t embed other feet, both are finite state. Indeed, Yu’s paper uses a finite state transducer tool (xfst) to represent both syllables and feet. Am I missing something?
Wrt the depth of derivation, it has been known since C. Douglas Johnson’s 1972 “Formal Aspects of Phonological Description” that despite appearances, phonological rules of the type then common (less so today…) can be represented by finite state transducers. In constructing an FST, the application of a phonological rule is represented by what starts out as a three level representation, but is then “compiled” down into a two level representation. Iterating over a sequence of rules gives as final output a two level finite state representation consisting of lexical forms on one level and surface forms on the other. The many implementations of FSTs for use in modeling rule-based phonology and morphology (and to a lesser extent, constraint-based systems) are based on this fact. (The only exception is full word reduplication, which cannot be derived by an FST. Technically this is only true if input words have unlimited length, but in practice the representations become unmanageable beyond a few characters/ phonemes/ segments.)
Not mentioned in the paper is the issue of whether phonology is hierarchical if you take into account cyclicity (assuming of course you believe in cyclic word structure). But here again, the cyclicity is bounded; there’s no self embedding. (Ok, there is if you want to treat anti-missile missile, anti-anti-missile missile missile and so forth as morphology…)
In sum, I think phonology is different: it is not hierarchical, at least not in the same sense that syntax is. But perhaps I’m missing s.t., or misunderstanding?
Thank you Mike for these comments. I’ll have to give them some thought before I reply. And sorry that I’ve just approved the comment now, three years after it was posted – this blog has been on hiatus!
Hi again Mike,
Check out Kristine Yu’s recent paper:
https://revistes.uab.cat/catJL/article/view/v20-yu
This seems relevant.