Please leave comments on readings and classes for 9/4 and 9/9 here.
4 thoughts on “Types of phonological knowledge”
Ben Zobel
Question about maximum entropy model:
I was a bit confused about how the weights were established for the maximum entropy model. I found it a bit difficult to follow in Kager & Pater (2012), but from what I could gather, the weights were calculated based on a fit between the analysis of the lexicon and the behavioral data. Am I correct in saying that the maximum entropy and n-gram models are both based on statistical analysis of the lexicon, and that the superiority of the maximum entropy model over the n-gram model is a due to the fact that the maximum entropy model can represent higher-level features of the lexicon compared to the rather “dumb†n-gram model? If this is the case, the explanatory value that the maximum entropy model provides (compared to the n-gram model) is in highlighting the features of the lexicon that are important for predicting behavior. However, it does not appear to be able to explain why a lexicon is shaped a specific way to begin with. Should we assume that there are underlying reasons as to the statistical shape of a lexicon, or should we assume that the statistical shape is rather arbitrary? Why would you choose one assumption over the other?
I was hoping someone could clear up a confusion or misinterpretation I might have been having regarding Kager and Pater (2012).
On p. 98, the authors say, “When we add TP to the model presented in Table VIII, the result is as shown in Table XI. The AIC score is 8933, the best that we found in our model exploration.” I’m confused about whether the data that this model was run on is the same data as from the results in Table VIII. If it is, I’m curious why TP was not considered as a factor earlier. If it isn’t, it seemed to me that perhaps some capitalizing on chance was occurring; in other words, using one set of data to determine a significant effect/coefficient, and then going back to include that factor in a different model (to yield the overall best result) seems strange to me.
@Ben – Yes, the advantage of the MaxEnt model over the n-gram model is in terms of its richer representation of the structure of words in the lexicon. And right, neither model says why the words should have the shape they do.
You ask: “Should we assume that there are underlying reasons as to the statistical shape of a lexicon, or should we assume that the statistical shape is rather arbitrary? Why would you choose one assumption over the other?”
This is a great question. One way in which the structure of the lexica of languages is not arbitrary is that we can often draw the sorts of implicational generalizations that I was mentioning in class – if a language has sequence type X, it will have sequence type Y. Some phonological theories are set up so that a language that would violate such a universal would be unlearnable: if you tried to learn a language with only X, you’d automatically set up a grammar/model that allow Y too. The fact that the MaxEnt grammars don’t do this is why Daland et al. have the discussion of Evolutionary Phonology in their paper. We’ll come back to this when we talk about artificial language learning.
@Lisa You comment: “Using one set of data to determine a significant effect/coefficient, and then going back to include that factor in a different model (to yield the overall best result) seems strange to me.”
I guess this is a little strange! I did these stats, so I can tell you why I proceeded the way I did. We controlled our items for TP (Transitional Probability), and the first regression just included the factors we manipulated in the experimental design. We thought, though, that we should test for whether residual effects of TP could be at play – this was the motivation for the second regression that you mention. Crucially, the main effects and interactions in that were significant in the first model were significant in the second one (in fact, the p-value for the interaction is lower). We probably could have gone straight to the second regression, but I had some worries about including a variable that we had controlled in the regression, so I wanted to do, and present, that first one.
Question about maximum entropy model:
I was a bit confused about how the weights were established for the maximum entropy model. I found it a bit difficult to follow in Kager & Pater (2012), but from what I could gather, the weights were calculated based on a fit between the analysis of the lexicon and the behavioral data. Am I correct in saying that the maximum entropy and n-gram models are both based on statistical analysis of the lexicon, and that the superiority of the maximum entropy model over the n-gram model is a due to the fact that the maximum entropy model can represent higher-level features of the lexicon compared to the rather “dumb†n-gram model? If this is the case, the explanatory value that the maximum entropy model provides (compared to the n-gram model) is in highlighting the features of the lexicon that are important for predicting behavior. However, it does not appear to be able to explain why a lexicon is shaped a specific way to begin with. Should we assume that there are underlying reasons as to the statistical shape of a lexicon, or should we assume that the statistical shape is rather arbitrary? Why would you choose one assumption over the other?
I was hoping someone could clear up a confusion or misinterpretation I might have been having regarding Kager and Pater (2012).
On p. 98, the authors say, “When we add TP to the model presented in Table VIII, the result is as shown in Table XI. The AIC score is 8933, the best that we found in our model exploration.” I’m confused about whether the data that this model was run on is the same data as from the results in Table VIII. If it is, I’m curious why TP was not considered as a factor earlier. If it isn’t, it seemed to me that perhaps some capitalizing on chance was occurring; in other words, using one set of data to determine a significant effect/coefficient, and then going back to include that factor in a different model (to yield the overall best result) seems strange to me.
Thanks! Any help would be appreciated.
@Ben – Yes, the advantage of the MaxEnt model over the n-gram model is in terms of its richer representation of the structure of words in the lexicon. And right, neither model says why the words should have the shape they do.
You ask: “Should we assume that there are underlying reasons as to the statistical shape of a lexicon, or should we assume that the statistical shape is rather arbitrary? Why would you choose one assumption over the other?”
This is a great question. One way in which the structure of the lexica of languages is not arbitrary is that we can often draw the sorts of implicational generalizations that I was mentioning in class – if a language has sequence type X, it will have sequence type Y. Some phonological theories are set up so that a language that would violate such a universal would be unlearnable: if you tried to learn a language with only X, you’d automatically set up a grammar/model that allow Y too. The fact that the MaxEnt grammars don’t do this is why Daland et al. have the discussion of Evolutionary Phonology in their paper. We’ll come back to this when we talk about artificial language learning.
@Lisa You comment: “Using one set of data to determine a significant effect/coefficient, and then going back to include that factor in a different model (to yield the overall best result) seems strange to me.”
I guess this is a little strange! I did these stats, so I can tell you why I proceeded the way I did. We controlled our items for TP (Transitional Probability), and the first regression just included the factors we manipulated in the experimental design. We thought, though, that we should test for whether residual effects of TP could be at play – this was the motivation for the second regression that you mention. Crucially, the main effects and interactions in that were significant in the first model were significant in the second one (in fact, the p-value for the interaction is lower). We probably could have gone straight to the second regression, but I had some worries about including a variable that we had controlled in the regression, so I wanted to do, and present, that first one.