Post comments on the readings for Thurs. Oct. 27th here.
5 thoughts on “Subregularities in alternations”
Presley
At first I thought Zuraw’s constraint family against stem initial nasals in (19) was supposed to say “or a backer nasal” instead of “or a fronter nasal” and that it was just a typo, but she later (p.451 just after the graph) says that one of those constraints is against not just stem-initial n but also stem-initial m, implying that she really did mean for the constraints to be *[nasal, *[n/m, *[m. Yet she definitely wants engma to be the most marked stem-initially. Am I experiencing temporary insanity in believing that these constraints will make m the most marked stem-initial nasal?
As for something more substantive: do I understand correctly in thinking that even though the GLA gives real numbers, they work as rankings rather than as weights when it comes to evaluating harmony? So that we’re just doing OT with variably ranked constraints? In that case, it looks like she’s making a choice about whether to use OT or HG based on something that seems separable from that choice – the way of keeping weights from getting too high. How crazy would it be to tweak that part of MaxEnt? Is there a reason to believe that any algorithm with gang effects would still have the problem that she reported with MaxEnt?
1. On page 29, the authors state that incorporating token frequency of existing verbs into predictive models degraded performance on predicting the “production†task data. I found this very interesting because this finding coincided with the fact that underlying forms are never actually used as speech tokens – they “underlie†speech tokens. In the realm of “analogical†models, could this result argue for a model in which there are nodes corresponding to actual speech tokens but also nodes corresponding to generalizations such as morphophonological underlying forms? Given that frequency effects do exist, such effects would be predicted to only arise if actual speech tokens play a role in the task at hand, but when the target of a task is specifically a generalization (because “underlying voicing†is never a property of a speech token – it is a relation between groups of speech tokens), frequency effects will not play a role because generalizations cannot have a frequency.
(For instance, cars with wheels are very frequent in the “real†world, but the generalization “if X is a car, X will have wheels†can’t be said to be frequent or not – it can only be said to be true of a large amount of Xs.)
2. Related to point 1., I think it’s slightly misleading that analogical models are seen as being preferable over OT through Occam’s razor. Whereas analogical models may have a simpler, and sometimes even more predictive, account of the facts, analogical models don’t seem to have a well worked-out story for the concepts “underlying form†and “alternationâ€, while OT and its variants were built to handle these concepts. Because the data used crucially refer to underlying forms and alternations, it seems at best premature to conclude that analogical models are simpler than OT and therefore “betterâ€, given that their predictive power is at least equal.
3. I’m wondering why the authors used the past tense -te/-de alternation and not the singular/plural alternation. All regular plural verb forms in Dutch end in the suffix -@(n) (), which gives the opportunity for the underlying voicing of a stem-final consonant to surface.
The -te/-de alternation seems problematic because it is extensively taught in school, which means that people probably somehow can’t get it right without instruction. The rule that is taught is basically that -te comes after stems that end in a voiceless consonant letter after you chop off the infinitive/plural suffix (infinitives and present plurals are homophonous), and -de is used elsewhere.
The results from the Ernestus & Baayen (2001) study in which the shape of the past tense suffix was partly determined by underlying form, and partly by lexical neighborhood appears to confirm that the pattern is not as simple as it seems to be. I remember how me and my classmates back in the Leiden were trying to figure out what we actually said for some of the forms that were supposed to be spelled with -de, and I think we concluded that, at least in our dialect (which is pretty close to the kind of Dutch spoken on radio and TV), we would often say [t@] in contexts we were supposed to write -de.
This is all suggestive rather than conclusive, of course, but I’m just wondering why they didn’t use the singular/plural alternation, and let subjects say their versions of the nonce words, rather than use this strange and prescriptivism-affected alternation and use spelling as the medium for a “production†study.
4. Slighty off-topic: one of the nonce word stems used in this study, [tif], was actually a slang verb stem used at least by teenagers at my secondary school; the basic meaning of the word was “to fallâ€, but it could also be used with the particle op- to mean “fuck offâ€. I’m sure however, that this fact didn’t affect the study – in fact, the experimental results for this stem were 50-50.
So, I’m interested in the “loose boundary” effect, which is mentioned at the beginning of the paper, then quietly pushed to the side and never spoken of again.
Specifically, Zuraw states that “Nasal substitution is less likely to occur at “looser” morpheme boundaries – that is, with semantically transparent prefixes and low-frequency words” (432).
If the “loose boundary effect” is real, I’m wondering how we would expect this to interact with the nonce-word tests Zuraw administered to Tagalog speakers. The nonce words she constructed seem to fit the “loose boundary” definition perfectly: being nonce-words, they are obviously low-frequency, and since Zuraw uses the same prefix with the same meaning throughout, the prefix’s meaning should become more transparent as the experiment progresses (if it wasn’t transparent to begin with).
Both of Zuraw’s nonce-word experiments show significant effects of voicing. But given that all of the nonce-words Zuraw uses have “loose boundaries,” shouldn’t we expect nasal substitution to apply infrequently, regardless of phonological context? Or does this depend on the ranking of MORPHEME COHESION (the constraint that penalizes substitution across a loose boundary)? Zuraw is never really clear about where this constraint is ranked relative to the constraints that motivate nasal substitution, but she doesn’t mention any sort of attested interaction between boundary looseness and place/voicing, suggesting that MORPHEME COHESION should rank above all of the constraints that motivate nasal substitution, in which case it is a surprise to see significant nasal substitution rates in the nonce-word task.
If the discrepancy between the predictions of the “loose boundary” condition and the results of the nonce-word tasks really is unexpected under Zuraw’s analysis, how do we account for it? Is the “loose boundary” condition poorly defined? Is there something wrong with Zuraw’s analysis? Or does this indicate that participants don’t treat nonce-word tasks as “real” langauge-use situations?
If so, this is at odds with the actual outcome, in which there is a significant effect of voicing.
we should expect nasal substitution to apply infrequently, regardless of the phonological context
should we expect the results that she actually gets? Specifically, she finds a significant difference in substitution between voiced and voiceless stem-initial consonants, but the fact that
In the comparison between Stochastic OT and MaxEnt, Zuraw points out that the voicing difference is small (6%) and the place difference is smaller (.01%). Both of these are smaller than what the constraint set gives *prior to learning* in Stochastic OT (Fig. 11). This seems worrisome. What it would suggest to me is that there’s a scaling problem — the default equal weights for Stochastic OT do well, but we’d need higher starting weights in MaxEnt. Maybe the small differences observed *with* learning would then be big enough in MaxEnt.
I’d like to hear how this stuff jives (or not) with your experimentation with regularization and generalization in MaxEnt grammars…
So I don’t know where to post this, but this is about Rumelhart and McClelland.
I wonder about how they establish distinct epochs for training/testing in which certain frequency words are simply present or absent. Would we expect the same effects if we actually sampled according to the words distributions? Was there any reason (besides maybe something practical about computational power) to do things this way?
At first I thought Zuraw’s constraint family against stem initial nasals in (19) was supposed to say “or a backer nasal” instead of “or a fronter nasal” and that it was just a typo, but she later (p.451 just after the graph) says that one of those constraints is against not just stem-initial n but also stem-initial m, implying that she really did mean for the constraints to be *[nasal, *[n/m, *[m. Yet she definitely wants engma to be the most marked stem-initially. Am I experiencing temporary insanity in believing that these constraints will make m the most marked stem-initial nasal?
As for something more substantive: do I understand correctly in thinking that even though the GLA gives real numbers, they work as rankings rather than as weights when it comes to evaluating harmony? So that we’re just doing OT with variably ranked constraints? In that case, it looks like she’s making a choice about whether to use OT or HG based on something that seems separable from that choice – the way of keeping weights from getting too high. How crazy would it be to tweak that part of MaxEnt? Is there a reason to believe that any algorithm with gang effects would still have the problem that she reported with MaxEnt?
Ernestus & Baayen (2003):
1. On page 29, the authors state that incorporating token frequency of existing verbs into predictive models degraded performance on predicting the “production†task data. I found this very interesting because this finding coincided with the fact that underlying forms are never actually used as speech tokens – they “underlie†speech tokens. In the realm of “analogical†models, could this result argue for a model in which there are nodes corresponding to actual speech tokens but also nodes corresponding to generalizations such as morphophonological underlying forms? Given that frequency effects do exist, such effects would be predicted to only arise if actual speech tokens play a role in the task at hand, but when the target of a task is specifically a generalization (because “underlying voicing†is never a property of a speech token – it is a relation between groups of speech tokens), frequency effects will not play a role because generalizations cannot have a frequency.
(For instance, cars with wheels are very frequent in the “real†world, but the generalization “if X is a car, X will have wheels†can’t be said to be frequent or not – it can only be said to be true of a large amount of Xs.)
2. Related to point 1., I think it’s slightly misleading that analogical models are seen as being preferable over OT through Occam’s razor. Whereas analogical models may have a simpler, and sometimes even more predictive, account of the facts, analogical models don’t seem to have a well worked-out story for the concepts “underlying form†and “alternationâ€, while OT and its variants were built to handle these concepts. Because the data used crucially refer to underlying forms and alternations, it seems at best premature to conclude that analogical models are simpler than OT and therefore “betterâ€, given that their predictive power is at least equal.
3. I’m wondering why the authors used the past tense -te/-de alternation and not the singular/plural alternation. All regular plural verb forms in Dutch end in the suffix -@(n) (), which gives the opportunity for the underlying voicing of a stem-final consonant to surface.
The -te/-de alternation seems problematic because it is extensively taught in school, which means that people probably somehow can’t get it right without instruction. The rule that is taught is basically that -te comes after stems that end in a voiceless consonant letter after you chop off the infinitive/plural suffix (infinitives and present plurals are homophonous), and -de is used elsewhere.
The results from the Ernestus & Baayen (2001) study in which the shape of the past tense suffix was partly determined by underlying form, and partly by lexical neighborhood appears to confirm that the pattern is not as simple as it seems to be. I remember how me and my classmates back in the Leiden were trying to figure out what we actually said for some of the forms that were supposed to be spelled with -de, and I think we concluded that, at least in our dialect (which is pretty close to the kind of Dutch spoken on radio and TV), we would often say [t@] in contexts we were supposed to write -de.
This is all suggestive rather than conclusive, of course, but I’m just wondering why they didn’t use the singular/plural alternation, and let subjects say their versions of the nonce words, rather than use this strange and prescriptivism-affected alternation and use spelling as the medium for a “production†study.
4. Slighty off-topic: one of the nonce word stems used in this study, [tif], was actually a slang verb stem used at least by teenagers at my secondary school; the basic meaning of the word was “to fallâ€, but it could also be used with the particle op- to mean “fuck offâ€. I’m sure however, that this fact didn’t affect the study – in fact, the experimental results for this stem were 50-50.
RE Zuraw:
So, I’m interested in the “loose boundary” effect, which is mentioned at the beginning of the paper, then quietly pushed to the side and never spoken of again.
Specifically, Zuraw states that “Nasal substitution is less likely to occur at “looser” morpheme boundaries – that is, with semantically transparent prefixes and low-frequency words” (432).
If the “loose boundary effect” is real, I’m wondering how we would expect this to interact with the nonce-word tests Zuraw administered to Tagalog speakers. The nonce words she constructed seem to fit the “loose boundary” definition perfectly: being nonce-words, they are obviously low-frequency, and since Zuraw uses the same prefix with the same meaning throughout, the prefix’s meaning should become more transparent as the experiment progresses (if it wasn’t transparent to begin with).
Both of Zuraw’s nonce-word experiments show significant effects of voicing. But given that all of the nonce-words Zuraw uses have “loose boundaries,” shouldn’t we expect nasal substitution to apply infrequently, regardless of phonological context? Or does this depend on the ranking of MORPHEME COHESION (the constraint that penalizes substitution across a loose boundary)? Zuraw is never really clear about where this constraint is ranked relative to the constraints that motivate nasal substitution, but she doesn’t mention any sort of attested interaction between boundary looseness and place/voicing, suggesting that MORPHEME COHESION should rank above all of the constraints that motivate nasal substitution, in which case it is a surprise to see significant nasal substitution rates in the nonce-word task.
If the discrepancy between the predictions of the “loose boundary” condition and the results of the nonce-word tasks really is unexpected under Zuraw’s analysis, how do we account for it? Is the “loose boundary” condition poorly defined? Is there something wrong with Zuraw’s analysis? Or does this indicate that participants don’t treat nonce-word tasks as “real” langauge-use situations?
If so, this is at odds with the actual outcome, in which there is a significant effect of voicing.
we should expect nasal substitution to apply infrequently, regardless of the phonological context
should we expect the results that she actually gets? Specifically, she finds a significant difference in substitution between voiced and voiceless stem-initial consonants, but the fact that
Zuraw:
In the comparison between Stochastic OT and MaxEnt, Zuraw points out that the voicing difference is small (6%) and the place difference is smaller (.01%). Both of these are smaller than what the constraint set gives *prior to learning* in Stochastic OT (Fig. 11). This seems worrisome. What it would suggest to me is that there’s a scaling problem — the default equal weights for Stochastic OT do well, but we’d need higher starting weights in MaxEnt. Maybe the small differences observed *with* learning would then be big enough in MaxEnt.
I’d like to hear how this stuff jives (or not) with your experimentation with regularization and generalization in MaxEnt grammars…
So I don’t know where to post this, but this is about Rumelhart and McClelland.
I wonder about how they establish distinct epochs for training/testing in which certain frequency words are simply present or absent. Would we expect the same effects if we actually sampled according to the words distributions? Was there any reason (besides maybe something practical about computational power) to do things this way?