In our 2012 SIGMORPHON paper, we propose the following measure of categorical success in MaxEnt learning with hidden structure, in this case Underlying Representations (URs) given only observed Surface Representations (SRs) (pp. 67-68):
Our objective function is stated in terms of maximizing the summed probability of all (UR, SR) pairs that have the correct SR, and an appropriate criterion is therefore to require that the summed probability over full structures be greater for the correct SR than for any other SR. We thus term this simulation successful. We further note that given a MaxEnt grammar that meets this criterion, one can make the probabilities of the correct forms arbitrarily close to 1 by scaling the weights (multiplying them by some constant).
Unfortunately, the claim in the last sentence in false, and our success criterion does not seem stringent enough, since a grammar that meets it is not necessarily correct in the sense we would like.
Here’s a simple counter-example to that claim, involving metrical structure rather than URs. We have a trisyllable that has two parsings that generate medial stress, and a single parsing that gets us each initial and final. Stress is a capital A, and footing is shown in parentheses. These probabilities come from zero weights on all constraints, except “Iamb”, which wants the foot to be right-headed, and thus penalizes candidates 2 and 3. Here Iamb has weight 0.1.
1. batAma (batA)ma 0.2624896
2. batAma ba(tAma) 0.2375104
3. bAtama (bAta)ma 0.2375104
4. batamA ba(tamA) 0.2624896
The summed probability of rows 1. and 2. is 0.50, and thus this grammar meets our definition of success if the target language has medial stress. But no matter how high we increase the weight of Iamb, we will never get that sum to exceed 0.50 (another demonstration would have been just to have the weights at zero, since scaling will have no effect, and batAma will again have 0.50 probability). A correct grammar in the sense we would like also needs to include non-zero weight on a constraint that prefers 1. over 4 (e.g. Align-Left).
So what’s the right definition? One obvious possibility would be to require a single correct candidate to have the highest probability, which corresponds to a categorical version of HG (see this paper for some discussion of the relationship between categorical HG and MaxEnt), but that seems wrong given our objective function, which doesn’t have that structure (though see my comment on this post for more on this). Another would be to require some arbitrary amount of probability on the correct form, but we could construct another counter-example simply by making the set of parses that correspond to one overt form sufficiently large w.r.t. to the others. It seems the right answer would involve knowing the conditions under which it is in fact true that scaling will bring probabilities arbitrarily close to 1, but I don’t know what they are when hidden structure is involved.