January | 2016 | UMass Computational Phonology

From an e-mail from Paul Smolensky, March 28, 2015. Even though he wasn’t doing phonology in the mid-1980’s when he coined the term “Harmony Theory”, Paul had apparently taken a course on phonology with Jorge Hankamer and found vowel harmony fascinating.

“Harmony” in “Harmony Theory” arises from the fact that the Harmony function is a measure of *compatibility*; the particular word was inspired by vowel harmony, and by the letter ‘H’ which is used in physics for the Hamiltonian or energy function, which plays in statistical mechanics the same mathematical role that the Harmony function plays in Harmony theory: i.e., the function F such that prob(x) = k*exp(F(x)/T).

(Although I took the liberty of changing the sign of the function; in physics, it’s p(x) = k*exp(–H(x)/T), in Harmony Theory, it’s p(x) = k*exp(H(x)/T). That’s because it drove me crazy working in statistical mechanics that that minus sign kept coming and going and coming and going from equation to equation, leading to countless errors; I just dispensed with it at the outset and cut all that nonsense off at the pass.)

From an e-mail from Mark Johnson Jan. 16th, 2016:

I always thought the reason why the physicists had a minus sign in the exponential was that otherwise temperatures would have to be negative. But I guess you can push the negation into the Hamiltonian, which is perhaps what Paul did.

From an e-mail from Paul Smolensky, Feb. 10th, 2016:

Yes, that’s just what I did. Instead of minimizing badness I switched to maximizing goodness. I’m just that kind of guy.

From an e-mail from Mark Johnson, Feb. 10th, 2016:

Probabilities are never greater than one, so log probabilities are always less than or equal to zero. So a negative log likelihood is always a positive quantity, and smaller negative log likelihood values are associated with more likely outcomes. So one way to understand the minus sign in the Gibbs-Boltzmann distribution is that it makes H(x) correspond to a negative log likelihood.

But I think one can give a more detailed explanation.

In a Gibbs-Boltzmann distribution p(x) = k*exp(–H(x)/T), H(x) is the energy of a configuration x.

Because energies H(x) are non-negative (which follows from the definition of energy?), and given a couple of other assumptions (e.g., that there are an infinite number of configurations and energies are unbounded — maybe other assumptions will do?), it follows that probability must decrease with energy, otherwise the inverse partition function k would not exist (i.e., the probability distribution p(x) would not sum to 1).

So if the minus sign were not there, the temperature T (which relates energy and probability) would need to be negative. There’s no mathematical reason why we couldn’t allow negative temperatures, but the minus sign makes the factor T in the formula correspond much closer with our conventional understanding of temperature.

In fact, I think it is amazing that the constant T in the Gibbs-Boltzmann formula denotes exactly the pre-statistical mechanics concept of temperature (well, absolute temperature in Kelvin). In many other domains there’s a complex relationship between a physical quantity and our perception of it; what is the chance of a simple linear relationship like this for temperature?

But perhaps it’s not a huge coincidence. Often our perceptual quantities are logarithmically related to physical quantities, so perhaps its no accident that T is inside the exp() rather than outside (where it would show up as an “exponential temperature” term). And the concept of temperature we had before Gibbs and Boltzmann wasn’t just a naive perception of warmth; there had been several centuries of careful empirical work on properties of gases, heat engines, etc., which presumably lead scientists to the right notion of temperature well before the Gibbs-Boltzmann relationship was discovered.

From an e-mail from Paul Smolensky March 27, 2016:

Here are some quick thoughts.

0. Energy E in physics is positive. That’s what forces the minus sign in p(x) \propto exp(—E(x)/T), as Mark observes.

Assuming x ranges over an infinite state space, the probability distribution can only be normalized to sum to one if the exponent approaches zero as x -> infinity, and if E(x) > 0 and T > 0, this can only happen if E(x) -> infinity as x -> infinity and we have the minus sign in the exponent.

1. Why is physical E > 0?

2. Perhaps the most fundamental property of E is that it is conserved: E(x(t)) = constant, as the state of an isolated physical system x(t) evolves in time t. From that point of view there’s no reason that E > 0; any constant value would do.

3. For a mechanical system, E = K + V, the sum of the kinetic energy K derived from the motion of the massive bodies in the system and the potential energy V. Given Newton’s second law, F = ma = m dv/dt, E is conserved when F = — grad V and K = mv^2/2

then dE/dt = d(mv(t)^2/2)/dt + dV(x(t))/dt = mv dv/dt + dx/dt . grad V = v(ma) + v(—F) = 0; that’s where the — sign in —grad V comes from.

Everything in the equation E = K + V could be inverted, multiplied by —1, without change in the conservation law. But the commonsense meaning of “energy” is something that should increase with v, hence K = mv^2/2 rather than —mv^2/2.

4. Although K = mv^2/2 > 0, V is often negative.

E.g., for the gravitational force centered at x = 0, F(x) = —GmM x/|x|^3 = —grad V if V(x) = —GmM/|x| < 0

(any constant c can be added to this definition of V without consequence; but even so, for sufficiently small x, V(x) < 0)

Qualitatively: gravitational force is attractive, directed to the origin in this case, and this force is —grad V, so grad V must point away from the origin, so V must increase as x increases, i.e., must decrease as x decreases. V must fall as 1/|x| in order for F to fall as 1/|x|^2 so the decrease in V as x —> 0 must take V to minus infinity.

5. In the cognitive context, it’s not clear there’s anything corresponding to the kinetic energy of massive bodies. So it’s not clear there’s anything to fix a part of E to be positive; flipping E by multiplication by —1 doesn’t seem to violate any intuitions. Then, assuming we keep T > 0, we can (must) drop the — in p(x) \propto exp(—E(x)/T) = exp(H(x)/T) where we define Harmony as H = —E. Now the probability of x increases with H(x); lower H is avoided, hence higher H is “better”, hence the commonsense meaning of “Harmony” has the right polarity.

E-mail from Mark Johnson March 27, 2016

Very nice! I was thinking about kinetic energy, but yes, potential energy (such as gravitational energy) is typically conceived as negative (I remember my high school physics class, where we thought of gravitational fields as “wells”). I never thought about how this is forced once kinetic energy is positive.

Continuing in this vein, there are a couple of other obvious questions once one thinks about the relationship between Harmony theory and exponential models in physics.

For example, does the temperature T have any cognitive interpretation? That is, is there some macroscopic property of a cognitive system that T represents?

More generally, in statistical mechanics the number (or more precisely, the density) of possible states or configurations varies as a function of their energy, and there are so many more higher energy states than lower energy ones that the typical or expected value of a physical quantity like pressure is not that of the more probable low energy states, but instead determined by the more numerous, less probable higher energy states.

I’d be extremely interested to hear if Paul knows of any cases where this or something like it occurs in cognitive science. I’ve been looking for convincing cases ever since I got interested in Bayesian learning! The one case I know of has to do with “sparse Dirichlet priors”, and it’s not exactly overwhelming.

E-mail from Paul Smolensky, March 27, 2016

The absolute magnitude of T has no significance unless the absolute magnitude of H does, which I doubt. So I’d take Mark’s question about T to come down to something like: what’s the cognitive significance of T —> 0 or T —> infinity or T ~ O(1)?

And I’d look for answers in terms of the cognitive role of different types of inference. T —> 0 gives maximum-likelihood inference; T —> infinity gives uniform sampling; T ~ O(1) gives sampling from the distribution exp(H(x)). Mark, you’re in a better position to interpret the cognitive significance of such inference patterns.

As for the question of density of states of different Harmony/energy, the (log) density of states is essentially the entropy, so any cognitive significance entropy may have — e.g., entropy reduction as predictor of incremental sentence processing difficulty à la Hale — qualifies as cognitive relevance of density of states. As for the average value of a quantity reflecting less-probable-but-more-numerous states more than more-probable states, I’m not sure what the cognitive significance of average values is in general.

UMass Computational Phonology

Computational models and methods in phonology

Monthly Archives: January 2016

What’s Harmony?