“On the expressive power of programming languages”

In response to a Keanu moment I recently had, Arjun recommended I read On the expressive power of programming languages. The paper is divided into roughly two parts: in the first part, Felleisen sets up the formal framework for describing the expressiveness of programming languages. In the second part, he illustrates the repercussions with Scheme.

Let $$\mathbb{F}_1,\ldots,\mathbb{F}_n,\ldots$$ be a set of constructs we wish to add to a language $$\mathcal{L}_0$$. Let this new language be denoted by $$\mathcal{L}_1$$. These languages have equivalent expressiveness if (a) adding the constructs does not change the semantics of the subset of $$\mathcal{L}_1$$ that can be expressed with just the constructs in $$\mathcal{L}_0$$ and (b) the newly added constructs can be expressed using local changes in the program. An important enforcement of the locality requirement is that the mapping between the original language and the extension must be the identity on the formulae in the language and a homomorphism on the local connectors. The formalism that he defines enforces these requirements. The main conclusion of the paper is that “…an increase in expressive power comes at the expense of less ‘intuitive’ semantic equivalence relations.”

The proofs are quite nice and clear. It’s a journal article, so it’s long, but thorough. There’s an interesting discussion in the second half about how introducing abort and call/cc mess up the semantics of the program. I had had difficulty imagining concrete examples to illustrate some of the issues that showed up as abstractions in the first half, so the Scheme section was helpful.

It isn’t clear to me whether there was a big, definitive result that came out of this work. I’ll have to investigate more.

Participation and Contribution in Crowdsourced Surveys

Participation and Contribution in Crowdsourced Surveys, a recent PLOSOne article, discusses some interesting approaches to crowdsourced surveys. Not only are the answers crowdsourced, but the questions themselves are also crowdsourced. The surveys are seeded with a small number of questions and later augmented with questions supplied by respondents. These questions are curated by hand and presented to respondents in a random order.

Approach and comparison with Quizz

The authors accomplished this by setting up three separate websites to collect the data. The only one that is still active is for Energy Minder which is an ongoing research project. The other two surveys were about BMI and personal finance.

The motivation for this work is very similar to Quizz. The authors state:

The crowdsourcing method employed in this paper was motivated by the hypothesis that there are many questions of scientific and societal interest, for which the general (non-expert) public has substantial insight. For example, patients who suffer from a disease, such as diabetes, are likely to have substantial insight into their condition that is complementary to the knowledge of scientifically-trained experts. The question is, how does one collect and combine this non-expert knowledge to provide scientifically-valid insight into the outcome of interest.

Like Quizz, their system eschews financial incentives for survey completion. Unlike Quizz, new questions are added by the users themselves, rather than by a system. In Quizz, the objective is to complete a knowledge base — responses to questions are point estimates. In this system, the questions serve as features designed to predict a variable of interest, whether it be energy consumption, BMI, or the amount an individual has in their savings. The paper does not explicitly state the measurement level of the outcome variable; it isn’t clear if, for example, energy consumption is a binary variable (high/low), a categorical variable defined by buckets, or a real-valued prediction of kWh.

Questions, Observations, Insights

  • Are there any baseline/ground-truth studies? All three surveys ask questions that could be influenced by a variety of biases (e.g., Vermonters are hippies who desire a low energy footprint, biases against overweight and poor people, etc.).
  • One big advantage of using crowdsourced questions is that they can give insights for how to get around social desirability bias. This isn’t discussed in the paper, but would of interest to social scientists.
  • Early in the paper they state, “…a real-time machine learning module continuously builds models from this growing store of data to predict the outcome variable, and the resulting predictions are shown to the users.” The machine learning module they refer to is Hod Lipson’s symbolic regression package. It’s not clear to me when the predictions are shown. Aren’t there methodological issues with telling the respondent what you’re trying to predict? Although this can sometimes be the case, social desirability and other biases may have a significant impact on the outcome variable.
  • Related work: Program Boosting uses GP and crowdsourcing.
  • “If the user decides to submit their own question, they are then required to respond to it. We have found that being forced to immediately respond to the question helps users to recognize when their questions are unclear or confusing.” Do they have revision data for questions? Or do respondents just re-enter the question if it isn’t clear? Is there feedback on question clarity, or is this something that the human curator determines? It’s not clear to me how this works, but this data might be an interesting feature to use in quality control.
  • Beyond surveys, this is an interesting way to collect features for some other task. The questions are basically features here.
  • Problems of propagation of error could be connected to issues we’re looking at in Automan vis a vis multiple comparisons.
  • The learning module: could we use these techniques to build up blocks dynamically? Learn blocks?

Criticisms

  • The validation checks (e.g. valid ranges) are a very weak adversarial model. Since there are no financial incentives for this survey, a greater threat to validity are inattentive respondents.
  • I’d like to see a stronger comparison with the active learning literature. There are issues of compounded error when using stepwise regression and the kind of user-generated question dogfooding fu that’s happening here. I suspect the active learning literature addresses some of these issues and would give insight into how to have greater statistical validity.
  • Testing for a correlation coefficient different from 0 is too sensitive. This hardly ever happens. To guard against this, or at least establish a kind of prior on false correlations, the authors could inject seemingly unrelated questions into the survey. Of course, there is some probe bias here that could cause unintended consequences, so it would have to be thought out carefully. I’m just not satisfied with, “The lack of correlation between participation and contribution falsifies the hypothesis that a higher level of participation is indicative of interest in and knowledge of the subject area.”
  • Also asked above: what’s the baseline? What’s to stop the system from predicting the most common answer, given the class? How does this perform against a naive Bayes or decision tree classifier?
  • I would like to see some regularlization in the modelling. Symbolic regression can be very sensitive to outliers. I’m not sure what’s in this implementation, though. The paper would benefit from a discussion of regularization.

Rage Against the Machine Learning

Woah, it’s been a while since I actually made a blog post. I’m going to to try to make a blog post every day until I head off for my Facebook internship in June (super psyched!). I’m putting it in writing, so that’s how you know I’m definitely going to break my word…

I owe the above title to Bryan Ford, who listened to me raging against Machine Learning this afternoon. Bryan is visiting for our first systems lunch in a while. It’s been a long hiring season…

Anyway, back to the complaint at hand. Today was the final machine learning class of the semester and people were presenting their projects. One of the groups wasn’t really a group, so much as two students working loosely on the same approach to different problems. At the end of their presentation, I asked what commonalities there were between the two projects, other than both using CRFs to model the problem. They said that they were both trying to show that some kind of regularization would cause improvement on the state of the art of their respective data sets. I then asked if they were using the same base code. The students responded that they were not: one was implemented in Matlab and the other in Scala. They had reasons for this (mainly, the Scala student uses Scala in his lab, and he seemed to have scared off the Matlab student from using Scala because of its “learning curve” — see, Scala, you seriously need some better PR). I was troubled that they weren’t using comparable backends, since this meant they would definitely not be able to generalize without modeling the impact of measurement on their results. I didn’t ask if they were implementing the state of the art in their systems, so as to make them comparable (although I should have). In the end, I thought maybe it wouldn’t have that much of an impact.

At the end of the class, I picked up my final “mini-project.” We were given some sample output to test against and for my implementation of a one-vs-all multiclass classifier using a polynomial kernel, I was only able to get a test accuracy of 78.70%, versus the expected test accuracy of 87.3%. The TA (whom I don’t mean to criticize — he’s done a great job in this class, and I really appreciate it!) had written “too low, but I couldn’t find errors in your code. May be some small bugs.” Now, I had originally implemented other versions of other parts of this project; they too were also too low. Those older versions of the code were written using for-loops. After porting my loops over to matrix multiplications, I was able to improve the accuracy of my test set to equal that of the provided checkpoints. All this is to say that I strongly suspect the offending code (then one-vs-all multiclass classifier using a polynomial kernel) is semantically equivalent to a matrix-operation-only version, but will produce “better” results. This (of course!) is an exercise in dealing with numeric stability.

The futility of comparisons?

So of course looking at my homework and reflecting on my question for my colleagues, I felt confirmation that since we can’t even compare semantically equivalent computations when written in the same language, how can we compare across languages? Part of the reason why people use libraries like BLAS is because they are very efficient, but it also has the effect of determinism when comparing against other implementations — we don’t have faithful models of computation to provide meaningful comparisons between implementations when people use different floating point operations on different architectures. These issues are clearly tied to problems of reproducibility that have been getting a lot of buzz in our community lately.

Topical: Measurement

Immediately before class I had seen Andrew Gelman’s post on how we undervalue measurement in statistics. Although numerical analysis has been tackled by a lot of people who are way smarter than I am, it still causes problems for practitioners. With the rise of deep learning, big data, and data science, more and more people are using tools whose shortcomings are non-obvious. As a computer scientist, I am aware that floating point numbers are problematic, and that compounded floating point error can be a problem. I was still shocked by how much of a problem it was. I expected a small amount of variation, but was surprised that my result deviated as much as is did. Of course, the problem could still be in the code itself — after all, it isn’t statically verified. However, I still noticed differences in test accuracy for the other code that I converted from loops to matrix operations. These differences were on the order of a few percentage points. Although an order of magnitude smaller than the difference in the kernelized one-vs-all classifier, the difference of two percentage points or so actually makes a result publishable (at least in some ML/NLP conferences I’ve seen).

I wonder if we would get some variation in the test accuracy using loops just from randomizing the layout of some junk under the surface. Lisps have been partially addressing numerical stability by having ratio types. There must be some languages that do some semi-symbolic computations, simplifying and deferring floating point operations until a value is needed. If there were true, they would be able to return some bound on the error of that computation.

Rise of the Machines

So is there a way to convert loops to matrix operations automatically? I asked John if he had ever heard of such a thing and he said that there’s “vectorization” (a term I must have heard before, but have forgotten due to my advanced age), but he isn’t aware of any compiler optimizations for as high a level as I was suggesting. Since the Julia folks seem to be at the forefront of numerical computing right now (or are at least getting a lot of inflated press about it right now), I thought I’d look into what they have to say, but it doesn’t look like it. Are there constrained cases where we could do this? Maybe there’s a FFTW-style meta-programming approach someone could employ.