Category Archives: Research

Posts about research.

Notes from Charles Sutton’s Talk

About a month ago, Charles Sutton stopped by UMass to give a talk called “Statistical Analysis Of Computer Programs.” Here are my lightly-edited, cleaned-up (ish) notes on the talk (this approach inspired by ezyang‘s amazing note-taking abilities):

conceit — text as language
(bad metaphor)
computer program = precise set of instructions
people — more aspects (social interactions)
language a good metaphor as a mean of human communication

when writing code, others may want to use later
next person might be you!
research:
— lots of open source code online
— lots of implicit knowledge in how to write software — libraries api, avoid bugs
— easy to read, easy to maintain
take implicit knowledge and make explicit
writing code in a new library, look at what people did in the past
suggest patterns
key insight: means of communication
regularities of natural language (NL) may be found in programming languages (PL)
no new machine learning
apply existing techniques to statistical NLP, new patterns
coding conventions
— names, formatting
summarization
— get a compressed representation of long verbose source files
mining idioms
— small syntactics patterns in code
describe how we use them

NL coding conventions
local->global movement of abstraction
what kinds of coding conventions?
examples:
junit example
create input stream in java
create an identifier
name of input stream
maybe know the type of name someone would use who contributed to the junit project
formatting–e.g. braces
coding convention is a syntactic constraint beyond that imposed by the languages grammar
programmers themselves decide to impose on top of what the compiler requires them to do
developers care a lot
style checkers style guide

“small amount” of research in software engineering
how they use these in software engineering…

go through conventions, find out what’s important — big commercial (microsoft) projects

threads different aspects of code committed

38 percent related to conventions rather than functionality
(why i hate code review!)

why not just run a formatter over the code?
corner cases — it doesn’t handle for you
renaming variables to be more consistent ? (jsnice)
review time — can talk more about the functionality
where do conventions come from?
– implicit from code base
one programmer starts, others pick it up
emergent quality
mores, rather than laws
large number of software constraints, well modeled with statistical machine learning
even with a lot of programming experiences, won’t know about how things are named
coding convention inference problem –> why not use machine translation to take my conventions and change them to your conventions?

eclipse plugin called devstyle

click on your identifier, will give you a list of other names
renaming suggestions
how should class objects be named
some disagreement with conventions
we can suggest a name
go through a region and rank names
block of code someone wants to add to the project

how large of a corpus would you need?

what kind of technology do you use inside the scoring function?
n-gram language model

smoothing, taking into account the constraints of the compiler/language conventions
constraint is library of changes
only renaming, not generating syntactically correct (would not be syntactically correct)

pull together all uses of the identifier

look at the set of all other names that have been used anywhere in previous or succeeding context
ask ngram language model of joint prob. of entire file — sounds really expensive
— actually using any ngram centered about the thing we want to rename

laura: coreference analysis on the code
— knowing that i and i are the same gives you a nonlinear language mode
— can you get something more robust
— tapping the compiler for name resolution
— how do you corporate into the model (only incorporate into the suggestions — done post hoc)

score by ngram model, threshold so user doesn’t see terrible suggestions
side effect of architecture:
don’t want names to be very common
system does not choose really common names
sympathetic uniqueness principle
have variable program entity, give unusual name
have a very domain specific thing, such as smoothing, choose an appropriate name, considered appropriate big statistical properties

in training set, if we have an id that occurs infrequently, give ? for unknown (deals with tails of the distribution)
suggestion process for alternatives, use known token
whatever context, use rare word, don’t suggest change

what happens if you have a common word, but there should be an unusual — don’t know if there is an answer to this common/question

like adding a new table in a conditional random field (CRP)

formatting conventions
encoding spacing decisions as tokens (indexed by location, foo)

use same framework or suggestions on this basis

does this thing work?
evaluation methodology
automatic evaluation:
— doesn’t say what this is (is it the generative thing, idempotency — thats how you would explain)
— should really being doing a case study no user study or human evaluation or whatever

don’t want low precision win this tool because people won’t use it
95% accuracy — basically, what google does
can do this for other types of identifiers, but its harder
sympathetic uniqueness — do we rename everything as i?
x axis is similar, how often do we make revisions
y axis — element of surprise — things that were rare, what percentage did we incorrectly try to say were something else
set threshold high, no suggestions, no new names
set threshold low, rename everything

methods, variables, and types are very different
variables and types back off
methods are much more surprising all of the time

naturalize tools

final thing — test on github, submit patches
look at top suggestions on top 5 suggestions
submitted 18 patches
14 accepted
do programmers really care about this kind of name
suggest that their exceptions be renamed from e to t?
t was okay
throwable (t) to e
people accept t
evidence that programmers think about and care about
paper on arxiv — fowkes, ranca, allamanis, lapata, sutton

question: exists bad users?
– AI complete question

question: hard metrics for successful renaming — most people like it better?
– programmers are picky
– does it actually make programs easier to maintain

question: fancier language model better?
– yes, think so
– types are names recovering — very conventional (very GP)
– java — names are 1:1 correspondence to class

question: run on dynamic languages?
— no results
— corpus of java, c, python
don’t know if its run on python — i don’t think it would work as well on python because it is not as redundant as java

new topic:
autofolding to summarize code

summarize, compress out java boilerplate
use code folding (which obviously uses the fact that we know blocks are denoted by braces)

is the summary just folding?

difference audiences
task based vs non task based
experience versus novice
expert in project
first look problem — opening single file first time, get overview

TASSAL — tree based autofolding software summarization algorithm
start with file, parse, say we want to compress certain types — block statements, comments

foldable tree — subset of AST that contains nodes we could consider following

file ? bag of words — split identifiers by camel case
some of these are going to be generic java stuff
some are concepts used throughout the project

topic models — find characterizing words for files and packages; tried method as leaves, but this was too sparse

for each node, pick a mixing distribution

single topic per file — packages or other levels of abstraction <– wonder if you could use this for refactoring

— fit the model
this will give us for each token in the source file an indicator variable whether this thing is generated from java, package, file (how characteristic)

think of optimization problem
vector u binary vector
each element indicates whether node is folded
— okay, so topic models for summarization, dug
look at all tokens assigned to file via generative process
empirical distribution of nodes included in the summary

constraints within budget for length of the summary

tree consistency
— if a node is included in the summary, must include parents

optimise via greedy algorithm

question: what about naming or other conventions that are drawn from different natural language distributions?
— clusters of developers that following different conventions — cluster developers together with both formatting naming — models don't work as well

question: what if i have multiple devs collaborating ON THE SAME FILE?
— topic per continent
— run topic per content
— where does z come from?

taking topic models and applying to name? not done yet

look at example topics
example columns are topics
three background topics (e.g. get string baclue name type object i)
projects: spring, bigbluebutton
files: datasourceutils, qualsp

to evaluate:
create gold standard
folded files manually to measure precision and recall
compare with
javadocs — always include, add random
shallowest to deep
expand nodes in order of length
heuristic, but that’s all eclipse is doing — comparing with state of the art

second:
show summaries to developers
6 developers, avg. 4 years industrial experience
— rate conciseness and usefulness
these are more concise and useful
automatic summaries from TASSAL were better than any of the other baselines

third thing:
mining idioms from code using existing NLP tools

what are code idioms?

example: reading into a buffer, iterating over an array
— are these all things that are encapsulated by other abstractions?
opening resources/context — common pattern
need meta variables

code idiom is a syntactic code fragment that recurs frequently across software projects and has a single semantic purposes

— wondering if you could learn to match ASTs from one language to another (from one that has these higher level abstractions to one that doesn't)

idiom-related tools — intellij and eclipse

no way of identifying which idioms are useful (presumably to add them to the IDEs — how do you find new ones)

other types of code patterns
— surface level — code clones copy past code garments
api patterns usage patterns of methods

idiom mining problem —
can i find these templates?

use a probabilistic grammar
CFG slides, pCFG slide

use tree substitution grammar — generalization of a tree joining grammar
non-terminal can expand into a tree instead of a list of terminals and nonterminals

can make a probabilistic version

tree substitution grammar over tree nodes and regular expansions
represents a family of idioms

this will allows us to represent these idioms

input: probabilistic tree substitution grammar within
take a corpus of ASTs
learn the grammar
every tree rule in the TSG that i learn i treat as an idiom

convert into a textual representation

build a library of idioms to show developers
how do we infer the grammar?

maximum likelihood conditioned on the pTSG rules

previous work from sharon goldwater et al
— infer what these trees are, pick list of trees that best explain the corpus
— number of possible things i could put in theta are intractable, maximum likelihood is degenerate, pick 1:1 rules to trees

don't make tree fragments too big
put a prior on probabilistic grammars
if you're going to add another tree, this is what the idiom would like
get a joint distribution over pCFGs and source files
dist. over dist. of parse trees

given a corpus code of code to get a distribution over probabilistic grammars over trees i've inferred
type-based MCMC from liang et al; (think this is from liang’s GP-like work)

some questions i didn't catch

mined idioms
iterator, loop through lines, logger for class, string constant
get patterns you would actually find
get something from actual APIs
e.g. database transaction (opening a resource and cleaning up properly)
get the distance between two points in ??
jsoup get mhtl

lots of work in SE in API mining
no syntactically nested things

take a held out set of files
percentage of AST nodes explained by the —/
existing method for clone detection
completely duplicated ?

copy paste phenomenon

idioms we find occur across projects much more often…?
SE perspective — dozens of papers in SE about copy past clones

if these things are really idioms, maybe they will occur more often in example code — actually what happens
from a data set of regular projects on github, we find 22% of idioms found are actually used in examples, higher in stack overflow

finally, can do a co-occurrence matrix — how are idioms used across different projects
eclipse snipmatch
— open source addition to eclipse — manually took 44 snippets, stuff worked or something
cant put all the idioms in the tool, so many found
considered this as validation that the thing works
interesting that there i was one idiom used that is considered bad practice

exploiting that source code is a means of human communication

maybe surprising to people who are from a different background, that you would need to train the model

Logic and Probability

When I first started at UMass, I had effectively no background in statistics or probability. So, when I was taking the first course in the graduate stats sequence, I tried to frame what I was learning in terms of things I already understood. When I saw the conditional probability $$\mathbb{P}(Y\;\vert\; X)$$, I couldn’t help but think:

$$\begin{array}{|l} X \\ \hline \vdots \\ Y \\ \hline \end{array}\\ X \rightarrow Y$$

Assumption seems to be a close analogy of observation, and if we analyze each construct operationally, they both have a strict order (i.e., observe/assume $$X$$, then derive/calcuate the probability of $$Y$$). Both hold $$X$$ fixed in some way for part of the calculation. Suppose we then say that $$X$$ implies $$Y$$ with some probability $$p$$. If we denote this as $$X \overset{p}{\rightarrow} Y$$, then we have some equivalence relation where $$X \overset{p}{\rightarrow} Y \equiv \mathbb{P}(X\rightarrow Y) = p \equiv \mathbb{P}(Y\;\vert\;X) = p$$.

Since $$X \overset{p}{\rightarrow} Y$$ is just normal logical implication, with a probability attached, we should be able to use the usual rewrite rules and identities (after all, what’s the point of modeling this as a logic if we don’t get our usual identities, axioms, and theorems for free?). In classical logic, implication is short for a particular instance of disjunction: $$X \rightarrow Y \hookrightarrow \neg X \vee Y$$. We can then rewrite our probabilistic implication as $$\neg X \overset{p}{\vee} Y$$ and say $$\mathbb{P}(\neg X \vee Y) = p \equiv \mathbb{P}(\neg X \cup Y) = p$$.

Similarly, we want to have the usual rules of probability at our disposal, so by the definition of conditional probabilities, $$\mathbb{P}(Y\;\vert\; X) = \frac{\mathbb{P}(Y\;\cap\;X)}{\mathbb{P}(X)}$$. We can apply the above rewrite rule for implication to say $$\mathbb{P}(\neg X \cup Y) = p \equiv \frac{\mathbb{P}(Y\;\cap\;X)}{\mathbb{P}(X)} = p$$. This statement must be true for all events/propositions $$X$$ and $$Y$$.

Let’s take a closer look at a subset of events: those where $$X$$ is independent of $$Y$$, denoted $$X \bot Y$$. Independence is defined by the property $$\mathbb{P}(Y\;\vert\; X)=\mathbb{P}(Y)$$. From this definition, we can also derive the identities $$\mathbb{P}(X\cap Y) = \mathbb{P}(X)\mathbb{P}(Y)$$ and $$\mathbb{P}(X\cup Y) = \mathbb{P}(X) + \mathbb{P}(Y)$$. Now we can rewrite $$\mathbb{P}(\neg X \cup Y) = p \equiv \frac{\mathbb{P}(Y\;\cap\;X)}{\mathbb{P}(X)} = p$$ as $$\mathbb{P}(\neg X) + \mathbb{P}(Y) = p \equiv \mathbb{P}(Y) = p$$. Since the relations on either side are equivalent, we can then substitute the right into the left and obtain $$\mathbb{P}(\neg X) = 0 \equiv \mathbb{P}(Y) = p$$. Although this looks a little weird, it’s still consistent with our rules: we’re just saying that when the events are independent (a notion that has no correspondence in our logical framework), the probability of the implication (i.e., the conditional probability) is wholly determined by $$Y$$ — if $$X$$ happens (which it will, almost surely) then $$Y$$’s marginal is $$p$$. If $$X$$ never happens (which it won’t), then $$Y$$ is 0, and the probability of the whole implication is 0.

Now let’s consider how this works over events that are not independent. For this example, let’s gin up some numbers:

$$\mathbb{P}(X) = 0.1 \quad\quad \mathbb{P}(Y) = 0.4 \quad\quad \mathbb{P}(X \cap Y) = 0.09$$.

Note that $$X\not\bot\; Y$$ because $$\mathbb{P}(X\cap Y) \not = 0.04$$. Recall that because either $$X$$ or $$Y$$ are supersets of $$X\cap Y$$, their marginals cannot have a lower probability than their intersections.

Now let’s compute values for either side of the equivalence $$\mathbb{P}(\neg X \cup Y) = p \equiv \mathbb{P}(Y\;\vert\; X) = p$$. First, the conditional probability:

$$\mathbb{P}(Y\;|\; X) = \frac{\mathbb{P}(Y\cap X)}{\mathbb{P}(X)} = \frac{0.09}{0.1} = 0.9 = p$$

Now for the left side of the equivalence, recall the definition of union:
$$\mathbb{P}(\neg X \cup Y) = \mathbb{P}(\neg X) + \mathbb{P}(Y) – \mathbb{P}(\neg X \cap Y)$$.

Since we don’t have $$\mathbb{P}(\neg X \cap Y)$$ on hand, we will need to invoke the law of total probability to compute it: $$\mathbb{P}(\neg X \cap Y) = \mathbb{P}(Y) – \mathbb{P}(X\cap Y) = 0.4 – 0.09 = 0.31$$.

We can now substitute values in:
$$\mathbb{P}(\neg X \cup Y) = 0.9 + 0.4 – 0.31 = 0.99 = p$$.

Now our equivalence looks like this:
$$\mathbb{P}(\neg X \cup Y) = 0.99 \equiv \mathbb{P}(Y\;\vert\; X) = 0.9$$,
which isn’t really much of an equivalence at all.

So what went wrong? Clearly things are different when our random variables are independent. Throughout the above reasoning, we assumed there was a correspondence between propositions and sets. This correspondence is flawed. Logical propositions are atomic, but sets are not. The intersection of non-independent sets illustrates this. We could have identified the source of this problem earlier, had we properly defined the support of the random variables. Instead, we proceeded with an ill-defined notion that propositions and sets are equivalent in some way.

Alternate Backends for PLASMA Crowdsourcing Tools

Although in practice AutoMan and SurveyMan were both designed to make their backends pluggable, we have yet to implement an alternate backend for either because there simply aren’t any AMT competitors out there. There are plenty of crowdsourcing websites, but none are as programmable as AMT and few are as general. That is to say, all competitors appear to offer specialized labor markets and/or be designed for specialized work.

A known problem with the labor market on Amazon is that, even if you pay your workers minimum wage based on time actually spent on a task, they spend a significant amount of time searching for tasks. There are websites set up to facilitate this process, but it’s still time spent searching for work, instead of actually working. A major subfield of alternate approaches involves extracting work either voluntarily, or in contexts where non-monetary compensations make sense. Quizz uses Google’s advertising system to embed knowledge-mining quizzes in with its usual ads. Other approaches substitute consumer marketing tasks or questions for paywalls. In both cases, users are motivated by something other than payment.

I’ve been wondering for a while whether thefacebook would be a good platform for our software. Although the general understanding is that respondents are anonymized, but we know this is not true. Researchers have assumed that workers are independent. Recent work out of MSR has found that some Indian workers actually collaborate on tasks. For these reasons, I think Facebook would be a perfectly reasonable alternate platform for crowd sourcing. In fact, I believe that Facebook is a better platform for crowdsourcing, since it overcomes ones of the major shortcomings of AMT — people are already hanging out there*. Rather than appeal to a populace that is explicitly looking for work, sometimes as a primary source of income, we would like to instead use a Facebook to tap into people’s excess capacity**.

Since Facebook doesn’t currently have a crowdsourcing interface, could we mock up a substitute using what’s already available? Amazon currently handles listing, pool management, payment, presentation, and offers a sandboxed version of AMT for testing. A minimal implementation would involve hosting our own *Man servers and just using Facebook advertising to recruit workers. However, this diverts the user away from the Facebook ecosystem, which defeats the purpose of using Facebook in the first place (for example, we could just as easily use Google AdWords instead).

To keep users in the ecosystem, we could write a SurveyMan app***. I looked into this briefly, and while it isn’t as integrated into the main Facebook experience as I’d want, it’s closer than routing users to an outside website. We could use Facebook advertising to do the initial recruitment and then use wall updates to bootstrap that process. If Facebook advertising provided a way to target ads to particular demographics, we would have a better time with bias in our sample.


* Admittedly, I am not a regular user of thefacebook. I've heard the "so and so spends their whole day on facebook" complaint, but I really don't know how common this is. Consequently, this post is predicated on the idea that thefacebook is a place where people spend a lot of time, not doing work. I have heard that this is less the case since mobile became ubiquitous.

** TBH, I think the cult of excess capacity is unethical, but for the sake of capitalism and this blog post, let's assume it isn't. I will save a discuss of ethics and excess capacity later.

** Word on the street is that no one actually uses these things anyway...