About a month ago, Charles Sutton stopped by UMass to give a talk called “Statistical Analysis Of Computer Programs.” Here are my lightly-edited, cleaned-up (ish) notes on the talk (this approach inspired by ezyang‘s amazing note-taking abilities):
conceit — text as language
(bad metaphor)
computer program = precise set of instructions
people — more aspects (social interactions)
language a good metaphor as a mean of human communication
when writing code, others may want to use later
next person might be you!
research:
— lots of open source code online
— lots of implicit knowledge in how to write software — libraries api, avoid bugs
— easy to read, easy to maintain
take implicit knowledge and make explicit
writing code in a new library, look at what people did in the past
suggest patterns
key insight: means of communication
regularities of natural language (NL) may be found in programming languages (PL)
no new machine learning
apply existing techniques to statistical NLP, new patterns
coding conventions
— names, formatting
summarization
— get a compressed representation of long verbose source files
mining idioms
— small syntactics patterns in code
describe how we use them
NL coding conventions
local->global movement of abstraction
what kinds of coding conventions?
examples:
junit example
create input stream in java
create an identifier
name of input stream
maybe know the type of name someone would use who contributed to the junit project
formatting–e.g. braces
coding convention is a syntactic constraint beyond that imposed by the languages grammar
programmers themselves decide to impose on top of what the compiler requires them to do
developers care a lot
style checkers style guide
“small amount” of research in software engineering
how they use these in software engineering…
go through conventions, find out what’s important — big commercial (microsoft) projects
threads different aspects of code committed
38 percent related to conventions rather than functionality
(why i hate code review!)
why not just run a formatter over the code?
corner cases — it doesn’t handle for you
renaming variables to be more consistent ? (jsnice)
review time — can talk more about the functionality
where do conventions come from?
– implicit from code base
one programmer starts, others pick it up
emergent quality
mores, rather than laws
large number of software constraints, well modeled with statistical machine learning
even with a lot of programming experiences, won’t know about how things are named
coding convention inference problem –> why not use machine translation to take my conventions and change them to your conventions?
eclipse plugin called devstyle
click on your identifier, will give you a list of other names
renaming suggestions
how should class objects be named
some disagreement with conventions
we can suggest a name
go through a region and rank names
block of code someone wants to add to the project
how large of a corpus would you need?
what kind of technology do you use inside the scoring function?
n-gram language model
smoothing, taking into account the constraints of the compiler/language conventions
constraint is library of changes
only renaming, not generating syntactically correct (would not be syntactically correct)
pull together all uses of the identifier
look at the set of all other names that have been used anywhere in previous or succeeding context
ask ngram language model of joint prob. of entire file — sounds really expensive
— actually using any ngram centered about the thing we want to rename
laura: coreference analysis on the code
— knowing that i and i are the same gives you a nonlinear language mode
— can you get something more robust
— tapping the compiler for name resolution
— how do you corporate into the model (only incorporate into the suggestions — done post hoc)
score by ngram model, threshold so user doesn’t see terrible suggestions
side effect of architecture:
don’t want names to be very common
system does not choose really common names
sympathetic uniqueness principle
have variable program entity, give unusual name
have a very domain specific thing, such as smoothing, choose an appropriate name, considered appropriate big statistical properties
in training set, if we have an id that occurs infrequently, give ? for unknown (deals with tails of the distribution)
suggestion process for alternatives, use known token
whatever context, use rare word, don’t suggest change
what happens if you have a common word, but there should be an unusual — don’t know if there is an answer to this common/question
like adding a new table in a conditional random field (CRP)
formatting conventions
encoding spacing decisions as tokens (indexed by location, foo)
use same framework or suggestions on this basis
does this thing work?
evaluation methodology
automatic evaluation:
— doesn’t say what this is (is it the generative thing, idempotency — thats how you would explain)
— should really being doing a case study no user study or human evaluation or whatever
don’t want low precision win this tool because people won’t use it
95% accuracy — basically, what google does
can do this for other types of identifiers, but its harder
sympathetic uniqueness — do we rename everything as i?
x axis is similar, how often do we make revisions
y axis — element of surprise — things that were rare, what percentage did we incorrectly try to say were something else
set threshold high, no suggestions, no new names
set threshold low, rename everything
methods, variables, and types are very different
variables and types back off
methods are much more surprising all of the time
naturalize tools
final thing — test on github, submit patches
look at top suggestions on top 5 suggestions
submitted 18 patches
14 accepted
do programmers really care about this kind of name
suggest that their exceptions be renamed from e to t?
t was okay
throwable (t) to e
people accept t
evidence that programmers think about and care about
paper on arxiv — fowkes, ranca, allamanis, lapata, sutton
question: exists bad users?
– AI complete question
question: hard metrics for successful renaming — most people like it better?
– programmers are picky
– does it actually make programs easier to maintain
question: fancier language model better?
– yes, think so
– types are names recovering — very conventional (very GP)
– java — names are 1:1 correspondence to class
question: run on dynamic languages?
— no results
— corpus of java, c, python
don’t know if its run on python — i don’t think it would work as well on python because it is not as redundant as java
new topic:
autofolding to summarize code
summarize, compress out java boilerplate
use code folding (which obviously uses the fact that we know blocks are denoted by braces)
is the summary just folding?
difference audiences
task based vs non task based
experience versus novice
expert in project
first look problem — opening single file first time, get overview
TASSAL — tree based autofolding software summarization algorithm
start with file, parse, say we want to compress certain types — block statements, comments
foldable tree — subset of AST that contains nodes we could consider following
file ? bag of words — split identifiers by camel case
some of these are going to be generic java stuff
some are concepts used throughout the project
topic models — find characterizing words for files and packages; tried method as leaves, but this was too sparse
for each node, pick a mixing distribution
single topic per file — packages or other levels of abstraction <– wonder if you could use this for refactoring
— fit the model
this will give us for each token in the source file an indicator variable whether this thing is generated from java, package, file (how characteristic)
think of optimization problem
vector u binary vector
each element indicates whether node is folded
— okay, so topic models for summarization, dug
look at all tokens assigned to file via generative process
empirical distribution of nodes included in the summary
constraints within budget for length of the summary
tree consistency
— if a node is included in the summary, must include parents
optimise via greedy algorithm
question: what about naming or other conventions that are drawn from different natural language distributions?
— clusters of developers that following different conventions — cluster developers together with both formatting naming — models don't work as well
question: what if i have multiple devs collaborating ON THE SAME FILE?
— topic per continent
— run topic per content
— where does z come from?
taking topic models and applying to name? not done yet
look at example topics
example columns are topics
three background topics (e.g. get string baclue name type object i)
projects: spring, bigbluebutton
files: datasourceutils, qualsp
to evaluate:
create gold standard
folded files manually to measure precision and recall
compare with
javadocs — always include, add random
shallowest to deep
expand nodes in order of length
heuristic, but that’s all eclipse is doing — comparing with state of the art
second:
show summaries to developers
6 developers, avg. 4 years industrial experience
— rate conciseness and usefulness
these are more concise and useful
automatic summaries from TASSAL were better than any of the other baselines
third thing:
mining idioms from code using existing NLP tools
what are code idioms?
example: reading into a buffer, iterating over an array
— are these all things that are encapsulated by other abstractions?
opening resources/context — common pattern
need meta variables
code idiom is a syntactic code fragment that recurs frequently across software projects and has a single semantic purposes
— wondering if you could learn to match ASTs from one language to another (from one that has these higher level abstractions to one that doesn't)
idiom-related tools — intellij and eclipse
no way of identifying which idioms are useful (presumably to add them to the IDEs — how do you find new ones)
other types of code patterns
— surface level — code clones copy past code garments
api patterns usage patterns of methods
idiom mining problem —
can i find these templates?
use a probabilistic grammar
CFG slides, pCFG slide
use tree substitution grammar — generalization of a tree joining grammar
non-terminal can expand into a tree instead of a list of terminals and nonterminals
can make a probabilistic version
tree substitution grammar over tree nodes and regular expansions
represents a family of idioms
this will allows us to represent these idioms
input: probabilistic tree substitution grammar within
take a corpus of ASTs
learn the grammar
every tree rule in the TSG that i learn i treat as an idiom
convert into a textual representation
build a library of idioms to show developers
how do we infer the grammar?
maximum likelihood conditioned on the pTSG rules
previous work from sharon goldwater et al
— infer what these trees are, pick list of trees that best explain the corpus
— number of possible things i could put in theta are intractable, maximum likelihood is degenerate, pick 1:1 rules to trees
don't make tree fragments too big
put a prior on probabilistic grammars
if you're going to add another tree, this is what the idiom would like
get a joint distribution over pCFGs and source files
dist. over dist. of parse trees
given a corpus code of code to get a distribution over probabilistic grammars over trees i've inferred
type-based MCMC from liang et al; (think this is from liang’s GP-like work)
some questions i didn't catch
mined idioms
iterator, loop through lines, logger for class, string constant
get patterns you would actually find
get something from actual APIs
e.g. database transaction (opening a resource and cleaning up properly)
get the distance between two points in ??
jsoup get mhtl
lots of work in SE in API mining
no syntactically nested things
take a held out set of files
percentage of AST nodes explained by the —/
existing method for clone detection
completely duplicated ?
copy paste phenomenon
idioms we find occur across projects much more often…?
SE perspective — dozens of papers in SE about copy past clones
if these things are really idioms, maybe they will occur more often in example code — actually what happens
from a data set of regular projects on github, we find 22% of idioms found are actually used in examples, higher in stack overflow
finally, can do a co-occurrence matrix — how are idioms used across different projects
eclipse snipmatch
— open source addition to eclipse — manually took 44 snippets, stuff worked or something
cant put all the idioms in the tool, so many found
considered this as validation that the thing works
interesting that there i was one idiom used that is considered bad practice
exploiting that source code is a means of human communication
maybe surprising to people who are from a different background, that you would need to train the model