Finally, a post devoted entirely to television: Scandal needs some Sandra Oh!

I have pages of notes for a blog post about correlation, but since it’s 11pm, let’s put that off for now and have a little summer fun!

I had a chat today with Myung-ha Jang about Grey’s Anatomy and Scandal. Myung-ha’s watched more Grey’s, but hasn’t tried Scandal yet. I unfortunately dove into Grey’s at the height of its ridiculousness (the icicle that stabbed Cristina, the romance of The Lady of the Lake and poor homely Sonya, and Izzie’s hallucinations that turned out to be cancer). If it was about to get worse, I didn’t want to see it.

Okay, you know what? That clip of the icicle stabbing Cristina begins with a great moment between friends:
[youtube]https://www.youtube.com/watch?v=FyePUT2pbIY[/youtube]
Yeah, it was a little judgey. Cristina isn’t known for sugar-coating what she thinks. Even if it’s not the smoothest move in real life, it’s nice to see a character say what you’re thinking. This is television fantasy land, where all your male coworkers are McSexy and you’re surrounded by awesome, interesting women who speak their minds and are really f***ing good at their jobs.

Now, I told Myung-ha I preferred Scandal because the story lines were eerily believable. However, the thing that’s been nagging me has been that Olivia Pope has no friends.

Sad Olivia

While I think Scandal’s plot lines are more believable, I find the relationships between characters sad and emotionally stunted. Of course, these characters are in a morally bankrupt universe of bad actors, whereas the Grey’s folks are trying to do good in the world. In any case, it would be great if Sandra Oh guest starred as Cristina and had a long-distance besties relationship with Olivia. Apparently Cristina is in Switzerland now, and I think Olivia might have been sent there for her internal spy training secondary schooling. I wish they would ship in someone like her, rather than yet another emotionally crippled assassin/torture artist.

Christina Yang: A+ Surgeon, A+ Friend

Defining Equality

This post follows directly from my previous post on correctness. Rather than dive directly into the details of correctness for surveys, I wanted to spend some time thinking about some of the things we might want to prove, show or derive first.

We’ve stated that one way to define what we mean by survey correctness is to say that the same survey (modulo randomization) administered to the same person, under the same set of conditions, will yield the same set of results. Let’s start by assuming that it is possible to repeat the survey for a respondent under the exact same conditions. Then there is is some underlying gold standard for the survey. While we cannot repeat the survey for the same set of respondents, under the exact same conditions in practice, we can emulate this scenario with our simulator. We choose a set of answers, according to a particular profile and let this be the gold standard. The simulated respondent then chooses a response according to the map from questions to answers. Once the survey is complete, we compare the results returned by the respondent with our generated gold standard. The question-answer sets for the gold standard and the returned values should always be equal.

At first, it may seem that this definition of equality is trivial. We have unique identifiers for each question and answer option that are created directly from their data’s location in the input csv. The response form uses these ids. One question is displayed at a time and no questions ever repeat (we’ll prove this later). We simply check the id of the question currently displayed, look up its answer in our response map, and select the option(s) that has/have matching identifier(s) to our gold standard’s responses. So long as the HTML is correct, we will get back the same id, right?

Wrong. Well, a little wrong. If your survey does not use variants, you’re fine. If it does, here’s the problem: the variant is chosen at runtime i.e. when the respondent is actually executing the program (taking the survey). Since we cannot divine which question will be shown, we need a different way of deciding the appropriate answer.

Solution 1: Specify the random choice at compile-time, rather than run-time.

This is certainly a compelling argument, and it solves our problem of not knowing what question will be displayed at runtime. When we began work on SurveyMan, we generated all of our surveys statically. All of the randomization was done in Java. We produced the full HTML to be displayed and sent this over as the payload. We didn’t have much in the way of an interpreter; we just had some jQuery running in document.ready that hid all question divs and then showed the first question. We were not playing with breakoff at this point, so a next and a previous button were always displayed (except for the first and last questions). We simply swapped which divs were visible and which weren’t.

Practical issues We found considerable issues with this approach, enumerated below:

  • AMT limitations While this may change at any time, historically there haven’t been many degrees of freedom when deciding assignments of workers to tasks on AMT. Since our primary source of respondents was AMT, we had to keep this in mind. Since each survey was static HTML, we had to send each over in its own HIT. In the past, this meant that we would post some number of HITs at a time. The easiest way to do this was to post a survey, wait for a response, validate that response, issue “reverse” qualifications, and then post some more. The “reverse” qualification thing is an Automan technique to disqualify people who had already responded to that HIT under a different randomization.

    There are some logistical issues with this approach. Posting one HIT that has one assignment at a time means that we incur significant latency in our response period. We cannot take advantage of the massive parallelism of the crowd. If we instead post the HITs in batches, we have to hedge against workers who will answer some number of HITs in a row before the reverse qualification kicks in. We end up paying for redundant collection, as well as incurring additional latency (due to the workers who completed the HIT more than once). Repeaters are a major threat to this approach — AMT groups HITs in such a way that similar HITs appear together. The researcher can get around this automated grouping by creating a new group id and putting each HIT in its own group. Now we are back to the original latency problem, because we have one HIT wonders floating around in the MTURK ether.

    Note that the AMT system is designed as follows: Requesters may either specify one HIT that has many assignments (redundancy), or they may specify many HITs that will be grouped together with the express purpose of making it easier for one person to perform many HITs. We would like to be able to take advantage of the former, and posting static HTML does not allow this.

  • Page Load Time The prototypicality survey had almost 100 questions. There was an obvious lag in loading on several browsers; the contents flashed before hiding the appropriate divs. We could have denoted these divs as hidden, but we would still have the lag of loading in a large file. The problem was even worse when we loaded in media content. It was clear from this alone that some of the content would have to be loaded on an as-needed basis.
  • Payload size AMT has a maximum payload size. I can look it up, but needless to say, we hit it with one of the linguistics surveys. I believe that survey had approximately 200 questions — they were similar in nature to the phonology questions. The true payload would have been smaller, since that survey was designed to be sampled from. However, there will be cases where survey writers are going to want to simulate an infinite stream of survey questions. We would like to be able to send over as large a survey as possible, and this is just not going to happen if we send over a static webpage. Now, one could argue that the cost of the payload was really in the redundant HTML we were sending over — we could easily have swapped out the question and option text, while still specifying the full set of questions to be seen a priori. This is a fairly compelling argument that adequately addresses both the page load time, as well as the payload size.
  • Opportunities for exploitation It’s very easy for someone to write a script that, for instance, selects every radio button answer and hits submit. In fact, if we have a submit button sitting around anywhere, we have a problem. We could dynamically generate all buttons as needed, while still sending over the statically chosen questions. The solution at this point would be to send over minimal HTML and instead swap in the data that we need, as needed. All buttons and control flow through the survey would be controlled by a small state machine. This would prevent respondents from writing a script to click on all input of a given type and submit. It’s certainly still possible to write a “bot” to submit a form when inputs are generated dynamically, but it will now require the script to simulate the survey, which is more effort. Furthermore, the user has no guarantee that the script will terminate (although we know it will).
  • Opportunities to “cheat.” Besides automated “cheating,” it’s also possible for users to see the set of questions being asked, simply by looking at the source code. If we store the questions and option text in a lookup table, they become significantly more difficult for a human to interpret — while they can still see what the questions are asking, human adversaries will not have the full picture. Even though there is still room for adversarial action, we hope that the increased indirection of dynamically generating the questions increases the complexity of the problem and deters adversaries from gaming the system.

The above arguments make a strong case for generating the survey dynamically. The first point is the strongest practical reason for performing randomization at runtime; for the latter points, randomizing at runtime gives us a nice defense against some adversarial behavior, but it isn’t particularly compelling.

A major advantage to randomizing at runtime is that it is fundamentally more principled than randomizing a priori. Our randomization mechanism can also be framed as a an issue of nondeterministic choice — in the case of variants, we have choose one of a set of these variants. In the case of the total set of questions, it is one of the set of all possible permutations. In principle, we can say that the behavior of the machine at runtime is the same for all permutations of the survey, since the machine itself is selecting which permutation. All equivalent surveys are statically equivalent, with only the runtime behavior differing. This will make certain properties easier to prove later on.

Solution 2: Mark every question in a variant block with equivalent answers

Recall that the problem we’re trying to solve is how to define equivalence between questions. The idea behind fixing the random order at compile time was so that our two parallel universe survey takers would end up with the exact same answers. Even if you don’t buy the argument to randomize at runtime, rather than randomizing at compile-time, there’s still the problem of comparing across different respondents who see different variants.

A new pass solution might be to answer every question in a variant’s block. However, if we do this, we will have answers for every question in that block sitting around in our answer set. The resulting set of question-answer pairs will not be equal to the gold standard set.

We could try to get around this shortcoming by removing the questions that weren’t seen. This doesn’t quite work either. One of the things we’re trying to show is that, not only are the answers consistent (this is the part of the correctness argument that addresses bugs that are unique to surveys), but also that the same set of questions are seen (this is the part of the correctness argument that will address the more traditional programming languages issues in our design). If we just remove the other questions, we run into a dangerous situation where we are adding and removing data in the course of our correctness argument, which will make some things harder to prove. Furthermore, we will need to wait until runtime to decide whether two answer sets are equivalent.

Solution 3: Define a set of equality relations for our objects of interest

Thus far our solutions have been computational in nature — their correctness is defined by their implementation. We’d really like a more general solution that will allow us to reason about a survey program from first principles. We would also like a solution that can be applied to a broader swatch of implementations. We’d like our definitions to be the gold standard of correctness for future survey language writers. Let’s start with some definitions.

A survey is a set of blocks. A block is a heterogenous set of questions and blocks. A question is a record with a survey-scoped unique identifier and a (possibly empty) set of answer options. An option is a record with a survey-scoped unique identifier. These all follow from our definition of the syntax.

Answer Set An answer set is a set of 3-tuples of question identifiers, option identifiers, and indices.

Equality of Answer Sets

Two sets of answers are “equal” (denoted as $$\equiv_{resp}$$, called “response-equivalent”) if and only if two conditions are satisfied:

  1. There exists a bijective mapping between two answer sets.
  2. The tuples in the mapping satisfy an equivalence relation, which we will define.
Begin Algebra Tangent

I asked Dan Stubbs if there was an algebraic structure where these properties would make sense. He suggested that perhaps I was defining a quotient group. After a nap and some perusing of Wikipedia (and a dense Algebra text), I’m not entirely sure this is what I want. Here are my unstructured thoughts on this:

  • The answer set is a subset of all possible answers. All possible answers for each question type are defined in the language as follows:
    • Exclusive (Radio Button) questions The set of individual response options.
    • Non-exclusive (Checkbox) questions The power set of response options minus one (you cannot have the empty set — that is, the respondent must check at least one box).
    • Regex questions Any string satisfying the regex, up to the size allowable in the response field (would have to look this up).
    • Freetext questions The subset of all strings up to the size allowable in the response field.
  • Since we have a clear mapping between questions and answers, we can talk about questions instead. Really, when we say that the response set is invariant under randomization, we’re saying that the question-answer pairs should be invariant under randomization.
  • According to Wikipedia

    In mathematics, specifically group theory, a quotient group (or factor group) is a group obtained by aggregating similar elements of a larger group using an equivalence relation that preserves the group structure.

    What’s the larger group here? Presumably it would be the set of all possible answers. This doesn’t seem right, thought — we don’t want the set of all answers, because this would mean that a valid response could include answering, say, two questions from the same variant block. Instead, I think it ought to be the set of unique answer sets. We’re not worrying about breakoff for now, so we don’t need to define this as the power set of some set of answers. If we have $$n$$ questions, each with $$m$$ radio-button (exclusive) options and no variants and no breakoff, then we have $$m^n$$ possible answer sets. To account for variants, we multiply by the cardinality of each variant set.

    What are the smaller groups? They’re the set of question-answer pairs we’ve collapsed. Now what’s the group structure? To be honest, I’m not entirely sure an answer set satisfies the definition of a group. For starters, what’s our group operator? I suppose a tempting first choice might be a set addition operator. We could say that if $$\lbrace (q_i, a_i, i) \rbrace$$ and $$\lbrace (q_j, a_j, j) \rbrace$$ are each in the set of possible question-answer pairs, then $$\lbrace (q_i, a_i, i), (q_j, a_j, j)\rbrace$$ is also in the set of possible question-answer pairs. Of course, for $$\lbrace (q_i, a_i, i) \rbrace$$ and $$\lbrace (q_j, a_j, j) \rbrace$$ to be floating around, we would need to allow breakoff. Now, what would happen if we tried to combine two complete answer sets? If the answer sets are equivalent, we get back the same answer set. If the answer sets are equivalent, modulo variant questions, how do we decide what’s returned? If the answer sets are radically different, then what?

    I think at this point, we need to be talking about questions, rather than answer sets. Two answer sets can differ either because the questions seen are different (due to breakoff or variants) or because the answers chosen are different, or both. If we only consider questions, we can define a question-combining operator (rather than a response-combining operator) and get the properties we want. We would need to define some unified variable to represent the set of variants, so any time a variant is combined, we represent the variant with that variable.

    And this is about where I’m stuck. It’s not clear to me that this kind of algebraic structure is what we want. Wait — why would we want it in the first place? Well, I like challenging myself to think about these problems from a variety of angles, so it’s intellectually satisfying as an exercise in its own right. But it could also prove fruitful later down the line. It might be the case that by having these algebraic properties, we get some proofs of correctness or other properties for free. Reducing a known and well-studied problem to a new domain for the win!

    End Algebra Tangent

    So basically, we need to define this bijective mapping. Here’s the short version (since this post has been sitting, incomplete and open in a tab for well over a month):

    Two questions are eqivalent if either:

    • The questions are identical. That is, they have the same ids (and should belong to the same block, have the same flags, etc. Questions are immutable entries in a spreadsheet, so you could only get two questions with equal ids but unequal other stuff if your lexer/parser/compiler screwed something up).
    • The questions are both contained in a block having type ALL. In this case, the number of options will be equivalent. The order of their entry denotes equivalence. That is, the $$i^{th}$$ option for each question variant is assumed to be equal in its meaning.

    Now here’s the punchline: since we use different paths through the blocks (and thus different containing block ids) to denote an experimental control (really, a “control path”) and to denote experimental designs such as Latin squares, we currently have no way in either our static nor our dynamic analyses to understand equivalence across these paths. Although we can express experimental surveys using SurveyMan, our system currently cannot analyse these surveys as experiments.

Metrics to answer our research questions

We had a number of excellent questions and comments from our reviewers. I’ve been working on a few entries to address them.

One of our reviewers pointed out that we didn’t provide metrics for our research questions. Another pointed out that our results were qualitative, but that this was okay because we provided case studies, rather than user studies. Since we’d like to be able to do user studies in the future (hopefully the exposure we’ll get at OOPSLA will lead to more collaborations!), it’s certainly worth thinking about.

Research Question 1: Is SurveyMan usable by survey authors and sufficiently expressive to describe their surveys?

Usable by survey authors? We think so. Certainly our colleagues in linguistics didn’t have any problems with the CSV structure and the PL-ness of it. We were warned a bit more by our colleagues in econ. The argument I’ve heard is that (a) people are accustomed to flashy interfaces and would put off by using something text-based and (2) the formatting of the data, both the input and output, is counter-intuitive. I’m not entirely surethis is true. First of all, people have been using SAS and SPSS for years. When I took my first stats course in Spring 2004, we used one of those tools (too long ago to remember which). I remember the painstaking data entry involved. The summaries spit out, full of text tables (no graphics) weren’t the easiest to interpret. This was the standard tool for statistical analyses in the social sciences at the time (I was an econ major then).

This project has been a challenge \emph{because} less-powerful, flashy alternative exist. Our colleagues in linguistics liked the csv idea because they were already storing their data in csvs (since they’re easy to load into R and since they can — and already were — programmatically generate the data). It’s unclear whether people in fields that are used to generating these things by hand will be as receptive. There’s no reason I can see why someone wouldn’t be able to write in our csv language, but drag and drop doesn’t require reading a spec, and it’s what most commercial services are using.

While I think the language is usable as-is, it would be more usable if we had a spreadsheet plugin to statically check the thing. How would we measure the usability of the tool, empirically?

I think one way of evaluating its usability would be to gather a bunch of surveys and recruit people for a user study. A traditional, offline version of this might go as follows:

  1. Generate a list of available survey tools.
  2. Recruit a representative sample of participants across the social sciences (probably other graduate students at UMass).
  3. Administer a survey to those graduate students to learn which survey technologies they’d used before, whether they were familiar with any programming languages, and if they have any experience with spreadsheets and/or databases.
  4. Ask the participants to take a survey specification and design a series of surveys with various features on one of the competing survey software packages (that they had not used before) and have them do the same tasks with SurveyMan.
  5. Perform a post-treatment survey to ascertain results.

We would measure the time it took to complete the task, the number of clarifying questions the participants ask, and the perceived ease of the task. It might be a good idea to have the post-study follow-up survey about a week later.

Some design features we might want to consider are whether we assign the SurveyMan task first, how much prior experience in either programming languages or survey software served as prior training, and how we present the survey tasks. If we give the participants csvs of questions and just ask them to manipulate the data, we will be biasing them to a more favorable view of SurveyMan. We might instead give them a variety of inputs, such as a csv, a pdf, an English description of a survey we might want to run, and a followup survey based on that English description.

Expressive enough to describe the target population’s surveys? Yes. We can certainly express all of the surveys that our clients wanted to be able to express, and in some cases we were able to express more. One interesting aspect of our collaborations is that we’ve discovered that some things we’ve been calling surveys are actually experiments. While we can describe them using our language, we currently lack the support in our runtime system to perform the kinds of analyses we’d like to see. This has caused us to pursue work in a new direction.

I think the best way to address this question would be with a list of features describing what it means to be a survey. I’ve been thinking about the difference between experiments and surveys a lot recently and a checklist like this would be illuminating.

Research Question 2: Is SurveyMan able to identify survey errors?

We think so. We use standard statistical methods to identify wording bias and question order bias.

For breakoff, we report the distribution of breakoff over questions and positions. Diagnosing whether we have an unusual amount of breakoff requires a prior over comparable surveys, so we don’t attempt to define a threshold for whether the amount of breakoff is statistically significant.

We could model a small amount of noise that says, “if this survey is well-formed, every respondent will have equal probability of breaking off at any point before the final question.” We could say that there is some small $$\epsilon$$ representing this probability, which is independent of any features of the survey. Let $$r$$ represent our sample size. Then the expected number of respondents who broke off from this survey would be $$r\epsilon$$. If the total number of respondents who broke off before the final question is unusually high, then the survey author would need to debug the survey to figure out why so many respondents were leaving early.

We’re clearly be interested in where respondents were breaking off, and not just that they were breaking off. It’s possible that the survey is uniformly poorly designed; it’s also possible that it’s too long or that there is some point in the survey (such as a transition from one block to another) that causes an unusual amount of breakoff. For a flat survey having $$n$$ questions, every position is identical insofar as the set of questions that may be seen there. In such a case, we might want to model breakoff as a series of Bernoulli trials — at each point, a respondent may break off with probability $$\frac{\epsilon}{n-1}$$. We would then expect there to be $$\frac{\epsilon}{n – 1} \times r$$ people breaking off at each position.

The problem with this model is that although the probability of breaking off at any given point is the same, the number of participants is not \emph{because a single person cannot break off twice}. First, let’s model this as a recurrence. Let $$B_i$$ be the random variable denoting expected breakoff at position $$i$$. Then we get the recurrence:

$$\mathbb{E}(B_1) = \frac{r\epsilon}{n-1}$$
$$\mathbb{E}(B_{i<n}) = \frac{\epsilon}{n-1}(r – \sum_{k=1}^{i-1}\mathbb{E}(B_k))$$

Let’s see if we can do better substituting in values:

$$\mathbb{E}(B_2) = \frac{\epsilon}{n-1}\biggl(r – \frac{r \epsilon}{n-1}\biggr) = \frac{r\epsilon}{n-1}\biggl(1 – \frac{\epsilon}{n-1}\biggr)$$
$$\mathbb{E}(B_3) = \frac{\epsilon}{n-1}\biggl(r – \bigl(\frac{r\epsilon}{n-1}\bigl(1 – \frac{\epsilon}{n-1}\bigr)\biggr) = \frac{r\epsilon}{n-1}\biggl(1 – \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigr)^2\biggr)$$
$$\mathbb{E}(B_4) = \frac{\epsilon}{n-1}\biggl(r – \frac{r\epsilon}{n-1}\bigl(1 – \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigr)^2\bigr)\biggr) = \frac{r\epsilon}{n-1}\biggl(1 – \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigl)^2 – \bigl(\frac{\epsilon}{n-1}\bigr)^3\biggr)$$
$$\mathbb{E}(B_{i<n}) = \frac{r\epsilon}{n-1}\sum_{k=0}^{i-1}\bigl(-\frac{\epsilon}{n-1}\bigr)^k$$

Recall the closed-form formula for a geometric series, shamelessly ripped off Wikipedia:

$$\sum_{k=0}^{n-1}ar^k = a\frac{1-r^n}{1-r}$$.

Since we’re overloading variable names, let’s be clear about what represents what: $$a_{wiki} \equiv \frac{r\epsilon}{n-1}$$, $$r_{wiki}\equiv \frac{-\epsilon}{n-1}$$, $$n_{wiki}\equiv i$$. That means our closed form for $$\mathbb{E}(B_{i<n})$$ is $$\frac{r\epsilon}{n-1}\times\frac{1 – \bigl(\frac{-\epsilon}{n-1}\bigr)^i}{1 – \frac{-\epsilon}{n-1}}$$, which we can make prettier as $$\frac{r\epsilon}{n-1}\times\biggl(1 – \bigl(\frac{-\epsilon}{n-1}\bigr)^i\biggr)\times\frac{n-1}{n – 1 + \epsilon}$$ and reduce to a still hideous $$\frac{r\epsilon}{n-1+\epsilon}\biggl(1 – \bigl(\frac{-\epsilon}{n-1}\bigr)^i\biggr)$$. No lie, I remember it being less formidable looking in my notes, but they’re at my desk, and I’m pretty sloppy with the algebraic manipulation, so I welcome any spot-checking.

So what does this mean? We have a formula for the expected number of respondents who break off at each index. Now we can at least use the Markov inequality to flag really egregious positional breakoff. Since the Markov inequality is very coarse, I have little doubt that researchers would be able to spot the cases it would find. That is, Markov doesn’t give us much value for small data sets (note that it’s still useful as a feature in an automated tool).

In order to get tighter bounds, we are going to need to know more about the distribution. We can use Chebyshev’s inequality if we know the variance. We can get a Chernoff-like bound by substituting into Chebyshev’s Inequality the expectation for the moment generating function, if we have it.

I think this process is similar to a Martingale, but I will have to spend more time thinking about this. Clearly the total expected breakoff is the same. The number of participants in the survey is expected to decay. It would be great to get some tight bounds on this. Once we can characterize breakoff for flat surveys, we have a prior on the role position plays and can use this information for diagnosing breakoff in more structured surveys. I haven’t considered an equivalent analysis for question-based breakoff yet.

For now, in the analyses, we just stick with reporting the top locations and questions for abandonment/breakoff.

Research Question 3: Is SURVEYMAN able to identify random or inattentive respondents?

I am working on a more in-depth analysis to help answer this question. It’s going to need a whole blog post to itself. We have some gold-standard data from Joe and company for the phonology survey, and there are some heuristics I can use for the prototypicality survey. Graphs forthcoming!