Tag Archives: surveyman

SurveyMan is a language and runtime for designing, debugging, and deploying surveys on the web at scale. For more information, see surveyman.org

The pricing problem in SurveyMan

This is going to be a short post that I’ll expand on more, post-portfolio…

One of the nice features of AutoMan is that it manages the pricing of a task for you. The user only needs to specify a maximum amount they’re willing to pay, and AutoMan will return the result at the optimal price. It first computes the number of agreeing responses needed to have high confidence that the result is correct. Then it starts at an initial baseline assignment duration (i.e. the time expected to complete the task). The initial duration may be provided by the user; the default setting is 30 seconds. AutoMan uses this time to compute the wage, which is tied to the US federal minimum wage. The assignment is posted on Mechanical Turk for some lifetime set to one hundred times the task duration. If no results come back during that lifetime, it doubles the task time and reposts the job.

Of course, there’s one caveat : since AutoMan relies on there being a single answer, if the population is in disagreement, the budget will be used up and no results will be returned.

One of the original motivations for SurveyMan was to address the idea of returning distributions of results, rather than point estimates of results. We were also interested in computing end-to-end confidence intervals for chained AutoMan computations. We realized that the underlying structure of chained distributions of functions exactly modeled surveys. Thus, SurveyMan.

SurveyMan today is quite far from this original motivation. Since we began collaborating with social scientists, we veered into experimental design and discovery, rather than static analysis or speculative execution of programs having functions returning probabilistic results.

Determining the optimal price for a SurveyMan task is a feature from AutoMan that I’d like to see in SurveyMan. AutoMan’s automated pricing scheme was possible because the system could calculate the number of responses needed to determine if the answer had been found. This is significantly more challenging for SurveyMan. If we treat a survey as a the joined probability distribution of each of its questions, then determining the sample size of “good” respondents boils down to power analysis. However, the techniques here are somewhat different; power analysis was designed for either the case where the probability of certain conditions could be computed directly, or in post hoc data analysis, once the data had already been collected. A first-pass consideration of what we’re really looking for here is an online algorithm to determine the convergence of distribution of the joint probabilities of the questions. I don’t think this is a bad start, but I worry about a decision procedure that relies on such complex data. For example, if we consider a survey that’s flat, this means we treat each question as exchangeable. This does not however mean that the questions are independent. Suppose for a moment that they were; then we could say something like, the survey is a random variable defined to be the sum of the random variables representing the questions:

$$ S = Q_{1} + Q_{2} + … + Q_{n} $$

This survey has $$n$$ questions and each of the random variables $$Q_i, 1 \leq i \leq n$$ corresponds to the distribution of the answer texts — that is, it has no notion of its own position in the survey.

We could then decide that our stopping condition is the case where our expectation converges. Since expectation is linear, we can look at the convergence of each question’s distribution and make our decision then.

Okay, so if the questions were actually independent, I don’t see being such a bad approach. I guess if we assume that the underlying population can be represented as the sum of some unknown number of independent normal distributions, we can say that the mean is a sufficient statistic and call it a day.

Of course, we have little reason to believe that the questions are independent. While randomizing the order of the questions simplifies our identification of bugs, it’s quite different when we consider the convergence of distribution. We would need to consider each instance of the survey as a Bayesian network, where the previous questions are parents of the following questions. We already perform pairwise correlation tests; we could use some of this information to determine independence. If we could simplify the model sufficiently, we might be able to converge on an optimal number of samples. We could use the independence assumption to calculate a lower bound on the number of responses needed.

Anyway, this was meant to be a short post — the idea is to present some of the difficulties of automatically determining pricing in SurveyMan. The point is that the pricing mechanism itself depends on knowing how many “good” responses we need and answering that is hard. Even if we could answer that question (and we should — it’s certainly been on the minds of our colleagues in linguistics), we would then need to consider the effects of allowing breakoff, which further complicates things.

Adversaries

Bad actors are a key threat to validity that cannot be controlled directly through better survey design. That is, unlike the case of bias in wording or order, we cannot eliminate bugs through the survey design. What we can do is use the design to make it easier to identify these adversaries.

Bots

Bots are computer programs that fill out surveys automatically. We assume that bots have a policy for choosing answers that is either completely independent of the question, or is based upon some positional preference.

No positional preference A bot that chooses responses randomly is an example of one that answers questions independently from their content.

Positional preference A bot that always chooses the first question or always chooses the last question, or alternates positions on the basis of the number of available choices: for example, “Christmas tree-ing” a multiple choice survey.

Lazy Respondents

We define a lazy respondent as a human who is behaving in a bot-like way. In the literature these individuals are called spammers and according to a study from 2010, almost 40% of the population sampled failed a screening task that only required basic reading comprehension. There are two key differences between human adversaries and software adversaries : (1) we hypothesize that individual human adversaries are less likely to choose responses randomly and (2) that when human adversaries have a positional preference, they are more likely make small variations in their otherwise consistent responses. Regarding (1), while there is no end to the number of studies and amount of press devoted to humans’ inability to identify randomness, there has been some debate over whether humans can actually generate sequences of random numbers. Regarding (2), while a bot can be programmed to make small variations in positional preference, we believe that humans will make much more strategic deviations in their positional preferences.

Both humans and bots may have policies that depend on the surface text of a question and/or its answer options. An example of a policy that chooses answers on the basis of surface text might be one that prefers the lexicographically first option, or one that always chooses surface strings equal to a value (e.g. contains “agree”). These adversaries are significantly stronger than the ones mentioned above.

It’s possible that some could see directly modeling a set of adversaries as overkill; after all, services such as AMT rely on reputable respondents for their systems to attract users (or not?). While AMT has provided means for requesters to filter the population, this system can easily be gamed. This tutorial from 2010 describes best practices for maximizing the quality of AMT jobs. Unfortunately, injecting “attention check” or gold standard questions is insufficient to ward off bad actors. Surveys are a prime target for bad actors because the basic assumption is that the person posting the survey doesn’t know what the underlying distribution of answers ought to look like — otherwise, why would they post a survey? Sara Kingsley recently pointed us to an article from All Things Considered. Emery found the following comment:

I’ve been doing Mechanical Turk jobs for about 4 months now.

I think the quality of the survey responses are correlated to the amount of money that the requester is paying. If the requester is paying very little, I will go as fast as I can through the survey making sure to pass their attention checks, so that I’m compensated fairly.

Conversely, if the requester wants to pay a fair wage, I will take my time and give a more thought out and non random response.

A key problem that the above quote illustrates is that modeling individual users is fruitless. MACE is a seemingly promising tool that uses post hoc generative models of annotator behavior to “learn whom to trust and when.” This work notably does not cite prior work by Panos Ipeirotis on modeling users with EM and considered variability in workers’ annotations.

The problem with directly modeling individual users is that it cannot account for the myriad latent variables that lead a worker to behave badly. In order to do so, we would need to explicitly model every individual’s utility function. This function would incorporate not only the expected payment for the task, but also the workers’ subjective assessment of the ease of the task, the aesthetics of the task, or their judgement of the worthiness of the task. Not all workers behave consistently across tasks of the same type (e.g. annotations), let alone across tasks of differing types. Are workers who accept HITs that cause them dissatisfaction more likely to return the HIT, or to complete the minimum amount of work required to convince the requester to accept their work?

On Keeping the Survey a DAG

A topic that came up during my SurveyMan lab talk in October was our lack of support for looping questions. Yuriy had raised the objection that there will be cases where we will want to repeat a question, such as providing information on employment. We argued that, since we were emulating paper surveys (at the time), the user could provide an upper bound on the number of entries and ask the user whether they wanted to add another entry for a category. A concern I had was that, since we’re interested in role of survey length in the quality of responses, and since we allow breakoff, when we have a loop in a question, it becomes much more difficult to tell whether the question is a problem or if the length of the survey is a problem. Where previously we treated each question as a random variable, we would now need to model a repeating question as an unknown sum of random variables.

The probability model of a survey with a loop differs from the model of a survey without one.

The probability model of a survey with a loop differs from the model of a survey without one. Note that while both random variables corresponding to the responses to question Q2 may be modeled by the same distribution, they will have different parameters.

This issue came up again during the OBT talk. The expanded version of Topsl that appeared in the PLT Redex book described a semantics for a survey that was allowed to have these kinds of repeated questions.

We do not think it is appropriate to model such questions as loops. Loops are fundamentally necessary to express computable functions. Since the kinds of questions these loops are modeling are more accurately described as having finite, unknown length, we do not want to encode the ability to loop forever.

Aside from this semantic difference, we see another problem with the potentially perpetual loop. Consider the use-case for such a question: in the case of the lab talk, it was Yuriy’s suggestion that we allow people to enter an employment history of unknown length. In the case of Topsl, it was self-reporting relationship history. If a respondent’s employment or relationship history is very long, they may be tempted to under-report the number of instances. This might be curtailed if the respondent is required to first answer* a question that asks for the number of jobs or relationships they** have had. Then responses in the loop could be correlated with the previous question, or the length of the loop could be bounded. In our setting, where we do not respondents to skip questions, the former would need to be implemented if we were to allow loops at all.

Alternatively, instead of presenting each response to what is semantically the same question as if it were a separate question, we could first ask the question for the number of jobs or relationships, and then ask a followup question on a page that takes the response to the previous question, and displays that number of text boxes on the page. We would still bound the total number of responses, but instead of presenting each question separately, we would present them as a single question.

In the analysis of a survey we ran, we found statistically significant breakoff at the freetext question. We’d like to test whether freetext questions in general are correlated with high breakoff. If this is the case, we believe it provides further evidence that the approach to “loop questions” is better implemented using our approach.

* I just wanted to note that I love splitting infinitives.
** While I’m at it, I also support gender-neutral pronouns. Political grammar FTW!