Get money, get paid.

When we allow breakoff, we typically tell respondents that they will be paid a bonus commensurate with the amount and quality of work they do at the end of the study. We don’t typically inform them of when the study will end, since we may need to re-run surveys and we’d like to keep respondents from trying to game our algorithms. The experiments we did in the fall that awarded bonuses used a tiered system. The ones listed here are using respondent’s scores to identify the top 95% of scores and award those respondents two cents per question. I’d like to investigate more sophisticated pricing schemata with Sara in the future.

Wage Survey

It was my intention to only have one wage survey running on AMT. However, as I’ve been porting the Python analyses into Clojure, I found that I actually had three instances running. Given the expiration dates on the latter two, I’m pretty sure they were posted accidentally. I should probably consider asking the user if their sure they want to continue, or have a safe mode that asks a million times “are you sure you want to do X?” so future users don’t make this mistake. There’s also small possibility that when I extended the original HIT, it somehow spawned two new HITs instead. This isn’t documented anywhere, but it’s something I probably want to double check on sandbox.

So we had three surveys running. At the time of our OOPSLA submission, we had the wage survey running for about four days and only accrued 69 responses. I extended that HIT twice. It expired Mar 26 2014, 04:42 PM PDT. The other two HITs had expiration dates of Mar 28 2014, 04:44 PM PDT and Mar 30 2014, 07:22 PM PDT. Each HIT requested 150 assignments, and paid $0.10 base wage per survey. Between the three surveys, we collected 154 responses. Under normal circumstances, I wouldn’t have three of the same HIT running concurrently — a feature I might consider adding to SurveyMan is a check for whether a survey with similar parameters has been posted before. If I implement a “safe mode” version of SurveyMan, I could ask then ask the user if they really want to post this survey.

Anyway, the point is that because I had three versions running, I had repeaters. We had only 132 unique respondents. I typically exclude repeaters from the analyses, since we tell them to return the surveys if they haven’t taken them before. After running our new dynamic analyses report, I found that 98 respondents were classified as bad actors. I had a similarly high percentage in the Python analysis and wasn’t confident that it was correct. Since we hadn’t tested the effects of breakoff on bot classification in simulation, I was hesitant to make any strong assertions about these classifications, without investigating further. Furthermore, since we had so few data points at the time of the OOPSLA submission, we decided to simply report qualitative results.

Examining the larger data set, we found that the maximum number of questions answered by any one respondent was still 26. I’ll leave a more thorough analysis of our quality control to another blog post, but the results were quite interesting and corroborated some of my suspicions about the the original set of data. In any case, the number of questions bad actors answered had a high of 18 and a low of 2 (interestingly, those who only answered one question were confined to repeaters. Inspecting manually, I saw 10 repeaters in total, of whom only one appeared to be a legitimate bad actor.).

Using the federal minimum wage of $7.25 an hour and an estimated 10 seconds per question, we should award $0.02 per question. Since we already awarded a base pay of $0.10, we subtracted this quantity off the total payment calculated for honest respondents. The static analyses gave us an average path length of 41 questions and an expected payment of $0.825694 for each respondent who answered the survey to completion. Since we requested 150 responses, if every respondent were categorized as honest and answered the survey to completion, it would cost us $123.8541, not counting the AMT commission. AMT charges a 10% commission on both the base pay and the bonuses. This gives us a total expected cost of $136.23951.

Our actual costs were much lower (although the quality of the data was presumably also lower). Our 154 respondents cost us $22.53 in base pay, including AMT commission. We calculated $7.78 in bonuses to be awarded, costing us $8.558 for the commission. In total, the experiment under these conditions cost us $31.088.

Prototypicality

The next survey we’ll look at is Presley’s prototypicality survey. We had 149 responses total. The survey had an average path length of 17 questions. The estimated base price for honest respondents who answer the survey in full is $0.342361 per survey. Our expected cost for all honest respondents, plus AMT commission is $56.489565. We classified 65 respondents as bad actors, and 84 as valid responses. These results differ from our initial reported results due to how we calculate the frequencies for questions that are variants. In our Python code, we only compared questions that were exactly the same — that is, we didn’t unify the distributions of the variants. In the Clojure implementation, we first remove bots, and assume that there is no statistical difference between the questions, unifying all the variants. We’ll leave a discussion of the pros and cons of this approach to a future blog post.

The result of unifying variants is that significantly more responses are classified as bots, but none of the variants are flagged as being drawn from different distributions! I’ll have to double check to make sure that the variant code is running as expected, but since my unit test for flagging variants seems to work, I’m going to assume that the differences we detected were due to outliers. We can discuss what might be going on here in a later blog post, since this one is mostly about pricing. The bonuses to be paid amount to $20.32. With AMT commission this is $22.35. In total, this survey costs $38.742.

Choose Randomly

I ran a survey that wasn’t featured in our OOPSLA paper because it was one I made up entirely. The idea was to post a survey with two floating blocks, where both asked respondents to choose one of the responses randomly, but the one block had identical options, whereas the other had arbitrary categories of things. I wanted to see how random people could actually be. I also wanted to track timing information. One of the things I noticed about this survey, which I posted on a Wednesday morning, was that I was able to collect all of my responses within 90 minutes. The time per question here is clearly less than 10s. If it takes about 2 seconds per question, then the expected cost for a completed survey would be about $0.08. As a result, I decided to not award bonuses. The total cost was $16.50.

Phonology

Finally we have our classic phonology survey. With an average path length of 99 questions, we would expect to pay $1.993750 per survey. With a target of 150 responses, our total cost, including commission, would be $328.96875.

We ran this survey 3 times (not counting our preliminary run). The details of each run can be found in an earlier post. We collected 395 responses total and had 311 unique respondents. 22 respondents accounted for the 84 duplicate responses. This initial run cost us $43.45. 182 responses were classified as valid responses. The total bonuses to be paid were calculated to be $327.96. Factoring in AMT commission, this comes to $360.756. In all, this survey will have cost us $404.206.

Notes on timing

Some of the the timing data we returned was flawed, so we couldn’t use that information to improve our payment scheme. One possible use of this information would be to use it as a proxy for the difficulty the question and vary the payment in accordance with this information.

On calculating survey entropy

I’ve been spending the past two weeks converting analyses that were implemented in Python and Julia into Clojure. The OOPSLA Artifact Evaluation deadline is June 1 and moving these into Clojure means that the whole shebang runs on the JVM (and just one jar!).

One of the changes I really wanted to make to the artifact we submit was a lower upper bound on survey entropy. Upper bounds on entropy can be useful in a variety of ways: in these initial runs we did for the paper, I found them useful for comparing across different surveys. The intuition is that surveys with similar max entropies have similar complexity, similar runtimes, similar costs, and similar tolerance to bad behavior. Furthermore, if the end-user were to use the simulator in a design/debug/test loop, they could use max entropy to guide their survey design.

We’ve iterated our calculation of the max entropy. Each improvement has lowered the upper bound for some class of surveys.

Max option cardinality Our first method for calculating maximum entropy of a survey was the one featured in the paper: we find the question with the largest number of options and say that the entropy of the survey must be less than the entropy of a survey having equal number of questions, where every question has this maximum number of answer options. Each option has equal probability of being chosen. For some $$survey$$ having $$n$$ questions, the maximum entropy would then be $$\lceil n \log_2 (\max ( \lbrace \lvert \lbrace o : o \in options(q) \rbrace \rvert : q \in questions(survey) \rbrace ) ) \rceil$$.

The above gives a fairly tight bound on surveys such as the phonology survey. For surveys that have more variance in the number of options proffered to the respondent, it would be better to have a tighter bound.

Total survey question max entropy We’ve had a calculation for total survey question max entropy implemented in Clojure for a few weeks now. For any question having at least one answer option, we calculate the entropy of that question, and sum up all those bits. For some $$survey$$ having $$n$$ questions, where each question $$q_i$$ has $$m_i$$ options, the maximum entropy would then be $$\lceil \sum_{i=1}^n \mathbf{1}_{\mathbb{N}^+}(m_i)\log_2(m_i)\rceil$$

While the total survey question max entropy gives a tighter bound on surveys with higher variance, it is still a bit too high for surveys with branching. Consider the wage survey. In Sara’s initial formulation of the survey (i.e. not the one we ran), the question with the greatest number of answer options was one asking for the respondent’s date of birth. The responses were dates ranging from 1900 to 1996. Most of the remaining questions have about 4 options each:

#/Options#/Questions
#/Options#/Questions
2
872
3581
41691
52101
62971

Clearly in this case, using max option cardinality would not give much information about the entropy of the survey. The max cardinality maximum entropy calculation gives 258 bits, whereas the total survey question max entropy gives 80 bits.

This lower upper bound still has shortcomings, though — it doesn’t consider surveys with branching. For many surveys, branching is used to ask one additional question, to help refine answers. For these surveys, many respondents answer every question in the survey. However, there are some survey that are designed so that no respondent answers every question in the survey. Branching may be used to re-route respondents along a particular path. We used branching in this way when we actually deployed Sara’s wage survey. The translated version of Sara’s survey has two 39-question paths, with a 2-option branch question to start the survey and zero option instructional question to end the survey. This version of the survey has a max cardinality maximum entropy calculation of $$80 * \log_2 97 = 528$$ and a total survey question max entropy of 160 bits (without the ceiling operator, this is approximately equal two two times the entropy of the previous version, plus one bit for the introductory branch question).

The maximum number of bits needed to represent this survey approximately doubled from one version to the next. This isn’t quite right — we know that the maximum path through the survey is 41 questions, not 80. In this case, branching makes a significant difference in the lower bound.

Max path maximum entropy Let’s instead compute the maximum entropy over distinct paths through the survey. We’ve previously discussed the computational complexity of computing distinct paths through surveys. In short, randomization significantly increases the number of possible paths through the survey; if we focus on path through blocks instead, we have more tractable results. Rather than thinking about paths through the survey as distinct lists of questions, where equivalent paths have equivalent lengths and orderings, we can instead think about them as unique sets of questions. This perspective aligns nicely with the invariants we preserve.

Our new maximum entropy calculation will compute the entropy over unique sets of questions and select the maximum entropy computed over this set. Some questions to consider are:

  1. Are joined paths the same path?
  2. If we are computing empirical entropy, should we also consider breakoff? That is, do we need the probability of answering a particular question?

We consider paths that join distinct from each other; the probability of answering that question will sum up to one, if we don’t consider breakoff. As for breakoff, for now let’s ignore it. If we need to compute the empirical entropy over the survey (as opposed to the maximum entropy), then we will use the subset relation to determine which questions belong to which path. That is, if we have a survey with paths $$q_1 \rightarrow q_2 \rightarrow q_4$$ and $$q_1 \rightarrow q_3 \rightarrow q_4$$, then a survey response with only $$q_1$$ answered will be used to compute the path frequencies and answer option frequencies for both paths. The maximum entropy is then computed as $$\lceil max(\lbrace -\sum_{q\in survey} \sum_{o \in ans(q)} \mathbb{P}(o \cap p) \log_2 \mathbb{P}(o \cap p) : p \in paths \rbrace) \rceil$$.

There are two pieces of information we need to calculate before actually computing the maximum entropy path. First, we need the set of paths. Since paths are unique over blocks, we can define a function to return the set of blocks over the paths. The key insight here is that for blocks that have the NONE or ONE branch paradigm, every question in that block is answered. For the branch ALL paradigm, every question is supposed to be “the same,” so they will all have the same number of answer options. Furthermore, since the ordering of floating (randomizable) top level blocks doesn’t matter, and since we prohibit branching from or to these blocks, we can compute the DAG on the totally ordered blocks and then just concatenate the floating blocks onto the unique paths through those ordered blocks.

The second thing we need to compute is $$\mathbb{P}(o \cap p)$$. The easiest way to do this is to take a survey response and determine which unique path(s) it belongs to. If we count the number of times we see option $$o$$ on path $$p$$, the probability we’re estimating is $$\mathbb{P}(o | p)$$. We can compute $$\mathbb{P}(o \cap p)$$ from $$\mathbb{P}(o | p)$$ by noting that $$\mathbb{P}(o \cap p) = \mathbb{P}(o | p)\mathbb{P}(p)$$. This quantity is computed by $$\frac{\# \text{ of } o \text{ on path } p}{\#\text{ of responses on path } p}\times\frac{\#\text{ of responses on path } p}{\text{total responses}}$$, which we can reduce to $$\frac{\# \text{ of } o \text{ on path } p}{\text{total responses}}$$. It should be clear from this derivation that even if two paths join, the entropy for the joined sub path is equal to the case where we treat paths separately.

The maximum entropy for the max path in the wage survey, computed using the current implementation of SurveyMan’s static analyses, is 81 bits — equivalent to the original version of the survey, plus one extra bit for the branching.

Hack the system

The Java/Clojure QC of SurveyMan has a RandomRespondent built in. This RandomRespondent class generates answers to the surveys on the basis of some policy. The policies currently available are uniform random, first option, last option, and Gaussian. I’ve been thinking about some other adversary models I could add to the mix :

  1. Christmas Tree : This is a variant of the uniform random respondent, where the respondent zigzags down the survey in the form of a “Christmas Tree.” zigzag christmas tree
  2. Memory Bot : This is more like an augmentation to one of the existing policies, where the questions and answers are cached, and for each question, the bot checks whether it has answered something like it before. We know that sometimes researchers repeat questions, or have similarly worded questions with the same answers (e.g. year of birth). The goal of this bot would be to identify similarly worded questions and try to give consistent answers.
  3. IR Bot : Alternately, we could search through Google for answers and use those answers as solutions.

It’s fairly trivial to write some Javascript to answer our kind of surveys. Since we now have automated browser testing set up, we should also be able to test collusion in the context of the full pipeline. W00t!