When we allow breakoff, we typically tell respondents that they will be paid a bonus commensurate with the amount and quality of work they do at the end of the study. We don’t typically inform them of when the study will end, since we may need to re-run surveys and we’d like to keep respondents from trying to game our algorithms. The experiments we did in the fall that awarded bonuses used a tiered system. The ones listed here are using respondent’s scores to identify the top 95% of scores and award those respondents two cents per question. I’d like to investigate more sophisticated pricing schemata with Sara in the future.
Wage Survey
It was my intention to only have one wage survey running on AMT. However, as I’ve been porting the Python analyses into Clojure, I found that I actually had three instances running. Given the expiration dates on the latter two, I’m pretty sure they were posted accidentally. I should probably consider asking the user if their sure they want to continue, or have a safe mode that asks a million times “are you sure you want to do X?” so future users don’t make this mistake. There’s also small possibility that when I extended the original HIT, it somehow spawned two new HITs instead. This isn’t documented anywhere, but it’s something I probably want to double check on sandbox.
So we had three surveys running. At the time of our OOPSLA submission, we had the wage survey running for about four days and only accrued 69 responses. I extended that HIT twice. It expired Mar 26 2014, 04:42 PM PDT. The other two HITs had expiration dates of Mar 28 2014, 04:44 PM PDT and Mar 30 2014, 07:22 PM PDT. Each HIT requested 150 assignments, and paid $0.10 base wage per survey. Between the three surveys, we collected 154 responses. Under normal circumstances, I wouldn’t have three of the same HIT running concurrently — a feature I might consider adding to SurveyMan is a check for whether a survey with similar parameters has been posted before. If I implement a “safe mode” version of SurveyMan, I could ask then ask the user if they really want to post this survey.
Anyway, the point is that because I had three versions running, I had repeaters. We had only 132 unique respondents. I typically exclude repeaters from the analyses, since we tell them to return the surveys if they haven’t taken them before. After running our new dynamic analyses report, I found that 98 respondents were classified as bad actors. I had a similarly high percentage in the Python analysis and wasn’t confident that it was correct. Since we hadn’t tested the effects of breakoff on bot classification in simulation, I was hesitant to make any strong assertions about these classifications, without investigating further. Furthermore, since we had so few data points at the time of the OOPSLA submission, we decided to simply report qualitative results.
Examining the larger data set, we found that the maximum number of questions answered by any one respondent was still 26. I’ll leave a more thorough analysis of our quality control to another blog post, but the results were quite interesting and corroborated some of my suspicions about the the original set of data. In any case, the number of questions bad actors answered had a high of 18 and a low of 2 (interestingly, those who only answered one question were confined to repeaters. Inspecting manually, I saw 10 repeaters in total, of whom only one appeared to be a legitimate bad actor.).
Using the federal minimum wage of $7.25 an hour and an estimated 10 seconds per question, we should award $0.02 per question. Since we already awarded a base pay of $0.10, we subtracted this quantity off the total payment calculated for honest respondents. The static analyses gave us an average path length of 41 questions and an expected payment of $0.825694 for each respondent who answered the survey to completion. Since we requested 150 responses, if every respondent were categorized as honest and answered the survey to completion, it would cost us $123.8541, not counting the AMT commission. AMT charges a 10% commission on both the base pay and the bonuses. This gives us a total expected cost of $136.23951.
Our actual costs were much lower (although the quality of the data was presumably also lower). Our 154 respondents cost us $22.53 in base pay, including AMT commission. We calculated $7.78 in bonuses to be awarded, costing us $8.558 for the commission. In total, the experiment under these conditions cost us $31.088.
Prototypicality
The next survey we’ll look at is Presley’s prototypicality survey. We had 149 responses total. The survey had an average path length of 17 questions. The estimated base price for honest respondents who answer the survey in full is $0.342361 per survey. Our expected cost for all honest respondents, plus AMT commission is $56.489565. We classified 65 respondents as bad actors, and 84 as valid responses. These results differ from our initial reported results due to how we calculate the frequencies for questions that are variants. In our Python code, we only compared questions that were exactly the same — that is, we didn’t unify the distributions of the variants. In the Clojure implementation, we first remove bots, and assume that there is no statistical difference between the questions, unifying all the variants. We’ll leave a discussion of the pros and cons of this approach to a future blog post.
The result of unifying variants is that significantly more responses are classified as bots, but none of the variants are flagged as being drawn from different distributions! I’ll have to double check to make sure that the variant code is running as expected, but since my unit test for flagging variants seems to work, I’m going to assume that the differences we detected were due to outliers. We can discuss what might be going on here in a later blog post, since this one is mostly about pricing. The bonuses to be paid amount to $20.32. With AMT commission this is $22.35. In total, this survey costs $38.742.
Choose Randomly
I ran a survey that wasn’t featured in our OOPSLA paper because it was one I made up entirely. The idea was to post a survey with two floating blocks, where both asked respondents to choose one of the responses randomly, but the one block had identical options, whereas the other had arbitrary categories of things. I wanted to see how random people could actually be. I also wanted to track timing information. One of the things I noticed about this survey, which I posted on a Wednesday morning, was that I was able to collect all of my responses within 90 minutes. The time per question here is clearly less than 10s. If it takes about 2 seconds per question, then the expected cost for a completed survey would be about $0.08. As a result, I decided to not award bonuses. The total cost was $16.50.
Phonology
Finally we have our classic phonology survey. With an average path length of 99 questions, we would expect to pay $1.993750 per survey. With a target of 150 responses, our total cost, including commission, would be $328.96875.
We ran this survey 3 times (not counting our preliminary run). The details of each run can be found in an earlier post. We collected 395 responses total and had 311 unique respondents. 22 respondents accounted for the 84 duplicate responses. This initial run cost us $43.45. 182 responses were classified as valid responses. The total bonuses to be paid were calculated to be $327.96. Factoring in AMT commission, this comes to $360.756. In all, this survey will have cost us $404.206.
Notes on timing
Some of the the timing data we returned was flawed, so we couldn’t use that information to improve our payment scheme. One possible use of this information would be to use it as a proxy for the difficulty the question and vary the payment in accordance with this information.