Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk

Tanushree Mitra, C.J. Hutto, and Eric Gilbert. 2015. Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, New York, NY, USA, 1345-1354. DOI=10.1145/2702123.2702553 http://doi.acm.org/10.1145/2702123.2702553

Discussion Leader: Adam

Summary

Much of what we have read so far regarding quality control has focused on post hoc methods of cleaning data, where you either take multiple responses and find the best one, or iterate on a response to continually improve it. This paper takes a different approach, using person-centric a priori methods to improve data quality. What the authors are essentially trying to find out, is whether non-expert crowdworkers can be screened, trained, or incentivized, prior to starting the task, to improve the quality of there responses.

To do this, the authors used four different subjective qualitative coding tasks to examine the effects of various interventions or incentives on data quality. People in Pictures asks workers to identify the number of people in a picture, giving five different ranges for workers to choose from. Sentiment Analysis had workers rate the positive or negative sentiment of tweets on a five point scale. Word Intrusion had workers select from a list of five words the one that doesn’t belong with the rest. Finally, Credibility Assessment tasked workers with rating the credibility of a tweet about some world event.

The authors used three different means to intervene with or incentivize the selected works. Targeted screening gave workers a reading comprehension qualification test. Training gave workers some examples of qualitative coding annotations and had them pass some example annotations in order to begin the actual tasks. And the final bonus rewarded workers with double the pay if their response matched the majority of workers responses. A second experiment varied the ways in which workers qualified for the bonus.

In general, the authors found that these a priori strategies were able to effectively increase the quality of worker data, with the financial incentives having the least amount of effect on quality. For the first two tasks, nearly all methods provided statistically significant improvements in quality over the control with financial bonus and baseline, with the combination of screening, training, and bonuses providing the highest quality for each task. Additionally, these a priori methods provided higher quality data than through iteration in the Credibility Assessment task, though not statistically significantly so.

Reflections

This paper provides many interesting results, some of which the authors did not really discuss. The most important take away from this paper is that a priori intervention methods can provide just as high quality data if not more so than process-centered methods such as iteration. And this is significant because of how scalable a priori methods are. You need only screen or train someone once, and then they will provide high quality data for as long as they work on that task. With process-centered methods, you must run the processes for each piece of data you collect, increasing overhead.

However, there are many other results worth discussing. One is that the authors found the control condition quality has significantly increased in the past several years, indicating AMT workers are generally providing higher quality results than before. A few years ago, accuracies for control conditions with a 20% randomly correct rate peaked at about 40%, while in this paper the control qualities were between 55-80%. The authors suggest better person-centric quality control measures enacted by Amazon, such as stricter residency requirements and CAPTCHA use, but I wonder if that is truly the case.

One interesting result that the authors do not really discuss is the fact that in all three tasks from experiment 1, the control category with the bonus incentive performed worse than the control group without the financial bonus. Additionally, the baseline group, which screened workers based on the standard 95% approval rating and 100 HIT experience, performed worse than the control group without these restrictions for each of the three tasks. Maybe new workers tend to provide high quality data because they are exciting about trying something new? This seems like it would be an important issue to look into, as many tasks on AMT use these basic screening methods.

Finally, I find it interesting that financial incentives caused no statistical improvement in quality from the screening or training interventions. I guess it goes along with some of our previous discussions, in that increasing pay will attract more workers more quickly, but once someone decides to do a HIT, the amount of money offered does not affect the quality of their work.

Questions

Why has the general quality of workers on AMT improved over the past few years?
Can you think of any other intervention or incentive methods that fit this person-centered approach?
While these tasks were subjective, they still had a finite number of possible responses (5). Do you think these methods would improve the quality of free-response types of tasks? And how would you judge that?
Do you think these methods can replace process-centered quality control all together, or will we always need some form of data verification process?

adamq

Leave a Reply Cancel reply