R. K. Markus Krause, “To Play or not to Play: Interactions between Response Quality and Task Complexity in Games and Paid Crowdsourcing,” 2015.

Devil’s advocate: Will Ellis

Summary

In this paper, Krause and Kizilcec ask the research questions, “Is the response quality higher in games or in paid crowdsourcing?” and, “How does task complexity influence the difference in response quality between games an paid crowdsourcing?” To answer these questions, the authors devise and carry out an experiment where they test four experimental treatments between 1,262 study participants. Each experimental group has either a simple or complex task set to perform and either performs the task set as a web browser game or as paid crowdwork. As participants self-selected for each treatment and were sourced from online resources—Newgrounds and Kongregate in the case of players and Crowdflower in the case of workers—rather than recruited from a general population and assigned an experimental treatment, the number of participants in each group varies widely. However, for each group, 50 participants were selected at random for analysis.

The authors employed human judges to analyze the quality of responses of the selected participants and used this data to form their conclusions. The simple task consisted of labeling images. Authors employed the ESP game as the gamefied version of this task, having participants earn points by guessing the most-submitted labels for a particular image. Paid crowdworkers were simply given instructions to label each image and were given feedback on their performance. The complex task consisted of participants generating “questions” to given text excerpts, which was meant to mimic the game show Jeopardy. In fact, authors employed a Jeopardy-like interface in the gamefied version of the task. Players selected text excerpts with a particular category and difficulty from a table, and attempted to generate questions, which were automatically graded for quality (though not “ground truth”). On the other hand, paid crowdworkers were given each text in turn and asked to create a question for each. Answers were evaluated in the same automated way as the gamefied task, and workers were given feedback with the opportunity to revise their answers.

In their analysis of their data, authors found that while there was no statistically significant difference in quality between players and workers for the simple task, there was a statistically significant 18% increase in response quality for players over workers for the complex task. Authors posit that the reason for this difference is that, since players choose to play the game, they are interested in the task itself for its entertainment quality. Workers, on the other hand, choose to do the task for monetary reward and are less interested in the quality of their answers. While it is easier to produce quality work for simple tasks with little engagement in the work, higher quality work for complex tasks can be achieved by gamefying such tasks and recruiting interested players.

Critique

The authors’ conclusions rest in large part on data gathered from the two complex task experiments, which ask participants to form Jeopardy-style “questions” as “answers” to small article excerpts. This is supposed to contrast to the simple task experiments using the ESP game, which was developed as a method for doing the work of labeling pictures. However, the authors do not give justification that the Jeopardy game, serving as the complex task experimental condition, is an appropriate contrast to the ESP game.

The ESP game employs as its central mechanic an adaptation of Family Feud-style word guessing. It is a tried and true game mechanic with the benefit that it can be harnessed for the useful work of labeling images with keywords, as was discussed in [Ahn and Dabbish, 2004]. On the surface, the authors’ use of the Jeopardy game mechanic seems similar, but I believe they’ve failed to use it appropriately in two ways that ultimately weaken their conclusions. Firstly, the mechanic itself seems poorly adapted to the work. A text excerpt from an article is not a Jeopardy-style “answer”, and one need only read the examples in the paper to see the “questions” that participants produce based on those answers make no sense in the “Jeopardy” context. Such gameplay did induce engagement in self-selected players, producing quality answers in the process, but it should not be surprising that, in the absence of the game, this tortured game mechanic failed to induce engagement in workers and, thus, failed to produce answers of quality equal to that of the entertainment incentive experimental condition.

This leads into what I believe is the second shortcoming of the experiment, which is that the complex task, as paid work, is unclear and produces nothing of clear value, both of which likely erode worker engagement. Put yourself in the position of someone playing the game version of this task, and assume that, after a few questions, you find it fun enough to keep playing. You figure out the strategies that allow you to achieve higher scores, you perform better, and your engagement is reinforced. Now put yourself in the position of a worker. You’re asked to, in the style of Jeopardy, “Please write down a question for which you think the shown article is a good response for.” From the paper, it’s clear you’re not then presented with a “Jeopardy”-style answer but instead the first sentence of a news article. This is not analogous to answering a Jeopardy question, and what you may write has no clear or even deducible purpose. It is little wonder that, in an effort to complete the task, bewildered workers would try to only do what is necessary to get their work approved. Compare this to coming up with a keyword for an image, as in the simple paid experimental condition. In that task, what is expected is much clearer, and even a modestly computer-literate worker could suppose the benefit of their work is improving the labeling of images. In short, while it may indeed be the simplicity of a task that induces paid workers to produce higher quality work and the difficulty of a task that causes them to produce lower quality work, this experiment may only show that workers produce lower quality work for confusing and seemingly pointless tasks. A better approach may be to, as with the ESP game, turn complex work into a game instead of trying to turn a game into complex work.

A Critique of: “To Play or not to Play: Interactions between Response Quality and Task Complexity in Games and Paid Crowdsourcing”

Summary

Critique

wiellis

Leave a Reply Cancel reply