A Critique of: “To Play or not to Play: Interactions between Response Quality and Task Complexity in Games and Paid Crowdsourcing”

R. K. Markus Krause, “To Play or not to Play: Interactions between Response Quality and Task Complexity in Games and Paid Crowdsourcing,” 2015.

Devil’s advocate: Will Ellis

Summary

In this paper, Krause and Kizilcec ask the research questions, “Is the response quality higher in games or in paid crowdsourcing?” and, “How does task complexity influence the difference in response quality between games an paid crowdsourcing?” To answer these questions, the authors devise and carry out an experiment where they test four experimental treatments between 1,262 study participants. Each experimental group has either a simple or complex task set to perform and either performs the task set as a web browser game or as paid crowdwork. As participants self-selected for each treatment and were sourced from online resources—Newgrounds and Kongregate in the case of players and Crowdflower in the case of workers—rather than recruited from a general population and assigned an experimental treatment, the number of participants in each group varies widely. However, for each group, 50 participants were selected at random for analysis.

The authors employed human judges to analyze the quality of responses of the selected participants and used this data to form their conclusions. The simple task consisted of labeling images. Authors employed the ESP game as the gamefied version of this task, having participants earn points by guessing the most-submitted labels for a particular image. Paid crowdworkers were simply given instructions to label each image and were given feedback on their performance. The complex task consisted of participants generating “questions” to given text excerpts, which was meant to mimic the game show Jeopardy. In fact, authors employed a Jeopardy-like interface in the gamefied version of the task. Players selected text excerpts with a particular category and difficulty from a table, and attempted to generate questions, which were automatically graded for quality (though not “ground truth”). On the other hand, paid crowdworkers were given each text in turn and asked to create a question for each. Answers were evaluated in the same automated way as the gamefied task, and workers were given feedback with the opportunity to revise their answers.

In their analysis of their data, authors found that while there was no statistically significant difference in quality between players and workers for the simple task, there was a statistically significant 18% increase in response quality for players over workers for the complex task. Authors posit that the reason for this difference is that, since players choose to play the game, they are interested in the task itself for its entertainment quality. Workers, on the other hand, choose to do the task for monetary reward and are less interested in the quality of their answers. While it is easier to produce quality work for simple tasks with little engagement in the work, higher quality work for complex tasks can be achieved by gamefying such tasks and recruiting interested players.

Critique

The authors’ conclusions rest in large part on data gathered from the two complex task experiments, which ask participants to form Jeopardy-style “questions” as “answers” to small article excerpts. This is supposed to contrast to the simple task experiments using the ESP game, which was developed as a method for doing the work of labeling pictures. However, the authors do not give justification that the Jeopardy game, serving as the complex task experimental condition, is an appropriate contrast to the ESP game.

The ESP game employs as its central mechanic an adaptation of Family Feud-style word guessing. It is a tried and true game mechanic with the benefit that it can be harnessed for the useful work of labeling images with keywords, as was discussed in [Ahn and Dabbish, 2004]. On the surface, the authors’ use of the Jeopardy game mechanic seems similar, but I believe they’ve failed to use it appropriately in two ways that ultimately weaken their conclusions. Firstly, the mechanic itself seems poorly adapted to the work. A text excerpt from an article is not a Jeopardy-style “answer”, and one need only read the examples in the paper to see the “questions” that participants produce based on those answers make no sense in the “Jeopardy” context. Such gameplay did induce engagement in self-selected players, producing quality answers in the process, but it should not be surprising that, in the absence of the game, this tortured game mechanic failed to induce engagement in workers and, thus, failed to produce answers of quality equal to that of the entertainment incentive experimental condition.

This leads into what I believe is the second shortcoming of the experiment, which is that the complex task, as paid work, is unclear and produces nothing of clear value, both of which likely erode worker engagement. Put yourself in the position of someone playing the game version of this task, and assume that, after a few questions, you find it fun enough to keep playing. You figure out the strategies that allow you to achieve higher scores, you perform better, and your engagement is reinforced. Now put yourself in the position of a worker. You’re asked to, in the style of Jeopardy, “Please write down a question for which you think the shown article is a good response for.” From the paper, it’s clear you’re not then presented with a “Jeopardy”-style answer but instead the first sentence of a news article. This is not analogous to answering a Jeopardy question, and what you may write has no clear or even deducible purpose. It is little wonder that, in an effort to complete the task, bewildered workers would try to only do what is necessary to get their work approved. Compare this to coming up with a keyword for an image, as in the simple paid experimental condition. In that task, what is expected is much clearer, and even a modestly computer-literate worker could suppose the benefit of their work is improving the labeling of images. In short, while it may indeed be the simplicity of a task that induces paid workers to produce higher quality work and the difficulty of a task that causes them to produce lower quality work, this experiment may only show that workers produce lower quality work for confusing and seemingly pointless tasks. A better approach may be to, as with the ESP game, turn complex work into a game instead of trying to turn a game into complex work.

Read More

Structuring, Aggregating, and Evaluating Crowdsourced Design Critique

K. Luther, J.-L. Tolentino, W. Wu, A. Pavel, B. P. Bailey, M. Agrawala, B. Hartmann, and S. P. Dow, “Structuring, Aggregating, and Evaluating Crowdsourced Design Critique,” in Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, New York, NY, USA, 2015, pp. 473–485.

Discussion leader: Will Ellis

Summary

In this paper, Luther et al. describe CrowdCrit, a system for eliciting and aggregating design feedback from crowdworkers. The authors motivate their work by explaining the peer and instructor feedback process that is often employed in a design classroom setting. By designers’ exposing their designs to high-quality constructive criticism, they can improve their designs and improve their craft. However, the authors point out that it is difficult to find such essential feedback outside of the classroom. Online design communities may provide some sources of critique, but this is often too little and too shallow. To solve this problem, the authors built CrowdCrit, and they tested it in three studies. Their studies attempted to answer the questions How similar are crowd critiques to expert critiques?, How do designers react to crowd critiques?, and How does crowd critique impact the design process and results?.

In Study 1, the authors had a group of 14 CrowdCrit-sourced workers and group of 3 design experts evaluate 3 poster designs using the CrowdCrit interface. They found that while individual crowdworkers’ critiques matched poorly to experts’, the crowd in aggregate matched 45% to 60% of the experts’ design critiques. The results suggest that even more workers will produce results that more closely match experts.

In Study 2, the authors tested designers’ reactions to crowd critiques by way of a poster contest for a real concert. Designers spent an hour designing a poster according to client criteria. Both crowdworkers and the client then generated feedback using CrowdCrit. Designers then had the chance to review feedback and make changes to their designs. Finally, the client chose a poster design winner. In interviewing the designers, the authors found that they felt the majority of the crowd critiques were helpful and that they appreciated a lot of the features of the CrowdCrit system including abundant feedback, decision affirmation, scaffolded responses, and anonymity.

In Study 3, the authors evaluated the impact of crowd critique on the design process using another design contest, this time hosted on 99designs.com. After the initial design stage, half of the design participants were given crowd feedback through CrowdCrit, and the other half were given generic, unhelpful feedback. The final designs were evaluated by both the client and a group of crowdworkers meeting a certain design expertise threshold. While the designers appreciated the crowd feedback more than the generic feedback, results showed no significant differences between the quality of the treatment and control groups.

The authors conclude with implications for their work. They feel that crowd feedback may make designers feel as though they are making major revisions when in fact they’re only making minor improvements. Indeed, the nature of CrowdCrit seems to ensure that designers will receive large lists of small changes that do not cause them to make substantive design changes but, if implemented, contribute to busier, less simple designs.

Reflection

CrowdCrit is implemented on top of Amazon Mechanical Turk and, thus, has the benefit of being able to pull feedback from a lot of design novices. This paper makes the case that such feedback, in aggregate, can approximate the feedback of design experts. I am very concerned with the amount of noise introduced in the aggregation approach discussed in Study 1. Yes, with enough crowdworkers, you will eventually have enough people clicking enough critique checkboxes that all of the ones that an expert selected will also be selected by crowdworkers. However, if we assume that the critiques an expert would have made are the most salient, the designer would be unable to separate the salient from the inconsequential. I would hope that the most-selected critiques made by an army of crowdworkers would better approximate those of an actual expert, but the authors do not explore this strategy. I would also explore a weighting system that favors critiques from CrowdCrit’s design-experienced crowdworkers, not just by coloring them more boldly, but also by hiding novice critiques that have low replication.

I am impressed by the effort and technique employed by the authors to distill their seven design principles, which they came to by analyzing important works in design teaching. I think the scaffolding approach to teaching design to crowdworkers was novel, and I appreciated the explanation of the pilot studies they performed to arrive at their strategy. I wonder if those who would use a system like CrowdCrit, the designers themselves, would not benefit from participating as workers in the system. Much like a design classroom, they could benefit from the scaffolded learning and application of design principles, which they may only know in part.

In Study 3, I’m sure the authors were disappointed to find no statistically significant improvement in design outcomes using crowd feedback. However, I think the most important goal of peer and expert critique, at least in the classroom, is not to improve the design, but to improve the designer. With that in mind, it would be interesting to see a longitudinal study evaluating the output of designers who use CrowdCrit over a significant period of time.

Questions

  • Study 1 shows adding more workers produces more data, but also more “false positives”. Authors conjecture that these may not be false positives, but could in fact be critiques that the experts missed. Are the authors correct, or is this just more noise? Is the designer impaired by so many extra critiques?
  • CrowdCrit is designed to work with any kind of crowd, not just the Mechanical Turk community. Based on other papers we’ve read, how could we restructure CrowdCrit to fit within a community of practice like graphic design?
  • Study 3 seems to show that for a single design, critique does not improve a design so much as simple iteration. Is feedback actually an important part of the design process? If so, how do we know? If we accept that feedback is an important part of the design process, how might we design a study that evaluates CrowdCrit’s contribution?
  • The results of Study 2 show a lot of positive feedback from designers for CrowdCrit’s features and interface. Implied in the designers’ comments is their enthusiasm for mediated engagement with clients and users (crowdworker stand-ins in this case) over their designs. What are CrowdCrit’s most important contributions in this regard?

Something Cool

Moneyball, but for Mario—the data behind Super Mario Maker popularity

Read More

A Comparison of Social, Learning, and Financial Strategies on Crowd Engagement and Output Quality

L. Yu, P. André, A. Kittur, and R. Kraut, “A Comparison of Social, Learning, and Financial Strategies on Crowd Engagement and Output Quality,” in Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, New York, NY, USA, 2014, pp. 967–978.

Discussion leader: Will Ellis

Summary

In this paper, Yu et al. describe three experiments they ran to test whether accepted human resource management strategies can be employed individually or in combination to improve crowdworker engagement and output quality. In motivating their research, they describe how crowd platform features aimed at lowering transaction costs work at cross-purposes with worker retention (which they view as equivalent to engagement). These “features” include simplified work histories, de-identification of workers, and lack of long-term contracts. The strategies the authors employ to mitigate the shortcomings of these features are social (through teambuilding and worker interaction), learning (through performance feedback), and financial (through long-term rewards based on quality).

The broad arc of each experiment is 1) recruit workers for an article summarization task, 2) attempt to retain workers for a similar follow-up task through recruitment messages employing zero (control), one, two, or three strategies, 3) measure worker retention and output quality, and 4) repeat steps 2 and 3. The first experiment tested employing all three strategies versus a control. The results showed that using these strategies together improved retention and quality by a statistically significant amount. The second experiment tested each strategy individually as well as pairs of strategies versus control. The results showed that only the social strategy significantly improved worker retention, while all three individual strategies improved output quality. However, no two-strategy combination significantly improved retention or quality. The authors view this as a negative interaction between the pairs of strategies and offer a few possible explanations for these outcomes, one of which is that they needed to put more effort into integrating the strategies. This led them to develop the integrated strategies that they tested in Experiment 3. In Experiment 3, the authors again tested each strategy individually. They also tested their improved strategy pair treatments, as well as a more integrated 3-strategy treatment. Again, only the social strategy by itself showed significant improvement in retention. Whereas Experiment 1 showed significant improvement in retention when employing all 3 strategies, the results of this experiment suggest otherwise. In addition, only the learning strategy by itself and the 3-strategy treatment showed improved output quality.

The authors conclude that the social strategy is the most effective way of increasing retention in crowdworkers and that the learning strategy is the most effective way of increasing output quality. The authors say these results suggest that multiple strategies undermine each other when employed together and that careful design is needed when devising crowdwork systems that try to do so.

Reflection

Yu et al. have taken an interesting tack in trying to apply perhaps more traditional human resource strategies to crowdworking in an attempt to improve worker engagement and output quality. I think they’re correct in identifying the very qualities of systems like Amazon Mechanical Turk – extremely short-term contracts, pseudonymous identities, simplified worker histories – as what make it difficult to employ such strategies. I appreciate their data-driven approach to measuring engagement, that is, they measure it through worker retention. However, I can’t help but question their equivocation of worker retention with worker engagement. Engagement implies a mental investment in the work, whereas worker retention is a measure of who was motivated to come back for tasks, regardless of their investment in the work.

A fascinating aspect of their experimental setup is that in experiments 1 and 2, the social strategy employed claimed team involvement in the follow-up recruitment materials but did not actually implement them. Despite this fact, retention benefits were clearly realized with the social strategy in experiment 2 and likely contributed to improved retention in experiment 1. Further, even though actual social collaboration was implemented in experiment 3, no further retention improvements were realized. It seems the idea of camaraderie is just as motivating as actual collaboration. The authors suggest that experiencing conflict with real teammates may mitigate the benefits to retention of teammate interaction. Indeed, this may be the place where “retention” as a substitute for “engagement” breaks down. That is, in traditional workplaces, workers engage not just with their work but also with each other. I imagine it’s much more difficult to feel engaged with pseudonymous teammates over the Internet than teammates in person.

Disappointingly, the authors cannot claim much about combined strategies. While a 3-strategy approach is consistently better in terms of quality between experiment 1 and experiment 3, none of the strategy pairs improve retention or quality significantly. They can only recommend that, when combining strategies, designers of crowdwork systems do so carefully. I would hope that future work explores in more depth what factors are at play that undermine strategy combinations.

Questions

  • Do you think worker retention is a good measure of engagement?
  • In experiment 1, the authors did not actually operationalize the learning strategy. What, if anything, do their results say about learning strategy in this experiment?
  • What do you think of the reasons the authors give for why strategy combinations perform worse than individual strategies? What changes to their experimental setup could you suggest to improve outcomes?
  • This paper is very focused on improving outcomes for work requesters using HR management strategies. Is there any benefit to crowdworkers in recreating traditional HR management structures on top of crowdwork systems?

Read More

Micro Perceptual Human Computation for Visual Tasks

Gingold, Yotam, Ariel Shamir, and Daniel Cohen-Or. “Micro Perceptual Human Computation for Visual Tasks.” ACM Trans. Graph. 31, no. 5 (September 2012): 119:1–119:12. doi:10.1145/2231816.2231817.

Discussion leader: Will Ellis

Summary

Human computation (HC) involves using people to solve problems for which pure computational algorithms are unsuited. While previous human computation processes have used people to operate on large batches of problem data or solve problems using complex interactions, this paper describes a paradigm of decomposing complex, human-solvable problems into very simple, independent, parallelizable micro-tasks. Through this, the authors devise algorithms that break down large problems, farm out their constituent micro-tasks to unskilled “human processors” (Mechanical Turkers in this case), and reassemble the results within a timeframe suitable for interactive software use. The paper describes applying this approach to three separate image-processing problems considered to be difficult: finding depth layers in an image, figuring out the surface normal map of a 3-dimensional object in an image, and detecting the bilateral symmetry of an object in an image. The paper further describes various quality control strategies and which combinations thereof produce the best results and economic value.

Reflection

The paper devotes significant coverage to figuring out the best strategies to maximize the accuracy of results. Authors employ duplication, or having different or even the same human processor (HP) perform the same task multiple times. They employ “sentinel operations”, or making an HP solve a problem with a known answer to verify his or her reliability. They attempted self-refereeing, or giving the results of one HP to other HPs for approval. Ultimately, their need for speed dictated they use a combination of duplication and sentinel operation strategies for quality control. Of course, employing more HPs to operate on the same problem, as in the duplication and self-refereeing strategies, costs more money per micro-task. However, researchers found they could achieve high-level accuracy with the least expenditure by setting very high (100%) duplication and sentinel thresholds with only 1 HP. I find this interesting, though not all that surprising, that best outcomes are achieved when highly-performant “unskilled labor” is used. In other words, paying one good worker to do a task is more economical than paying multiple mediocre workers to do the same task separately and combining their results. The authors seem to agree, saying, “Identifying accurate HPs with lower sentinel and consistency overhead is an important direction for future work.” This outcome, in my mind, works against a fundamental premise of the work, which is that tasks can be made simple enough that any unskilled individual can perform them well enough to make this paradigm of human computing an economically viable alternative to pure software solutions or manual problem-solving by experts.

Questions

  • Do the macro-tasks the authors attempted to solve in this paper seem generalizable to other problems in graphical vision or other fields? What other kinds of problems could be solved by human computation using this divide-and-conquer strategy?
  • From a usability perspective, what are the pitfalls of having other humans “in the loop” in your professional software? How could these be mitigated?
  • Does this system introduce new ethical concerns about HP workers for software end-users who may or may not be aware of their existence?
  • How do you feel about the authors’ MTurker compensation strategy? (They used less strict criteria to decide to pay HPs than to decide if HP’s answers could be used.)

Bonus!
Crowdsourcing level design in Nintendo’s Super Mario Maker
Super Mario Maker level design contest at Facebook

Read More