To Play or not to Play: Interactions between Response Quality and Task Complexity in Games and Paid Crowdsourcing

Krause, M., Kizilcec, R.: To Play or not to Play: Interactions between Response Quality and Task Complexity in Games and Paid Crowdsourcing. Conference on Human Computation and Crowdsourcing, San Diego, USA (2015)

Pro Discussion Leader: Adam

Summary

This paper takes a look at how the quality of paid crowdsourcing compares to the quality of crowdsourcing games. There has been a lot of research comparing expert work to non-expert crowdsourcing quality. The choice is a trade off between price and quality, with expert work costing more but having higher quality. You may need to pay more non-expert crowdworkers to get a comparable level of quality to a single paid expert worker. This same trade off may exist with paid versus game crowd work. There has been research in the cost of making a crowdsourcing take gamified, but nothing comparing the quality of game to paid crowd work.

The authors achieve this by creating two tasks, a simple one and a complex one, and create a game version and paid version for the crowd to complete. The simple task is a simple image labeling task. Images were found by searching Google image search on 160 common nouns. Then for each image, the nouns on the webpage the image was found are tallied. More frequently occurring nouns are considered more relevant to the image. For the game version, they modeled it after Family Feud, so that more relevant labels were given more points. The paid task was simple in that it asked workers to provide a keyword for an image.

In addition to the simple task, the authors wanted to see how quality compared in a more complicated task. To do this, they created a second task that had participants look at a website and provide a question that could be answered by the content of the website. The game version is modeled after Jeopardy, with high point values assigned for more complex websites. The paid version had a predetermined order the tasks were completed in, but in both cases participants completed all of the same tasks.

Overall, the quality of the game tasks was significantly higher than the paid tasks. However, when broken down between simple and complex tasks, only the complex task had significantly higher quality for the game version. The quality for the complex task was about 18% higher for the game task, as rated by the selected judges. The authors suggest one reason for this is the selectiveness of game players. Paid workers will do any task as long as it pays them well, but game players will only play the games that actually appeal to them. So only people really interested in the complex task played the game, leading to higher engagement and quality.

Reflections

Gamification has the potential to generate a lot of useful data from crowd work. While one of the benefits of gamification is that you don’t have to pay workers, it still has a cost. Creating an interesting enough game is not an easy or cheap process, and the authors take that into consideration when framing game crowdsourcing. They are essentially comparing game participants to expert crowd work. It has the potential to generate higher quality work, but at a higher cost than paid non-expert crowd work. However, there difference is that with the game, it’s more of a fixed cost. So if you need to collect massive amounts of data, the cost of creating the game may be amortized to the point where it’s cheaper than paid non-expert work, with at least of high if not higher quality.

I really like how the authors try to account for any possible confounding variables between game and paid crowdsourcing tasks. We’ve seen from some of our previous papers and discussions that feedback can play a large role in the quality of crowd work. It was very important that the authors considered this by incorporating feedback into the paid tasks. This provides much more legitimacy to the results found in the paper.

There is also a simplicity to this paper that makes it very easy to understand and buy into. The authors don’t create too many different variables to study. They use a between-subjects design to avoid any cross-over effects. And there analysis is definitive. There were enough participants to give them statistically significant results and meaningful findings. The paper wasn’t weighed down with statistical calculations like some papers are. They keep the most complicated statistical discussion to two short paragraphs to appease any statisticians who might question their results, but their calculations for the comparisons of quality between the two conditions is very straightforward.

Questions

  • Games have a fixed cost for creation, but are there any other costs that should be considered when deciding whether to go the route of game crowdsourcing versus paid crowdsourcing?
  • Other than instantaneous feedback, are there any other variables that could affect the quality between paid and game crowd work?
  • Was there any other analysis the others should have performed or any other variables that should have been considered, or were the results convincing enough?

Read More

VQA: Visual Question Answering

Paper: S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” ArXiv preprint arXiv:1505.00468, 2015.

Summary

This paper presents a dataset to use for evaluating AI algorithms involving computer vision, natural language processing, and knowledge representation and reasoning. Many tasks that we once thought would be difficult for an AI algorithm to perform, such as image labeling, have now become fairly commonplace. Thus, the authors hope to help push the boundaries of AI algorithms by providing a data set for a more complex problem that combines multiple disciplines, which they name Visual Question Answering. Others can then use this dataset to test their algorithms that try to solve this highly complex puzzle.

The authors use around 200,000 images from the Microsoft Common Objects in Context data set, and 50,000 abstract scene images created by the authors. For each of these images, they collected three open ended questions from crowdworkers. And then for each of these questions, they collected ten answers from unique crowdworkers. An accuracy measure was then applied to each answer, where if at least three responses to a question were identical, that answer was deemed to be 100% accurate. This concluded the data collection for generating the data set, but they then used crowdworkers to evaluate the complexity of the questions received.

There were three criteria examined to justify the complexity of these questions: whether the image is necessary to answer the question, whether the questions require any common sense knowledge not available in the image, and how well the question can be answered using the captions alone and not the actual image. The studies conducted to support this successfully showed that the questions generated generally required the image to answer, a fair amount require some form of common sense, and the questions are answered significantly better with access to the image than with just access to the captions. Finally, the authors used various algorithms to test their effectiveness against this data set, and found that current algorithms still significantly underperform compared to humans. This means that the data set can successfully test the abilities of a new complex set of AI algorithms.

Reflection

While the purpose of this paper is focused on artificial intelligence algorithms, a large portion of it involves crowd work. It is not specifically mentioned in the body of the paper, but from the description and from the acknowledgements and figure descriptions you can tell that the question and answer data was collected on Amazon Mechanical Turk. And this isn’t surprising given the vast amount of data they collected (nearly 10 million question answers). It would be interesting to learn more about how the tasks were setup and the compensation, but the crowdsourcing aspects are not the focus of the paper.

One part of the paper that I thought was most relevant to are studies of crowd work was the discussion of how to get the best complex, open-ended questions relating to the pictures. The authors used three different prompts to try to get the best answers out of the crowdworkers: ask a question that either a toddler, alien, or smart robot would not understand. I thought it was very interesting that the smart robot prompt produced the best questions. This prompt is actually fairly close to reality, as the smart robot could just be considered modern AI algorithms. Good questions are ones that can stump these algorithms, or the smart robot.

I was surprised that the authors chose to go with exact text matches for all of their metrics, especially given the discussion regarding my project last week with the image comparison labeling. The paper mentions a couple reasons for this, such as not wanting things like “left” and “right” to be grouped together, and because current algorithms don’t do a good enough job of synonym matching for this type of task. It would be interesting to see if the results might differ at all if synonym matching were used. The exact matching was used in all scenarios, however, so adding in synonym matching would theoretically not change the relative results.

That being said, this was a very interesting article that aimed to find human tasks that computers still have difficulty dealing with. Every year that passes, this set of tasks gets smaller and smaller. And this paper is actually trying to help this set get smaller more quickly, by helping test new AI algorithms for effectiveness. The workers may not know it, but for the tasks in this paper they were actually working toward making their own job obsolete.

Questions

  • How would you have set up these question and answer gathering tasks, regarding the number that each worker performs per HIT? How do you find the right number of tasks per HIT before the worker should just finish the HIT and accept another one?
  • Is it just a coincidence that the “smart robot” prompt performed the best, or do you think there’s a reason that the closest to the truth version produced the best results (are crowdworkers smart enough to understand what questions are difficult for AI)?
  • What do you think about the decision to use exact text matching (after some text cleaning) instead of any kind of synonym matching?
  • How much longer are humans going to be able to come up with questions that are more difficult for computers to answer?

Read More

Labeling Images with a Computer Game

Paper: Luis von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’04). ACM, New York, NY, USA, 319-326. DOI=http://dx.doi.org/10.1145/985692.985733

Discussion Leader: Adam Binford

Summary

The authors of this paper try to tackle the issue of image labeling. Image labeling has many purposes, such as enabling better text image searches and creating training data for machine learning algorithms. The typical approaches to this image labeling during this time were through computer vision algorithms and or manual labor. This was before crowdsourcing really became a thing, and before Amazon Mechanical Turk was even launched, so the labor required to produce these labels were likely expensive and hard to obtain quickly.

The paper presents a new way to obtain image labels, through a game called The ESP Game. The idea behind the game is that image labels can be obtained from players who don’t realize they’re actually providing this data. They just find the game fun and want to play. The game works by matching up two players and showing them a common image. Players are told to try to figure out what word the other player is thinking of, they are not told anything about trying to describe the image presented to them. Each player then enters words until a match is found between the two players. Players also have the option to vote to skip the image if it is too difficult to come up with a word for.

The game also includes this idea of taboo words, which to players are words that cannot be guessed for an image. These words come from previous iterations of the game using the same image, so that multiple labels get generated for each image instead of the same obvious one over and over again. When an image starts to get skipped frequently, it is removed from the pool of possible images. They estimate that with 5,000 players playing the game constantly, all of the roughly 425,000,000 images on Google could be labeled in about a month, with each image getting their threshold of six labels within six months.

The authors were able to show that they’re game was indeed fun and that quality images were generated by its players. They supported the level of fun of their game with usage statistics, indicating that over 80% of the 13,670 players who played the game played it on multiple days. Additionally, 33 people played for more than 50 hours total. These statistics indicate that the game is provides sufficient enjoyment for players to keep coming back to play.

Reflection

This paper is one of the first we’ve read this year that looks at alternatives to paid crowd work. What’s even more impressive is that this paper was published before crowdsourcing really became a thing, and a year before Amazon Mechanical Turk was even launched. The ESP Game really started this idea of gamification of meaningful work, which many people have tried to emulate since, including Google which basically made their own version of this game. While not specifically mentioned by the authors, I imagine few of the players, if any, of this game knew that it was intended to be used for image labeling. This means the players truly just played it for fun, and not for any other reason.

Crowdsourcing through games has many advantages over what we would consider traditional crowdsourcing through AMT. First, and most obviously, it provides free labor. You attract workers through the fun of the game, not through any monetary incentive. This provides additional benefits. With paid work, you have to worry about workers trying to perform the least amount of work for the most amount of money, and this can result in poor quality. With a game, there is less incentive to try to game the system, albeit still some. With paid work, there isn’t much satisfaction lost by cheating your way to more money. But with a game, it will be much less satisfying to get a high score by cheating or gaming the system than it would be to legitimately get a high score, at least for most people. And the authors of this paper found some good ways to combat any possible cheating or collusion between players. While they discussed their strategies for this, however, it would be interesting to hear about if they had to use them at all and how rampant, if it all, cheating became in the game.

Obviously the issue with this approach is making your game fun. The authors were able to achieve this, but not every task that can benefit from crowdsourcing can easily be turned into a fun game. Image labeling just happens to have many possible ways of turning into an interesting game. All of the Metadata games linked to on the course schedule involve image (or audio) labeling. And they don’t hide the true purpose of the work nearly as well. The game descriptions specifically mentioning tagging images, unlike The ESP Game which mentioned nothing about describing the images presented. The fact that Mechanical Turk has become so popular and all the kinds of tasks available on it goes to show how difficult it is to turn these problems into an interesting game.

I do wonder how useful this game would be today. One of the things mentioned several times by the authors is that with 5,000 people playing the game constantly, they could label all images indexed by Google within a month. But that was back in 2004, when they said there were about 425,000,000 images indexed by Google. In the past 10 years, the internet has been expanding at an incredible scale. I was unable to find any specific numbers on images, but Google has indexed over 40 billion web pages. I would imagine the number of images indexed by Google could be nearly as high. This leads to some questions…

Questions

  • Do you think the ESP Game would be as useful today as it was 11 years ago, with respect to the number of images on the internet? What about with respect to the improvements in computer vision?
  • What might be some other benefits of game crowd work over paid crowd work that I didn’t mention? Are there any possible downsides?
  • Can you think of any other kinds of work that might be gamifiable other than the Fold It style games? Do you know of any examples?
  • Do you think it’s ok to hide the fact that your game is providing useful data for someone, or should the game have to disclose that fact up front?

Read More

Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk

Tanushree Mitra, C.J. Hutto, and Eric Gilbert. 2015. Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, New York, NY, USA, 1345-1354. DOI=10.1145/2702123.2702553 http://doi.acm.org/10.1145/2702123.2702553

Discussion Leader: Adam

Summary

Much of what we have read so far regarding quality control has focused on post hoc methods of cleaning data, where you either take multiple responses and find the best one, or iterate on a response to continually improve it. This paper takes a different approach, using person-centric a priori methods to improve data quality. What the authors are essentially trying to find out, is whether non-expert crowdworkers can be screened, trained, or incentivized, prior to starting the task, to improve the quality of there responses.

To do this, the authors used four different subjective qualitative coding tasks to examine the effects of various interventions or incentives on data quality. People in Pictures asks workers to identify the number of people in a picture, giving five different ranges for workers to choose from. Sentiment Analysis had workers rate the positive or negative sentiment of tweets on a five point scale. Word Intrusion had workers select from a list of five words the one that doesn’t belong with the rest. Finally, Credibility Assessment tasked workers with rating the credibility of a tweet about some world event.

The authors used three different means to intervene with or incentivize the selected works. Targeted screening gave workers a reading comprehension qualification test. Training gave workers some examples of qualitative coding annotations and had them pass some example annotations in order to begin the actual tasks. And the final bonus rewarded workers with double the pay if their response matched the majority of workers responses. A second experiment varied the ways in which workers qualified for the bonus.

In general, the authors found that these a priori strategies were able to effectively increase the quality of worker data, with the financial incentives having the least amount of effect on quality. For the first two tasks, nearly all methods provided statistically significant improvements in quality over the control with financial bonus and baseline, with the combination of screening, training, and bonuses providing the highest quality for each task. Additionally, these a priori methods provided higher quality data than through iteration in the Credibility Assessment task, though not statistically significantly so.

Reflections

This paper provides many interesting results, some of which the authors did not really discuss. The most important take away from this paper is that a priori intervention methods can provide just as high quality data if not more so than process-centered methods such as iteration. And this is significant because of how scalable a priori methods are. You need only screen or train someone once, and then they will provide high quality data for as long as they work on that task. With process-centered methods, you must run the processes for each piece of data you collect, increasing overhead.

However, there are many other results worth discussing. One is that the authors found the control condition quality has significantly increased in the past several years, indicating AMT workers are generally providing higher quality results than before. A few years ago, accuracies for control conditions with a 20% randomly correct rate peaked at about 40%, while in this paper the control qualities were between 55-80%. The authors suggest better person-centric quality control measures enacted by Amazon, such as stricter residency requirements and CAPTCHA use, but I wonder if that is truly the case.

One interesting result that the authors do not really discuss is the fact that in all three tasks from experiment 1, the control category with the bonus incentive performed worse than the control group without the financial bonus. Additionally, the baseline group, which screened workers based on the standard 95% approval rating and 100 HIT experience, performed worse than the control group without these restrictions for each of the three tasks. Maybe new workers tend to provide high quality data because they are exciting about trying something new? This seems like it would be an important issue to look into, as many tasks on AMT use these basic screening methods.

Finally, I find it interesting that financial incentives caused no statistical improvement in quality from the screening or training interventions. I guess it goes along with some of our previous discussions, in that increasing pay will attract more workers more quickly, but once someone decides to do a HIT, the amount of money offered does not affect the quality of their work.

Questions

  • Why has the general quality of workers on AMT improved over the past few years?
  • Can you think of any other intervention or incentive methods that fit this person-centered approach?
  • While these tasks were subjective, they still had a finite number of possible responses (5). Do you think these methods would improve the quality of free-response types of tasks? And how would you judge that?
  • Do you think these methods can replace process-centered quality control all together, or will we always need some form of data verification process?

Read More