Rzeszotarski, Jeffrey, and Aniket Kittur. “CrowdScape: interactively visualizing user behavior and output.” Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 2012.
Discussion Leader: Mauricio
Summary
This paper presents CrowdScape, a system that supports the evaluation of complex crowd work through mixed-initiative machine learning and interactive visualization. This system aims to solve the challenges in quality control that arise in crowdsourcing platforms. Researchers previously have developed approaches for quality control based on worker outputs or on worker behavior. However, these two by themselves have limitations for evaluating complex work. Subjective tasks such as writing or drawing may have no single “right” answer and no two answers may be identical. In regards to behavior, two workers might complete a task in a different manner yet provide valid output. CrowdScape combines worker behavior with worker output information in its visualizations to address these limitations in the evaluation of complex crowd work. CrowdScape’s features allow users to make hypotheses about their crowd, test them, and refine their selections based on machine learning and visual feedback. Its interface allows interactive exploration of worker results and it supports the development of insights on worker performance. CrowdScape is built on top of Amazon Mechanical Turk and it captures data from both the Mechanical Turk API in order to obtain the products of work and from Rzeszotarski and Kittur’s Task Fingerprinting system in order to capture worker behavioral traces (such as time spent on tasks, key presses, clicking, browser focus shifts, and scrolling). It uses these two information sources to create an interactive data visualization of workers. To illustrate the different use cases of the system, they posted four varieties of tasks on Mechanical Turk and solicited submissions. The tasks were: translating text from Japanese to English, pick a color from an HSV color picker and write its name, describing their favorite place, and tagging science tutorial videos. In the end paper they conclude that the linking of behavioral information about workers with data about their output is beneficial in reinforcing or contradicting our own initial conception of the cognitive process workers use when completing tasks and in developing and testing our own mental model of the behavior of workers who have good (or bad) outputs.
Reflections
I think CrowdScape presents a very interesting hybrid approach to address low quality in crowdsourcing work, which according to the authors comprises about one third of all submissions. When starting to read the paper, I got the impression that logging behavioral traces of crowd workers when completing tasks would be a bit of an intrusive way to address this issue. But the explanations they give as to why this approach is more appropriate for assessing the quality of creative tasks (such as writing) than post-hoc output evaluations (such as gold standard questions) was really convincing.
I liked how they were self-critical about the many limitations that CrowdScape has, such as its need for workers to have JavaScript enabled, or how there are cases in which behavioral traces aren’t indicative of the work done, such as if users complete a task in a text editor and then paste it on Mechanical Turk. I would like to see how further research addresses these issues.
I found it curious that in the first task (translation) that, even though the workers were told that their behavior would be captured, they still went ahead and used translators for the task. I would have liked to see what wording the authors used in their tasks when giving this warning, and also in describing compensation. For instance, if the authors told workers that they were going to log the workers’ moves, but that they would be paid regardless, then that gives the workers no incentive to do the translation correctly, which is why the majority (all but one) of the workers might have ended up using Google Translate or another translator for the task. In the other hand, if the authors just told workers that their moves were going to be recorded, I would imagine that would cause the workers to think that not only their output will be evaluated but also their behavior, which would cause them to perform a better job. The wording in the task when they tell workers that their behavioral traces are being logged I think is important, because it might skew the results one way or the other.
Questions
- What wording would you use to tell the workers that their behavioral traces would be captured when completing a task?
- What do you think about looking at a worker’s behavior to determine the quality of their work? Do you think it might be ineffective or intrusive in some cases?
- The authors combine worker behavior and worker output to control quality. What other measure(s) could they have integrated in CrowdScape?
- How can CrowdScape address the issue of cases in which behavioral traces aren’t indicative of the work done (e.g. writing the task’s text in another text editor)?