CrowdScape: interactively visualizing user behavior and output

Rzeszotarski, Jeffrey, and Aniket Kittur. “CrowdScape: interactively visualizing user behavior and output.” Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 2012.

Discussion Leader: Mauricio

Summary

This paper presents CrowdScape, a system that supports the evaluation of complex crowd work through mixed-initiative machine learning and interactive visualization. This system aims to solve the challenges in quality control that arise in crowdsourcing platforms. Researchers previously have developed approaches for quality control based on worker outputs or on worker behavior. However, these two by themselves have limitations for evaluating complex work. Subjective tasks such as writing or drawing may have no single “right” answer and no two answers may be identical. In regards to behavior, two workers might complete a task in a different manner yet provide valid output. CrowdScape combines worker behavior with worker output information in its visualizations to address these limitations in the evaluation of complex crowd work. CrowdScape’s features allow users to make hypotheses about their crowd, test them, and refine their selections based on machine learning and visual feedback. Its interface allows interactive exploration of worker results and it supports the development of insights on worker performance. CrowdScape is built on top of Amazon Mechanical Turk and it captures data from both the Mechanical Turk API in order to obtain the products of work and from Rzeszotarski and Kittur’s Task Fingerprinting system in order to capture worker behavioral traces (such as time spent on tasks, key presses, clicking, browser focus shifts, and scrolling). It uses these two information sources to create an interactive data visualization of workers. To illustrate the different use cases of the system, they posted four varieties of tasks on Mechanical Turk and solicited submissions. The tasks were: translating text from Japanese to English, pick a color from an HSV color picker and write its name, describing their favorite place, and tagging science tutorial videos. In the end paper they conclude that the linking of behavioral information about workers with data about their output is beneficial in reinforcing or contradicting our own initial conception of the cognitive process workers use when completing tasks and in developing and testing our own mental model of the behavior of workers who have good (or bad) outputs.

Reflections

I think CrowdScape presents a very interesting hybrid approach to address low quality in crowdsourcing work, which according to the authors comprises about one third of all submissions. When starting to read the paper, I got the impression that logging behavioral traces of crowd workers when completing tasks would be a bit of an intrusive way to address this issue. But the explanations they give as to why this approach is more appropriate for assessing the quality of creative tasks (such as writing) than post-hoc output evaluations (such as gold standard questions) was really convincing.

I liked how they were self-critical about the many limitations that CrowdScape has, such as its need for workers to have JavaScript enabled, or how there are cases in which behavioral traces aren’t indicative of the work done, such as if users complete a task in a text editor and then paste it on Mechanical Turk. I would like to see how further research addresses these issues.

I found it curious that in the first task (translation) that, even though the workers were told that their behavior would be captured, they still went ahead and used translators for the task. I would have liked to see what wording the authors used in their tasks when giving this warning, and also in describing compensation. For instance, if the authors told workers that they were going to log the workers’ moves, but that they would be paid regardless, then that gives the workers no incentive to do the translation correctly, which is why the majority (all but one) of the workers might have ended up using Google Translate or another translator for the task. In the other hand, if the authors just told workers that their moves were going to be recorded, I would imagine that would cause the workers to think that not only their output will be evaluated but also their behavior, which would cause them to perform a better job. The wording in the task when they tell workers that their behavioral traces are being logged I think is important, because it might skew the results one way or the other.

Questions

  • What wording would you use to tell the workers that their behavioral traces would be captured when completing a task?
  • What do you think about looking at a worker’s behavior to determine the quality of their work? Do you think it might be ineffective or intrusive in some cases?
  • The authors combine worker behavior and worker output to control quality. What other measure(s) could they have integrated in CrowdScape?
  • How can CrowdScape address the issue of cases in which behavioral traces aren’t indicative of the work done (e.g. writing the task’s text in another text editor)?

Read More

Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk

Tanushree Mitra, C.J. Hutto, and Eric Gilbert. 2015. Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, New York, NY, USA, 1345-1354. DOI=10.1145/2702123.2702553 http://doi.acm.org/10.1145/2702123.2702553

Discussion Leader: Adam

Summary

Much of what we have read so far regarding quality control has focused on post hoc methods of cleaning data, where you either take multiple responses and find the best one, or iterate on a response to continually improve it. This paper takes a different approach, using person-centric a priori methods to improve data quality. What the authors are essentially trying to find out, is whether non-expert crowdworkers can be screened, trained, or incentivized, prior to starting the task, to improve the quality of there responses.

To do this, the authors used four different subjective qualitative coding tasks to examine the effects of various interventions or incentives on data quality. People in Pictures asks workers to identify the number of people in a picture, giving five different ranges for workers to choose from. Sentiment Analysis had workers rate the positive or negative sentiment of tweets on a five point scale. Word Intrusion had workers select from a list of five words the one that doesn’t belong with the rest. Finally, Credibility Assessment tasked workers with rating the credibility of a tweet about some world event.

The authors used three different means to intervene with or incentivize the selected works. Targeted screening gave workers a reading comprehension qualification test. Training gave workers some examples of qualitative coding annotations and had them pass some example annotations in order to begin the actual tasks. And the final bonus rewarded workers with double the pay if their response matched the majority of workers responses. A second experiment varied the ways in which workers qualified for the bonus.

In general, the authors found that these a priori strategies were able to effectively increase the quality of worker data, with the financial incentives having the least amount of effect on quality. For the first two tasks, nearly all methods provided statistically significant improvements in quality over the control with financial bonus and baseline, with the combination of screening, training, and bonuses providing the highest quality for each task. Additionally, these a priori methods provided higher quality data than through iteration in the Credibility Assessment task, though not statistically significantly so.

Reflections

This paper provides many interesting results, some of which the authors did not really discuss. The most important take away from this paper is that a priori intervention methods can provide just as high quality data if not more so than process-centered methods such as iteration. And this is significant because of how scalable a priori methods are. You need only screen or train someone once, and then they will provide high quality data for as long as they work on that task. With process-centered methods, you must run the processes for each piece of data you collect, increasing overhead.

However, there are many other results worth discussing. One is that the authors found the control condition quality has significantly increased in the past several years, indicating AMT workers are generally providing higher quality results than before. A few years ago, accuracies for control conditions with a 20% randomly correct rate peaked at about 40%, while in this paper the control qualities were between 55-80%. The authors suggest better person-centric quality control measures enacted by Amazon, such as stricter residency requirements and CAPTCHA use, but I wonder if that is truly the case.

One interesting result that the authors do not really discuss is the fact that in all three tasks from experiment 1, the control category with the bonus incentive performed worse than the control group without the financial bonus. Additionally, the baseline group, which screened workers based on the standard 95% approval rating and 100 HIT experience, performed worse than the control group without these restrictions for each of the three tasks. Maybe new workers tend to provide high quality data because they are exciting about trying something new? This seems like it would be an important issue to look into, as many tasks on AMT use these basic screening methods.

Finally, I find it interesting that financial incentives caused no statistical improvement in quality from the screening or training interventions. I guess it goes along with some of our previous discussions, in that increasing pay will attract more workers more quickly, but once someone decides to do a HIT, the amount of money offered does not affect the quality of their work.

Questions

  • Why has the general quality of workers on AMT improved over the past few years?
  • Can you think of any other intervention or incentive methods that fit this person-centered approach?
  • While these tasks were subjective, they still had a finite number of possible responses (5). Do you think these methods would improve the quality of free-response types of tasks? And how would you judge that?
  • Do you think these methods can replace process-centered quality control all together, or will we always need some form of data verification process?

Read More

A Comparison of Social, Learning, and Financial Strategies on Crowd Engagement and Output Quality

L. Yu, P. André, A. Kittur, and R. Kraut, “A Comparison of Social, Learning, and Financial Strategies on Crowd Engagement and Output Quality,” in Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, New York, NY, USA, 2014, pp. 967–978.

Discussion leader: Will Ellis

Summary

In this paper, Yu et al. describe three experiments they ran to test whether accepted human resource management strategies can be employed individually or in combination to improve crowdworker engagement and output quality. In motivating their research, they describe how crowd platform features aimed at lowering transaction costs work at cross-purposes with worker retention (which they view as equivalent to engagement). These “features” include simplified work histories, de-identification of workers, and lack of long-term contracts. The strategies the authors employ to mitigate the shortcomings of these features are social (through teambuilding and worker interaction), learning (through performance feedback), and financial (through long-term rewards based on quality).

The broad arc of each experiment is 1) recruit workers for an article summarization task, 2) attempt to retain workers for a similar follow-up task through recruitment messages employing zero (control), one, two, or three strategies, 3) measure worker retention and output quality, and 4) repeat steps 2 and 3. The first experiment tested employing all three strategies versus a control. The results showed that using these strategies together improved retention and quality by a statistically significant amount. The second experiment tested each strategy individually as well as pairs of strategies versus control. The results showed that only the social strategy significantly improved worker retention, while all three individual strategies improved output quality. However, no two-strategy combination significantly improved retention or quality. The authors view this as a negative interaction between the pairs of strategies and offer a few possible explanations for these outcomes, one of which is that they needed to put more effort into integrating the strategies. This led them to develop the integrated strategies that they tested in Experiment 3. In Experiment 3, the authors again tested each strategy individually. They also tested their improved strategy pair treatments, as well as a more integrated 3-strategy treatment. Again, only the social strategy by itself showed significant improvement in retention. Whereas Experiment 1 showed significant improvement in retention when employing all 3 strategies, the results of this experiment suggest otherwise. In addition, only the learning strategy by itself and the 3-strategy treatment showed improved output quality.

The authors conclude that the social strategy is the most effective way of increasing retention in crowdworkers and that the learning strategy is the most effective way of increasing output quality. The authors say these results suggest that multiple strategies undermine each other when employed together and that careful design is needed when devising crowdwork systems that try to do so.

Reflection

Yu et al. have taken an interesting tack in trying to apply perhaps more traditional human resource strategies to crowdworking in an attempt to improve worker engagement and output quality. I think they’re correct in identifying the very qualities of systems like Amazon Mechanical Turk – extremely short-term contracts, pseudonymous identities, simplified worker histories – as what make it difficult to employ such strategies. I appreciate their data-driven approach to measuring engagement, that is, they measure it through worker retention. However, I can’t help but question their equivocation of worker retention with worker engagement. Engagement implies a mental investment in the work, whereas worker retention is a measure of who was motivated to come back for tasks, regardless of their investment in the work.

A fascinating aspect of their experimental setup is that in experiments 1 and 2, the social strategy employed claimed team involvement in the follow-up recruitment materials but did not actually implement them. Despite this fact, retention benefits were clearly realized with the social strategy in experiment 2 and likely contributed to improved retention in experiment 1. Further, even though actual social collaboration was implemented in experiment 3, no further retention improvements were realized. It seems the idea of camaraderie is just as motivating as actual collaboration. The authors suggest that experiencing conflict with real teammates may mitigate the benefits to retention of teammate interaction. Indeed, this may be the place where “retention” as a substitute for “engagement” breaks down. That is, in traditional workplaces, workers engage not just with their work but also with each other. I imagine it’s much more difficult to feel engaged with pseudonymous teammates over the Internet than teammates in person.

Disappointingly, the authors cannot claim much about combined strategies. While a 3-strategy approach is consistently better in terms of quality between experiment 1 and experiment 3, none of the strategy pairs improve retention or quality significantly. They can only recommend that, when combining strategies, designers of crowdwork systems do so carefully. I would hope that future work explores in more depth what factors are at play that undermine strategy combinations.

Questions

  • Do you think worker retention is a good measure of engagement?
  • In experiment 1, the authors did not actually operationalize the learning strategy. What, if anything, do their results say about learning strategy in this experiment?
  • What do you think of the reasons the authors give for why strategy combinations perform worse than individual strategies? What changes to their experimental setup could you suggest to improve outcomes?
  • This paper is very focused on improving outcomes for work requesters using HR management strategies. Is there any benefit to crowdworkers in recreating traditional HR management structures on top of crowdwork systems?

Read More

The Economies of Online Cooperation: Gifts and Public Goods in Cyberspace

Paper: Kollock, The Economies of Online Cooperation: Gifts and Public Goods in Cyberspace

Discussion leader: Anamary Leal

Summary

This paper discusses features about online internet communities that support cooperation and gift giving (of sometimes very expensive things like hundred-dollar consultations). The authors compare gift and commodity economies; Getting a commodity does not obligate you to get another. Getting a gift means you get the feeling to reciprocate. Gifts at “the thing that so-n-so gave me”, and commodities are “a” thing. In the internet, If you give the gift of free advice, there is no feeling of reciprocation. The gift is given to some huge group. But, there may be a sense of reciprocity within the group.

Online goods are a public good that is indivisible (online person viewing an answer does not hinder another), non-excludable (can’t exclude others from the good), and can be duplicated. Everyone benefits, but it doesn’t mean that it will happen. And, the temptation for online users to not contribute much and still get a ton of benefits, known as free-ride, arises. Only one person needs to pay the cost by contributing (known as the privileged group) to get the most benefit. How do you motivate people to produce the good and to coordinate with others?

One reason is anticipated reciprocity, which is reciprocity from the group itself for help in the future. A good contributor to a forum may feel entitled to receive help from the forum in the future. One study found that they indeed get help more quickly than others. Another is maintaining a reputation online (which also implies that there is an fixed identify set to a contribution to keep track of their contributions.) Self-efficacy is also a well-studied motivator, and the logic is that one will help the group to help make their own impact seem wider.

The paper discusses two case studies in online cooperation. The first is making Linux, and while it had many markings to fail, it succeeded due to one person doing a large amount of the work up front to get it usable, and making it compelling to contributors. Programmers contributed drivers to get Linux to work on their devices.

The second is connecting elementary schools to internet access by organizing an online rally to organize, coordinate volunteers, and accomplish the task in one day. Additionally, a committee also did much of the work in having face-to-face meetings with school officials and such. The online system allowed for people to sign up based on schools’ needs.

The authors caution that while online communities can rally together to do great things, that interest, not necessarily importance, help rally people. A massive plumbing repair job instead of wiring to the internet may be less successful of a job than the massive wiring internet job. Additionally, many digital goods are produced and managed by a small group or even one person, even if initially.

Reflection:

The paper has a few hints of its age, such as stressing the benefits of instantly communicating online compared to doing a mail, TV or newspaper campaign. But, this paper remains to be compelling to start outlining the features of how these communities interact (ongoing interaction, identity persistence, and knowledge of prior interactions.)

In discussing motivation, it was an interesting choice to first discuss motivation without altruism or group attachment to the equation, assuming that everyone is in it for themselves, and then ease into more altruistic motivations like group needs. To keep the discussion focused, it was a good idea. But, while the paper mentioned that it was rare, I wonder how much altruism, group need, or attachment impact how much they contribute.

The authors stress that many of these efforts are started with a small group or one person. In the Linux example, Linus put a an enormous amount of work to get Linux to a usable state, and then released it for programmers contributed and checked themselves on the contributions. There was no SVN, GIT or code control system back then to help support this (or at least, from what I checked.) I can only imagine how hard it was to keep and manage the code repository back then.

Additionally, how big was the size of the core committee that managed the NetDay? It moved 20,000 volunteers, but how many people did the online site, held the face-to-face meetings? I wouldn’t be surprised if it was one or a handful of people who met regularly and coordinated. I also surmise that this project took a large chunk of their time, compared to the regular volunteer who spent a day wiring.

Fast-forward until now, we now have systems to facilitate such endeavors much easier. Yet, I do not see multiple reasonable OS’s or multiple reasonable alternatives to common software. Most commonly used software, used by the majority that I have seen (not just technologists) is a result of a company of either made by Apple, Microsoft or Google. I wonder how much could quality still remain to be a factor. One would think that the more early crowdsourcing efforts would have the most maturity and be the most successful now, instead of potentially less interesting efforts like ones on Amazon Turk.

Discussion:

  1. The discussion of reciprocity is set in terms of accountability and credit, in 1999. What kinds of mechanisms that you have seen online have tried to design to keep track of a user’s contributions to a community? How well do they work or not work?
  2. One would assume that the earliest crowdsourcing efforts would have the most time to mature, and be the most successful (public events to benefit others, and making software). But Turk, with it’s boring tasks, is the most successful, and may not be widely motivating nor interesting. Why are not these online communities the most successful? Are there still challenges unsolved?
  3. What’s the relationship between efforts doing by one individual or a group, compared to the efforts of the crowd? Torvald built an OS, and surely some core set of people met and worked on NetDay for countless hours. In my experience, the most successful massive efforts are led by a core dedicated group meeting live. In other words, how much effort does an individual or group need to put to get these online communities to successfully do these projects?
  4. Could these individuals, in the present time, be able to delegate out some of the core tasks(develop an OS, organize a NetDay od 20,000 volunteers) to others? If so, how so, and which parts could be crodsourced? Are there any technologies or techniques that come to mind? If not, why not?

Read More

CrowdForge: Crowdsourcing Complex Work

Aniket Kittur , Boris Smus , Susheel Khamkar , Robert E. Kraut, CrowdForge: crowdsourcing complex work, Proceedings of the 24th annual ACM symposium on User interface software and technology, October 16-19, 2011, Santa Barbara, California, USA

Discussion Leader: Shiwani Dewal

Summary

CrowdForge is a framework which enables the creation and completion of complex, inter-dependent tasks using crowd workers. At the time of writing the paper (and even today), platforms like Amazon Mechnical Turk facilitated access to a micro-workers who complete simple, independent tasks which require little or no cognitive effort. Complex tasks, traditionally, require more coordination, time and cognitive effort; especially for the person managing or overseeing the effort. These challenges become even more acute when crowd workers are involved.

To address this issue, the authors present their framework, CrowdForge, alongwith case studies which were accomplished through a web-based prototype. The CrowdForge framework is drawn from distributed computing (MapReduce) and consists of three steps, viz. partition, map and reduce. The partitioning step breaks a higher level task into single units of work. The mapping step involves the units of work being assigned to workers. The same task may be assigned to several workers to allow for improvements and quality control. The final step is reduction in which the units of work are combined into a single output, which is essentially the solution for the higher level task.

The framework was tested through several case studies. The first case study was about writing a Wikipedia article about New York City. Surprisingly, the articles produced by groups of workers across HITs, were rated, on an average, as high as the Simple English Wikipedia article on New York City and higher than full articles written by individuals as part of a higher paying HIT. Quality control was tested through further map and reduce efforts to merge results and through voting, and was deemed more effective through merged efforts. The second case study involved collating information for researching purchase decisions. The authors do not provide any information about the quality of the resulting information. The last case study dealt with the complex flow of turning an academic paper into a newspaper article for the general public. The paper discusses the steps used to generate news leads (the hook for the paper) and a summary of the researchers’ work, as well as the quality of the resulting work.

The CrowdForge approach looked very promising which was exemplified through the case studies. It also had a few disadvantages such as not supporting iterative flows, assuming that a task can, in fact, be broken down into single units of work and possible overlap between the results of a task due to the lack of communication between workers. The authors concluded by encouraging researchers and task designers to consider crowd sourcing for complex tasks, and push the limits of what they could accomplish through this market.

Reflections

The authors have identified an interesting gap in the crowd sourcing market- ability to get complex tasks completed. And although requesters probably may have broken down their tasks into HITs in the past and taken care of the combining of results on their end, CrowdForge’s partition-map-reduce framework seems like it could alleviate the challenge and streamline the process, to some extent.

I like the way the partition-map-reduce framework is conceptualized. It seems fairly intuitive and seems to have worked well for the case-studies. I am a little surprised (and maybe skeptical?) that the authors did not include the results of the second case study or more details for the rest of the third case study.

The other aspect I really liked about the paper was the effort to identify and test alternative or creative ways to solve common crowd sourcing problems. For example, the authors came up with the idea of using further map-and-reduce steps in the form of merging as an alternative to voting on solutions. Additionally, they came up with the consolidate and exemplar patterns for the academic paper case study, to alleviate the problems of the high complexity of the paper and the effort workers expected to put in.

The paper mentions in its section on limitations that there are tasks which either cannot be decomposed and that another market with skilled or motivated workers should be considered This also brings me back to the notion that perhaps crowd-sourcing in the future will look more like crowd-sourcing for a skill-set, a kind of skill-based consulting.

In conclusion, I think that the work presented in the paper looks very promising, and it would be quite interesting to see the framework being applied to other use-cases.

Discussion

1. The paper mentions that using further map and reduce steps to increase the quality of the output, as opposed to voting, generated better results. Why do you think that happened?

2. There may be tasks which are too complex to be decomposed, or decomposed tasks which require a particular skill set. Some crowd sourcing platforms accomplish this through having an “Elite Taskforce”. Do you think this is against the principles of crowd sourcing, that is, that a task should ideally be available to every crowd worker or is skill-based crowd sourcing essential?

3. CrowdForge breaks tasks up, whereas TurkIt allowed iterative work-flows and the authors talk about their vision to merge the approaches. What do you think would be some key benefits for such a merged approach?

4. The authors advocate for pushing the envelope when it comes to the kind of tasks which can be crowd sourced. Thoughts?

Read More

Beyond the Turk: An empirical comparison of alternative platforms for crowdsourcing online behavioral research

Eyal Peera , Sonam Samatb , Laura Brandimarteb & Alessandro Acquistib

Discussion Leader: Divit Singh

Summary

This paper focused on finding alternatives to MTurk and evaluating its results.  MTurk is considered to be the front-runner among other, similar crowdsorucing platforms since it tends to produce high quality data.  However, since the worker growth of MTurk is starting to stagnate, the workers that use MTurk have become extremely efficient on completing the tasks that are often published on MTurk.  The reason is because these tasks tend to be very similar (surveys, transcribing etc).  The familiarity with tasks have shown to reduce effect sizes of research findings (completing same survey multiple times with same answers skews data to be collected).  Due to this reason, this paper explores other crowdsourcing platforms and evaluates the performance, results, similarities and differences between each other in an effort to find a viable alternative for researchers to publish their tasks.

In order to evaluation the performance of these various crowdsourcing platforms, they tried to create a survey among all 6 of the platforms being tested.  Among these 6, only 3 of them successfully published the survey.  Some platforms simply rejected the survey without a reason, other platforms either required considerable amount of money or  had errors in their platforms which prevented the study from exploring those alternatives.  From the the platforms that were able to be publish the survey, it appeared that the only viable alternative to MTurk among the ones that were tested turned out to be CrowdFlower.  The surveys used involved questions that contained attention-check questions, decision-making questions, as well as a question which measured the honesty of the worker.   This paper provides an excellent overview of the various properties between each of the platforms and includes many tables which outline when one platform may be more effective over another.

Reflection

This paper does present a considerable amount of information among the various platforms that were described.  However, reading through this paper, it really revealed the lack of any actual competition to MTurk that is out there.  Sure, it does discuss that CrowdFlower is a good alternative in order to reach a different population of workers, it is still considered less than equal to MTurk for a lot of instances.  The main basis of using these other platforms is because MTurk workers have become extremely efficient at completing tasks which may cause the skewing of results.   I believe it is only a matter of time before workers on these other platforms lose their “naivety” as the platform becomes more mature.

The results of this paper may be invaluable to a researcher who wants to really target their audience.  For example, this paper revealed that CBDR is managed by CMU and that it is composed of students and non-students.  Although not guaranteed, it might be the most appealing for a researcher who wants to target college students since it may contain a considerable university student population.  Another excellent bit of information that they provided is the failure rate of attention-seeking questions that were posted on their survey.  This outlines two things: how inattentive workers are during their surveys, and also how experienced the workers of MTurk really are (they most likely have seen questions like these in the past which prevents them from making the same mistake again).  However, keep in mind that these results are a snapshot at a given time.  There is nothing that is prevented the workers of CrowdFlower (which are apparently disjoint from workers of MTurk) which contain a massive worker base from learning from these surveys and become smarter workers.

Questions

  1. Is there any other test that you believe that the study missed?
  2. Based on the tables and information provided, how would you rank the different crowdsourcing platforms?  Why?
  3. This paper outlined that outlined different approaches for these platforms (e.g. review committee that determines if a survey is valid).  What method do you agree with or how would you design your own platform in order to optimize quality and reduce noise?

 

Read More

TurKit: human computation algorithms on mechanical turk

Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2010. TurKit: human computation algorithms on mechanical turk. In Proceedings of the 23nd annual ACM symposium on User interface software and technology (UIST ’10). ACM, New York, NY, USA, 57-66. DOI=10.1145/1866029.1866040 http://doi.acm.org/10.1145/1866029.1866040

Discussion Leader: Adam Binford

Summary

TurKit is a framework developed by the authors to develop algorithms that incorporate human computation through Amazon’s Mechanical Turk. Mechanical Turk has its own API for requesters to use to integrate with their algorithms, but it’s mostly only conducive to highly independent and parallelizable tasks. TurKit provides a way to develop sequential and iterative algorithms that make use of multiple steps of human computation. It does this through some JavaScript extensions and a new crash-and-rerun programming model.

The crash-and-run model allows a TurKit script to be re-executed without having to rerun the costly Mechanical Turk functions, which are costly in both time and money. During an initial run of the script, the calls to the Mechanical Turk API go through and the results, whenever they are returned, are saved, so that subsequent executions of the script simply use the saved result. This allows you to develop, test, and make changes to your TurKit scripts without having to pay someone to do the HIT each time. This saving of state is achieved through some keywords introduced on top of regular JavaScript syntax. Use of the keyword once means the indicated function should only be called once on the first run, and in subsequent executions the same result is used. Additionally, functions fork and join are added to help facilitate parallelism and execution of multiple HITs at the same time, whose results can then be combined into later logic in the algorithm. The authors discuss some iterative applications that can make use of this paradigm, such as iterative writing or iterative blurry text recognition. Finally, they discuss two experiments by other researchers that make use of TurKit and the performance hit of the crash-and-rerun paradigm.

Reflection

The key contribution of this paper I believe is the crash-and-rerun programming model. It is what enables the key benefit of TurKit, developing and modifying a human computation enabled algorithm without having to pay someone to complete a HIT each time you run it. I think the crash-and-rerun style to achieve this is really clever, and I wonder if it could have any use in non-human computation development as well. The idea of retroactive print-line-debugging would be very useful for many different scenarios. Clearly the usability of this model is dependent on the amount of non-deterministic code in your algorithm, and many real world programs would be too large or have too much concurrency to make it feasible.

Crowdsourcing does have some unique properties that make it especially suited for this model. As stated in the paper, compared to the time it takes to complete a HIT, code execution time is almost negligible. So while some overhead with re-executing the program is acceptable, it may not be for other areas. Additionally, most human computation tasks tend to be fairly straightforward. If it’s not completely parallelizable, it is a simple iterative process. It will be interesting to see if toolkits like these enable people to come up with more complex human computation algorithms that stretch the limits of the crash-and-rerun model.

Questions

– Do you think most human computation implementations could make use of TurKit or are the majority single parallelizable task based?
– What human computation algorithms might be too complex for TurKit?
– What other applications would there be for the crash-and-rerun model?
– Do you think the example of writing an article would be possible through a pool of generally unskilled and unknowledgeable workers?

Read More

Conducting behavioral research on Amazon’s Mechanical Turk

  1. Winter Mason and Siddharth Suri. 2012. Conducting behavioral research on Amazon’s Mechanical Turk. Behavior research methods 44, 1: 1–23. http://doi.org/10.3758/s13428-011-0124-6

Discussion Leader: Anamary Leal

Summary:

This guide both motivates researchers to explore conducting research on Amazon Turk, for numerous benefits (a faster theory-experiment turn around time, a large pool of diverse and qualified participants for low pay, beneficial for tasks that humans are easier to do than computers, quality of data is comparable to laboratory testing, and other benefits we covered in class), and a how-to guide on strategies to conduct behavioral research on Amazon Turk.

They’ve collected data from multiple studies looking at worker demographics, requester information, and what a HIT looks like. A HIT can be internally hosted or externally hosted, made up of multiple assignments, or parts that a worker can do. One worker can only work on one assignment in a HIT. The paper also covers how to make a HIT, collect data, and the closing of a HIT. Additionally, the paper had insight on workers and requester relationships, such as how workers search for HITs, and recommendations for requesters to be successful in making their HITs and engaging with workers.

The paper also covers in depth how how to implement various different kinds experiments, such as how to do synchronous experiments, by first collecting a group of reliable and quality workers and conducting preliminary studies to find such a group. Additional informative aspects mentioned are how to perform randomization assignment,

 

Reflection:

In addition to being a how-to guide, this paper serves as a good literature survey paper, covering the literature alongside covering the how-to guide. There were all kinds of useful data scattered throughout the document, and it was especially helpful.

I just finished using AT for homework before doing the reflection, and according to the review, finding the most recently created HITs was the most popular way workers found HITs, with the second-most being HITs offered so that a worker can learn how perform a task well, and do similar tasks faster. I wonder what do these strategies reveal about how workers use the system. We know that now, a small group of professional workers get the majority of hits. Do workers try to find new HITs to snatch up all the best paying HITs..? Is that a hint of how they use?

In my short experience with AT in doing HITs including IRB-approved research, I still felt like the pay was atrocious compared to the work done, but I also find reasonable the author’s arguments about the convenience of workers  not needing to schedule a time to go into a laboratory study, along with quality not being affected by pay. (though, one of the studies compares quality between paying a penny and 10 cents, which isn’t very much.)

The paper goes into detail about pay in AT, from a practical perspective, to a quality issues, to the ethics of it. There is a negative relationship between “highest wage” tasks and probability of a HIT being taken. I found that the highest wage tasks call for a ridiculous amount of time and effort that such tasks were not worth the effort. In my short experience with AT in doing HITs including IRB-approved research, I still felt like the pay was atrocious compared to the work done,.

This paper was cited at least 808 times (according to Google Scholar), and advocates for positive and professional connections between workers and requester. This paper can still inform requester researchers now in 2015.

Questions:

  1. The paper has cited work showing that workers found hits by finding the most recently made HITs first. What does worker’s strategy say about Turk and it’s workers?
    1. How were your strategies, in finding appropriate HITs to take?
  2. How did this paper affect in how your completed the assignment? Or, if you did the homework first and then read the paper, how would you have done your homework (as a worker and requester) differently?
  3. The authors address one of the biggest questions potential researchers may have: compensation. They presented a case that quality, for certain tasks (not judgement or decision type tasks) generally remains the same, and that in-lab participants should be paid higher. They recommend starting with less than reservation rate ($1.38/hr) and increase with reception. Given our discussions with workers and pay, what do you think of this rate?
  4. The writers encourage that “new requesters “introduce” them-selves to the Mechanical Turk community by first posting to Turker Nation before putting up HITs.” and to “keep a professional rapport with their workers as if they were company employees.” This paper was published in 2012 and cited a lot. How do you see influences of this attitude (or not) among requesters and workers?
  5. How applicable are some of these techniques on M Turk with respect to some of the issues we discussed before, such as Turkopticon’s rating system, Worker’s bill of rights, and other ethical and quality issues?

 

Read More

Turkopticon: Interrupting Worker Invisibility in Amazon Mechanical Turk 

Lilly C. Irani and M. Six Silberman. 2013. Turkopticon: interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’13). ACM, New York, NY, USA, 611-620. DOI= 10.1145/2470654.2470742http://dl.acm.org/citation.cfm?id=2470742

Discussion leader: Mauricio De La Barra

Summary
This paper provides an analysis of Amazon Mechanical Turk, a human computation system, as a site of technically mediated worker-employer relationships. The authors argue that human computation currently relies on worker invisibility, which in turn leads HCI researchers to pay less attention to crowdwork’s ethics and values. In order to bring to light the relations between requesters and Turkers, they conducted a case study in which they asked 67 Turkers questions about what they would desire as a “Workers’ Bill of Rights.” The points of agreement in the survey were the basis of Turkopticon’s design, which is an activist system that allows workers to do two main things: publicize and evaluate their relations with employers, and engage one another in mutual aid. Turkopticon was developed as a browser extension for Chrome and Firefox. When browsing in Mechanical Turk for HITs, this extension display a CSS button next to the requester’s name. On mouse-over, the workers can see the ratings of the requester according to four qualities: Communicativity, Generosity, Fairness, and Promptness, and also a link to a website to see all the written reviews for that requester. Workers can also leave their own reviews for that requester. This paper continues by explaining that the design of Mechanical Turk favors requesters over workers, which causes the unjust treatment of workers. Turkopticon attempts to provide more fairness in the relationship between requesters and Turkers by holding requesters accountable and enabling help among workers. This piece of software has become an essential tool for many Turkers, as it has been installed over 7,000 times and the Turkopticon website receives 100,000 page views a month. The authors conclude the paper by highlighting the lessons they have learned from intervening in large-scale socio-technical systems, such as Mechanical Turk.

Reflection
As someone who doesn’t use Mechanical Turk often, I find this paper to be a great overview of some of the ethical issues of the use of this platform. While some of them are somewhat obvious (such as whether or not to have a minimum wage for HITs), I didn’t really internalize these issues while using the system. One of the most interesting things I found about the paper is how it highlights that when designing Mechanical Turk, Amazon has prioritized the needs of employers over workers.  As an example, by hiding workers behind APIs, Mechanical Turk makes employers see themselves as builders, rather than as employers unconcerned with working conditions of Turkers. Also, because Mechanical Turk’s participation agreement gives requesters intellectual property rights over submission regardless of rejection, workers have no legal recourse against employers who reject work and then use it. One would think that a system such as Mechanical Turk, which is based on relations between requesters and workers, would be designed so that both parties have their concerns addressed. But as Amazon’s goal (like any other company) is to make money, it treats workers  interchangeably and since there are so many workers, Mechanical Turk can sustain the loss of workers who don’t abide by the terms of agreements: since Amazon collects money for task volume, they have little reason to prioritize worker needs. I feel that systems such as Turkopticon are a step in the right direction in order to make the workers’ relationships with employers visible to other workers in Mechanical Turk, but I feel that change needs to happen at the infrastructure level – Amazon should also consider the ethical issues that arise through the use of Mechanical Turk.

Questions
• Do you think that if the Turkopticon extension get widely adopted among workers in Mechanical Turk, that Mechanical Turk requesters would move on to a different human computation/crowdsourcing platform? Why?
• Do you agree or disagree with the criteria used to review requesters (do they represent an accurate framing of the interaction with the requester)? Or what other criterion could have been used instead?
• Turkopticon’s developers hope that Amazon would change its system design to include worker safeguards in Mechanical Turk. This has not happened yet. If Amazon becomes aware of Turkopticon, and how useful it is for workers, do you think that it might consider changing its design? In what ways?
• What are some policies that Mechanical Turk should adopt in order to show that it doesn’t just care about the needs of requesters, but also about the needs of workers (e.g., requiring requesters to justify rejections, having a minimum wage for HITs, etc.)? Or do you think that the system is fine as it is currently?

Read More

We Are Dynamo: Overcoming Stalling and Friction in Collective Action for Crowd Workers

Niloufar Salehi, Lilly C. Irani, Michael S. Bernstein, Ali Alkhatib, Eva Ogbe, Kristy Milland, and Clickhappier. 2015. We Are Dynamo: Overcoming Stalling and Friction in Collective Action for Crowd Workers. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, New York, NY, USA, 1621-1630. DOI=10.1145/2702123.2702508 http://doi.acm.org/10.1145/2702123.2702508

Discussion leader: Nai-Ching Wang

Summary

This paper proposes structured labor to overcome two obstacles impeding collective action through a year-long ethnographic field study. By participating deeply in the ecology of human computation instead of being outside observers, the authors find out although there were already forums to help better work condition, there are still several challenges preventing further collective action such as trust, privacy, risk of loss of jobs, diverse purposes of working. To support collective action in the Amazon Mechanical Turk community, Dynamo is designed to provide three affordances including trust and privacy, assembling a public, and mobilizing. With several cases on Dynamo platform, the authors identify two intertwined threats, stalling and friction, for collective action. To overcome the two threats, this paper proposes structured labor including “Debates with deadlines”, “Act and undo”, “Produce Hope”, and “Reflect and Propose” and demonstrates how the structured labor can be used in real cases.

Reflection

I see several connections from this paper to our previous topics and in-class discussions. Based on definitions from Quinn and Bederson’s paper, this paper falls into the categories of social computing and crowdsourcing and thus collective intelligence in the sense that the crowd workers on Amazon Mechanical Turk form an online community and (ideas of) social movements are crowdsourced to the community instead of some designated leaders or consultants. In the discussion of the question “Can we foresee a future crowd workplace in which we would want our children to participate?”, this paper might have provided some possibility for the crowd workers to take collective action to strive for a fair labor marketplace. This paper seems to be a good example of human-computer collaboration, too. Interestingly, Dynamo (the computer) is designed with the concept of affordances naturally in-line with the affordance-based framework discussed previously. Turkers along with admins provide human affordances such as sociocultural awareness, creativity and domain knowledge. Examples in this paper also demonstrate that crowd workers can complete not only micro-tasks as seen from last week’s discussion but also brainstorming ideas and quality writing. In addition, from this paper, it also seems that traditional management is better than algorithmic management for such topic at least for now.

Questions

  • Do you think campaigns in the paper successful? Why or Why not? What else do you think important for success of collective action?
  • For the labor of action (debates with deadlines, act and undo, the production of hope, reflect and propose),
    • Which (part) do you think that might be addressed automatically by computer algorithms? In other words, which parts are the ones that really need human computation (at least for now)?
    • Can these tasks be further divided into smaller tasks?
    • How can that possibly be done?
  • How do we find a trustworthy person to be the “moderator”? Or say, how do we decide if a person is trustworthy enough to be the “moderator”?
  • Can we delegate some moderation to computer? Which part? And how?

Read More