VizWiz: Nearly Real-time Answers to Visual Questions

Bigham, Jeffrey P., et al. “VizWiz: Nearly Real-time Answers to Visual Questions” Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM, 2010

Discussion Leader: Sanchit

Crowdsourcing Example: Couch Surfing

Youtube video for a quick overview: VizWiz Tutorial

Summary

VizWiz is a mobile application designed to answer visual questions for blind people, in real time by taking advantage of existing crowdsourcing technologies such as Amazon’s Mechanical Turk. Existing software and hardware to aid blind people solve visual problems are either too costly or too cumbersome to use. OCR is not advanced enough and reliable to completely solve vision-based problems and existing text to speech software only helps solve a singular issue of reading text back to the blind user. The application interface is designed to take advantage of Apple’s accessibility service called VoiceOver which allows the operating system to talk to the user and describe what the current selected option or view is on the screen. Touch based gestures are used to navigate the application so that users may easily take a picture, ask a question and receive answers from remote workers in real time.

The authors also present an abstraction layer on top of the Mechanical Turk API called quikTurkit. This allows requesters to create their own website on which Mechanical Turk workers are recruited and are able to answer questions posed by users of the VizWiz application. There is a constant stream of HITs being posted on Mech Turk so that a pool of workers is available to work as soon as a new question is posed by the user. While the user is taking a picture and recording their question, VizWiz sends a notification to quikTurkit which allows the API to start recruiting workers and therefore reduce the overall time and latency in waiting for an answer to come back.

VizWiz also featured a second version which detected blurry or dark images and asked users to retake them in order to get more accurate results. The authors also developed a use case VizWiz:LocateIt which allows blind users to locate an object in 3D space. They take a picture of an area where the desired object is located and then pose a question asking for the location of the specific object. Crowdworkers then highlight the object and the application processes the camera properties, the user’s location and the highlighted object to determine how much the user should turn and how far the user should walk in order to reach the general vicinity of the object. A lot of favorable responses were generated from the post user study surveys which showed that this technology is definitely in demand by blind users and may set up future research to automate the answering process without human interaction.

Reflections

I thought the concept in itself was brilliant. It is a problem that not many people think about in their daily lives, but when you sit down and really start to ponder on how trivial tasks such as finding the right object in a space can be nearly impossible for blind people, you realize the potential of such an application. The application design was very solid. Apple designed the VoiceOver API for vision-impaired people to begin with, so using it in such an application was the best choice. Employing large gestures for UI navigation is also smart because it can be very difficult or impossible for vision-impaired people to click a specific button or option on a touch based screen/device.

QuikTurkit was in my opinion a good foundation and beginning as the backend model for this application. It can definitely be improved by not placing too much focus on speech recognition, or not bombarding Mechnical Turk with too many HITs. Finding the right balance between the number of active workers in the pool and the number of HITs to be posted will really benefit both the system load and the cost the user has to incur in the long run.

A minor observance that I noticed was that the study had 11 blind users with 5 females initially, but later on in the paper there were 3 females. Probably a typo, but thoughts? Speaking of their study, I think the heuristics made a lot of sense and the survey results were generally favorable for the application. A latency of 2-3 minutes on average is not too bad considering the helpless situation of a vision-impaired person. Any amount of additional information or answering of a question that the user can get will only be helpful. I honestly didn’t see the point for speech recognition to be a focus for their product. If workers can just listen to the question, then that should be sufficient enough to answer it. There is no need to introduce errors with failed speech recognition attempts.

In my opinion, VisWiz:LocateIt was too complicated of a system with too many external variables to worry about so that a visually-impaired user can successfully find an object. The location detection and mapping is based only on the picture taken by the user which is not guaranteed to be perfect (more often than not). Although they have several algorithms and techniques to combat ineffective pictures, I still think there are potential hazards and accidents waiting to happen based on the direction ques provided by the application. Not entirely convinced on this use case.

Overall it was a solid concept and execution in terms of the mobile application. It looks like the software is public and is being used by over 5000 blind people right now, so that is pretty impressive.

Questions:

  1. One aspect that confused me about quikTurkit was who actually deployed the server or made the website for Mechical Turk workers to use this service. As in, was it the VizWiz people who created the server or can requesters build their own websites using this service as well? And who would the requesters be? Blind people?
  1. Besides human compassion and empathy, what is stopping users from giving wrong answers? Also, who determines whether the answer was correct or not?
  1. If a handheld barcode scanner works fairly well to locate a specific product in an area, then why couldn’t the authors just use a barcode scanning API on the iPhone along with the existing voice over technology to help locate a specific product? Do you foresee any problems with this approach?

 

Read More

Crowds in two seconds: enabling realtime crowd-powered interfaces

Bernstein, Michael S., et al. “Crowds in two seconds: Enabling realtime crowd-powered interfaces.” Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011.

Discussion Leader: Shiwani

Youtube video for a quick overview: https://www.youtube.com/watch?v=9IICXFUP6MM

Summary

Crowd-sourcing has been successfully used in a variety of avenues, including interactive applications such as word processors, image searches, etc. However, a major challenge is the latency in returning a result to the user. If an interace takes more than 10 seconds to react, the user is likely to lose focus and/or abandon the interface. Near real-time techniques at the time required at least 56 seconds for simple tasks and 22 minutes or longer for more complex workflors.
In this paper, the authors propose techniques for recruiting and effectively using synchronous crowds in order to provide real-time, crowd-powered interfaces. The first technique is called the retainer model and involves hiring workers in advance and placing them on hold by paying them a small amount. When a task is ready, the workers are alerted and they are paid additional amount on completion of the task. The paper also discusses empirical guidelines for this technique. The second technique introduced in the paper is rapid refinement. It is a design pattern for algorithmically recognizing crowd agreement early on and rapidly reducing the search space to identify a single resullt.
The authors created a system called Adrenaline to validate the retainer model and rapid refinement. Adrenaline is a smart photo shooter, designed to find the most photogenic moment by capturing a short (10 second) video clip and using the crowd to identify the best moment.
Additionally, the authors were interested in looking at other applications for real-time crowd-powered interfaces. Fo this, they created two systems, Puppeteer and A|B. Puppeteer is intended for creative content generation tasks, and allows the designer to interaction with the crowd as they work. A|B is a simple platform for asking A-or-B questions, with the user providing two options and asking the crowd to choose one based on pre-specified criteria.
The results of the experiments suggest that the retainer model is effective in assembling a crowd about two seconds after the request is made and that a small reward for quickness remediated longer reaction times caused by longer retainer times. It was also found that rapid refinement enabled small groups to select the best photograph faster than the single fastest member. However, forcing agreement too quickly, sometimes affected the quality. For puppeteer, there was a small latency due to complexity of the task, but the throughput rates were constant. On the other hand, for A\B, responses were received in near real-time.

Reflections

This paper ventures into an interesting space and brings me back to the idea to “Wizard of Turk”, where users interact with the system and the responses of the system are generated through human intelligence. The reality of machine learning at the moment is that there are still areas which are subjective and require a human in the loop. Tasks such as identifying the most photogenic photo or voting whether a red sweater looked better than a black sweater are classic examples of such subjective aspects. This is demonstrated- in part- through the Adrenaline experiment where the quality of the photo chosen through crowd-sourcing was better than the computer vision generated photo. For subjective voting (A-or-B), users might even prefer to get an input from other people, as opposed to the machine. Research in media equations and affect would indicate that this is likely.
The authors talk about their vision of the future- where crowd-sourcing markets are designed for quick requests. Although the authors have demonstrated that it is possible to have synchronous crowds and use them to perform real-time tasks with quick turn-around times, a lot more thought needs to go into the design of such systems. For example, if many requesters wanted to have workers on “retainer”, workers could easily accept tasks from multiple requesters and simply make some money for being on hold. The key idea of a retainer is to not prevent the worker from accepting other tasks, while they wait. These two ideas seem at logger heads with each other. Additionally, this might introduce a higher latency, which perhaps could be remediated with competitive quickness bonuses. The authors do not explicitly state how much money the workers were paid for completion of the task, and I wonder how these amounts compared to the retainer rates they offered.
For the Adrenaline experiment, the results compared the best photo identified from a short clip through a variety of techniques, viz. Generate-and-vote, Generate-one, Computer Vision, Rapid Refinement, Photographer. It would have been interesting to see if two additional conditions had been added- a single photograph taken by an expert photographer and a set of photographs taken by a photographer, as input to the techniques.

Questions:

1. The Adrenaline system allows users to capture the best moment, and the cost per image is about $0.44. The authors envision this cost going down to about $0.10. Do you think users would be willing to pay for such an application? Especially given that Android phones such as Samsung Galaxy has a mode to “capture best photo” whereby multiple images are taken at short intervals and the user has an option to select the best one to save.

2. Do you think that using the crowd for real-time responses makes sense?

3. For the rapid refinement model, one of the issues mentioned was that it might stifle individual expression, and that a talented worker’s input might get disregarded as compared to that of 3-4 other workers. Voting has the same issue. Can you think of ways to mitigate this?

4.. Do we feel comfortable out-sourcing such tasks to crowd-workers? It is one thing when it is a machine…

Read More

Frenzy: Collaborative Data Organization for Creating Conference Sessions

Lydia Chilton, Juho Kim, Paul André, Felicia Cordeiro, James A. Landay, Daniel S. Weld, Steven P. Dow, Robert C. Miller, Haoqi Zhang

Discussion Leader: Divit Singh

Summary

In a conference, similar papers are usually organized into sessions.  This is done so that conference attendees can see related talks in the same time-block.  The process of organizing papers into these sessions is nontrivial.  This paper offers a different approach in order to aid the process of grouping papers into sessions.  This paper presents Frenzy: a web application designed to leverage the distributed knowledge of the program committee to rapidly group papers into sessions.  This application breaks session-making into 2 sub-problems: meta-data elicitation and global constraint satisfaction.  In the meta-data elicitation stage, users search for papers via queries on their abstracts/authors etc. and group them into categories that they believe makes sense.  They also have the ability to “+1” categories that have been suggested by other users to show support for that category.   In the global constraint satisfaction stage, users must assign a paper to a session and also make sure that every session contains at least two papers in it.  The author(s) tested this application at CSCW 2014 PC meeting and the schedule produced for the CSCW 2014 was generated with the aid of Frenzy.

Reflection

The idea of leveraging parallelism to create sessions for a conference is a brilliant one.  This paper mentioned that this process used to take an entire day and that the even then, the schedulers were usually pidgeon-holed into deciding which session a paper belonged to (in order to fulfill constraints).  By creating a web application that allows all users access to the data, I believe that they created an extremely useful and efficient system.  The only downside I see to this application is that I fear that they may give users too much power.  For example, users may delete categories.  I’m not sure that giving users this type of power would be wise.  For the purposes of a conference, where all users are educated and have a clear sense of the goal, it may be okay.  However, if they were to open up this system to a wider audience, this system may backfire.

I really liked how they divided up their process into sub-problems.   From my understanding, the first stage is to get a general sense as to where these papers belong and to get some user feedback to show where the majority of users believe the appropriate category for a paper should be.  This stage is open to the entire audience so that everyone may contribute and have a say.  The second stage is thought to be more of a “clean-up” stage.  A special subset from the committee members then make the final choices as to deciding papers for session.  Now, they are provided with the thoughts of the overall group, which greatly help in deciding where papers go.  In my head, I viewed this approach as a map-reduce job.  The metaphor may be a stretch, but I viewed the first stage, they are just trying to “map” a paper to the best possible category.  This task happens in parallel and it generates an increasing set of results.  The second stage, “reduces” these sets and delegates them into their appropriate sessions.  For those reasons, reading through this paper, it was pretty interesting how they were able to pull this off.  Apart from the information-intense UI that they provided for their web application, they did an excellent job in simplifying the tasks enough to produce valid results.

Questions

  • The interface that Frenzy has contains a lot of jam-packed information.  Do you think as a user of this system, you would understand everything that was going on?
  • The approach used by Frenzy breaks the problem of conference session making into 2 problems: meta-data elicitation and session constraint satisfaction.  Do you think that these two problems are to broad and can be broken down into further sub-problems?
  • This system gives the power to “delete” categories.  How do you make sure that a category that is valid is not deleted by a user?  Can this system be used on a group that is larger than a conference committee? Ex: MTurk?

Read More

Crowd synthesis: extracting categories and clusters from complex data

Paul André, Aniket Kittur, and Steven P. Dow. 2014. Crowd synthesis: extracting categories and clusters from complex data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (CSCW ’14). ACM, New York, NY, USA, 989-998.

Discussion Leader: Nai-Ching Wang

Summary

This paper proposes a two-stage approach to guide crowd workers to produce accurate and useful categories from unorganized and ill-structured text data. Although there are automatic techniques available already to group text data in terms of topics, manual labeling is still required for inferring meaningful concepts by analysts. Assuming crowd workers to be transient, inexpert and conflicting, this work is motivated by two major challenges of harnessing crowd members to synthesize complex tasks. One is to produce expert work without requiring domain knowledge. The other one is to enforce global constraints with only local views available to the crowd workers. The proposed approach deals with the former challenge by introducing re-representation stage, which consists of different combinations of classification, context and comparison including raw text, classification (Label 1), classification+context (Label 10) and comparison/grouping. The latter challenge is coped with by introducing an iterative clustering stage, which shows existing work (categories) to subsequent crowd workers to enforce global constraints. The results show that classification with context (Label 10) approach produces the most accurate categories with most useful level of abstraction.

Reflections

This paper resonates our discussion about human and algorithmic computation pretty well because it points out why humans are required in the synthesis process and algorithmic computation was really used to demonstrate this point. This paper also mentions potential conflicts among crowd workers but as we can see in this paper that there are also conflicts between the professionals (the two raters). This makes me wonder if there are really right answers. Unfortunately, this paper does not include comparisons among crowd workers’ work to understand how conflicting their answers are. It would also be interesting to see and compare the consistencies of experts and crowd workers. Another interesting result is that the raw text condition is almost as good as the classification plus context condition except for the quality of abstraction. It feels that by combining previously-discussed person-centric strategies, the raw text condition might perform as well as the classification plus context condition or even outperforms it. In addition, the choice of 10 items for context and grouping at Stage A seems arbitrary. Based on the results, it seems more context hints better results but is that true? Or there is a best/least amount of context? Also, for grouping, the paper also mentions that the selection of groups might (greatly) affect the results so it would be interesting to see how different selections affect the results. As for the results of coarse-grained recall, it seems strange that the paper does not disclose the original values even though the authors think the result of coarse-grained recall is valuable.

Questions

  • The global constraints are enforced by showing existing categories to subsequent workers. How do you think about this idea? Any issues this approach might have? What is your solution?
  • The paper seems to hint that characters in the labels can be used to measure levels of concepts. Do you agree? Why? What else measures will you suggest for defining levels of concepts?
  • How will you expect quality control to be conducted?

Read More

CrowdScape: interactively visualizing user behavior and output

Rzeszotarski, Jeffrey, and Aniket Kittur. “CrowdScape: interactively visualizing user behavior and output.” Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 2012.

Discussion Leader: Mauricio

Summary

This paper presents CrowdScape, a system that supports the evaluation of complex crowd work through mixed-initiative machine learning and interactive visualization. This system aims to solve the challenges in quality control that arise in crowdsourcing platforms. Researchers previously have developed approaches for quality control based on worker outputs or on worker behavior. However, these two by themselves have limitations for evaluating complex work. Subjective tasks such as writing or drawing may have no single “right” answer and no two answers may be identical. In regards to behavior, two workers might complete a task in a different manner yet provide valid output. CrowdScape combines worker behavior with worker output information in its visualizations to address these limitations in the evaluation of complex crowd work. CrowdScape’s features allow users to make hypotheses about their crowd, test them, and refine their selections based on machine learning and visual feedback. Its interface allows interactive exploration of worker results and it supports the development of insights on worker performance. CrowdScape is built on top of Amazon Mechanical Turk and it captures data from both the Mechanical Turk API in order to obtain the products of work and from Rzeszotarski and Kittur’s Task Fingerprinting system in order to capture worker behavioral traces (such as time spent on tasks, key presses, clicking, browser focus shifts, and scrolling). It uses these two information sources to create an interactive data visualization of workers. To illustrate the different use cases of the system, they posted four varieties of tasks on Mechanical Turk and solicited submissions. The tasks were: translating text from Japanese to English, pick a color from an HSV color picker and write its name, describing their favorite place, and tagging science tutorial videos. In the end paper they conclude that the linking of behavioral information about workers with data about their output is beneficial in reinforcing or contradicting our own initial conception of the cognitive process workers use when completing tasks and in developing and testing our own mental model of the behavior of workers who have good (or bad) outputs.

Reflections

I think CrowdScape presents a very interesting hybrid approach to address low quality in crowdsourcing work, which according to the authors comprises about one third of all submissions. When starting to read the paper, I got the impression that logging behavioral traces of crowd workers when completing tasks would be a bit of an intrusive way to address this issue. But the explanations they give as to why this approach is more appropriate for assessing the quality of creative tasks (such as writing) than post-hoc output evaluations (such as gold standard questions) was really convincing.

I liked how they were self-critical about the many limitations that CrowdScape has, such as its need for workers to have JavaScript enabled, or how there are cases in which behavioral traces aren’t indicative of the work done, such as if users complete a task in a text editor and then paste it on Mechanical Turk. I would like to see how further research addresses these issues.

I found it curious that in the first task (translation) that, even though the workers were told that their behavior would be captured, they still went ahead and used translators for the task. I would have liked to see what wording the authors used in their tasks when giving this warning, and also in describing compensation. For instance, if the authors told workers that they were going to log the workers’ moves, but that they would be paid regardless, then that gives the workers no incentive to do the translation correctly, which is why the majority (all but one) of the workers might have ended up using Google Translate or another translator for the task. In the other hand, if the authors just told workers that their moves were going to be recorded, I would imagine that would cause the workers to think that not only their output will be evaluated but also their behavior, which would cause them to perform a better job. The wording in the task when they tell workers that their behavioral traces are being logged I think is important, because it might skew the results one way or the other.

Questions

  • What wording would you use to tell the workers that their behavioral traces would be captured when completing a task?
  • What do you think about looking at a worker’s behavior to determine the quality of their work? Do you think it might be ineffective or intrusive in some cases?
  • The authors combine worker behavior and worker output to control quality. What other measure(s) could they have integrated in CrowdScape?
  • How can CrowdScape address the issue of cases in which behavioral traces aren’t indicative of the work done (e.g. writing the task’s text in another text editor)?

Read More

Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk

Tanushree Mitra, C.J. Hutto, and Eric Gilbert. 2015. Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, New York, NY, USA, 1345-1354. DOI=10.1145/2702123.2702553 http://doi.acm.org/10.1145/2702123.2702553

Discussion Leader: Adam

Summary

Much of what we have read so far regarding quality control has focused on post hoc methods of cleaning data, where you either take multiple responses and find the best one, or iterate on a response to continually improve it. This paper takes a different approach, using person-centric a priori methods to improve data quality. What the authors are essentially trying to find out, is whether non-expert crowdworkers can be screened, trained, or incentivized, prior to starting the task, to improve the quality of there responses.

To do this, the authors used four different subjective qualitative coding tasks to examine the effects of various interventions or incentives on data quality. People in Pictures asks workers to identify the number of people in a picture, giving five different ranges for workers to choose from. Sentiment Analysis had workers rate the positive or negative sentiment of tweets on a five point scale. Word Intrusion had workers select from a list of five words the one that doesn’t belong with the rest. Finally, Credibility Assessment tasked workers with rating the credibility of a tweet about some world event.

The authors used three different means to intervene with or incentivize the selected works. Targeted screening gave workers a reading comprehension qualification test. Training gave workers some examples of qualitative coding annotations and had them pass some example annotations in order to begin the actual tasks. And the final bonus rewarded workers with double the pay if their response matched the majority of workers responses. A second experiment varied the ways in which workers qualified for the bonus.

In general, the authors found that these a priori strategies were able to effectively increase the quality of worker data, with the financial incentives having the least amount of effect on quality. For the first two tasks, nearly all methods provided statistically significant improvements in quality over the control with financial bonus and baseline, with the combination of screening, training, and bonuses providing the highest quality for each task. Additionally, these a priori methods provided higher quality data than through iteration in the Credibility Assessment task, though not statistically significantly so.

Reflections

This paper provides many interesting results, some of which the authors did not really discuss. The most important take away from this paper is that a priori intervention methods can provide just as high quality data if not more so than process-centered methods such as iteration. And this is significant because of how scalable a priori methods are. You need only screen or train someone once, and then they will provide high quality data for as long as they work on that task. With process-centered methods, you must run the processes for each piece of data you collect, increasing overhead.

However, there are many other results worth discussing. One is that the authors found the control condition quality has significantly increased in the past several years, indicating AMT workers are generally providing higher quality results than before. A few years ago, accuracies for control conditions with a 20% randomly correct rate peaked at about 40%, while in this paper the control qualities were between 55-80%. The authors suggest better person-centric quality control measures enacted by Amazon, such as stricter residency requirements and CAPTCHA use, but I wonder if that is truly the case.

One interesting result that the authors do not really discuss is the fact that in all three tasks from experiment 1, the control category with the bonus incentive performed worse than the control group without the financial bonus. Additionally, the baseline group, which screened workers based on the standard 95% approval rating and 100 HIT experience, performed worse than the control group without these restrictions for each of the three tasks. Maybe new workers tend to provide high quality data because they are exciting about trying something new? This seems like it would be an important issue to look into, as many tasks on AMT use these basic screening methods.

Finally, I find it interesting that financial incentives caused no statistical improvement in quality from the screening or training interventions. I guess it goes along with some of our previous discussions, in that increasing pay will attract more workers more quickly, but once someone decides to do a HIT, the amount of money offered does not affect the quality of their work.

Questions

  • Why has the general quality of workers on AMT improved over the past few years?
  • Can you think of any other intervention or incentive methods that fit this person-centered approach?
  • While these tasks were subjective, they still had a finite number of possible responses (5). Do you think these methods would improve the quality of free-response types of tasks? And how would you judge that?
  • Do you think these methods can replace process-centered quality control all together, or will we always need some form of data verification process?

Read More

A Comparison of Social, Learning, and Financial Strategies on Crowd Engagement and Output Quality

L. Yu, P. André, A. Kittur, and R. Kraut, “A Comparison of Social, Learning, and Financial Strategies on Crowd Engagement and Output Quality,” in Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, New York, NY, USA, 2014, pp. 967–978.

Discussion leader: Will Ellis

Summary

In this paper, Yu et al. describe three experiments they ran to test whether accepted human resource management strategies can be employed individually or in combination to improve crowdworker engagement and output quality. In motivating their research, they describe how crowd platform features aimed at lowering transaction costs work at cross-purposes with worker retention (which they view as equivalent to engagement). These “features” include simplified work histories, de-identification of workers, and lack of long-term contracts. The strategies the authors employ to mitigate the shortcomings of these features are social (through teambuilding and worker interaction), learning (through performance feedback), and financial (through long-term rewards based on quality).

The broad arc of each experiment is 1) recruit workers for an article summarization task, 2) attempt to retain workers for a similar follow-up task through recruitment messages employing zero (control), one, two, or three strategies, 3) measure worker retention and output quality, and 4) repeat steps 2 and 3. The first experiment tested employing all three strategies versus a control. The results showed that using these strategies together improved retention and quality by a statistically significant amount. The second experiment tested each strategy individually as well as pairs of strategies versus control. The results showed that only the social strategy significantly improved worker retention, while all three individual strategies improved output quality. However, no two-strategy combination significantly improved retention or quality. The authors view this as a negative interaction between the pairs of strategies and offer a few possible explanations for these outcomes, one of which is that they needed to put more effort into integrating the strategies. This led them to develop the integrated strategies that they tested in Experiment 3. In Experiment 3, the authors again tested each strategy individually. They also tested their improved strategy pair treatments, as well as a more integrated 3-strategy treatment. Again, only the social strategy by itself showed significant improvement in retention. Whereas Experiment 1 showed significant improvement in retention when employing all 3 strategies, the results of this experiment suggest otherwise. In addition, only the learning strategy by itself and the 3-strategy treatment showed improved output quality.

The authors conclude that the social strategy is the most effective way of increasing retention in crowdworkers and that the learning strategy is the most effective way of increasing output quality. The authors say these results suggest that multiple strategies undermine each other when employed together and that careful design is needed when devising crowdwork systems that try to do so.

Reflection

Yu et al. have taken an interesting tack in trying to apply perhaps more traditional human resource strategies to crowdworking in an attempt to improve worker engagement and output quality. I think they’re correct in identifying the very qualities of systems like Amazon Mechanical Turk – extremely short-term contracts, pseudonymous identities, simplified worker histories – as what make it difficult to employ such strategies. I appreciate their data-driven approach to measuring engagement, that is, they measure it through worker retention. However, I can’t help but question their equivocation of worker retention with worker engagement. Engagement implies a mental investment in the work, whereas worker retention is a measure of who was motivated to come back for tasks, regardless of their investment in the work.

A fascinating aspect of their experimental setup is that in experiments 1 and 2, the social strategy employed claimed team involvement in the follow-up recruitment materials but did not actually implement them. Despite this fact, retention benefits were clearly realized with the social strategy in experiment 2 and likely contributed to improved retention in experiment 1. Further, even though actual social collaboration was implemented in experiment 3, no further retention improvements were realized. It seems the idea of camaraderie is just as motivating as actual collaboration. The authors suggest that experiencing conflict with real teammates may mitigate the benefits to retention of teammate interaction. Indeed, this may be the place where “retention” as a substitute for “engagement” breaks down. That is, in traditional workplaces, workers engage not just with their work but also with each other. I imagine it’s much more difficult to feel engaged with pseudonymous teammates over the Internet than teammates in person.

Disappointingly, the authors cannot claim much about combined strategies. While a 3-strategy approach is consistently better in terms of quality between experiment 1 and experiment 3, none of the strategy pairs improve retention or quality significantly. They can only recommend that, when combining strategies, designers of crowdwork systems do so carefully. I would hope that future work explores in more depth what factors are at play that undermine strategy combinations.

Questions

  • Do you think worker retention is a good measure of engagement?
  • In experiment 1, the authors did not actually operationalize the learning strategy. What, if anything, do their results say about learning strategy in this experiment?
  • What do you think of the reasons the authors give for why strategy combinations perform worse than individual strategies? What changes to their experimental setup could you suggest to improve outcomes?
  • This paper is very focused on improving outcomes for work requesters using HR management strategies. Is there any benefit to crowdworkers in recreating traditional HR management structures on top of crowdwork systems?

Read More

The Economies of Online Cooperation: Gifts and Public Goods in Cyberspace

Paper: Kollock, The Economies of Online Cooperation: Gifts and Public Goods in Cyberspace

Discussion leader: Anamary Leal

Summary

This paper discusses features about online internet communities that support cooperation and gift giving (of sometimes very expensive things like hundred-dollar consultations). The authors compare gift and commodity economies; Getting a commodity does not obligate you to get another. Getting a gift means you get the feeling to reciprocate. Gifts at “the thing that so-n-so gave me”, and commodities are “a” thing. In the internet, If you give the gift of free advice, there is no feeling of reciprocation. The gift is given to some huge group. But, there may be a sense of reciprocity within the group.

Online goods are a public good that is indivisible (online person viewing an answer does not hinder another), non-excludable (can’t exclude others from the good), and can be duplicated. Everyone benefits, but it doesn’t mean that it will happen. And, the temptation for online users to not contribute much and still get a ton of benefits, known as free-ride, arises. Only one person needs to pay the cost by contributing (known as the privileged group) to get the most benefit. How do you motivate people to produce the good and to coordinate with others?

One reason is anticipated reciprocity, which is reciprocity from the group itself for help in the future. A good contributor to a forum may feel entitled to receive help from the forum in the future. One study found that they indeed get help more quickly than others. Another is maintaining a reputation online (which also implies that there is an fixed identify set to a contribution to keep track of their contributions.) Self-efficacy is also a well-studied motivator, and the logic is that one will help the group to help make their own impact seem wider.

The paper discusses two case studies in online cooperation. The first is making Linux, and while it had many markings to fail, it succeeded due to one person doing a large amount of the work up front to get it usable, and making it compelling to contributors. Programmers contributed drivers to get Linux to work on their devices.

The second is connecting elementary schools to internet access by organizing an online rally to organize, coordinate volunteers, and accomplish the task in one day. Additionally, a committee also did much of the work in having face-to-face meetings with school officials and such. The online system allowed for people to sign up based on schools’ needs.

The authors caution that while online communities can rally together to do great things, that interest, not necessarily importance, help rally people. A massive plumbing repair job instead of wiring to the internet may be less successful of a job than the massive wiring internet job. Additionally, many digital goods are produced and managed by a small group or even one person, even if initially.

Reflection:

The paper has a few hints of its age, such as stressing the benefits of instantly communicating online compared to doing a mail, TV or newspaper campaign. But, this paper remains to be compelling to start outlining the features of how these communities interact (ongoing interaction, identity persistence, and knowledge of prior interactions.)

In discussing motivation, it was an interesting choice to first discuss motivation without altruism or group attachment to the equation, assuming that everyone is in it for themselves, and then ease into more altruistic motivations like group needs. To keep the discussion focused, it was a good idea. But, while the paper mentioned that it was rare, I wonder how much altruism, group need, or attachment impact how much they contribute.

The authors stress that many of these efforts are started with a small group or one person. In the Linux example, Linus put a an enormous amount of work to get Linux to a usable state, and then released it for programmers contributed and checked themselves on the contributions. There was no SVN, GIT or code control system back then to help support this (or at least, from what I checked.) I can only imagine how hard it was to keep and manage the code repository back then.

Additionally, how big was the size of the core committee that managed the NetDay? It moved 20,000 volunteers, but how many people did the online site, held the face-to-face meetings? I wouldn’t be surprised if it was one or a handful of people who met regularly and coordinated. I also surmise that this project took a large chunk of their time, compared to the regular volunteer who spent a day wiring.

Fast-forward until now, we now have systems to facilitate such endeavors much easier. Yet, I do not see multiple reasonable OS’s or multiple reasonable alternatives to common software. Most commonly used software, used by the majority that I have seen (not just technologists) is a result of a company of either made by Apple, Microsoft or Google. I wonder how much could quality still remain to be a factor. One would think that the more early crowdsourcing efforts would have the most maturity and be the most successful now, instead of potentially less interesting efforts like ones on Amazon Turk.

Discussion:

  1. The discussion of reciprocity is set in terms of accountability and credit, in 1999. What kinds of mechanisms that you have seen online have tried to design to keep track of a user’s contributions to a community? How well do they work or not work?
  2. One would assume that the earliest crowdsourcing efforts would have the most time to mature, and be the most successful (public events to benefit others, and making software). But Turk, with it’s boring tasks, is the most successful, and may not be widely motivating nor interesting. Why are not these online communities the most successful? Are there still challenges unsolved?
  3. What’s the relationship between efforts doing by one individual or a group, compared to the efforts of the crowd? Torvald built an OS, and surely some core set of people met and worked on NetDay for countless hours. In my experience, the most successful massive efforts are led by a core dedicated group meeting live. In other words, how much effort does an individual or group need to put to get these online communities to successfully do these projects?
  4. Could these individuals, in the present time, be able to delegate out some of the core tasks(develop an OS, organize a NetDay od 20,000 volunteers) to others? If so, how so, and which parts could be crodsourced? Are there any technologies or techniques that come to mind? If not, why not?

Read More

CrowdForge: Crowdsourcing Complex Work

Aniket Kittur , Boris Smus , Susheel Khamkar , Robert E. Kraut, CrowdForge: crowdsourcing complex work, Proceedings of the 24th annual ACM symposium on User interface software and technology, October 16-19, 2011, Santa Barbara, California, USA

Discussion Leader: Shiwani Dewal

Summary

CrowdForge is a framework which enables the creation and completion of complex, inter-dependent tasks using crowd workers. At the time of writing the paper (and even today), platforms like Amazon Mechnical Turk facilitated access to a micro-workers who complete simple, independent tasks which require little or no cognitive effort. Complex tasks, traditionally, require more coordination, time and cognitive effort; especially for the person managing or overseeing the effort. These challenges become even more acute when crowd workers are involved.

To address this issue, the authors present their framework, CrowdForge, alongwith case studies which were accomplished through a web-based prototype. The CrowdForge framework is drawn from distributed computing (MapReduce) and consists of three steps, viz. partition, map and reduce. The partitioning step breaks a higher level task into single units of work. The mapping step involves the units of work being assigned to workers. The same task may be assigned to several workers to allow for improvements and quality control. The final step is reduction in which the units of work are combined into a single output, which is essentially the solution for the higher level task.

The framework was tested through several case studies. The first case study was about writing a Wikipedia article about New York City. Surprisingly, the articles produced by groups of workers across HITs, were rated, on an average, as high as the Simple English Wikipedia article on New York City and higher than full articles written by individuals as part of a higher paying HIT. Quality control was tested through further map and reduce efforts to merge results and through voting, and was deemed more effective through merged efforts. The second case study involved collating information for researching purchase decisions. The authors do not provide any information about the quality of the resulting information. The last case study dealt with the complex flow of turning an academic paper into a newspaper article for the general public. The paper discusses the steps used to generate news leads (the hook for the paper) and a summary of the researchers’ work, as well as the quality of the resulting work.

The CrowdForge approach looked very promising which was exemplified through the case studies. It also had a few disadvantages such as not supporting iterative flows, assuming that a task can, in fact, be broken down into single units of work and possible overlap between the results of a task due to the lack of communication between workers. The authors concluded by encouraging researchers and task designers to consider crowd sourcing for complex tasks, and push the limits of what they could accomplish through this market.

Reflections

The authors have identified an interesting gap in the crowd sourcing market- ability to get complex tasks completed. And although requesters probably may have broken down their tasks into HITs in the past and taken care of the combining of results on their end, CrowdForge’s partition-map-reduce framework seems like it could alleviate the challenge and streamline the process, to some extent.

I like the way the partition-map-reduce framework is conceptualized. It seems fairly intuitive and seems to have worked well for the case-studies. I am a little surprised (and maybe skeptical?) that the authors did not include the results of the second case study or more details for the rest of the third case study.

The other aspect I really liked about the paper was the effort to identify and test alternative or creative ways to solve common crowd sourcing problems. For example, the authors came up with the idea of using further map-and-reduce steps in the form of merging as an alternative to voting on solutions. Additionally, they came up with the consolidate and exemplar patterns for the academic paper case study, to alleviate the problems of the high complexity of the paper and the effort workers expected to put in.

The paper mentions in its section on limitations that there are tasks which either cannot be decomposed and that another market with skilled or motivated workers should be considered This also brings me back to the notion that perhaps crowd-sourcing in the future will look more like crowd-sourcing for a skill-set, a kind of skill-based consulting.

In conclusion, I think that the work presented in the paper looks very promising, and it would be quite interesting to see the framework being applied to other use-cases.

Discussion

1. The paper mentions that using further map and reduce steps to increase the quality of the output, as opposed to voting, generated better results. Why do you think that happened?

2. There may be tasks which are too complex to be decomposed, or decomposed tasks which require a particular skill set. Some crowd sourcing platforms accomplish this through having an “Elite Taskforce”. Do you think this is against the principles of crowd sourcing, that is, that a task should ideally be available to every crowd worker or is skill-based crowd sourcing essential?

3. CrowdForge breaks tasks up, whereas TurkIt allowed iterative work-flows and the authors talk about their vision to merge the approaches. What do you think would be some key benefits for such a merged approach?

4. The authors advocate for pushing the envelope when it comes to the kind of tasks which can be crowd sourced. Thoughts?

Read More

Beyond the Turk: An empirical comparison of alternative platforms for crowdsourcing online behavioral research

Eyal Peera , Sonam Samatb , Laura Brandimarteb & Alessandro Acquistib

Discussion Leader: Divit Singh

Summary

This paper focused on finding alternatives to MTurk and evaluating its results.  MTurk is considered to be the front-runner among other, similar crowdsorucing platforms since it tends to produce high quality data.  However, since the worker growth of MTurk is starting to stagnate, the workers that use MTurk have become extremely efficient on completing the tasks that are often published on MTurk.  The reason is because these tasks tend to be very similar (surveys, transcribing etc).  The familiarity with tasks have shown to reduce effect sizes of research findings (completing same survey multiple times with same answers skews data to be collected).  Due to this reason, this paper explores other crowdsourcing platforms and evaluates the performance, results, similarities and differences between each other in an effort to find a viable alternative for researchers to publish their tasks.

In order to evaluation the performance of these various crowdsourcing platforms, they tried to create a survey among all 6 of the platforms being tested.  Among these 6, only 3 of them successfully published the survey.  Some platforms simply rejected the survey without a reason, other platforms either required considerable amount of money or  had errors in their platforms which prevented the study from exploring those alternatives.  From the the platforms that were able to be publish the survey, it appeared that the only viable alternative to MTurk among the ones that were tested turned out to be CrowdFlower.  The surveys used involved questions that contained attention-check questions, decision-making questions, as well as a question which measured the honesty of the worker.   This paper provides an excellent overview of the various properties between each of the platforms and includes many tables which outline when one platform may be more effective over another.

Reflection

This paper does present a considerable amount of information among the various platforms that were described.  However, reading through this paper, it really revealed the lack of any actual competition to MTurk that is out there.  Sure, it does discuss that CrowdFlower is a good alternative in order to reach a different population of workers, it is still considered less than equal to MTurk for a lot of instances.  The main basis of using these other platforms is because MTurk workers have become extremely efficient at completing tasks which may cause the skewing of results.   I believe it is only a matter of time before workers on these other platforms lose their “naivety” as the platform becomes more mature.

The results of this paper may be invaluable to a researcher who wants to really target their audience.  For example, this paper revealed that CBDR is managed by CMU and that it is composed of students and non-students.  Although not guaranteed, it might be the most appealing for a researcher who wants to target college students since it may contain a considerable university student population.  Another excellent bit of information that they provided is the failure rate of attention-seeking questions that were posted on their survey.  This outlines two things: how inattentive workers are during their surveys, and also how experienced the workers of MTurk really are (they most likely have seen questions like these in the past which prevents them from making the same mistake again).  However, keep in mind that these results are a snapshot at a given time.  There is nothing that is prevented the workers of CrowdFlower (which are apparently disjoint from workers of MTurk) which contain a massive worker base from learning from these surveys and become smarter workers.

Questions

  1. Is there any other test that you believe that the study missed?
  2. Based on the tables and information provided, how would you rank the different crowdsourcing platforms?  Why?
  3. This paper outlined that outlined different approaches for these platforms (e.g. review committee that determines if a survey is valid).  What method do you agree with or how would you design your own platform in order to optimize quality and reduce noise?

 

Read More