Success & Scale in a Data-Producing Organization: The Socio-Technical Evolution of OpenStreetMap in Response to Humanitarian Events

Palen, Leysia, et al. “Success & Scale in a Data-Producing Organization: The Socio-Technical Evolution of OpenStreetMap in Response to Humanitarian Events.” Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 2015.

Discussion Leader: Ananya

Summary:

OpenStreetMap, often called ‘the Wikipedia of maps’, is a collaborative effort to create a common digital map of the world. This paper analyzes the socio-technical evolution of this large, distributed, volunteer-driven organization by examining its mapping activities during two major disaster events, Haiti earthquake in 2010 and 2013 Typhoon Yolanda.

The Haiti earthquake was the first major event that used OSM during its relief efforts. The repercussions of this sudden influx of usage hindered relief efforts, subsequently giving rise to multiple socio-technical changes within the organization.

The Humanitarian OpenStreetMap Team(HOT) which assists humanitarian organizations with mapping needs was formalized and registered as a non-profit organization just 7 months later.

During Haiti earthquake, several issues such as mapping conflicts and map duplications arose. To address this congestion, HOT created the OSM Task Manager that helped mappers coordinate efficiently. The Administrator creates jobs for large geographical area and the Task Manager divides each job into one of the three types of tasks: ‘yellow’ (which means taken), ‘red’ (awaiting validation) and ‘green’ (completed).

Other changes included getting OSM licensed through ODBL – Open Database License. To attract and retain new participants, OSM upgraded its openstreetmap.org website and interface making it easier for new participants to map. ‘Notes’, a drop-pin annotation feature for users to point out improper tagging or suggest additional information, was also added.

Unlike Wikipedia which laid down governance policies and strategies as the community grew, the OSM governance structure still maintains a non-bureaucratic approach to promote and cope with growth. Furthermore to avert low diversity among contributors LearnOSM materials were translated into 9 languages.

Typhoon Yolanda, another disaster of the scale of Haiti earthquake that struck 4 years later tested OSM’s organizing efforts. The response to the event was significant with 3x increase in the number of contributors. The now well-established HOT coordinated with volunteers using emails, chatrooms. The Task Manager was widely used by the mappers which helped prevent mapping collisions.

However since all the jobs are put into only one instance of Task Manager, there is a possibility of sufficient traffic congestion as the mapping population grows. The OSM is considering multiple solutions to mitigate this problem. It has also developed a range of socio-technical interventions aimed at promoting a supportive environment towards new contributors while managing community growth. This is in stark contrast to Wikipedia’s policy driven strategies.

 
Reflection:

This paper presents a good description of how a community matured within the bounds of two major events that shaped the growth. I really appreciate how the authors tracked changes within the OSM community after the Haiti Earthquake and analyzed effects of those changes with regards to another major event (Typhoon Yolanda) 4 years later. One of the authors being a founding member of HOT definitely helped.

However I am skeptical about the comparisons made between Wikipedia and OSM’s way of operation because despite many commonalities, they still work with very distinct types of data. So a non-bureaucratic collaborative environments that OSM maintains may not work for Wikipedia which also has to deal with a completely different set of challenges associated with creative objects such as plagiarism, editorial disputes, etc.

One of the problems that the authors mention Wikipedia faces is with respect to diversity which the OSM community has made notable efforts to alleviate. Still, the gender disparity that plagued Wikipedia was prevalent in OpenStreetMap. Studies done in 2011 showed Wikipedia had 7% women contributor while in OSM it was worse, only 3%. I wish the authors had detailed more on the new email list that OSM launched in 2013 to promote inclusion of more women and how effective this step was to motivate a currently inactive group.

Although not extensively used, the Notes feature did show some potential use by both new and experienced users. However the authors conjectured that guests may use this for noting transient information such as ‘temporary disaster shelter’. I wonder why this is an issue. Incase of a disaster many important landmarks such as a makeshift emergency shelter or a food distribution center will be temporary and still be part of a relief team’s data needs. Well, an additional step must be developed to update the map when these temporary landmarks are gone.

Overall, this paper provides a good understanding of some of the features of OSM’s management technics and is also one of the first papers that studied OpenStreetMap with such intricacy.

 
Questions:
– Do you think the comparison made in the paper between Wikipedia and OSM about their governance strategy is fair? Will OSM’s collaborative governance style work for Wikipedia?
– How can gender imbalance or other diversity issues be resolved in a voluntary crowdsourcing environment?
– As the authors mention, guests can see Notes as an opportunity for personalization. How do you think OSM can reduce noise in notes? Can the Task Manager work here and label each note as a yellow, red or green task?
– I think a volunteer driven platform like OSM is particularly useful in a disaster situation when the landscape is changing rapidly. Do you feel the same? Can you think of any other volunteer driven application that will help in situational awareness at real time?

Read More

CommunitySourcing: engaging local crowds to perform expert work via physical kiosks

Heimerl, Kurtis, et al. “CommunitySourcing: engaging local crowds to perform expert work via physical kiosks.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2012.

Discussion Leader: Shiwani

Summary:

In this paper, the authors introduce a new mechanism, called community-sourcing, which is intended to facilitate crowd-sourcing when domain experts are required. Community-sourcing is different from other platforms in the sense that it involves the placement of physical kiosks in locations which are likely to attract the right people and aims to involve these people when they have surplus time (e.g. when they are waiting).

The authors defined three cornerstones for the design of a community sourcing system, viz. task selection, location selection and reward selection.

To evaluate the concept, the authors created a system called Umati. Umati consisted of a vending machine interfaced with a touch screen. Users could earn “vending credit” by completing tasks on the touch screen, and once they had earned enough credit, they could exchange the same for items from the vending machine. Although Umati was programmed to accept a number of different tasks, the authors selected the tasks of exam grading and filling out a survey for their evaluation (task selection). Research evidence suggests that redundant peer grades were strongly correlated with expert scores, which made grading an interesting task to choose; while the survey task helped the authors capture demographic information about the users.  Umati was placed in front of the major lecture hall in the Computer Science building, which mainly supported computer science classes (location selection). The authors chose snacks (candies) as reward, as food is a commonly used incentive for students to participate in campus events (reward selection).

For their evaluation, the authors generated 105 sample answers to 13 questions, which were taken from prior mid-term questions for the second semester undergraduate course CS2. These answers were then evaluated on two systems, Umati and Amazon Mechanical Turk, as well as experts. A special spam detection logic was implemented by adding golden standard questions.

The results showed a strong correlation between Umati evaluations and expert evaluations, whereas Amazon Mechanical Turk evaluations differed greatly. Additionally, and more interestingly, Umati was able to grade exams more accurately, or at a lower cost than traditional single-expert grading. The authors mention several limitations of their study such as the duration of the study, privacy concerns, restriction to a particular domain, etc. Overall, Umati looks promising but requires more evaluation.

 

Reflection:

The title of the paper gives the perfect summary of what community-sourcing is about, viz. local crowds, expert work, and physical kiosks. It is a novel and pretty interesting approach to crowd-sourcing.

I really like the case the authors make for this new approach. They talk about the limitations and challenges in accessing experts. For example, some of the successful domain-driven platforms do so by creating competitions which is not the best approach for high-volume work. Others seek to identify the “best answer” (StackOverflow) which is not great for use-cases such as grading. Secondly, there are many natural locations where there is cognitive surplus (e.g. academic buildings, airport lounges, etc.) and the individuals in these locations can serve as “local experts” under certain conditions. Thirdly, having a physical kiosk as a reward system, thereby giving out tangible rewards seems like a great idea.

I also like that the authors situate community sourcing very well and state where it would be applicable, that is, specifically for “short-duration, high volume tasks that require specialized knowledge or skills which are specific to a community but widely available within the community”. This is perhaps a very niche aspect, but an interesting niche and the authors gave some examples of where they could see this being applied (grading, tech support triage, market research, etc).

The design for Umati, the evaluation system, was quite thorough and clearly based on the three chief design considerations put forward by the authors (location, reward, task). Every aspect seems to have been thought through and reviewed. An example is the fact that the survey task had more credit (5 credits) than the grading tasks (1 credit each), which I assume was to encourage users to provide their demographic information.

The spam detection concept used was the use of gold standard questions, and exclusion of participants who failed more than one such question. Interestingly, while for Umati this meant that the user was blacklisted (based on ID), the data upto the point of blacklisting was still used in the analysis. On the other hand, for AMT, two sets of data were presented, one including all responses and one which was filtered based on the spam detection criteria.

Another interesting point is that about 19% of users were blacklisted. The authors explain that this happened in some cases because some users forgot to log out, and in some cases, because users were merely exploring and did not realize that there would be repercussions. I wonder if the authors performed any pilot tests to catch this?

The paper presents a few more interesting ideas such as the double-edged effects of group participation, as well as the possibility that the results may not be generalizable due to the nature of the study (specific domain and tasks). I did not find any further work performed by the authors to extend this, which was a little unfortunate. There was some work related to community sourcing, but along very different lines.

Last, but not least, Umati had a hidden benefit for the users: grading tasks could potentially improve their understanding of the material, especially when the tasks were performed through discussions as a group. This opens up potential for instructors encouraging their students to participate in exchange for some class credit perhaps.

Discussion:

  1. The authors decided to include the data up to the failing of the second gold standard question for Umati users. Why do you think they chose to do that?
  2. Do you think community sourcing would be as effective if it was more volunteering-based, or if the reward was less tangible (virtual points, raffle ticket, etc)?
  3. 80% of the users had never participated in a crowdsourcing platform. Could this have been a reason for its popularity? Do you think interest may go down over a period of time?
  4. The paper mentions issue such as the vending machine running out of snacks, and people getting blacklisted because they did not realize there would be repercussions. Do you think having some controlled pilot tests would have re-mediated these issues?
  5. None of the AMT workers passed the CS qualification exam (5 MCQs on computations complexity and Java) , but only 35% failed the spam detection. The disparity in pay was $0.02 between the normal HIT, and the HIT with the qualification exam. Do you think the financial incentive was not enough, or was the gold standard not as effective?
  6. The authors mentioned an alternative possibility of separating the work and reward interfaces, in order to scale the interface both in terms of users and tasks. Do you think this would still be effective?

Read More

Bringing semantics into focus using visual abstraction

Zitnick, C. Lawrence, and Devi Parikh. “Bringing semantics into focus using visual abstraction.” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.

Discussion Leader: Nai-Ching Wang

Summary

To solve the problem of relating visual information and linguistic semantics of an image, the paper proposes to start studying with abstract images instead of real images to avoid complexity and low-level noise in real images. By using abstract images, it makes it possible to generate and reproduce same or similar images depending on the need of study while it is nearly impossible to do so with real images. This paper demonstrates this strength of using abstract images by recruiting different crowd users on Amazon Mechanical Turk to 1) create 1002 abstract images, 2) describe the created abstract images and 3) generate 10 images (from different crowd users) for each description. With this process, images with similar linguistic semantic meaning are then produced because they are created from the same description. Because the parameters of creation of the abstract images are known (or can be detected easily), the paper is able to find semantic importance of visual features derived from occurrence, person attributes, co-occurrence, spatial location and depth ordering of the objects in the images. The results also show that suggested important features have better recall than using low-level image features such as GIST and SPM. This paper also shows that visual features are highly related to text used to describe the images.

Reflections

Even though crowdsourcing is not the main focus of the paper, it is very interesting to see how crowdsourcing can be used and be helpful in other research fields. I really like the idea of generating different images with similar linguistic semantic meaning to find important features that determine the similarity of linguistic semantic meaning. It might be interesting to see the opposite way of study, that is, generating different descriptions with similar/same images.

For the crowdsourcing part, the quality control is not discussed in the paper probably due to its focus but it would be surprising if there was no quality control of the results from crowd workers during the study because as we discussed during class, we know maximizing compensation within a certain amount of time is an important goal for crowdsourcing markets such as Amazon Mechanical Turk. As we can imagine how to achieve that goal by submitting very short description and random placement of clip art. In addition, if multiple descriptions are required for one image, then how is the final description selected?

I can also see other crowdsourcing topics related to the study in the paper. It would be interesting to see how different workflows might affect the results. For example, ask the same crowd worker to do all the three stages vs. different crowd workers for different stages vs. different crowd workers to work collaboratively. With the setting, we might be able to find individual difference and/or social consensus in linguistic semantic meaning. In section 6, it seems to me that this part is somewhat similar to the ESP game and the words might be constrained to some types based on the need of research.

Overall, I think this paper is a very good example to show how we can leverage human computation along with algorithmic computation to understand the human cognition.

Questions

  • Do you think in real images, the reader of the images will be distracted by other complex features such that the importance of some features will decrease?
  • As for the workflow, what are the strengths and drawbacks of using same crowd users to do all the 3 stages vs. using different crowd users for different stages?
  • How do you do the quality control of the produced images with descriptions? For example, how do you make sure the description is legitimate for the given image?
  • If we want to turn the crowdsourcing part into a game, how will you do it?

Read More

VQA: Visual Question Answering

Paper: S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” ArXiv preprint arXiv:1505.00468, 2015.

Summary

This paper presents a dataset to use for evaluating AI algorithms involving computer vision, natural language processing, and knowledge representation and reasoning. Many tasks that we once thought would be difficult for an AI algorithm to perform, such as image labeling, have now become fairly commonplace. Thus, the authors hope to help push the boundaries of AI algorithms by providing a data set for a more complex problem that combines multiple disciplines, which they name Visual Question Answering. Others can then use this dataset to test their algorithms that try to solve this highly complex puzzle.

The authors use around 200,000 images from the Microsoft Common Objects in Context data set, and 50,000 abstract scene images created by the authors. For each of these images, they collected three open ended questions from crowdworkers. And then for each of these questions, they collected ten answers from unique crowdworkers. An accuracy measure was then applied to each answer, where if at least three responses to a question were identical, that answer was deemed to be 100% accurate. This concluded the data collection for generating the data set, but they then used crowdworkers to evaluate the complexity of the questions received.

There were three criteria examined to justify the complexity of these questions: whether the image is necessary to answer the question, whether the questions require any common sense knowledge not available in the image, and how well the question can be answered using the captions alone and not the actual image. The studies conducted to support this successfully showed that the questions generated generally required the image to answer, a fair amount require some form of common sense, and the questions are answered significantly better with access to the image than with just access to the captions. Finally, the authors used various algorithms to test their effectiveness against this data set, and found that current algorithms still significantly underperform compared to humans. This means that the data set can successfully test the abilities of a new complex set of AI algorithms.

Reflection

While the purpose of this paper is focused on artificial intelligence algorithms, a large portion of it involves crowd work. It is not specifically mentioned in the body of the paper, but from the description and from the acknowledgements and figure descriptions you can tell that the question and answer data was collected on Amazon Mechanical Turk. And this isn’t surprising given the vast amount of data they collected (nearly 10 million question answers). It would be interesting to learn more about how the tasks were setup and the compensation, but the crowdsourcing aspects are not the focus of the paper.

One part of the paper that I thought was most relevant to are studies of crowd work was the discussion of how to get the best complex, open-ended questions relating to the pictures. The authors used three different prompts to try to get the best answers out of the crowdworkers: ask a question that either a toddler, alien, or smart robot would not understand. I thought it was very interesting that the smart robot prompt produced the best questions. This prompt is actually fairly close to reality, as the smart robot could just be considered modern AI algorithms. Good questions are ones that can stump these algorithms, or the smart robot.

I was surprised that the authors chose to go with exact text matches for all of their metrics, especially given the discussion regarding my project last week with the image comparison labeling. The paper mentions a couple reasons for this, such as not wanting things like “left” and “right” to be grouped together, and because current algorithms don’t do a good enough job of synonym matching for this type of task. It would be interesting to see if the results might differ at all if synonym matching were used. The exact matching was used in all scenarios, however, so adding in synonym matching would theoretically not change the relative results.

That being said, this was a very interesting article that aimed to find human tasks that computers still have difficulty dealing with. Every year that passes, this set of tasks gets smaller and smaller. And this paper is actually trying to help this set get smaller more quickly, by helping test new AI algorithms for effectiveness. The workers may not know it, but for the tasks in this paper they were actually working toward making their own job obsolete.

Questions

  • How would you have set up these question and answer gathering tasks, regarding the number that each worker performs per HIT? How do you find the right number of tasks per HIT before the worker should just finish the HIT and accept another one?
  • Is it just a coincidence that the “smart robot” prompt performed the best, or do you think there’s a reason that the closest to the truth version produced the best results (are crowdworkers smart enough to understand what questions are difficult for AI)?
  • What do you think about the decision to use exact text matching (after some text cleaning) instead of any kind of synonym matching?
  • How much longer are humans going to be able to come up with questions that are more difficult for computers to answer?

Read More

Labeling Images with a Computer Game

Paper: Luis von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’04). ACM, New York, NY, USA, 319-326. DOI=http://dx.doi.org/10.1145/985692.985733

Discussion Leader: Adam Binford

Summary

The authors of this paper try to tackle the issue of image labeling. Image labeling has many purposes, such as enabling better text image searches and creating training data for machine learning algorithms. The typical approaches to this image labeling during this time were through computer vision algorithms and or manual labor. This was before crowdsourcing really became a thing, and before Amazon Mechanical Turk was even launched, so the labor required to produce these labels were likely expensive and hard to obtain quickly.

The paper presents a new way to obtain image labels, through a game called The ESP Game. The idea behind the game is that image labels can be obtained from players who don’t realize they’re actually providing this data. They just find the game fun and want to play. The game works by matching up two players and showing them a common image. Players are told to try to figure out what word the other player is thinking of, they are not told anything about trying to describe the image presented to them. Each player then enters words until a match is found between the two players. Players also have the option to vote to skip the image if it is too difficult to come up with a word for.

The game also includes this idea of taboo words, which to players are words that cannot be guessed for an image. These words come from previous iterations of the game using the same image, so that multiple labels get generated for each image instead of the same obvious one over and over again. When an image starts to get skipped frequently, it is removed from the pool of possible images. They estimate that with 5,000 players playing the game constantly, all of the roughly 425,000,000 images on Google could be labeled in about a month, with each image getting their threshold of six labels within six months.

The authors were able to show that they’re game was indeed fun and that quality images were generated by its players. They supported the level of fun of their game with usage statistics, indicating that over 80% of the 13,670 players who played the game played it on multiple days. Additionally, 33 people played for more than 50 hours total. These statistics indicate that the game is provides sufficient enjoyment for players to keep coming back to play.

Reflection

This paper is one of the first we’ve read this year that looks at alternatives to paid crowd work. What’s even more impressive is that this paper was published before crowdsourcing really became a thing, and a year before Amazon Mechanical Turk was even launched. The ESP Game really started this idea of gamification of meaningful work, which many people have tried to emulate since, including Google which basically made their own version of this game. While not specifically mentioned by the authors, I imagine few of the players, if any, of this game knew that it was intended to be used for image labeling. This means the players truly just played it for fun, and not for any other reason.

Crowdsourcing through games has many advantages over what we would consider traditional crowdsourcing through AMT. First, and most obviously, it provides free labor. You attract workers through the fun of the game, not through any monetary incentive. This provides additional benefits. With paid work, you have to worry about workers trying to perform the least amount of work for the most amount of money, and this can result in poor quality. With a game, there is less incentive to try to game the system, albeit still some. With paid work, there isn’t much satisfaction lost by cheating your way to more money. But with a game, it will be much less satisfying to get a high score by cheating or gaming the system than it would be to legitimately get a high score, at least for most people. And the authors of this paper found some good ways to combat any possible cheating or collusion between players. While they discussed their strategies for this, however, it would be interesting to hear about if they had to use them at all and how rampant, if it all, cheating became in the game.

Obviously the issue with this approach is making your game fun. The authors were able to achieve this, but not every task that can benefit from crowdsourcing can easily be turned into a fun game. Image labeling just happens to have many possible ways of turning into an interesting game. All of the Metadata games linked to on the course schedule involve image (or audio) labeling. And they don’t hide the true purpose of the work nearly as well. The game descriptions specifically mentioning tagging images, unlike The ESP Game which mentioned nothing about describing the images presented. The fact that Mechanical Turk has become so popular and all the kinds of tasks available on it goes to show how difficult it is to turn these problems into an interesting game.

I do wonder how useful this game would be today. One of the things mentioned several times by the authors is that with 5,000 people playing the game constantly, they could label all images indexed by Google within a month. But that was back in 2004, when they said there were about 425,000,000 images indexed by Google. In the past 10 years, the internet has been expanding at an incredible scale. I was unable to find any specific numbers on images, but Google has indexed over 40 billion web pages. I would imagine the number of images indexed by Google could be nearly as high. This leads to some questions…

Questions

  • Do you think the ESP Game would be as useful today as it was 11 years ago, with respect to the number of images on the internet? What about with respect to the improvements in computer vision?
  • What might be some other benefits of game crowd work over paid crowd work that I didn’t mention? Are there any possible downsides?
  • Can you think of any other kinds of work that might be gamifiable other than the Fold It style games? Do you know of any examples?
  • Do you think it’s ok to hide the fact that your game is providing useful data for someone, or should the game have to disclose that fact up front?

Read More

Algorithm discovery by protein folding game players

Authors: Firas Khatib, Seth Cooper, Micheal D. Tyke, Kefan Xu, Ilya Makedon, Zoran Popvic, David Baker

Discussion Leader: Divit Singh

Crowdsourcing Example: http://weathersignal.com

Summary

Foldit is an online puzzle video game.  It presents a platform on which multiple players can collaborate and compete on various tasks such as protein folding.  It utilizes citizen science: leveraging natural human abilities for scientific purposes.  Foldit provides the players with a palette of interactive tools and manipulations to aid them in structuring the protein presented to them.  In addition, Foldit also provides players with the ability to create their own “recipes” for augmenting proteins.  These recipes are a set of instructions and game play macros that enable the players using the recipe to automatically manipulate the proteins presented to them.  User-friendly algorithms from the Rosetta structure prediction methodology were presented as well to aid players in interacting with structures.  From observing how players utilized these algorithms, it became apparent that the players used these algorithms to aughment rather than to substitute for human strategizing.   There was no one algorithm that was employed.  At different stages of interaction, players would use multiple recipes to build their structures which in turn, lead to more recipes being created.

During the time of the study, researchers created the “Fast Relax” algorithm which achieved better results in less time.  However, an algorithm was also developed by the Foldit players during this time: “Blue Fuse”.  These algorithms were very similar to each other and developed completely independently.  On testing these algorithms in side by side, it was revealed that Blue Fuse is more effective than Fast Relax (in Foldit) on time scales best compatible with game play.  The discovery of this algorithm was created solely by Foldit players.

Reflection

This paper is about a popular crowdsourcing framework used in the bioinformatics field.  It presents a unique way to utilize the brainpower of the general masses to create efficient  new and efficient algorithms by introducing a gaming aspect to protein folding.  I really liked how they allowed the players to build their algorithsm/simulations by employing the concept of “recipes”.  I believe that this was a crucial feature that allowed other players to build off someone else’s work rather than starting from scratch and coming up with either their own small contribution or replicating someone else’s work.  They present a clear UI with a helpful suite of tools to help in manipulating the structure as well.  In addition, I found that there were videos on YouTube as well as abundant information on their website to really emphasize the purpose of this software.

Figures 3 and 4 really emphasized the power of citizen science as it shows the social evolution of Foldit recipes.  New recipes are essentially built on top of each other in hopes to gain marginal efficiency with each iteration.  Instead of using machine learning in an attempt to approximate these recipes and simulations, real humans creations were used to develop algorithms.  The fact that these recipes resembled that of an algorithm produced by researchers specifically focused on producing an efficient algorithm shows the power of human computation.  As it stands, machine learning can only take us so far, especially in visual tasks such as these.

Questions

1. What are your opinions on gamifying problem solving/reasoning tasks such as this to attract a crowd?  Do you think it takes away from the task at hand by attracting a crowd that may be too young/old for its purpose? If so, how would you leverage gamification/any other task to try to attract the specified target audience?

2. Assuming there was no “energy” function in which to rate recipes for.  Based on visual aesthetics, how would you create a metric to measure how “clean” or “well-produced” a certain recipe is?

3. Would you rather have recipes be built on top of each other, or have individuals try to create their own from scratch? If you want them to be built on top of each other, does it not “tunnel-vision” subsequent creators?

    Read More

    Shepherding the crowd yields better work

    Paper: Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (CSCW ’12). ACM, New York, NY, USA, 1013-1022. DOI=http://dx.doi.org/10.1145/2145204.2145355

     

    Discussion Leader: Anamary Leal

     

    Summary

    The research goal of this work is: How can crowdsourcing support learning? How do we get better, multi-faceted, creative, complex work from unskilled crowds?

    Their solution is shepherding – providing meaningful real-time feedback, with the worker iterating on their work with an assessor who will give them feedback.

    For the task of writing a product review, a worker would write the review, and then the shepherd gets a notification of a review submission. Then, in real-time, the shepherd gave structured feedback based on a ranking of the quality, checklist of things to cover in the review, and an open-ended question. Then the worker has chance to improve the work.

    The authors conducted a comparative study structured around product reviews with a control condition (no feedback), self-assessment, and this external assessment (shepherding). They measured task performance, learning, and perseverance (amount of revisions and string edit distance). The workers and the shepherd were all from the crowd, though the shepherd came from a single reviewer from ODesk. Self-assessment did slightly better than shepherding, and both feedback conditions did significantly better than the no feedback condition. Shepherding resulted in more worker edits, along with more and better revisions.

     

    Reflection

    This paper is a great stepping stone for more questions to explore. In addition to the attrition question and different learning strategies, I wonder in the longer term, how well do these crowds learn. Short-bursts of learning is one thing (like cramming) but I wonder if those same workers, through more feedback, get better and writing reviews than others. How well do these lessons stick? The role of feedback can help in bringing the dream of having crowdsourcing work we would want our kids to do.

    Another stepping stone is to measure with respect to iterations, even if it’s in the short term. How many gains happen if the worker gets 2+ iterations with the assessor, or even in self-assessment?

    Feedback, especially external feedback, helped motivate workers to do more and better work. I’m not well versed in educational research, but engagement in teaching the material and assessment are quite important.

    The authors took care to mention the attrition rate, and what composed of that rate. I wonder what can be done about that population. Most of the attribution is dropped out too early, but a decent portion was due to plagiarism. I wonder what those participants saw in the task to discourage them to not do the task.

    The external condition probably would have not been as successful if the feedback form was not appropriately structured, with a helpful checklist of items to cover. I can imagine that a ton of design work went into that form to guide shepherds to provide prompt constructive feedback that the worker can deliver upon.
    In their studies, it looks like workers cannot substitute expert sheepherders and provide quality feedback. But I wonder if that too can be learned? It’s harder to teach something than to just be good at something.

    Discussion

    1. How do we get this non-trivial audience who dropped out, to participate? Feedback encouraged more work, so in a general sense, would more feedback lure them in?
    2. If the task turned to assessing reviews compared to writing reviews, which one do you think would require more iterations to get better? Which one would be easier to learn, to write or critique others?)
    3. How much feedback do you think is needed for an unskilled worker to get better at these creative, multi-faceted, complex tasks? Are there some examples, like writing a story, which may need more cycles of writing and review to be better at it?
    4. How do you see (or not see) real-time external assessment fit into your projects, and what do you think the gains would be, after reading this paper?

    Read More

    Structuring, Aggregating, and Evaluating Crowdsourced Design Critique

    K. Luther, J.-L. Tolentino, W. Wu, A. Pavel, B. P. Bailey, M. Agrawala, B. Hartmann, and S. P. Dow, “Structuring, Aggregating, and Evaluating Crowdsourced Design Critique,” in Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, New York, NY, USA, 2015, pp. 473–485.

    Discussion leader: Will Ellis

    Summary

    In this paper, Luther et al. describe CrowdCrit, a system for eliciting and aggregating design feedback from crowdworkers. The authors motivate their work by explaining the peer and instructor feedback process that is often employed in a design classroom setting. By designers’ exposing their designs to high-quality constructive criticism, they can improve their designs and improve their craft. However, the authors point out that it is difficult to find such essential feedback outside of the classroom. Online design communities may provide some sources of critique, but this is often too little and too shallow. To solve this problem, the authors built CrowdCrit, and they tested it in three studies. Their studies attempted to answer the questions How similar are crowd critiques to expert critiques?, How do designers react to crowd critiques?, and How does crowd critique impact the design process and results?.

    In Study 1, the authors had a group of 14 CrowdCrit-sourced workers and group of 3 design experts evaluate 3 poster designs using the CrowdCrit interface. They found that while individual crowdworkers’ critiques matched poorly to experts’, the crowd in aggregate matched 45% to 60% of the experts’ design critiques. The results suggest that even more workers will produce results that more closely match experts.

    In Study 2, the authors tested designers’ reactions to crowd critiques by way of a poster contest for a real concert. Designers spent an hour designing a poster according to client criteria. Both crowdworkers and the client then generated feedback using CrowdCrit. Designers then had the chance to review feedback and make changes to their designs. Finally, the client chose a poster design winner. In interviewing the designers, the authors found that they felt the majority of the crowd critiques were helpful and that they appreciated a lot of the features of the CrowdCrit system including abundant feedback, decision affirmation, scaffolded responses, and anonymity.

    In Study 3, the authors evaluated the impact of crowd critique on the design process using another design contest, this time hosted on 99designs.com. After the initial design stage, half of the design participants were given crowd feedback through CrowdCrit, and the other half were given generic, unhelpful feedback. The final designs were evaluated by both the client and a group of crowdworkers meeting a certain design expertise threshold. While the designers appreciated the crowd feedback more than the generic feedback, results showed no significant differences between the quality of the treatment and control groups.

    The authors conclude with implications for their work. They feel that crowd feedback may make designers feel as though they are making major revisions when in fact they’re only making minor improvements. Indeed, the nature of CrowdCrit seems to ensure that designers will receive large lists of small changes that do not cause them to make substantive design changes but, if implemented, contribute to busier, less simple designs.

    Reflection

    CrowdCrit is implemented on top of Amazon Mechanical Turk and, thus, has the benefit of being able to pull feedback from a lot of design novices. This paper makes the case that such feedback, in aggregate, can approximate the feedback of design experts. I am very concerned with the amount of noise introduced in the aggregation approach discussed in Study 1. Yes, with enough crowdworkers, you will eventually have enough people clicking enough critique checkboxes that all of the ones that an expert selected will also be selected by crowdworkers. However, if we assume that the critiques an expert would have made are the most salient, the designer would be unable to separate the salient from the inconsequential. I would hope that the most-selected critiques made by an army of crowdworkers would better approximate those of an actual expert, but the authors do not explore this strategy. I would also explore a weighting system that favors critiques from CrowdCrit’s design-experienced crowdworkers, not just by coloring them more boldly, but also by hiding novice critiques that have low replication.

    I am impressed by the effort and technique employed by the authors to distill their seven design principles, which they came to by analyzing important works in design teaching. I think the scaffolding approach to teaching design to crowdworkers was novel, and I appreciated the explanation of the pilot studies they performed to arrive at their strategy. I wonder if those who would use a system like CrowdCrit, the designers themselves, would not benefit from participating as workers in the system. Much like a design classroom, they could benefit from the scaffolded learning and application of design principles, which they may only know in part.

    In Study 3, I’m sure the authors were disappointed to find no statistically significant improvement in design outcomes using crowd feedback. However, I think the most important goal of peer and expert critique, at least in the classroom, is not to improve the design, but to improve the designer. With that in mind, it would be interesting to see a longitudinal study evaluating the output of designers who use CrowdCrit over a significant period of time.

    Questions

    • Study 1 shows adding more workers produces more data, but also more “false positives”. Authors conjecture that these may not be false positives, but could in fact be critiques that the experts missed. Are the authors correct, or is this just more noise? Is the designer impaired by so many extra critiques?
    • CrowdCrit is designed to work with any kind of crowd, not just the Mechanical Turk community. Based on other papers we’ve read, how could we restructure CrowdCrit to fit within a community of practice like graphic design?
    • Study 3 seems to show that for a single design, critique does not improve a design so much as simple iteration. Is feedback actually an important part of the design process? If so, how do we know? If we accept that feedback is an important part of the design process, how might we design a study that evaluates CrowdCrit’s contribution?
    • The results of Study 2 show a lot of positive feedback from designers for CrowdCrit’s features and interface. Implied in the designers’ comments is their enthusiasm for mediated engagement with clients and users (crowdworker stand-ins in this case) over their designs. What are CrowdCrit’s most important contributions in this regard?

    Something Cool

    Moneyball, but for Mario—the data behind Super Mario Maker popularity

    Read More

    Ensemble: Exploring Complementary Strengths of Leaders and Crowds in Creative Collaboration

    Kim, Joy, Justin Cheng, and Michael S. Bernstein. “Ensemble: exploring complementary strengths of leaders and crowds in creative collaboration.” Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 2014.

    Discussion Leader : Ananya

    Summary:

    Ensemble is a collaborative story writing platform where the leader maintains a high-level vision of the story and articulates creative constraints while the contributor contributes new ideas, comments or up votes on existing one.

    Scenes are basic collaborative unit of each story. It may correspond to a turning pointing the story that reveals character development, new information and a goal for the next scene. The lead author creates a scene with a prompt and a short story description that suggests what problem the lead author wants to solve in this scene. The scene directs contributors towards specific sections that the author has chosen to be completed.

    The contributors can participate via drafts, comments or votes. They can communicate with the author or discuss specific scenes using scene comments. Each scene might have multiple drafts from different contributors. The lead author maintains a creative control by choosing a winning draft for each scene. He can optionally appoint a moderator to edit drafts. He can directly add the draft as a part of original story or take inspiration from the contributions and write his own.

    The authors evaluated their platform by running a short story writing competition using the platform, monitoring participant activity during the competition and conducting interviews with seven users. The results suggested that the lead authors spent a significant amount of time revising drafts while the moderators spent time mainly editing drafts created by the lead author and the contributors contributed somewhat more on creating  comments.

     

    Reflection:

    The idea presented in this paper is not new. Several TV series have been incorporating similar technics for many years now where the series creator defines the story outline and each episode is written by a different member in the team. To me, the novel contribution in this paper, was using this concept to create a online platform for creative collaboration among people who may not know each other.  Infact, one of the results analyzed in the paper was whether lead authors knew contributors previously. 4 out of 20 stories were written by teams made up of strangers. Although out of scope of this paper, I would still like to know how these 4 stories performed qualitatively in comparison to the stories by a team of friends.

    The author mentioned 55 Ensemble stories were started but later only 20 of these stories were submitted as entries. Again some analysis on why more than 50% of the stories could not be completed would have been good. And team size of submitted stories ranged from 1 to 7 people. Compared to any crowdsourcing platform this number is minuscule which makes me wonder, can this platform successfully cater to a larger user base where hundreds of people collaborate to write a story (the authors also raise this question in the paper), like we see in any crowdsourced videos these days?

    It would be interesting to see how this approach compares to traditional story writing methods, how quality varies when multiple people from different parts of the world collaborate to write a story, how their diverse background effect the flow of the story and how lead authors maneuvers through all these varieties to create the perfect story.

    At the end, I feel Ensemble in its current stage is not a platform where a crowd collaborates to write a story rather a platform where crowd collaborates to improve someone else’s story.

     

    Questions:

    • In this paper, the authors advertised the competition on several writing forums. Will this strategy work in a more generic and paid platform like MTurk? If yes, do you think only mturkers with writing expertise be allowed to participate? And how should mturkers be paid?
    • How will Ensemble handle ownership issues? Can this hamper productivity in the collaboration environment?
    • The lead author has an uphill task of collecting all drafts/comments/suggestions and incorporating in the story. Do you think it is worth spending extra hours compiling someone else’s idea? How will English literature (assuming only English for now)  per se, be benefited from a crowdsourced story?

    Read More

    Distributed Analogical Idea Generation: Inventing with Crowds

    Lixiu Yu, Aniket Kittur, and Robert E. Kraut. 2014. Distributed analogical idea generation: inventing with crowds. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, New York, NY, USA, 1245-1254.

     

    Discussion Leader: Nai-Ching Wang

    Summary

    This paper introduces a 4-step process, distributed analogical idea generation (identify examples, generate schemas, identify new domains and generate new ideas), to increase the possibility of production of creativity by introducing analogical transfer. There are two issues of current ways of producing new and good ideas which use quantity to exchange quality. The first issue is that rewards are usually only given to the best ideas ignoring contribution made by other participants. The other issue is that the exchange of quantity to quality is usually not stable and inefficient because we do not know how many is good enough. This paper uses three experiments to test the effectiveness of the proposed process. The result of the first experiment shows the quality (composed of practicality, usefulness, novelty) of creativity generation is better with expert-produced schemas. The result of second one shows the number of similar examples increases the quality of induced schemas from the crowd while contrasting examples are not as useful. The result of the third one shows different qualities of schemas produced in Exp. 2 affect the last step, idea generation. The three experiments confirm that the proposed process leads to better ideas than example-based methods.

    Reflections

    This paper starts to address the “winner takes all” issue we have been discussing in class, especially for the design/creativity domain. It seems that we now have a better way to evaluate/appreciate each person’s contribution and decrease unnecessary/inefficient effort. In general, I like the design of the three experiments, each of which deals with a specific aspect of the overall study. In experiment 3, it is shown that good schemas will help produce better ideas. It will be interesting to see how good the experimenter-generated schemas are, especially when we can compare the quality in terms of scores to the results of experiment 2. Unfortunately, this information is not available in the paper. The distributed process presented in the paper is very impressive because it decomposes a larger process into several smaller components that can be operated separately. It would be interesting if there is a comparison of idea quality between traditional way and the way used in the paper. It would also be interesting to see the quality between assembly line and artisan processes because the latter might provide learning opportunities and thus provide higher quality results although the process is not as flexible/distributed as assembly line.

    Questions

    • What are the benefits/shortcomings for the raters to discuss or be trained for judging?
    • Do you think design guidelines/heuristics are similar to schemas mentioned in the paper? How similar/different?
    • In experiment 3, what reasons do you think there is an example associated with either a good or bad schema? Why not just use good or bad schemas?
    • This paper mostly focuses on average quality. For creativity tasks, do you think that is a reasonable measure?

    Read More