Labeling Images with a Computer Game

Paper: Luis von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’04). ACM, New York, NY, USA, 319-326. DOI=http://dx.doi.org/10.1145/985692.985733

Discussion Leader: Adam Binford

Summary

The authors of this paper try to tackle the issue of image labeling. Image labeling has many purposes, such as enabling better text image searches and creating training data for machine learning algorithms. The typical approaches to this image labeling during this time were through computer vision algorithms and or manual labor. This was before crowdsourcing really became a thing, and before Amazon Mechanical Turk was even launched, so the labor required to produce these labels were likely expensive and hard to obtain quickly.

The paper presents a new way to obtain image labels, through a game called The ESP Game. The idea behind the game is that image labels can be obtained from players who don’t realize they’re actually providing this data. They just find the game fun and want to play. The game works by matching up two players and showing them a common image. Players are told to try to figure out what word the other player is thinking of, they are not told anything about trying to describe the image presented to them. Each player then enters words until a match is found between the two players. Players also have the option to vote to skip the image if it is too difficult to come up with a word for.

The game also includes this idea of taboo words, which to players are words that cannot be guessed for an image. These words come from previous iterations of the game using the same image, so that multiple labels get generated for each image instead of the same obvious one over and over again. When an image starts to get skipped frequently, it is removed from the pool of possible images. They estimate that with 5,000 players playing the game constantly, all of the roughly 425,000,000 images on Google could be labeled in about a month, with each image getting their threshold of six labels within six months.

The authors were able to show that they’re game was indeed fun and that quality images were generated by its players. They supported the level of fun of their game with usage statistics, indicating that over 80% of the 13,670 players who played the game played it on multiple days. Additionally, 33 people played for more than 50 hours total. These statistics indicate that the game is provides sufficient enjoyment for players to keep coming back to play.

Reflection

This paper is one of the first we’ve read this year that looks at alternatives to paid crowd work. What’s even more impressive is that this paper was published before crowdsourcing really became a thing, and a year before Amazon Mechanical Turk was even launched. The ESP Game really started this idea of gamification of meaningful work, which many people have tried to emulate since, including Google which basically made their own version of this game. While not specifically mentioned by the authors, I imagine few of the players, if any, of this game knew that it was intended to be used for image labeling. This means the players truly just played it for fun, and not for any other reason.

Crowdsourcing through games has many advantages over what we would consider traditional crowdsourcing through AMT. First, and most obviously, it provides free labor. You attract workers through the fun of the game, not through any monetary incentive. This provides additional benefits. With paid work, you have to worry about workers trying to perform the least amount of work for the most amount of money, and this can result in poor quality. With a game, there is less incentive to try to game the system, albeit still some. With paid work, there isn’t much satisfaction lost by cheating your way to more money. But with a game, it will be much less satisfying to get a high score by cheating or gaming the system than it would be to legitimately get a high score, at least for most people. And the authors of this paper found some good ways to combat any possible cheating or collusion between players. While they discussed their strategies for this, however, it would be interesting to hear about if they had to use them at all and how rampant, if it all, cheating became in the game.

Obviously the issue with this approach is making your game fun. The authors were able to achieve this, but not every task that can benefit from crowdsourcing can easily be turned into a fun game. Image labeling just happens to have many possible ways of turning into an interesting game. All of the Metadata games linked to on the course schedule involve image (or audio) labeling. And they don’t hide the true purpose of the work nearly as well. The game descriptions specifically mentioning tagging images, unlike The ESP Game which mentioned nothing about describing the images presented. The fact that Mechanical Turk has become so popular and all the kinds of tasks available on it goes to show how difficult it is to turn these problems into an interesting game.

I do wonder how useful this game would be today. One of the things mentioned several times by the authors is that with 5,000 people playing the game constantly, they could label all images indexed by Google within a month. But that was back in 2004, when they said there were about 425,000,000 images indexed by Google. In the past 10 years, the internet has been expanding at an incredible scale. I was unable to find any specific numbers on images, but Google has indexed over 40 billion web pages. I would imagine the number of images indexed by Google could be nearly as high. This leads to some questions…

Questions

  • Do you think the ESP Game would be as useful today as it was 11 years ago, with respect to the number of images on the internet? What about with respect to the improvements in computer vision?
  • What might be some other benefits of game crowd work over paid crowd work that I didn’t mention? Are there any possible downsides?
  • Can you think of any other kinds of work that might be gamifiable other than the Fold It style games? Do you know of any examples?
  • Do you think it’s ok to hide the fact that your game is providing useful data for someone, or should the game have to disclose that fact up front?

Read More

Algorithm discovery by protein folding game players

Authors: Firas Khatib, Seth Cooper, Micheal D. Tyke, Kefan Xu, Ilya Makedon, Zoran Popvic, David Baker

Discussion Leader: Divit Singh

Crowdsourcing Example: http://weathersignal.com

Summary

Foldit is an online puzzle video game.  It presents a platform on which multiple players can collaborate and compete on various tasks such as protein folding.  It utilizes citizen science: leveraging natural human abilities for scientific purposes.  Foldit provides the players with a palette of interactive tools and manipulations to aid them in structuring the protein presented to them.  In addition, Foldit also provides players with the ability to create their own “recipes” for augmenting proteins.  These recipes are a set of instructions and game play macros that enable the players using the recipe to automatically manipulate the proteins presented to them.  User-friendly algorithms from the Rosetta structure prediction methodology were presented as well to aid players in interacting with structures.  From observing how players utilized these algorithms, it became apparent that the players used these algorithms to aughment rather than to substitute for human strategizing.   There was no one algorithm that was employed.  At different stages of interaction, players would use multiple recipes to build their structures which in turn, lead to more recipes being created.

During the time of the study, researchers created the “Fast Relax” algorithm which achieved better results in less time.  However, an algorithm was also developed by the Foldit players during this time: “Blue Fuse”.  These algorithms were very similar to each other and developed completely independently.  On testing these algorithms in side by side, it was revealed that Blue Fuse is more effective than Fast Relax (in Foldit) on time scales best compatible with game play.  The discovery of this algorithm was created solely by Foldit players.

Reflection

This paper is about a popular crowdsourcing framework used in the bioinformatics field.  It presents a unique way to utilize the brainpower of the general masses to create efficient  new and efficient algorithms by introducing a gaming aspect to protein folding.  I really liked how they allowed the players to build their algorithsm/simulations by employing the concept of “recipes”.  I believe that this was a crucial feature that allowed other players to build off someone else’s work rather than starting from scratch and coming up with either their own small contribution or replicating someone else’s work.  They present a clear UI with a helpful suite of tools to help in manipulating the structure as well.  In addition, I found that there were videos on YouTube as well as abundant information on their website to really emphasize the purpose of this software.

Figures 3 and 4 really emphasized the power of citizen science as it shows the social evolution of Foldit recipes.  New recipes are essentially built on top of each other in hopes to gain marginal efficiency with each iteration.  Instead of using machine learning in an attempt to approximate these recipes and simulations, real humans creations were used to develop algorithms.  The fact that these recipes resembled that of an algorithm produced by researchers specifically focused on producing an efficient algorithm shows the power of human computation.  As it stands, machine learning can only take us so far, especially in visual tasks such as these.

Questions

1. What are your opinions on gamifying problem solving/reasoning tasks such as this to attract a crowd?  Do you think it takes away from the task at hand by attracting a crowd that may be too young/old for its purpose? If so, how would you leverage gamification/any other task to try to attract the specified target audience?

2. Assuming there was no “energy” function in which to rate recipes for.  Based on visual aesthetics, how would you create a metric to measure how “clean” or “well-produced” a certain recipe is?

3. Would you rather have recipes be built on top of each other, or have individuals try to create their own from scratch? If you want them to be built on top of each other, does it not “tunnel-vision” subsequent creators?

    Read More

    Shepherding the crowd yields better work

    Paper: Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (CSCW ’12). ACM, New York, NY, USA, 1013-1022. DOI=http://dx.doi.org/10.1145/2145204.2145355

     

    Discussion Leader: Anamary Leal

     

    Summary

    The research goal of this work is: How can crowdsourcing support learning? How do we get better, multi-faceted, creative, complex work from unskilled crowds?

    Their solution is shepherding – providing meaningful real-time feedback, with the worker iterating on their work with an assessor who will give them feedback.

    For the task of writing a product review, a worker would write the review, and then the shepherd gets a notification of a review submission. Then, in real-time, the shepherd gave structured feedback based on a ranking of the quality, checklist of things to cover in the review, and an open-ended question. Then the worker has chance to improve the work.

    The authors conducted a comparative study structured around product reviews with a control condition (no feedback), self-assessment, and this external assessment (shepherding). They measured task performance, learning, and perseverance (amount of revisions and string edit distance). The workers and the shepherd were all from the crowd, though the shepherd came from a single reviewer from ODesk. Self-assessment did slightly better than shepherding, and both feedback conditions did significantly better than the no feedback condition. Shepherding resulted in more worker edits, along with more and better revisions.

     

    Reflection

    This paper is a great stepping stone for more questions to explore. In addition to the attrition question and different learning strategies, I wonder in the longer term, how well do these crowds learn. Short-bursts of learning is one thing (like cramming) but I wonder if those same workers, through more feedback, get better and writing reviews than others. How well do these lessons stick? The role of feedback can help in bringing the dream of having crowdsourcing work we would want our kids to do.

    Another stepping stone is to measure with respect to iterations, even if it’s in the short term. How many gains happen if the worker gets 2+ iterations with the assessor, or even in self-assessment?

    Feedback, especially external feedback, helped motivate workers to do more and better work. I’m not well versed in educational research, but engagement in teaching the material and assessment are quite important.

    The authors took care to mention the attrition rate, and what composed of that rate. I wonder what can be done about that population. Most of the attribution is dropped out too early, but a decent portion was due to plagiarism. I wonder what those participants saw in the task to discourage them to not do the task.

    The external condition probably would have not been as successful if the feedback form was not appropriately structured, with a helpful checklist of items to cover. I can imagine that a ton of design work went into that form to guide shepherds to provide prompt constructive feedback that the worker can deliver upon.
    In their studies, it looks like workers cannot substitute expert sheepherders and provide quality feedback. But I wonder if that too can be learned? It’s harder to teach something than to just be good at something.

    Discussion

    1. How do we get this non-trivial audience who dropped out, to participate? Feedback encouraged more work, so in a general sense, would more feedback lure them in?
    2. If the task turned to assessing reviews compared to writing reviews, which one do you think would require more iterations to get better? Which one would be easier to learn, to write or critique others?)
    3. How much feedback do you think is needed for an unskilled worker to get better at these creative, multi-faceted, complex tasks? Are there some examples, like writing a story, which may need more cycles of writing and review to be better at it?
    4. How do you see (or not see) real-time external assessment fit into your projects, and what do you think the gains would be, after reading this paper?

    Read More

    Structuring, Aggregating, and Evaluating Crowdsourced Design Critique

    K. Luther, J.-L. Tolentino, W. Wu, A. Pavel, B. P. Bailey, M. Agrawala, B. Hartmann, and S. P. Dow, “Structuring, Aggregating, and Evaluating Crowdsourced Design Critique,” in Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, New York, NY, USA, 2015, pp. 473–485.

    Discussion leader: Will Ellis

    Summary

    In this paper, Luther et al. describe CrowdCrit, a system for eliciting and aggregating design feedback from crowdworkers. The authors motivate their work by explaining the peer and instructor feedback process that is often employed in a design classroom setting. By designers’ exposing their designs to high-quality constructive criticism, they can improve their designs and improve their craft. However, the authors point out that it is difficult to find such essential feedback outside of the classroom. Online design communities may provide some sources of critique, but this is often too little and too shallow. To solve this problem, the authors built CrowdCrit, and they tested it in three studies. Their studies attempted to answer the questions How similar are crowd critiques to expert critiques?, How do designers react to crowd critiques?, and How does crowd critique impact the design process and results?.

    In Study 1, the authors had a group of 14 CrowdCrit-sourced workers and group of 3 design experts evaluate 3 poster designs using the CrowdCrit interface. They found that while individual crowdworkers’ critiques matched poorly to experts’, the crowd in aggregate matched 45% to 60% of the experts’ design critiques. The results suggest that even more workers will produce results that more closely match experts.

    In Study 2, the authors tested designers’ reactions to crowd critiques by way of a poster contest for a real concert. Designers spent an hour designing a poster according to client criteria. Both crowdworkers and the client then generated feedback using CrowdCrit. Designers then had the chance to review feedback and make changes to their designs. Finally, the client chose a poster design winner. In interviewing the designers, the authors found that they felt the majority of the crowd critiques were helpful and that they appreciated a lot of the features of the CrowdCrit system including abundant feedback, decision affirmation, scaffolded responses, and anonymity.

    In Study 3, the authors evaluated the impact of crowd critique on the design process using another design contest, this time hosted on 99designs.com. After the initial design stage, half of the design participants were given crowd feedback through CrowdCrit, and the other half were given generic, unhelpful feedback. The final designs were evaluated by both the client and a group of crowdworkers meeting a certain design expertise threshold. While the designers appreciated the crowd feedback more than the generic feedback, results showed no significant differences between the quality of the treatment and control groups.

    The authors conclude with implications for their work. They feel that crowd feedback may make designers feel as though they are making major revisions when in fact they’re only making minor improvements. Indeed, the nature of CrowdCrit seems to ensure that designers will receive large lists of small changes that do not cause them to make substantive design changes but, if implemented, contribute to busier, less simple designs.

    Reflection

    CrowdCrit is implemented on top of Amazon Mechanical Turk and, thus, has the benefit of being able to pull feedback from a lot of design novices. This paper makes the case that such feedback, in aggregate, can approximate the feedback of design experts. I am very concerned with the amount of noise introduced in the aggregation approach discussed in Study 1. Yes, with enough crowdworkers, you will eventually have enough people clicking enough critique checkboxes that all of the ones that an expert selected will also be selected by crowdworkers. However, if we assume that the critiques an expert would have made are the most salient, the designer would be unable to separate the salient from the inconsequential. I would hope that the most-selected critiques made by an army of crowdworkers would better approximate those of an actual expert, but the authors do not explore this strategy. I would also explore a weighting system that favors critiques from CrowdCrit’s design-experienced crowdworkers, not just by coloring them more boldly, but also by hiding novice critiques that have low replication.

    I am impressed by the effort and technique employed by the authors to distill their seven design principles, which they came to by analyzing important works in design teaching. I think the scaffolding approach to teaching design to crowdworkers was novel, and I appreciated the explanation of the pilot studies they performed to arrive at their strategy. I wonder if those who would use a system like CrowdCrit, the designers themselves, would not benefit from participating as workers in the system. Much like a design classroom, they could benefit from the scaffolded learning and application of design principles, which they may only know in part.

    In Study 3, I’m sure the authors were disappointed to find no statistically significant improvement in design outcomes using crowd feedback. However, I think the most important goal of peer and expert critique, at least in the classroom, is not to improve the design, but to improve the designer. With that in mind, it would be interesting to see a longitudinal study evaluating the output of designers who use CrowdCrit over a significant period of time.

    Questions

    • Study 1 shows adding more workers produces more data, but also more “false positives”. Authors conjecture that these may not be false positives, but could in fact be critiques that the experts missed. Are the authors correct, or is this just more noise? Is the designer impaired by so many extra critiques?
    • CrowdCrit is designed to work with any kind of crowd, not just the Mechanical Turk community. Based on other papers we’ve read, how could we restructure CrowdCrit to fit within a community of practice like graphic design?
    • Study 3 seems to show that for a single design, critique does not improve a design so much as simple iteration. Is feedback actually an important part of the design process? If so, how do we know? If we accept that feedback is an important part of the design process, how might we design a study that evaluates CrowdCrit’s contribution?
    • The results of Study 2 show a lot of positive feedback from designers for CrowdCrit’s features and interface. Implied in the designers’ comments is their enthusiasm for mediated engagement with clients and users (crowdworker stand-ins in this case) over their designs. What are CrowdCrit’s most important contributions in this regard?

    Something Cool

    Moneyball, but for Mario—the data behind Super Mario Maker popularity

    Read More

    Ensemble: Exploring Complementary Strengths of Leaders and Crowds in Creative Collaboration

    Kim, Joy, Justin Cheng, and Michael S. Bernstein. “Ensemble: exploring complementary strengths of leaders and crowds in creative collaboration.” Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 2014.

    Discussion Leader : Ananya

    Summary:

    Ensemble is a collaborative story writing platform where the leader maintains a high-level vision of the story and articulates creative constraints while the contributor contributes new ideas, comments or up votes on existing one.

    Scenes are basic collaborative unit of each story. It may correspond to a turning pointing the story that reveals character development, new information and a goal for the next scene. The lead author creates a scene with a prompt and a short story description that suggests what problem the lead author wants to solve in this scene. The scene directs contributors towards specific sections that the author has chosen to be completed.

    The contributors can participate via drafts, comments or votes. They can communicate with the author or discuss specific scenes using scene comments. Each scene might have multiple drafts from different contributors. The lead author maintains a creative control by choosing a winning draft for each scene. He can optionally appoint a moderator to edit drafts. He can directly add the draft as a part of original story or take inspiration from the contributions and write his own.

    The authors evaluated their platform by running a short story writing competition using the platform, monitoring participant activity during the competition and conducting interviews with seven users. The results suggested that the lead authors spent a significant amount of time revising drafts while the moderators spent time mainly editing drafts created by the lead author and the contributors contributed somewhat more on creating  comments.

     

    Reflection:

    The idea presented in this paper is not new. Several TV series have been incorporating similar technics for many years now where the series creator defines the story outline and each episode is written by a different member in the team. To me, the novel contribution in this paper, was using this concept to create a online platform for creative collaboration among people who may not know each other.  Infact, one of the results analyzed in the paper was whether lead authors knew contributors previously. 4 out of 20 stories were written by teams made up of strangers. Although out of scope of this paper, I would still like to know how these 4 stories performed qualitatively in comparison to the stories by a team of friends.

    The author mentioned 55 Ensemble stories were started but later only 20 of these stories were submitted as entries. Again some analysis on why more than 50% of the stories could not be completed would have been good. And team size of submitted stories ranged from 1 to 7 people. Compared to any crowdsourcing platform this number is minuscule which makes me wonder, can this platform successfully cater to a larger user base where hundreds of people collaborate to write a story (the authors also raise this question in the paper), like we see in any crowdsourced videos these days?

    It would be interesting to see how this approach compares to traditional story writing methods, how quality varies when multiple people from different parts of the world collaborate to write a story, how their diverse background effect the flow of the story and how lead authors maneuvers through all these varieties to create the perfect story.

    At the end, I feel Ensemble in its current stage is not a platform where a crowd collaborates to write a story rather a platform where crowd collaborates to improve someone else’s story.

     

    Questions:

    • In this paper, the authors advertised the competition on several writing forums. Will this strategy work in a more generic and paid platform like MTurk? If yes, do you think only mturkers with writing expertise be allowed to participate? And how should mturkers be paid?
    • How will Ensemble handle ownership issues? Can this hamper productivity in the collaboration environment?
    • The lead author has an uphill task of collecting all drafts/comments/suggestions and incorporating in the story. Do you think it is worth spending extra hours compiling someone else’s idea? How will English literature (assuming only English for now)  per se, be benefited from a crowdsourced story?

    Read More

    Distributed Analogical Idea Generation: Inventing with Crowds

    Lixiu Yu, Aniket Kittur, and Robert E. Kraut. 2014. Distributed analogical idea generation: inventing with crowds. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, New York, NY, USA, 1245-1254.

     

    Discussion Leader: Nai-Ching Wang

    Summary

    This paper introduces a 4-step process, distributed analogical idea generation (identify examples, generate schemas, identify new domains and generate new ideas), to increase the possibility of production of creativity by introducing analogical transfer. There are two issues of current ways of producing new and good ideas which use quantity to exchange quality. The first issue is that rewards are usually only given to the best ideas ignoring contribution made by other participants. The other issue is that the exchange of quantity to quality is usually not stable and inefficient because we do not know how many is good enough. This paper uses three experiments to test the effectiveness of the proposed process. The result of the first experiment shows the quality (composed of practicality, usefulness, novelty) of creativity generation is better with expert-produced schemas. The result of second one shows the number of similar examples increases the quality of induced schemas from the crowd while contrasting examples are not as useful. The result of the third one shows different qualities of schemas produced in Exp. 2 affect the last step, idea generation. The three experiments confirm that the proposed process leads to better ideas than example-based methods.

    Reflections

    This paper starts to address the “winner takes all” issue we have been discussing in class, especially for the design/creativity domain. It seems that we now have a better way to evaluate/appreciate each person’s contribution and decrease unnecessary/inefficient effort. In general, I like the design of the three experiments, each of which deals with a specific aspect of the overall study. In experiment 3, it is shown that good schemas will help produce better ideas. It will be interesting to see how good the experimenter-generated schemas are, especially when we can compare the quality in terms of scores to the results of experiment 2. Unfortunately, this information is not available in the paper. The distributed process presented in the paper is very impressive because it decomposes a larger process into several smaller components that can be operated separately. It would be interesting if there is a comparison of idea quality between traditional way and the way used in the paper. It would also be interesting to see the quality between assembly line and artisan processes because the latter might provide learning opportunities and thus provide higher quality results although the process is not as flexible/distributed as assembly line.

    Questions

    • What are the benefits/shortcomings for the raters to discuss or be trained for judging?
    • Do you think design guidelines/heuristics are similar to schemas mentioned in the paper? How similar/different?
    • In experiment 3, what reasons do you think there is an example associated with either a good or bad schema? Why not just use good or bad schemas?
    • This paper mostly focuses on average quality. For creativity tasks, do you think that is a reasonable measure?

    Read More

    VizWiz: Nearly Real-time Answers to Visual Questions

    Bigham, Jeffrey P., et al. “VizWiz: Nearly Real-time Answers to Visual Questions” Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM, 2010

    Discussion Leader: Sanchit

    Crowdsourcing Example: Couch Surfing

    Youtube video for a quick overview: VizWiz Tutorial

    Summary

    VizWiz is a mobile application designed to answer visual questions for blind people, in real time by taking advantage of existing crowdsourcing technologies such as Amazon’s Mechanical Turk. Existing software and hardware to aid blind people solve visual problems are either too costly or too cumbersome to use. OCR is not advanced enough and reliable to completely solve vision-based problems and existing text to speech software only helps solve a singular issue of reading text back to the blind user. The application interface is designed to take advantage of Apple’s accessibility service called VoiceOver which allows the operating system to talk to the user and describe what the current selected option or view is on the screen. Touch based gestures are used to navigate the application so that users may easily take a picture, ask a question and receive answers from remote workers in real time.

    The authors also present an abstraction layer on top of the Mechanical Turk API called quikTurkit. This allows requesters to create their own website on which Mechanical Turk workers are recruited and are able to answer questions posed by users of the VizWiz application. There is a constant stream of HITs being posted on Mech Turk so that a pool of workers is available to work as soon as a new question is posed by the user. While the user is taking a picture and recording their question, VizWiz sends a notification to quikTurkit which allows the API to start recruiting workers and therefore reduce the overall time and latency in waiting for an answer to come back.

    VizWiz also featured a second version which detected blurry or dark images and asked users to retake them in order to get more accurate results. The authors also developed a use case VizWiz:LocateIt which allows blind users to locate an object in 3D space. They take a picture of an area where the desired object is located and then pose a question asking for the location of the specific object. Crowdworkers then highlight the object and the application processes the camera properties, the user’s location and the highlighted object to determine how much the user should turn and how far the user should walk in order to reach the general vicinity of the object. A lot of favorable responses were generated from the post user study surveys which showed that this technology is definitely in demand by blind users and may set up future research to automate the answering process without human interaction.

    Reflections

    I thought the concept in itself was brilliant. It is a problem that not many people think about in their daily lives, but when you sit down and really start to ponder on how trivial tasks such as finding the right object in a space can be nearly impossible for blind people, you realize the potential of such an application. The application design was very solid. Apple designed the VoiceOver API for vision-impaired people to begin with, so using it in such an application was the best choice. Employing large gestures for UI navigation is also smart because it can be very difficult or impossible for vision-impaired people to click a specific button or option on a touch based screen/device.

    QuikTurkit was in my opinion a good foundation and beginning as the backend model for this application. It can definitely be improved by not placing too much focus on speech recognition, or not bombarding Mechnical Turk with too many HITs. Finding the right balance between the number of active workers in the pool and the number of HITs to be posted will really benefit both the system load and the cost the user has to incur in the long run.

    A minor observance that I noticed was that the study had 11 blind users with 5 females initially, but later on in the paper there were 3 females. Probably a typo, but thoughts? Speaking of their study, I think the heuristics made a lot of sense and the survey results were generally favorable for the application. A latency of 2-3 minutes on average is not too bad considering the helpless situation of a vision-impaired person. Any amount of additional information or answering of a question that the user can get will only be helpful. I honestly didn’t see the point for speech recognition to be a focus for their product. If workers can just listen to the question, then that should be sufficient enough to answer it. There is no need to introduce errors with failed speech recognition attempts.

    In my opinion, VisWiz:LocateIt was too complicated of a system with too many external variables to worry about so that a visually-impaired user can successfully find an object. The location detection and mapping is based only on the picture taken by the user which is not guaranteed to be perfect (more often than not). Although they have several algorithms and techniques to combat ineffective pictures, I still think there are potential hazards and accidents waiting to happen based on the direction ques provided by the application. Not entirely convinced on this use case.

    Overall it was a solid concept and execution in terms of the mobile application. It looks like the software is public and is being used by over 5000 blind people right now, so that is pretty impressive.

    Questions:

    1. One aspect that confused me about quikTurkit was who actually deployed the server or made the website for Mechical Turk workers to use this service. As in, was it the VizWiz people who created the server or can requesters build their own websites using this service as well? And who would the requesters be? Blind people?
    1. Besides human compassion and empathy, what is stopping users from giving wrong answers? Also, who determines whether the answer was correct or not?
    1. If a handheld barcode scanner works fairly well to locate a specific product in an area, then why couldn’t the authors just use a barcode scanning API on the iPhone along with the existing voice over technology to help locate a specific product? Do you foresee any problems with this approach?

     

    Read More

    Crowds in two seconds: enabling realtime crowd-powered interfaces

    Bernstein, Michael S., et al. “Crowds in two seconds: Enabling realtime crowd-powered interfaces.” Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011.

    Discussion Leader: Shiwani

    Youtube video for a quick overview: https://www.youtube.com/watch?v=9IICXFUP6MM

    Summary

    Crowd-sourcing has been successfully used in a variety of avenues, including interactive applications such as word processors, image searches, etc. However, a major challenge is the latency in returning a result to the user. If an interace takes more than 10 seconds to react, the user is likely to lose focus and/or abandon the interface. Near real-time techniques at the time required at least 56 seconds for simple tasks and 22 minutes or longer for more complex workflors.
    In this paper, the authors propose techniques for recruiting and effectively using synchronous crowds in order to provide real-time, crowd-powered interfaces. The first technique is called the retainer model and involves hiring workers in advance and placing them on hold by paying them a small amount. When a task is ready, the workers are alerted and they are paid additional amount on completion of the task. The paper also discusses empirical guidelines for this technique. The second technique introduced in the paper is rapid refinement. It is a design pattern for algorithmically recognizing crowd agreement early on and rapidly reducing the search space to identify a single resullt.
    The authors created a system called Adrenaline to validate the retainer model and rapid refinement. Adrenaline is a smart photo shooter, designed to find the most photogenic moment by capturing a short (10 second) video clip and using the crowd to identify the best moment.
    Additionally, the authors were interested in looking at other applications for real-time crowd-powered interfaces. Fo this, they created two systems, Puppeteer and A|B. Puppeteer is intended for creative content generation tasks, and allows the designer to interaction with the crowd as they work. A|B is a simple platform for asking A-or-B questions, with the user providing two options and asking the crowd to choose one based on pre-specified criteria.
    The results of the experiments suggest that the retainer model is effective in assembling a crowd about two seconds after the request is made and that a small reward for quickness remediated longer reaction times caused by longer retainer times. It was also found that rapid refinement enabled small groups to select the best photograph faster than the single fastest member. However, forcing agreement too quickly, sometimes affected the quality. For puppeteer, there was a small latency due to complexity of the task, but the throughput rates were constant. On the other hand, for A\B, responses were received in near real-time.

    Reflections

    This paper ventures into an interesting space and brings me back to the idea to “Wizard of Turk”, where users interact with the system and the responses of the system are generated through human intelligence. The reality of machine learning at the moment is that there are still areas which are subjective and require a human in the loop. Tasks such as identifying the most photogenic photo or voting whether a red sweater looked better than a black sweater are classic examples of such subjective aspects. This is demonstrated- in part- through the Adrenaline experiment where the quality of the photo chosen through crowd-sourcing was better than the computer vision generated photo. For subjective voting (A-or-B), users might even prefer to get an input from other people, as opposed to the machine. Research in media equations and affect would indicate that this is likely.
    The authors talk about their vision of the future- where crowd-sourcing markets are designed for quick requests. Although the authors have demonstrated that it is possible to have synchronous crowds and use them to perform real-time tasks with quick turn-around times, a lot more thought needs to go into the design of such systems. For example, if many requesters wanted to have workers on “retainer”, workers could easily accept tasks from multiple requesters and simply make some money for being on hold. The key idea of a retainer is to not prevent the worker from accepting other tasks, while they wait. These two ideas seem at logger heads with each other. Additionally, this might introduce a higher latency, which perhaps could be remediated with competitive quickness bonuses. The authors do not explicitly state how much money the workers were paid for completion of the task, and I wonder how these amounts compared to the retainer rates they offered.
    For the Adrenaline experiment, the results compared the best photo identified from a short clip through a variety of techniques, viz. Generate-and-vote, Generate-one, Computer Vision, Rapid Refinement, Photographer. It would have been interesting to see if two additional conditions had been added- a single photograph taken by an expert photographer and a set of photographs taken by a photographer, as input to the techniques.

    Questions:

    1. The Adrenaline system allows users to capture the best moment, and the cost per image is about $0.44. The authors envision this cost going down to about $0.10. Do you think users would be willing to pay for such an application? Especially given that Android phones such as Samsung Galaxy has a mode to “capture best photo” whereby multiple images are taken at short intervals and the user has an option to select the best one to save.

    2. Do you think that using the crowd for real-time responses makes sense?

    3. For the rapid refinement model, one of the issues mentioned was that it might stifle individual expression, and that a talented worker’s input might get disregarded as compared to that of 3-4 other workers. Voting has the same issue. Can you think of ways to mitigate this?

    4.. Do we feel comfortable out-sourcing such tasks to crowd-workers? It is one thing when it is a machine…

    Read More

    Frenzy: Collaborative Data Organization for Creating Conference Sessions

    Lydia Chilton, Juho Kim, Paul André, Felicia Cordeiro, James A. Landay, Daniel S. Weld, Steven P. Dow, Robert C. Miller, Haoqi Zhang

    Discussion Leader: Divit Singh

    Summary

    In a conference, similar papers are usually organized into sessions.  This is done so that conference attendees can see related talks in the same time-block.  The process of organizing papers into these sessions is nontrivial.  This paper offers a different approach in order to aid the process of grouping papers into sessions.  This paper presents Frenzy: a web application designed to leverage the distributed knowledge of the program committee to rapidly group papers into sessions.  This application breaks session-making into 2 sub-problems: meta-data elicitation and global constraint satisfaction.  In the meta-data elicitation stage, users search for papers via queries on their abstracts/authors etc. and group them into categories that they believe makes sense.  They also have the ability to “+1” categories that have been suggested by other users to show support for that category.   In the global constraint satisfaction stage, users must assign a paper to a session and also make sure that every session contains at least two papers in it.  The author(s) tested this application at CSCW 2014 PC meeting and the schedule produced for the CSCW 2014 was generated with the aid of Frenzy.

    Reflection

    The idea of leveraging parallelism to create sessions for a conference is a brilliant one.  This paper mentioned that this process used to take an entire day and that the even then, the schedulers were usually pidgeon-holed into deciding which session a paper belonged to (in order to fulfill constraints).  By creating a web application that allows all users access to the data, I believe that they created an extremely useful and efficient system.  The only downside I see to this application is that I fear that they may give users too much power.  For example, users may delete categories.  I’m not sure that giving users this type of power would be wise.  For the purposes of a conference, where all users are educated and have a clear sense of the goal, it may be okay.  However, if they were to open up this system to a wider audience, this system may backfire.

    I really liked how they divided up their process into sub-problems.   From my understanding, the first stage is to get a general sense as to where these papers belong and to get some user feedback to show where the majority of users believe the appropriate category for a paper should be.  This stage is open to the entire audience so that everyone may contribute and have a say.  The second stage is thought to be more of a “clean-up” stage.  A special subset from the committee members then make the final choices as to deciding papers for session.  Now, they are provided with the thoughts of the overall group, which greatly help in deciding where papers go.  In my head, I viewed this approach as a map-reduce job.  The metaphor may be a stretch, but I viewed the first stage, they are just trying to “map” a paper to the best possible category.  This task happens in parallel and it generates an increasing set of results.  The second stage, “reduces” these sets and delegates them into their appropriate sessions.  For those reasons, reading through this paper, it was pretty interesting how they were able to pull this off.  Apart from the information-intense UI that they provided for their web application, they did an excellent job in simplifying the tasks enough to produce valid results.

    Questions

    • The interface that Frenzy has contains a lot of jam-packed information.  Do you think as a user of this system, you would understand everything that was going on?
    • The approach used by Frenzy breaks the problem of conference session making into 2 problems: meta-data elicitation and session constraint satisfaction.  Do you think that these two problems are to broad and can be broken down into further sub-problems?
    • This system gives the power to “delete” categories.  How do you make sure that a category that is valid is not deleted by a user?  Can this system be used on a group that is larger than a conference committee? Ex: MTurk?

    Read More

    Crowd synthesis: extracting categories and clusters from complex data

    Paul André, Aniket Kittur, and Steven P. Dow. 2014. Crowd synthesis: extracting categories and clusters from complex data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (CSCW ’14). ACM, New York, NY, USA, 989-998.

    Discussion Leader: Nai-Ching Wang

    Summary

    This paper proposes a two-stage approach to guide crowd workers to produce accurate and useful categories from unorganized and ill-structured text data. Although there are automatic techniques available already to group text data in terms of topics, manual labeling is still required for inferring meaningful concepts by analysts. Assuming crowd workers to be transient, inexpert and conflicting, this work is motivated by two major challenges of harnessing crowd members to synthesize complex tasks. One is to produce expert work without requiring domain knowledge. The other one is to enforce global constraints with only local views available to the crowd workers. The proposed approach deals with the former challenge by introducing re-representation stage, which consists of different combinations of classification, context and comparison including raw text, classification (Label 1), classification+context (Label 10) and comparison/grouping. The latter challenge is coped with by introducing an iterative clustering stage, which shows existing work (categories) to subsequent crowd workers to enforce global constraints. The results show that classification with context (Label 10) approach produces the most accurate categories with most useful level of abstraction.

    Reflections

    This paper resonates our discussion about human and algorithmic computation pretty well because it points out why humans are required in the synthesis process and algorithmic computation was really used to demonstrate this point. This paper also mentions potential conflicts among crowd workers but as we can see in this paper that there are also conflicts between the professionals (the two raters). This makes me wonder if there are really right answers. Unfortunately, this paper does not include comparisons among crowd workers’ work to understand how conflicting their answers are. It would also be interesting to see and compare the consistencies of experts and crowd workers. Another interesting result is that the raw text condition is almost as good as the classification plus context condition except for the quality of abstraction. It feels that by combining previously-discussed person-centric strategies, the raw text condition might perform as well as the classification plus context condition or even outperforms it. In addition, the choice of 10 items for context and grouping at Stage A seems arbitrary. Based on the results, it seems more context hints better results but is that true? Or there is a best/least amount of context? Also, for grouping, the paper also mentions that the selection of groups might (greatly) affect the results so it would be interesting to see how different selections affect the results. As for the results of coarse-grained recall, it seems strange that the paper does not disclose the original values even though the authors think the result of coarse-grained recall is valuable.

    Questions

    • The global constraints are enforced by showing existing categories to subsequent workers. How do you think about this idea? Any issues this approach might have? What is your solution?
    • The paper seems to hint that characters in the labels can be used to measure levels of concepts. Do you agree? Why? What else measures will you suggest for defining levels of concepts?
    • How will you expect quality control to be conducted?

    Read More