Improving Crowd Innovation with Expert Facilitation

Chan, Joel, Steven Dang, and Steven P. Dow. “Improving Crowd Innovation with Expert Facilitation.”

Discussion Leader: Shiwani

Summary:

As the title suggests, this paper studies whether crowd innovation can be improved through expert facilitation. The authors created a new system which builds on the strategies used during face-to-face brainstorming and uses this to provide high-level “inspirations” to crowd-workers to improve their ideation process.

The first study compared the creativity levels of ideas generated by a “guided” crowd with ideas generated without any facilitation. The study showed that the ideators in the condition with expert facilitators generated more ideas, generated more creative ideas and exhibited more divergent thinking. The second study focused on the abilities of the facilitator and involved novice facilitators, keeping all other constraints same. Surprisingly, the facilitation seemed to negatively influence the ideation process.

Reflections:

This paper touches on and build on many things we have been talking about this semester. One of the key ideas behind <SystemName> is feedback, and its role in improving creativity.

I really liked the paper as a whole. Their approach of adapting expert facilitation strategies from face-to-face brainstorming to a crowd-sourcing application was quite novel and interesting. They took special efforts to make the feedback synchronous in order to “guide” the ideation, as with real-time brainstorming.

They make a strong case for the need for something such as <SystemName>. Their first point centers around the fact that the crowd-workers may be hampered because of inadequate feedback (as we have discussed before in the “feedback” papers). And the second point is that the existing systems were not built to scale. With <SystemName> the authors created a system to provide expert feedback while also ensuring it could scale by keeping the feedback at a higher level, rather than individualized.

The authors mention that a good system requires divergent thinking as well as convergence of ideas. The divergence prevents local minima in the ideation, and the convergence allows for growth of promising solutions into better ideas. This was an interesting way of looking at the creative process. And this situates their choice of  using a skilled facilitator as a tool.

The study was quite well-designed with clear use-cases. On one-hand they wished to study the effect of having a facilitator guide the ideation. And a second study captured the effect of the skill-level of the facilitator. The interface design was simplistic, both for the ideators and the facilitators. I liked the word-cloud idea for the facilitators- it is a neat way to present an overview/insight at such a scale. I also liked the “pull” model for inspiration, where the ideators were empowered to ask for inspiration whenever they felt the need for it as opposed to pre-determined check points. This deviates somewhat from the traditional brainstorming where experts choose when to intervene, but again, for the scale of the system and the fact that the feedback was not individualized, it makes sense.

 The authors do mention that their chosen use-case may limit the generalization of their findings, but the situational, social case was a good choice for an explorative study.

As with a previous paper we read, creativity was qualified by the authors, due to its subjective nature. Using novelty and value as evaluative aspects of creativity seems like a good approach, and I liked that the creativity score was a multiplication of these two to reflect their interactive effect.

Questions:

  1. A previous paper defined creativity in terms of novelty and practical (being of use to the user, and it being practically feasible to manufacture in today’s age), whereas this paper focused only on the “of value” aspect in addition to novelty. Do you think either definition is better than the other?
  2. The paper brings forth the interesting notion that “just being watched” is not sufficient to improve worker output. Do you think this is specific to creativity, and the nature of the feedback the workers received?
  3. For the purpose of scale, the authors gave “inspirations” as opposed to individualized feedback (like Shepherd did). Do you think the more granular, personalized feedback would be helpful in addition to this? In fact, would you consider the inspirations as “feedback”?

Read More

CommunitySourcing: engaging local crowds to perform expert work via physical kiosks

Heimerl, Kurtis, et al. “CommunitySourcing: engaging local crowds to perform expert work via physical kiosks.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2012.

Discussion Leader: Shiwani

Summary:

In this paper, the authors introduce a new mechanism, called community-sourcing, which is intended to facilitate crowd-sourcing when domain experts are required. Community-sourcing is different from other platforms in the sense that it involves the placement of physical kiosks in locations which are likely to attract the right people and aims to involve these people when they have surplus time (e.g. when they are waiting).

The authors defined three cornerstones for the design of a community sourcing system, viz. task selection, location selection and reward selection.

To evaluate the concept, the authors created a system called Umati. Umati consisted of a vending machine interfaced with a touch screen. Users could earn “vending credit” by completing tasks on the touch screen, and once they had earned enough credit, they could exchange the same for items from the vending machine. Although Umati was programmed to accept a number of different tasks, the authors selected the tasks of exam grading and filling out a survey for their evaluation (task selection). Research evidence suggests that redundant peer grades were strongly correlated with expert scores, which made grading an interesting task to choose; while the survey task helped the authors capture demographic information about the users.  Umati was placed in front of the major lecture hall in the Computer Science building, which mainly supported computer science classes (location selection). The authors chose snacks (candies) as reward, as food is a commonly used incentive for students to participate in campus events (reward selection).

For their evaluation, the authors generated 105 sample answers to 13 questions, which were taken from prior mid-term questions for the second semester undergraduate course CS2. These answers were then evaluated on two systems, Umati and Amazon Mechanical Turk, as well as experts. A special spam detection logic was implemented by adding golden standard questions.

The results showed a strong correlation between Umati evaluations and expert evaluations, whereas Amazon Mechanical Turk evaluations differed greatly. Additionally, and more interestingly, Umati was able to grade exams more accurately, or at a lower cost than traditional single-expert grading. The authors mention several limitations of their study such as the duration of the study, privacy concerns, restriction to a particular domain, etc. Overall, Umati looks promising but requires more evaluation.

 

Reflection:

The title of the paper gives the perfect summary of what community-sourcing is about, viz. local crowds, expert work, and physical kiosks. It is a novel and pretty interesting approach to crowd-sourcing.

I really like the case the authors make for this new approach. They talk about the limitations and challenges in accessing experts. For example, some of the successful domain-driven platforms do so by creating competitions which is not the best approach for high-volume work. Others seek to identify the “best answer” (StackOverflow) which is not great for use-cases such as grading. Secondly, there are many natural locations where there is cognitive surplus (e.g. academic buildings, airport lounges, etc.) and the individuals in these locations can serve as “local experts” under certain conditions. Thirdly, having a physical kiosk as a reward system, thereby giving out tangible rewards seems like a great idea.

I also like that the authors situate community sourcing very well and state where it would be applicable, that is, specifically for “short-duration, high volume tasks that require specialized knowledge or skills which are specific to a community but widely available within the community”. This is perhaps a very niche aspect, but an interesting niche and the authors gave some examples of where they could see this being applied (grading, tech support triage, market research, etc).

The design for Umati, the evaluation system, was quite thorough and clearly based on the three chief design considerations put forward by the authors (location, reward, task). Every aspect seems to have been thought through and reviewed. An example is the fact that the survey task had more credit (5 credits) than the grading tasks (1 credit each), which I assume was to encourage users to provide their demographic information.

The spam detection concept used was the use of gold standard questions, and exclusion of participants who failed more than one such question. Interestingly, while for Umati this meant that the user was blacklisted (based on ID), the data upto the point of blacklisting was still used in the analysis. On the other hand, for AMT, two sets of data were presented, one including all responses and one which was filtered based on the spam detection criteria.

Another interesting point is that about 19% of users were blacklisted. The authors explain that this happened in some cases because some users forgot to log out, and in some cases, because users were merely exploring and did not realize that there would be repercussions. I wonder if the authors performed any pilot tests to catch this?

The paper presents a few more interesting ideas such as the double-edged effects of group participation, as well as the possibility that the results may not be generalizable due to the nature of the study (specific domain and tasks). I did not find any further work performed by the authors to extend this, which was a little unfortunate. There was some work related to community sourcing, but along very different lines.

Last, but not least, Umati had a hidden benefit for the users: grading tasks could potentially improve their understanding of the material, especially when the tasks were performed through discussions as a group. This opens up potential for instructors encouraging their students to participate in exchange for some class credit perhaps.

Discussion:

  1. The authors decided to include the data up to the failing of the second gold standard question for Umati users. Why do you think they chose to do that?
  2. Do you think community sourcing would be as effective if it was more volunteering-based, or if the reward was less tangible (virtual points, raffle ticket, etc)?
  3. 80% of the users had never participated in a crowdsourcing platform. Could this have been a reason for its popularity? Do you think interest may go down over a period of time?
  4. The paper mentions issue such as the vending machine running out of snacks, and people getting blacklisted because they did not realize there would be repercussions. Do you think having some controlled pilot tests would have re-mediated these issues?
  5. None of the AMT workers passed the CS qualification exam (5 MCQs on computations complexity and Java) , but only 35% failed the spam detection. The disparity in pay was $0.02 between the normal HIT, and the HIT with the qualification exam. Do you think the financial incentive was not enough, or was the gold standard not as effective?
  6. The authors mentioned an alternative possibility of separating the work and reward interfaces, in order to scale the interface both in terms of users and tasks. Do you think this would still be effective?

Read More

Crowds in two seconds: enabling realtime crowd-powered interfaces

Bernstein, Michael S., et al. “Crowds in two seconds: Enabling realtime crowd-powered interfaces.” Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011.

Discussion Leader: Shiwani

Youtube video for a quick overview: https://www.youtube.com/watch?v=9IICXFUP6MM

Summary

Crowd-sourcing has been successfully used in a variety of avenues, including interactive applications such as word processors, image searches, etc. However, a major challenge is the latency in returning a result to the user. If an interace takes more than 10 seconds to react, the user is likely to lose focus and/or abandon the interface. Near real-time techniques at the time required at least 56 seconds for simple tasks and 22 minutes or longer for more complex workflors.
In this paper, the authors propose techniques for recruiting and effectively using synchronous crowds in order to provide real-time, crowd-powered interfaces. The first technique is called the retainer model and involves hiring workers in advance and placing them on hold by paying them a small amount. When a task is ready, the workers are alerted and they are paid additional amount on completion of the task. The paper also discusses empirical guidelines for this technique. The second technique introduced in the paper is rapid refinement. It is a design pattern for algorithmically recognizing crowd agreement early on and rapidly reducing the search space to identify a single resullt.
The authors created a system called Adrenaline to validate the retainer model and rapid refinement. Adrenaline is a smart photo shooter, designed to find the most photogenic moment by capturing a short (10 second) video clip and using the crowd to identify the best moment.
Additionally, the authors were interested in looking at other applications for real-time crowd-powered interfaces. Fo this, they created two systems, Puppeteer and A|B. Puppeteer is intended for creative content generation tasks, and allows the designer to interaction with the crowd as they work. A|B is a simple platform for asking A-or-B questions, with the user providing two options and asking the crowd to choose one based on pre-specified criteria.
The results of the experiments suggest that the retainer model is effective in assembling a crowd about two seconds after the request is made and that a small reward for quickness remediated longer reaction times caused by longer retainer times. It was also found that rapid refinement enabled small groups to select the best photograph faster than the single fastest member. However, forcing agreement too quickly, sometimes affected the quality. For puppeteer, there was a small latency due to complexity of the task, but the throughput rates were constant. On the other hand, for A\B, responses were received in near real-time.

Reflections

This paper ventures into an interesting space and brings me back to the idea to “Wizard of Turk”, where users interact with the system and the responses of the system are generated through human intelligence. The reality of machine learning at the moment is that there are still areas which are subjective and require a human in the loop. Tasks such as identifying the most photogenic photo or voting whether a red sweater looked better than a black sweater are classic examples of such subjective aspects. This is demonstrated- in part- through the Adrenaline experiment where the quality of the photo chosen through crowd-sourcing was better than the computer vision generated photo. For subjective voting (A-or-B), users might even prefer to get an input from other people, as opposed to the machine. Research in media equations and affect would indicate that this is likely.
The authors talk about their vision of the future- where crowd-sourcing markets are designed for quick requests. Although the authors have demonstrated that it is possible to have synchronous crowds and use them to perform real-time tasks with quick turn-around times, a lot more thought needs to go into the design of such systems. For example, if many requesters wanted to have workers on “retainer”, workers could easily accept tasks from multiple requesters and simply make some money for being on hold. The key idea of a retainer is to not prevent the worker from accepting other tasks, while they wait. These two ideas seem at logger heads with each other. Additionally, this might introduce a higher latency, which perhaps could be remediated with competitive quickness bonuses. The authors do not explicitly state how much money the workers were paid for completion of the task, and I wonder how these amounts compared to the retainer rates they offered.
For the Adrenaline experiment, the results compared the best photo identified from a short clip through a variety of techniques, viz. Generate-and-vote, Generate-one, Computer Vision, Rapid Refinement, Photographer. It would have been interesting to see if two additional conditions had been added- a single photograph taken by an expert photographer and a set of photographs taken by a photographer, as input to the techniques.

Questions:

1. The Adrenaline system allows users to capture the best moment, and the cost per image is about $0.44. The authors envision this cost going down to about $0.10. Do you think users would be willing to pay for such an application? Especially given that Android phones such as Samsung Galaxy has a mode to “capture best photo” whereby multiple images are taken at short intervals and the user has an option to select the best one to save.

2. Do you think that using the crowd for real-time responses makes sense?

3. For the rapid refinement model, one of the issues mentioned was that it might stifle individual expression, and that a talented worker’s input might get disregarded as compared to that of 3-4 other workers. Voting has the same issue. Can you think of ways to mitigate this?

4.. Do we feel comfortable out-sourcing such tasks to crowd-workers? It is one thing when it is a machine…

Read More

CrowdForge: Crowdsourcing Complex Work

Aniket Kittur , Boris Smus , Susheel Khamkar , Robert E. Kraut, CrowdForge: crowdsourcing complex work, Proceedings of the 24th annual ACM symposium on User interface software and technology, October 16-19, 2011, Santa Barbara, California, USA

Discussion Leader: Shiwani Dewal

Summary

CrowdForge is a framework which enables the creation and completion of complex, inter-dependent tasks using crowd workers. At the time of writing the paper (and even today), platforms like Amazon Mechnical Turk facilitated access to a micro-workers who complete simple, independent tasks which require little or no cognitive effort. Complex tasks, traditionally, require more coordination, time and cognitive effort; especially for the person managing or overseeing the effort. These challenges become even more acute when crowd workers are involved.

To address this issue, the authors present their framework, CrowdForge, alongwith case studies which were accomplished through a web-based prototype. The CrowdForge framework is drawn from distributed computing (MapReduce) and consists of three steps, viz. partition, map and reduce. The partitioning step breaks a higher level task into single units of work. The mapping step involves the units of work being assigned to workers. The same task may be assigned to several workers to allow for improvements and quality control. The final step is reduction in which the units of work are combined into a single output, which is essentially the solution for the higher level task.

The framework was tested through several case studies. The first case study was about writing a Wikipedia article about New York City. Surprisingly, the articles produced by groups of workers across HITs, were rated, on an average, as high as the Simple English Wikipedia article on New York City and higher than full articles written by individuals as part of a higher paying HIT. Quality control was tested through further map and reduce efforts to merge results and through voting, and was deemed more effective through merged efforts. The second case study involved collating information for researching purchase decisions. The authors do not provide any information about the quality of the resulting information. The last case study dealt with the complex flow of turning an academic paper into a newspaper article for the general public. The paper discusses the steps used to generate news leads (the hook for the paper) and a summary of the researchers’ work, as well as the quality of the resulting work.

The CrowdForge approach looked very promising which was exemplified through the case studies. It also had a few disadvantages such as not supporting iterative flows, assuming that a task can, in fact, be broken down into single units of work and possible overlap between the results of a task due to the lack of communication between workers. The authors concluded by encouraging researchers and task designers to consider crowd sourcing for complex tasks, and push the limits of what they could accomplish through this market.

Reflections

The authors have identified an interesting gap in the crowd sourcing market- ability to get complex tasks completed. And although requesters probably may have broken down their tasks into HITs in the past and taken care of the combining of results on their end, CrowdForge’s partition-map-reduce framework seems like it could alleviate the challenge and streamline the process, to some extent.

I like the way the partition-map-reduce framework is conceptualized. It seems fairly intuitive and seems to have worked well for the case-studies. I am a little surprised (and maybe skeptical?) that the authors did not include the results of the second case study or more details for the rest of the third case study.

The other aspect I really liked about the paper was the effort to identify and test alternative or creative ways to solve common crowd sourcing problems. For example, the authors came up with the idea of using further map-and-reduce steps in the form of merging as an alternative to voting on solutions. Additionally, they came up with the consolidate and exemplar patterns for the academic paper case study, to alleviate the problems of the high complexity of the paper and the effort workers expected to put in.

The paper mentions in its section on limitations that there are tasks which either cannot be decomposed and that another market with skilled or motivated workers should be considered This also brings me back to the notion that perhaps crowd-sourcing in the future will look more like crowd-sourcing for a skill-set, a kind of skill-based consulting.

In conclusion, I think that the work presented in the paper looks very promising, and it would be quite interesting to see the framework being applied to other use-cases.

Discussion

1. The paper mentions that using further map and reduce steps to increase the quality of the output, as opposed to voting, generated better results. Why do you think that happened?

2. There may be tasks which are too complex to be decomposed, or decomposed tasks which require a particular skill set. Some crowd sourcing platforms accomplish this through having an “Elite Taskforce”. Do you think this is against the principles of crowd sourcing, that is, that a task should ideally be available to every crowd worker or is skill-based crowd sourcing essential?

3. CrowdForge breaks tasks up, whereas TurkIt allowed iterative work-flows and the authors talk about their vision to merge the approaches. What do you think would be some key benefits for such a merged approach?

4. The authors advocate for pushing the envelope when it comes to the kind of tasks which can be crowd sourced. Thoughts?

Read More