04/07/20 – Sukrit Venkatagiri – CrowdScape: Interactively Visualizing User Behavior and Output

Paper: Jeffrey Rzeszotarski and Aniket Kittur. 2012. CrowdScape: interactively visualizing user behavior and output. In Proceedings of the 25th annual ACM symposium on User interface software and technology (UIST ’12), 55–62. https://doi.org/10.1145/2380116.2380125

Summary:

Crowdsourcing has been used to do intelligent tasks/knowledge work at scale and for a lower price, all online. However, there are many challenges with controlling quality in crowdsourcing. This paper talks about how in prior approaches, quality control was done through algorithms evaluated against gold standard or looking at worker agreement and behavior. Yet, these approaches have many limitations, especially for creative tasks or other tasks that are highly complex in nature. This paper presents a system, called CrowdScape, to support manual or human evaluation of complex crowdsourcing task results through a visualization that is interactive and has a mixed initiative machine learning back-end. The paper describes features of the system as well as its uses through 4 very different case studies. First, a translation task from Japanese to English. The next one was a little unique, asking workers to pick their favorite color. The third was about writing about their favorite place, and finally the last one was tagging a video. Finally, the paper concludes with a discussion of the findings.

Reflection:

Overall, I really liked the paper and the CrowdScape system, and I found the multiple case studies really interesting. I especially liked the fact that the case studies varied in terms of complexity, creativity, and open-endedness. However, I found the color-picker task a little off-beat and wonder why the authors chose that task. 

I also appreciate that the system is built on top of existing work, e.g. Amazon Mechanical Turk (a necessity), as well as Rzeszotarski and Kittur’s Task Fingerprinting system to capture worker behavioral  traces. The scenario describing the more general use case was also very clear and concise. The fact that the system, CrowdScape, also utilizes two diverse data sources—as opposed to just one—is interesting. This makes triangulating the findings more easy, as well as observing and discrepancies in the data. More specifically, the CrowdScape system looks at worker’s behavioral traces as well as their output. This allows one to differentiate between workers in terms of their “laziness/eagerness” as well as the actual quality of the output. The system also provides an aggregation of the two features, and all of these are displayed as visualizations which makes it easy for a requester to view tasks and easily discard/include work.

However, I wonder how useful these visualizations might be for tasks such as surveys, or tasks that are less open-ended. Further, although the visualizations are useful, I wonder if they should be used in conjunction with gold standard datasets or not, and how useful that combination would be. Although the paper demonstrates the potential uses of the system via case studies, it does not demonstrate whether real users say it is useful. Thus, an evaluation by real-world users might help.

Questions:

  1. What do you think about the case study evaluation? Are there ways to improve it? How?
  2. What features of the system would you use as a requester?
  3. What are some drawbacks to the system?

Read More

04/08/2020 – Vikram Mohanty – CrowdScape: Interactively Visualizing User Behavior and Output

Authors: Jeffrey M Rzeszotarski, Aniket Kittur

Summary

This paper proposes a system CrowdScape, that supports human evaluation of crowd work through interactive visualization of behavioral traces and worker output, combined with mixed-initiative machine learning. Different case studies are discussed to showcase the utility of CrowdScape.

Reflection

The paper addresses the issue of quality control, a long-standing problem in crowdsourcing, by combining two existing standalone approaches that researchers currently adopt: a) inference from worker behavior and b) analyzing worker output. Combining these factors is advantageous as it provides the complete picture, either by providing corroborating evidence towards ideal workers, or in some cases, may provide complementary evidence that can help infer ideal “good” workers. Just analyzing the worker output might not be enough as there’s an underlying chance that it might be as good as a random coin toss. 

Even though it was a short text in parentheses, I really liked the fact that the authors explicitly sought permission to record the worker interaction logs. 

Extrapolating other similar or dissimilar behavior using Machine Learning seems intuitive here as the data and the features used (i.e. the building blocks) of the model are very meaningful, perfectly relevant to the task and not a black-box model. As a result, it’s not surprising to see it work almost everywhere. The one case where it didn’t work, it made up for it by showing that the complementary case works. This sets a great example for designing predictive models on top of behavioral traces that actually works. 

Moreover, the whole system was built agnostic of the task, and the evaluations justified it. However, I am not sure if the best use case of the system is optimized towards recruiting multiple workers for a single task, or whether it is to identify a set of good workers to subsequently retain for other tasks in the pipeline. I am guessing it is the latter, as the former might seem like an expensive approach for getting high-quality responses. 

On the other hand, I feel the implications of this paper go beyond just crowdsourcing quality control. CrowdScape, or a similar system, can provide assistance for studying user behavior/experience in any interface (web for now), which is important for evaluating interfaces.

Questions

  1. Does your evaluation include collecting behavioral trace logs? If so, what are some of your hypotheses regarding user behavior?
  2. How do you plan on assessing quality control?
  3. What kind of tasks do you see CrowdScape being best applicable for? (e.g. single task, multiple workers)

Read More

04/08/2020 – Vikram Mohanty – Agency plus automation: Designing artificial intelligence into interactive systems

Authors: Jeffrey Heer

Summary

The paper discusses interactive systems in three different areas — data wrangling, exploratory analysis, and natural language translation — to showcase the use of “shared representation” of tasks, where machines can augment human capabilities instead of replacing them. All the systems highlight balancing of the complementary strengths and weaknesses of each, while promoting human control.

Reflection

This paper makes the case for intelligence augmentation i.e. augmenting human capabilities with the strengths of AI rather than striving to replace them. Developers of intelligent user interfaces can come up with effective collaborative systems by carefully designing the interface for ensuring that that AI component “reshapes” the shared representations that users can contribute to, and not “replace” them. This is always a complex task, and therefore, requires scoping down from the notion that AI can be used to automate everything by focusing on these editable shared representations. This has other benefits i.e. helps exploit the benefits of AI in a sum-of-parts manner rather than an end-to-end mechanism where an AI is more likely to be erroneous. The paper discusses three different case studies where a mixed-initiative deployment was successful in catering to user expectations in terms of experience and output. 

It was particularly interesting to see the participants complaining that the Voyager system, despite being good, spoilt them as it made them think less. This can hamper adoption of such systems. A reasonable design implication here should be allowing users to choose the features they want or giving them the agency to adjust the degree of automation/suggestions. This also suggests the importance of conducting longitudinal studies to understand how users use the different features of an interface i.e. whether they use one but not the other. 

According to some prior work, machine-suggested recommendations have been known to perpetrate filter bubbles. In other words, users are exposed to a similar set of items and miss out on other stuff. Here, the Voyager recommendations work in contrast to prior work by allowing users to explore the space, analyze different charts and data points they wouldn’t otherwise notice and combat confirmation bias. In other words, the system does what it claims to do i.e. augment the capabilities of humans in a positive sense using the strengths of the machine. 

Questions

  1. In the projects you are proposing for the class, does the AI component augment the human capabilities or strive to replace it (eventually)? If so, how?
  2. How do you think developers should cater to cases where users are less likely to adopt a system because it impedes their creativity?
  3. Do you think AI components (should) allow users to explore the space more than they would normally? Any possible pitfalls (information overdose, unnatural tasks/interactions, etc.)

Read More

4/8/20 – Akshita Jha – Agency plus automation: Designing artificial intelligence into interactive systems

Summary:
“Agency plus automation: Designing artificial intelligence into interactive systems” by Heer talks about the drawback of using artificial intelligence techniques for automating tasks, especially the ones that are considered repetitive and monotonous. However, this presents a monumentally optimistic point of view by completely ignoring the ghost work or the invisible labor that goes into making ‘automating’ these tasks. This gap between crowd work and machine automation highlights the need for design and engineering interventions. The authors of this paper try to make use of the complementary nature strengths and weaknesses of the two – creativity, intelligence, world-knowledge of the crowd workers and the cheap and no cognitive overload provided by automated systems. The authors describe in detail the case studies of interactive systems in three different areas – data wrangling, exploratory analysis, and natural language translation. These systems combine computational support with interactive systems. The authors also talk about sharing representations of tasks to include both human intelligence and automated support in the design itself. The authors conclude that “neither automated suggestions nor direct manipulation plays a strictly dominant role” and ” a fluent interleaving of both modalities can enable more productive, yet flexible, work.”

Reflections:
There is a lot of invisible work that goes into automating a task. Most automated tasks require hundreds, if not thousands, of annotations. Machine learning researchers turn a blind eye to all the effort that goes into annotations by calling their systems ‘fully automated’. This view is exclusionary and does not do justice to the vital but seemingly trivial work done by the crowd workers. One of the areas that one can focus on is the open question of shared representation – Is it possible to integrate data representation with human intelligence? If yes, is that useful? Data representation often involves the construction of latent space to reduce the dimensionality of input data and get concise and meaningful information. There may or may not be such representations exist for human intelligence. Maybe borrowing from social psychology might help in such a scenario. There can be other ways to go around this. For example, the authors focus on building interactive systems with ‘collaborative’ interfaces. The three interaction models: Wrangler, Voyager, and PTM do not distribute the tasks equally between humans and automated systems. The automated methods prompt the users with different suggestions which the end user reviews. The final decision making power lies with the end user. It would be interesting to see what would the results looks like if the roles were reversed and the system was turned on its head. An interesting case study could be if the suggestion was given by the end user and the ultimate decision making capability rested with the system. Would the system still be as collaborative? What would the drawbacks of such systems be?

Questions:

1. What are your general thoughts on the paper?
2. What did you think about the case studies? Which other case studies would you include?
3. What are your thoughts on evaluating systems with shared representations? Which evaluation criteria can we use?

Read More

4/8/20 – Akshita Jha – CrowdScape: Interactively Visualizing User Behavior and Output

Summary:
“CrowdScape: Interactively Visualizing User Behavior and Output” by Rzeszotarski and Kittur talks about crowdsourcing and the importance of interactive visualization using the complementary strengths and weaknesses of crowd workers and machine intelligence. Crowd sourcing helps work distribution. Quality control approaches for this are often not scalable. Crowd organizing algorithms like Partition-Map-Reduce, Find-Fix-Verify, and Price-Divide-Solve are used for easy distribution, merging and checking the work in crowd sourcing. However, they aren’t very accurate or useful in complex subjective tasks. CrowdScape assimilates worker behavior with worker input using interaction, visualization, and machine learning. This supports the human evaluation of crowd work. CrowdScape enables the user to hypothesize about and test the crowd to distill the selections by using a sensemaking loop. This paper proposes novel techniques for crowd worker’s product exploration and visualizations for crowd worker behavior. It also provides tools for classification or crowd workers and an interface for interactive exploration of these results using mixed-method machine learning.

Reflections:
There has been work done involving crowd behaviour centered on worker behaviour or worker output in isolation but combining them is very fruitful to generate mental models of the workers and build a feedback loop. Visualisation of the workers’ process helps us understand their cognitive process and thus perceive the end product better. CrowdScape can only be used in webpages online that allow the injection of JavaScript. It is not useful when this is blocked or for non-web offline interfaces. The set of aggregate features used might not always provide useful feedback. The already existing quality control measures are not very different from CrowdScape in case of clear, consensus ground truth exists, such as identifying a spelling error. In such cases, the effort put in learning and using CrowdSpace may not always be beneficial and hence may not be too advantageous. In some cases, the behavioral traces of the worker may not be very indicative. Such as when they work on a different editor and finally copy and paste the work in another one. Tasks that are heavily cognitive or totally offline are also not very compliant with the general methods supported by CrowdScape. This system heavily relies on the detailed level of behavioral traces such as mouse movement, scrolling, keypresses, focus events, and clicks. It should be ensured that this intrusiveness and the implied decrease in efficiency should be countered by the accuracy of the measurement of the behavior. An interesting point to note here is that this tool can become privacy-intrusive if care is not taken. We should ensure that changes are made to the tool as crowd work becomes increasingly relevant and the tool becomes vital to better understand the underlying data and crowd behaviour. Apart from these reflections, I would just like to point out that the graphs that the authors use in the paper help in conveying their results really well. I feel this is one detail that is vital but easily overlooked in most papers.

Questions:
1. What are your general thoughts about this paper?
2. Do you agree with the methodology followed?
3. Do you approve of the interface? Would you make any changes to the interface?

Read More

04/09/2020 – Mohannad Al Ameedi – Agency plus automation Designing artificial intelligence into interactive systems

Summary

In this paper, the author proposes multiple systems that can combine the power of both artificial inttelgence and human computation and overcome each one weakness. The author thinks that automating all tasks can lead to a poor results as human component is needed to review and revise results get the best results. The author the autocomplete and spell checkers examples to show that artificial intelligence can offer suggestion and then human can review or revise these suggestions or dismiss the suggestions. The author propose different systems that uses predictive interaction that help users on their tasks that can be partially automated to help the users to focus more on the things that they care more about. One of these systems called Data Wrangling that can used by data analyst on the data preprocessing to help them with cleaning up the data to save more than %80 of their work. The users will need to setup some data mapping and can accept or reject the suggestions. The author proposed project called Voyager that can help with data visualization for exploratory analysis which can be used to help with suggesting visualization elements. The author suggests using AI to automate repeated task and offer the best suggestions and recommendations and let the human decide whether to accept or reject the recommendations. This kind of interaction can improve both machine learning results and human interaction.

Reflection

I found the material presented in the paper to be very interesting. Many discussions were about whether machine can replace human or not was addressed in this paper. The author mentioned that machine can do well with the help of human and the human in the loop will always be necessary.

I also like the idea of the Data Wrangling system as many data analysts and developer spend considerable time on cleaning up the data and most of the steps are repeated regardless of the type of data, and automating these steps will help a lot of people to do more effective work and to focus more on the problem that they are trying to solve rather than spending time on things that are not directly related to the problem.

I agree with author that human will always be in the loop especially on systems that will be used by humans. Advances in AI need human on annotating or labeling the data to work effectively and also to measure and evaluate the results.

Questions

  • The author mentioned that the Data Wrangler system can be used by data analysts to help with data preprocessing, do you think that this system can also be used by data scientist since most machine learning and deep learning projects require data cleanup ?
  • Can you give other examples of AI-Infused interactive systems that can help different domains and can be deployed into production environment to be used by large number of users and can scale well with increased load and demands?

Read More

04/08/20 – Jooyoung Whang – CrowdScape: Interactively Visualizing User Behavior and Output

In this paper, the authors try to help Mturk requesters by providing them with an analysis tool called “Crowdscape.” Crowdscape is a ML + visualization tool for viewing and filtering Mturk worker submissions based on the workers’ behaviors. The user of the application can threshold based on certain behavioral attributes such as time spent or typing delay. The application takes in two inputs: Worker behavior and results. The behavior input is a timeseries data of user activity. The result is what the worker submitted for the Mturk work. The authors focused on finding similarities of the answers to graph on parallel coordinates. The authors conducted a user study by launching four different tasks and recording user behavior and result. The authors conclude that their approach is useful.

This paper’s approach of integrating user behavior and result to filter good output was interesting. Although, I think this system should overcome a problem for it to be effective. The problem lies in the ethics area. The authors explicitly stated that they obtained consent from their pool of workers to collect user behavior. However, some Mturk requesters may decide not to do so with some ill intentions. This may result in intrusion of private information and even end up to theft. On the other hand, upon obtaining consent from the Mturk worker, the worker becomes aware of him or her being monitored. This could also result in unnatural behavior which is undesired for system testing.

I thought the individual visualized graphs and figures were effective for better understanding and filtering by user behavior. However, the entire Crowdscape interface looked a bit overpacked with information. I think a small feature to show or hide some of the graphs would be desirable. The same problem existed with another information exploration system from a project that I’ve worked in. In my experience, an effective solution was to provide a set of menus that hierarchically sorted attributes.

These are the questions that I had while reading the paper:

1. A big purpose of Crowdscape is that it can be used to filter and retrieve a subset of the results (that are thought to be high quality results). What other ways could this system be used for? For example, I think this could be used for rejecting undesired results. Suppose you needed 1000 results and you launched 1000 HITs. You know you will get some ill-quality results. However, since there are too many submissions, it’ll take forever to filter by eye. Crowdscape would help accelerate the process.

2. Do you think you can use Crowdscape for your project? If so, how would you use it? Crowdscape is useful if you, the researcher, is the endpoint of the Mturk task (as in the result is ultimately used by you). My project uses the results from Mturk in a systematic way without ever reaching me, so I don’t think I’ll use Crowdscape.

3. Do you think the graphs available in Crowdscape is enough? What other features would you want? For one, I’d love to have a boxplot for the user behavior attributes.

Read More

04/08/20 – Jooyoung Whang – Agency plus automation: Designing artificial intelligence into interactive systems

This paper seeks to investigate a method to achieve AI + IA. That is, enhancing human performance using automated methods but not completely replacing it. The author takes into notice that effective automation should first off bring significant value, second be unobtrusive, third do not require precise user input, and finally, adapt. The author takes these points to account and introduces three interactive systems that he built. All these systems utilize machine computing to handle the initial or small repetitive tasks and rely on human computing to make corrections and improve quality. They are all collaborative systems where AI and humans work together to boost each other’s performance. The AI part of the system tries to predict user intentions while the human part of the system drives the work.

This paper reminded me of Smart-Built Environments (SBE), a term I learned in a Virtual Environments class. SBE is an environment where computing is seamlessly integrated into the environment and interaction with it is very natural. It is capable of “smartly” providing appropriate services to humans in a non-intrusive way. For example, a system where the light automatically lights up upon a person entering a room is a smart feature. I felt that this paper was trying to build something similar in a desktop environment. One core difference with SBEs is that SBE also tries to tackle immersion and presence (which are terms frequently used for evaluating virtual environments). I wonder if the author knows about SBEs or got his project ideas from SBEs.

While reading the paper, I wasn’t sure if the author handled the “unobtrusive” part effectively. In one of the introduced systems, Wrangler was an assist tool for preprocessing data. It tries to predict user intention upon observing certain user behavior and recommends available data transformations on a side panel. I believe this was a similar approach to mimic the Google query auto-completion feature. However, I don’t think it’ll work as well as Google’s auto-completion. Google’s auto-complete suggestions appear right below where the user is typing whereas Wrangler suggests it in the side corner. This requires the user to avert his or her eye from where the point of the previous interaction was, and this is obtrusive.

These are the questions that I had while reading the paper:

1. Do you know any other systems that try to seamlessly integrate AI and human tasks? Is that system effective? How so?

2. The author of this paper mostly uses AI to predict user intentions and process repetitive tasks. What other capabilities of AI would be available for naturally integrating with human tasks? What other tasks are hard to do by humans that machines accel at that could be integrated?

3. Do you agree that “the best kind of systems is one where the user does not even know he or she is using it?” Would there ever be a case where it is crucial that the user feels the presence of the system as a separate entity? This thought came to me because systems could (and ultimately does) fail at some point. If none of the users understand how the system works, wouldn’t that be a problem?

Read More

04/08/2020 – Ziyao Wang – CrowdScape: Interactively Visualizing User Behavior and Output

The authors presented CrowdScape, which is a system used for supporting the human evaluation of increasing numbers of complex crowd work. The system used interactive visualization and mixed-initiative machine learning to combine information about worker behavior with the worker outputs. This system can help users to better understand the crowd workers and leverage their strength. They developed the system from three points to meet the requirement of quality control in crowdsourcing: output evaluation, behavioral traces, and integrated quality control. They visualized the workers’ behavior, quality of outputs and combined the findings of user behavior with user outputs to evaluate the work of the crowd workers. This system has some limitations, for example, it cannot work if the user completes the work in a separate text editor and the behavior traces are not detailed enough. However, this system is still good support for quality control.

Reflections:

How to evaluate the quality of the outputs made by the crowdsourcing workers? For those complex tasks, there is no single correct answer, and we can hardly evaluate the work of the workers. Previously, researchers proposed methods in which they traced the behavior of the workers and evaluated their work. However, this kind of method is still not accurate enough as workers may provide the same output while completing tasks in different ways. The authors provide us a novel approach that evaluates the workers from outputs, behavior traces and the combination of these two kinds of information. This combination increases the accuracy of their system and is able to do analysis on some of the complex tasks.

This system is valuable for crowdsourcing users. They can better understand the workers by building a mental model of them. As a result, they can distinguish good results from the poor ones. In projects related to crowdsourcing, developers will sometimes receive a poor response by inactive workers. With this system, they can only keep the valuable results for their research, which may increase the accuracy of their models, get a better view of their systems’ performance and get detailed feedback.

Also, for system designers, the visualization tool for behavioral traces is quite useful if they want to get detailed user feedback and user interactions. If they can analysis on these data, they can know what kinds of interactions are needed by their users and provide a better user experience.

However, I think there may be ethical issues in this system. Using this system, the hits publishers can obtain workers’ behavior while doing the hits. They can collect mouse movement, scrolling, keypresses, focus events and clicks information of the user. I think this may raise some privacy issues and these kinds of information may be used for crimes. The workers’ computers would be risky if their habits are collected by crackers.

Questions:

Can this system be applied to some more complex tasks other than purely generative tasks?

How can the designers use this system to design interfaces which can provide a better user experience?

How can we prevent crackers from using this system to collect user habits and do attacks on their computers?

Read More

Subil Abraham – 04/08/2020 – Rzeszotarski and Kittur, “CrowdScape”

Quality control in crowdwork is straightforward for straightforward tasks. Tasks like transcribing text on an image is fairly easy to evaluate the quality of because there is only one right answer. Requesters can use things like gold standard tests to evaluate the output of the crowdworkers directly in order to determine if they have done a good job, or use task fingerprinting to determine if the worker behavior indicates that they are making an effort. The authors propose CrowdScape as a way to combine both types of quality analysis, worker output and behavior, through a mix of machine learning and innovative visualization methods. CrowdScape includes a dashboard that provides a birds-eye view of the different aspects of the worker behavior in the form of graphs. These graphs showcase both aggregate behaviors of all the crowdworkers as well as the timeline of the individual actions a crowd worker takes on a particular task (scrolling, clicking, typing, and so on). They conduct multiple case studies on different kinds of tasks to show that their visualizations are beneficial in separating out the workers who make an effort in producing quality output versus those who are just phoning it in. Behavioral traces identify where the crowdworker spends their time by looking at their actions and how long they spend doing that action.

CrowdScape provides an interesting visual solution to the problem of “how to evaluate if the workers are being sincere in the completion of complex tasks”. Creative work especially, where you ask the crowd worker to write something on their own, is notoriously hard to determine because there is no gold standard test that you can do. So I find the inclusion of the behavior tracking visualizer where different colored lines along a timeline represent different actions done can be useful. Someone who makes an effort in typing out will show long blocks of typing with pauses for thinking. I can see how different behavioral heuristic can be applied for different tasks in order to determine if the workers are actually doing the work. I have to admit though that I find the scatter plots kind of obtuse and hard to parse. I’m not entirely sure how we’re supposed to read them and what information they are conveying. So I feel like the interface itself could do better in communicating exactly what the graphs are doing. There is promise for releasing this as a commercial or open source product (if it isn’t already one) once the polishing of the interface is done with. One last thing is the ability to group “good” submissions by the requester and then machine learning is used by CrowdScape to find other similar “good” submissions. However, the paper only makes mention of it and do not describe how it fits in with the interface as a whole. I felt this was another shortcoming of this design.

  1. What would a good interface for the grouping of the “good” output and subsequent listing of other related “good” output look like?
  2. In what kind of crowd work would CrowdScape not be useful (assuming you were able to get all the data that CrowdScape needs)?
  3. Did you find all the elements of the interface intuitive and understandable? Were there parts of it that were hard to parse?

Read More