04/08/2020 – Bipasha Banerjee – CrowdScape: Interactively Visualizing User Behavior and Output

Summary

The paper focuses on tackling the problem of quality control of the work done by crowdworkers. They created a system named CrowdScape to evaluate the work done by humans through mixed-initiative machine learning and interactive visualization. The provided details of quality control in crowdsourcing. This involved mentioning various methods that help evaluate the content. Some methods mentioned were post-hoc output evaluations, behavioral traces, and integrated quality control. CrowdScape is a system developed by the authors to capture worker behavior, also in the form of interactive data visualizations. The system incorporates various techniques to monitor user behavior. It helps to understand if the work was done diligently or was done in a rush. The output of the work is indeed a good indicator of the quality of the work; however, an in-depth review of the user behavior is needed to understand the method in which the worker completed the task.

Reflection

To be very honest, I found this paper fascinating and extremely important for research work in this domain. Ensuring the work submitted is of good quality not only helps legitimize the output of the experiment but also increases trust in the platform as a whole. I was astonished to read that about one-third of all submissions are of low quality. The stats suggest that we are wasting a significant amount of resources. 

The paper mentions that the tool uses two sources of data, output, and worker behavior. I was intrigued by how they took into account the worker’s behavior, like accounting for the time taken to complete the task, the way the work was completed, including scrolling, key-press, and other activities. I was curious to know if the worker’s consent was explicitly taken. It would also be an interesting study to see if knowing that the behavior is being recorded affects performance. Additionally, dynamic feedback can also be incorporated. By feedback, I mean, if the worker is supposed to take “x” min, alerting them that the time taken on the task is too low. This will prompt them to take the work more seriously and avoid unnecessary rejection of the task.

I have a comment on the collection of YouTube video tutorials. One of the features taken into account was ‘Total Time’, that signified if the worker had seen the video completely first and then summarized the content. However, I would like to point out that sometimes videos can be watched at an increased playback speed. I sometimes end up watching most tutorial related videos at 1.5 times speed. Hence, if the total time taken is lesser than expected, it might simply signify that they might have watched it at a different speed. A simple check could help solve the problem. YouTube generally has a fixed number of playback speeds. Considering that into account, when calculating the total time might be a viable option. 

Questions

  1. How are you ensuring the quality of the work completed by crowdworkers for your course project?
  2. Were the workers informed that their behavior was “watched”? Would the behavior and, subsequently, the performance change if they are aware of the situation?
  3. Workers might use different playback speeds to watch videos. How is that situation handled here?

Read More

04/08/2020 – Bipasha Banerjee – Agency plus automation: Designing artificial intelligence into interactive systems

Summary

The paper discusses the fact that computer-aided products should be considered to be an enhancement of human work rather than it being a replacement. The paper emphasizes that technology, on its own, is not always full proof and that humans, at times, tend to rely completely on technology. In fact, AI in itself can yield faulty results due to biases in the training data, lack of enough data, among other factors. The authors point out how the coupling of human and machine efforts can be done successfully through some examples of autocompleting of google search and grammar/spelling correction. The paper aimed to use AI techniques but in a manner that makes sure that humans remain the primary controller. The authors considered 3 case studies, namely data wrangling, data visualization for exploratory analysis, and natural language translation, to demonstrate how shared representations perform. In each case, the models were designed to be human-centric and to have automated reasoning enabled. 

Reflection

I agree with the authors’ statement about data wrangling that most of the time is spent in cleaning and preparing the data than actually interpreting or applying the task one specializes in. I was amused by the idea that users’ work of transforming the data is cut short and aided by the system that suggests users the proper action to take. I believe this would indeed help the users of the system if they get the desired options directly recommended to them. If not, it will help improve the machine further. I particularly found it interesting to see that users preferred to maintain control. This makes sense because, as humans, we have an intense desire to control.

The paper never explains clearly who the participants of the system are. This would be essential to know who the users were exactly and how specialized they are in the field they are working on. It would also give an in-depth idea about the experience they had interacting with the system, and thus I feel the evaluation would be complete.  

The paper’s overall concept is sound. It is indeed necessary to have a seamless interaction between man and the machine. They have mentioned three case studies. However, all of them are data-oriented. It would be interesting to see how the work can be extended to other forms – videos, images. Facebook picture tagging, for example, does this task to some extent. It suggests users with the “probable” name(s) of the person in the picture. This work can also be used to help detect fake vs. real images or if the video has been tampered.

Questions

  1. How are you incorporating the notion of intelligent augmentation in your class project?
  2. Case studies are varied but mainly data-oriented. How would this work differ if it was to imply images? 
  3. The paper mentions “participants” and how they provided feedback etc. However, I am curious to know how they were selected? Particularly, the criteria that were used to select users to test the system.

Read More

04/08/2020 – Palakh Mignonne Jude – CrowdScape: Interactively Visualizing User Behavior and Output

SUMMARY

There are multiple challenges that exist while ensuring quality control of crowdworkers that are not always easily resolved by employing simple methods such as the use of gold standards or worker agreement. Thus, the authors of this paper propose a new technique to ensure quality control in crowdsourcing for more complex tasks. By utilizing features from worker behavioral traces as well as worker outputs, they aid researchers to better understand the crowd. As part of this research, the authors propose novel visualizations to illustrate user behavior, new techniques to explore crowdworker products, tools to group as well as classify workers, and mixed initiative machine learning models that build on a user’s intuition about the crowd. They created CrowdScape – built on top of MTurk which captures data from the MTurk API as well as a Task Fingerprinting system in order to obtain worker behavioral traces. The authors discuss various case studies such as translation, picking a favorite color, writing about a favorite place, and tagging a video and describe the benefits of CrowdScape in each case.

REFLECTION

I found that CrowdScape is a very good system especially considering the difficulty in ensuring quality control among crowdworkers in case of more complex tasks. For example, in case of a summarization task, particularly for larger documents, there is no single gold standard that can be used and it would be rare that the answers of multiple workers would match for us to use majority vote as a quality control strategy. Thus, for applications  such as this, I think it is very good that the authors proposed a methodology that combines both behavioral traces as well as worker output and I agree that it provides more insight that using either alone. I found that the example of the requester intending to have summaries written for YouTube physics tutorials was an appropriate example.

I also liked the visualization design that the authors proposed. They aimed to combine multiple views and made the interface easy for requesters to use. I especially found the use of 1-D and 2-D matrix scatter plots showing distribution of features over the group of workers that also enabled dynamic exploration to be well thought out.

I found the case study on translation to be especially well thought out – given that the authors structured the study such that they included a sentence that did not parse well in computer generated translations. I feel that such a strategy can be used in multiple translation related activities in order to more easily discard submissions by lazy workers. I also liked the case study on ‘Writing about a Favorite Place’ as it indicated the performance of the CrowdScape system in a situation wherein no two workers would provide the same response and traditional quality control techniques would not be applicable.

QUESTIONS

  1. The CrowdScape system was built on top of Mechanical Turk. How well does it extend to other crowdsourcing platforms? Is there any difference in the performance?
  2. The authors mention that workers who may possibly work on their task in a separate text editor and paste the text in the end would have little trace information. Considering that this is a drawback of the system, what is the best way to overcome this limitation?
  3. The authors the case study on ‘Translation’ to demonstrate the power of CrowdScape to identify outliers. Could an anomaly detection machine learning model be trained to identify such outliers and aid the researchers better?

Read More

04/08/2020 – Palakh Mignonne Jude – Agency plus automation: Designing artificial intelligence into interactive systems

SUMMARY

The authors of this paper aim to demonstrate the capabilities of various interactive systems that build on the complementary strengths of humans and AI systems. These systems aim to promote human control and skillful action. The interactive systems that the authors have developed span three areas – data wrangling, exploratory analysis, and natural language translation. In the Data Wrangling project, the authors demonstrate a means that enabled users to create data-transformation scripts within a direct manipulation interface that was augmented by the use of predictive models. While covering the area of exploratory analysis, the authors developed an interactive system ‘Voyager’ that helps analysts engage in open-ended exploration as well as targeted question answering by blending manual and automated chart specification. As part of the predictive translation memory (PTM) project, that aimed to blend the automation capabilities of machines with rote tasks and the nuanced translation guidance that can be provided by humans. Through these projects, the authors found that there exist various trade-offs in the design of such systems.

REFLECTION

The authors mention that users ‘may come to overly rely on computational suggestions’ and this statement reminded me of the paper on ‘Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning’ wherein the authors discovered that the data scientists used as part of the study over-trusted the interpretability tools.

I thought that the use of visualizations as part of the Data Wrangling project was a good idea since humans often work well with visualizations and that this can speed up the task at hand. As part of previous coursework, my professor had conducted a small experiment in class wherein he made us identify a red dot among multiple blue dots and then identify a piece of text in a table. As expected, we were able to identify the red dot much quicker – attesting to the fact that visual aids often help humans to work faster. The interface of the ‘Voyager’ system reminded me of the interface of the ‘Tableau’ data visualization software. I found that, in the case of the predictive translation memory (PTM) project, it was interesting that the authors mention the trade-off between customers wanting translators that have more consistent results versus human translators that experienced a ‘short-circuiting’ of thought with the use of the PTM tool.

QUESTIONS

  1. Given that there are multiple trade-offs that need to be considered while formulating the design of such systems, what is the best way to reduce this design space? What simple tests can be performed to evaluate the feasibility of each of the systems designed?
  2. As mentioned in the case of the PTM project, customers hiring a team of translators prefer more consistent results which can be aided by MT-powered systems. However, one worker found that the MT ‘distracts from my own original translation’. Specifically in the case of natural language translation, which of the two do you find to be more important, the creativity/original translation of the worker or consistent outputs?
  3. In each of the three systems discussed, the automated methods suggest actions, while the human user is the ultimate decision maker. Are there any biases that the humans might project while making these decisions? How much would these biases affect the overall performance of the system?

Read More

04/08/20 – Lulwah AlKulaib-CrowdScape

Summary

The paper presents a system supporting the evaluation of complex crowd work through mixed-initiative machine learning and interactive visualization. This system proposes a solution for challenges in quality control that occur in crowdsourcing platforms. Previous work shows quality control concepts based on worker output or behavior which was not effective in evaluating complex tasks. Their suggested system combines the behavior and output of a worker to show the evaluation of complex crowd work. Their system features allow users to develop hypotheses about their crowd, test them, and refine selections based on machine learning and visual feedback. They use MTurk and Rzeszotarski and Kittur’s Task Fingerprinting system to create an interactive data visualization of the crowd workers. They posted four varieties of complex tasks: translating text from Japanese to English, picking a favorite color using an HSV color picker and writing its name, writing about their favorite place, and tagging science tutorial videos from Youtube. They conclude that the information gathered from crowd workers behavior is beneficial in reinforcing or contradicting the conception of the cognitive process that crowd workers use to complete tasks and in developing and testing mental models of the behavior of crowd workers who have good or bad outputs. This model helps  its users identify further good workers and output in a sort of positive feedback loop.

Reflection

This paper presents an interesting approach in addressing how to discover low quality responses from crowd workers. It is an interesting way to combine these two methods and makes me think of our project and what limitations might arise from following their method in logging behaviors of crowd workers. I was not thinking of disclosing to the crowdworkers about their behavior while responding is being recorded and now it’s making me look at previous work if that has affected the crowd workers response or not. I found it interesting that crowdworkers used machine translation in the Japanese to English translation task even when they knew their behavior was being recorded. I assume that since there wasn’t a requirement of speaking Japanese or the requirements were relaxed that crowd workers were able to perform the task and use tools like Google Translate. If the requirements were there, then the workers won’t be paid for the task. This has also alerted me to the importance of task requirements and explanation for crowd workers. Since some Turkers could abuse the system and give us low quality results simply because the rules weren’t so clear.

Having the authors list their limitations was useful for me. It gave me another perspective to think about how to evaluate the responses that we get in our project and what we can do to make our feedback approach better.

Discussion

  • Would you use behavioral traces as an element in your project? If yes, would you tell the crowd workers that you are collecting that data? Why? Why not?
  • Do you think that implicit feedback and behavioral traces can help determine the quality of a crowd worker’s answer? Why? Or why not?
  • Do you think that collecting such feedback is a privacy issue? Why? Or Why not?

Read More

04/09/2020 – Mohannad Al Ameedi – Agency plus automation Designing artificial intelligence into interactive systems

Summary

In this paper, the author proposes multiple systems that can combine the power of both artificial inttelgence and human computation and overcome each one weakness. The author thinks that automating all tasks can lead to a poor results as human component is needed to review and revise results get the best results. The author the autocomplete and spell checkers examples to show that artificial intelligence can offer suggestion and then human can review or revise these suggestions or dismiss the suggestions. The author propose different systems that uses predictive interaction that help users on their tasks that can be partially automated to help the users to focus more on the things that they care more about. One of these systems called Data Wrangling that can used by data analyst on the data preprocessing to help them with cleaning up the data to save more than %80 of their work. The users will need to setup some data mapping and can accept or reject the suggestions. The author proposed project called Voyager that can help with data visualization for exploratory analysis which can be used to help with suggesting visualization elements. The author suggests using AI to automate repeated task and offer the best suggestions and recommendations and let the human decide whether to accept or reject the recommendations. This kind of interaction can improve both machine learning results and human interaction.

Reflection

I found the material presented in the paper to be very interesting. Many discussions were about whether machine can replace human or not was addressed in this paper. The author mentioned that machine can do well with the help of human and the human in the loop will always be necessary.

I also like the idea of the Data Wrangling system as many data analysts and developer spend considerable time on cleaning up the data and most of the steps are repeated regardless of the type of data, and automating these steps will help a lot of people to do more effective work and to focus more on the problem that they are trying to solve rather than spending time on things that are not directly related to the problem.

I agree with author that human will always be in the loop especially on systems that will be used by humans. Advances in AI need human on annotating or labeling the data to work effectively and also to measure and evaluate the results.

Questions

  • The author mentioned that the Data Wrangler system can be used by data analysts to help with data preprocessing, do you think that this system can also be used by data scientist since most machine learning and deep learning projects require data cleanup ?
  • Can you give other examples of AI-Infused interactive systems that can help different domains and can be deployed into production environment to be used by large number of users and can scale well with increased load and demands?

Read More

04/08/2020 – Mohannad Al Ameedi – CrowdScape Interactively Visualizing User Behavior and Output

Summary

In this paper, the authors propose a system that can evaluate complex tasks based on both workers output and behaviors. Other available systems are focus on once aspect of evaluation on either the worker output or behavior which can give poor results especially with complex or creative work. The proposed system combine works through interactive visualization and mixed initiative machine learning. The proposed system, CrowdScape, offers visualization to the users that allow them to filter out poor output to focus on a limited number of responses and use machine learning to measure the similarity of response with the best submission and that way the requester can get the best output and best behavior at the same time. The system provides time series data about for user actions like mouse move or scroll up and down to generate visual timeline for tracing user behavior. The system can work only with web pages and has some limitation, but the value that can give to the customer is high and can enable users to navigate through workers results easily and efficiently.

Reflection

I found the method used by the authors be very interesting. Requesters receive too much information about the workers and visualizing that data can help the requesters to know more about the data, and the use of machine learning can help a lot on classifying or clustering the optimal workers output and behaviors. Other approaches mentioned in the paper are also interesting especially for simple tasks that don’t need complex evaluation.

I also didn’t know that we can get detailed information about the workers output and behavior and found YouTube example mentioned in the paper to be very interesting. The example mentioned shows that MTurk returns everything related to the user actions using the help of JavaScript while working on the YouTube video which can be used in many scenarios. I agree with the authors about the approach which can combine the best of the two approaches. I think it will be interesting to know how many worker response are filtered out in the first phase of the process because that can tell us if sending the request even worthwhile. If too many responses are not considered, then it is possible that task need to be evaluated again.

Questions

  • The authors mentioned that their proposed system can help to filter out poor outputs on the first phase. Do you think if to many responses are filtered out means that the guidelines are the selection criteria needs to be reevaluated?
  • The authors depend on JavaScript to track information about the workers behaviors do you think MTurk needs to approve that or it is not necessary? And do you think that the workers also need to be notified before accepting the task?
  • The authors mention that CrowdScape can be used to evaluate complex and creative tasks, do you think that they need to add some process to make sure that the task really need to be evaluated by their system, or you think the system can also work with simple tasks?

Read More

04/08/2020 – Sushmethaa Muhundan – Agency plus automation: Designing artificial intelligence into interactive systems

This work explores strategies to balance the role of agency and automation by designing user interfaces that enable the shared representations of AI and humans. The goal is to productively employ AI methods while also ensuring that humans remain in control. Three case studies are discussed and these are data wrangling, data visualization for exploratory analysis, and natural language translation. Across each, strategies for integrating agency and automation by incorporating predictive models and feedback into interactive applications are explored. In the first case study, an interactive system is proposed that aims at reducing human efforts by recommending potential transformation, gaining feedback from the user, and performing the transformations as necessary. This would enable the user to focus on tasks that would require the application of their domain knowledge and expertise rather than spending time and effort manually performing transformations. A similar interactive system was developed to aid visualization efforts. The aim was to encourage more systematic considerations of the data and also reveal potential quality issues. In the case of natural language translation, a mixed-initiative translation approach was explored.

The paper has a pragmatic view of the current AI systems and makes a realistic observation that the current AI systems are not capable of completely replacing humans. There is an emphasis on leveraging the complementary strengths of both the human and the AI throughout the paper which is practical. 

Interesting observations were made in the Data Wrangler project with respect to proactive suggestions. If these were presented initially, before the user has had a chance to interact with the system, this feature received negative feedback and was ignored. But, if the same suggestions were presented to users whilst the user was engaging with the system, although the suggestions were not related to the user’s current task, it was received positively. Users viewed themselves as initiators in the latter scenario and hence felt that they were controlling the system. This observation was fascinating since it shows that while designing such user interfaces, the designers should ensure that their users feel in control and are not feeling insecure while using AI systems.

With respect to the second case study, it was reassuring to learn that the inclusion of automated support from the interactive system was able to shift user behavior for the better and helped broaden their understanding of the data. Another positive effect was that the system helped humans combat confirmation bias. This shows that if the interface is designed well, the benefits of AI amplifies the results gained when humans apply their domain expertise.

  • The paper deals with designing interactive systems where the complementary strengths of agents and automation systems are leveraged. What could be the potential drawbacks of such systems, if any?
  • How would the findings of this paper be translated in the context of your class project? Is there potential to develop similar interactive systems to improve the user experience of the end-users?
  • Apart from the three case studies presented, what are some other domains where such systems can be developed and deployed?

Read More

04/08/2020 – Sushmethaa Muhundan – CrowdScape: Interactively Visualizing User Behavior and Output

This work aims to address quality issues in the context of crowdsourcing and explores strategies to involve humans in the evaluation process via interactive visualizations and mixed-initiative machine learning. CrowdScape is a tool proposed that aims to ensure quality even in complex or creative settings. This aims to leverage both the end output as well as workers’ behavior patterns to develop insights about performance. CrowdScape is built on top of Mechanical Turk and obtains data from two sources: the MTurk API in order to obtain the products of work done and Rzeszotarski and Kittur’s Task Fingerprinting system in order to capture worker behavioral traces. The tool utilizes these two data sources and generates an interactive data visualization platform. With respect to worker behavior, raw event logs, and aggregate worker features are incorporated to provide diverse interactive visualizations. Four specific case studies were discussed and these included tasks relating to translation, color preference survey, writing, and video tagging. 

In the context of creative works and complex tasks where it is extremely difficult to evaluate the task results objectively, I feel that mixed-initiative approaches like the one described in the paper can be effective to gauge the worker’s performance.

I specifically liked the feature mentioned with respect to aggregating features of worker behavioral traces wherein the user is presented with capabilities to dynamically query the visualization system to support data analysis. This gives the user control over what features are important to them and allows users to focus on those specific behavioral traces as opposed to presenting the users with static visualizations which would have limited impact.

Another interesting feature provided by the system was that it enabled users to cluster submission based on aggregate event features and I feel that this would definitely help save time and effort from the user’s side and would thereby quicken the process.

In the translation case study presented, it was interesting to note that one of the factors that affected the study of lack of focus was tracking copy-paste keyboard usage. This would intuitively translate to the fact that the worker has used third-party software for translation. However, this alone might not be proof enough since it is possible that the worker translated the task locally and was copy-pasting his/her own work. This shows that while user behavior tracking can provide insights, it might not be sufficient to draw conclusions. Hence, coupling it with the output data and comparing and visualizing it would definitely help draw concrete conclusions.

  • Apart from the techniques mentioned in the paper, what are some alternate techniques to gauge the quality of crowd workers in the context of complex or creative tasks?
  • Apart from the case studies presented, what are some other domains where such systems can be developed and deployed?
  • Given that the tool relies on worker’s behavior patterns and given that these may vary largely from worker to worker, are there situations in which the proposed tool would fail to produce reliable results with respect to performance and quality?

Read More

04/08/20 – Jooyoung Whang – CrowdScape: Interactively Visualizing User Behavior and Output

In this paper, the authors try to help Mturk requesters by providing them with an analysis tool called “Crowdscape.” Crowdscape is a ML + visualization tool for viewing and filtering Mturk worker submissions based on the workers’ behaviors. The user of the application can threshold based on certain behavioral attributes such as time spent or typing delay. The application takes in two inputs: Worker behavior and results. The behavior input is a timeseries data of user activity. The result is what the worker submitted for the Mturk work. The authors focused on finding similarities of the answers to graph on parallel coordinates. The authors conducted a user study by launching four different tasks and recording user behavior and result. The authors conclude that their approach is useful.

This paper’s approach of integrating user behavior and result to filter good output was interesting. Although, I think this system should overcome a problem for it to be effective. The problem lies in the ethics area. The authors explicitly stated that they obtained consent from their pool of workers to collect user behavior. However, some Mturk requesters may decide not to do so with some ill intentions. This may result in intrusion of private information and even end up to theft. On the other hand, upon obtaining consent from the Mturk worker, the worker becomes aware of him or her being monitored. This could also result in unnatural behavior which is undesired for system testing.

I thought the individual visualized graphs and figures were effective for better understanding and filtering by user behavior. However, the entire Crowdscape interface looked a bit overpacked with information. I think a small feature to show or hide some of the graphs would be desirable. The same problem existed with another information exploration system from a project that I’ve worked in. In my experience, an effective solution was to provide a set of menus that hierarchically sorted attributes.

These are the questions that I had while reading the paper:

1. A big purpose of Crowdscape is that it can be used to filter and retrieve a subset of the results (that are thought to be high quality results). What other ways could this system be used for? For example, I think this could be used for rejecting undesired results. Suppose you needed 1000 results and you launched 1000 HITs. You know you will get some ill-quality results. However, since there are too many submissions, it’ll take forever to filter by eye. Crowdscape would help accelerate the process.

2. Do you think you can use Crowdscape for your project? If so, how would you use it? Crowdscape is useful if you, the researcher, is the endpoint of the Mturk task (as in the result is ultimately used by you). My project uses the results from Mturk in a systematic way without ever reaching me, so I don’t think I’ll use Crowdscape.

3. Do you think the graphs available in Crowdscape is enough? What other features would you want? For one, I’d love to have a boxplot for the user behavior attributes.

Read More