04/09/2020 – Mohannad Al Ameedi – Agency plus automation Designing artificial intelligence into interactive systems

April 8, 2020 mohada4 1 Comment

Summary

In this paper, the author proposes multiple systems that can combine the power of both artificial inttelgence and human computation and overcome each one weakness. The author thinks that automating all tasks can lead to a poor results as human component is needed to review and revise results get the best results. The author the autocomplete and spell checkers examples to show that artificial intelligence can offer suggestion and then human can review or revise these suggestions or dismiss the suggestions. The author propose different systems that uses predictive interaction that help users on their tasks that can be partially automated to help the users to focus more on the things that they care more about. One of these systems called Data Wrangling that can used by data analyst on the data preprocessing to help them with cleaning up the data to save more than %80 of their work. The users will need to setup some data mapping and can accept or reject the suggestions. The author proposed project called Voyager that can help with data visualization for exploratory analysis which can be used to help with suggesting visualization elements. The author suggests using AI to automate repeated task and offer the best suggestions and recommendations and let the human decide whether to accept or reject the recommendations. This kind of interaction can improve both machine learning results and human interaction.

Reflection

I found the material presented in the paper to be very interesting. Many discussions were about whether machine can replace human or not was addressed in this paper. The author mentioned that machine can do well with the help of human and the human in the loop will always be necessary.

I also like the idea of the Data Wrangling system as many data analysts and developer spend considerable time on cleaning up the data and most of the steps are repeated regardless of the type of data, and automating these steps will help a lot of people to do more effective work and to focus more on the problem that they are trying to solve rather than spending time on things that are not directly related to the problem.

I agree with author that human will always be in the loop especially on systems that will be used by humans. Advances in AI need human on annotating or labeling the data to work effectively and also to measure and evaluate the results.

Questions

The author mentioned that the Data Wrangler system can be used by data analysts to help with data preprocessing, do you think that this system can also be used by data scientist since most machine learning and deep learning projects require data cleanup ?

Can you give other examples of AI-Infused interactive systems that can help different domains and can be deployed into production environment to be used by large number of users and can scale well with increased load and demands?

04/08/2020 – Mohannad Al Ameedi – CrowdScape Interactively Visualizing User Behavior and Output

April 8, 2020 mohada4 Leave a comment

Summary

In this paper, the authors propose a system that can evaluate complex tasks based on both workers output and behaviors. Other available systems are focus on once aspect of evaluation on either the worker output or behavior which can give poor results especially with complex or creative work. The proposed system combine works through interactive visualization and mixed initiative machine learning. The proposed system, CrowdScape, offers visualization to the users that allow them to filter out poor output to focus on a limited number of responses and use machine learning to measure the similarity of response with the best submission and that way the requester can get the best output and best behavior at the same time. The system provides time series data about for user actions like mouse move or scroll up and down to generate visual timeline for tracing user behavior. The system can work only with web pages and has some limitation, but the value that can give to the customer is high and can enable users to navigate through workers results easily and efficiently.

Reflection

I found the method used by the authors be very interesting. Requesters receive too much information about the workers and visualizing that data can help the requesters to know more about the data, and the use of machine learning can help a lot on classifying or clustering the optimal workers output and behaviors. Other approaches mentioned in the paper are also interesting especially for simple tasks that don’t need complex evaluation.

I also didn’t know that we can get detailed information about the workers output and behavior and found YouTube example mentioned in the paper to be very interesting. The example mentioned shows that MTurk returns everything related to the user actions using the help of JavaScript while working on the YouTube video which can be used in many scenarios. I agree with the authors about the approach which can combine the best of the two approaches. I think it will be interesting to know how many worker response are filtered out in the first phase of the process because that can tell us if sending the request even worthwhile. If too many responses are not considered, then it is possible that task need to be evaluated again.

Questions

The authors mentioned that their proposed system can help to filter out poor outputs on the first phase. Do you think if to many responses are filtered out means that the guidelines are the selection criteria needs to be reevaluated?
The authors depend on JavaScript to track information about the workers behaviors do you think MTurk needs to approve that or it is not necessary? And do you think that the workers also need to be notified before accepting the task?
The authors mention that CrowdScape can be used to evaluate complex and creative tasks, do you think that they need to add some process to make sure that the task really need to be evaluated by their system, or you think the system can also work with simple tasks?

Subil Abraham – 04/08/2020 – Heer, “Agency plus automation”

April 8, 2020 Subil Abraham Leave a comment

A lot of work has been independently done along the tangents of improving computers to allow humans to use them better, and separately in helping machines do work by themselves. The paper makes the case that in the quest for automation, research in augmenting humans to do work by improving the intelligence of tools has fallen to the wayside. This provides a rich area of exploration. The paper explores three tools in this space that work with the users in a specific domain and predict what they might need or want next, based on a combination of context clues from the user. Two of the three tools, Data Wrangler and Voyager use domain specific languages to represent to the user the operations that are possible, thus providing a shared representation of data transformations for the user and the machine. The last tool, for language translation, does not provide a shared representation but presents the suggestions directly because there is no real way of using a DSL here outside of exposing the parse tree which doesn’t really make sense for an ordinary end user. The paper also makes several suggestions of future work. This includes methods for better monitoring and introspection tools in these human AI systems, allowing shared representations to be designed by AI based on the domain instead of being pre-designed by a human, and finding techniques that would help to identify the right balance between human control and automation for a given domain.

The paper uses these three projects as a framing device to discuss the idea of developing better shared representations and their importance in human AI collaboration. I think its an interesting take, especially the idea of using DSLs as a means of communicating ideas between the human user and the AI underneath. They backed away from discussing what a DSL would look like for the translation software since anything outside of autocomplete suggestions don’t really make sense in that domain, but I would be interested in further exploration in that field. I also find it interesting and it makes sense that people might not like the machine predictions being thrust upon them, either because it influences the thinking or it is just annoying. I think the tools discussed manage to make a good balance in staying out of the users way. Yes, the user will be influenced but that is inevitable because the other option is to not give the predictions at all and now you get no benefit.

Although I see the point that the article is trying to make about shared representations (at least, I think I do), I really don’t see the reason for the article existing besides just the author saying “Hey look at my research, this research is very important and I’ve done things with it including making a startup”. The article doesn’t contribute any new knowledge. I don’t mean for that to sound harsh, and I can understand how reading this article is useful from a meta perspective (saves us the trouble of reading the individual pieces of research that are summarized in this article and trying to connect the dots between them).

In the translation task, Why wouldn’t a parse tree work? Are there other kinds of structured representations that would aid a user in the translation task?
Kind of a meta question, but do you think this paper was useful on its own? Did it provide anything outside of summarizing the three pieces of research the author was involved in?
Is there any way for the kind of software discussed here, where it makes suggestions to the user, to avoid influencing the user and interfering with their thought process?

Subil Abraham – 04/08/2020 – Rzeszotarski and Kittur, “CrowdScape”

April 8, 2020 Subil Abraham Leave a comment

Quality control in crowdwork is straightforward for straightforward tasks. Tasks like transcribing text on an image is fairly easy to evaluate the quality of because there is only one right answer. Requesters can use things like gold standard tests to evaluate the output of the crowdworkers directly in order to determine if they have done a good job, or use task fingerprinting to determine if the worker behavior indicates that they are making an effort. The authors propose CrowdScape as a way to combine both types of quality analysis, worker output and behavior, through a mix of machine learning and innovative visualization methods. CrowdScape includes a dashboard that provides a birds-eye view of the different aspects of the worker behavior in the form of graphs. These graphs showcase both aggregate behaviors of all the crowdworkers as well as the timeline of the individual actions a crowd worker takes on a particular task (scrolling, clicking, typing, and so on). They conduct multiple case studies on different kinds of tasks to show that their visualizations are beneficial in separating out the workers who make an effort in producing quality output versus those who are just phoning it in. Behavioral traces identify where the crowdworker spends their time by looking at their actions and how long they spend doing that action.

CrowdScape provides an interesting visual solution to the problem of “how to evaluate if the workers are being sincere in the completion of complex tasks”. Creative work especially, where you ask the crowd worker to write something on their own, is notoriously hard to determine because there is no gold standard test that you can do. So I find the inclusion of the behavior tracking visualizer where different colored lines along a timeline represent different actions done can be useful. Someone who makes an effort in typing out will show long blocks of typing with pauses for thinking. I can see how different behavioral heuristic can be applied for different tasks in order to determine if the workers are actually doing the work. I have to admit though that I find the scatter plots kind of obtuse and hard to parse. I’m not entirely sure how we’re supposed to read them and what information they are conveying. So I feel like the interface itself could do better in communicating exactly what the graphs are doing. There is promise for releasing this as a commercial or open source product (if it isn’t already one) once the polishing of the interface is done with. One last thing is the ability to group “good” submissions by the requester and then machine learning is used by CrowdScape to find other similar “good” submissions. However, the paper only makes mention of it and do not describe how it fits in with the interface as a whole. I felt this was another shortcoming of this design.

What would a good interface for the grouping of the “good” output and subsequent listing of other related “good” output look like?
In what kind of crowd work would CrowdScape not be useful (assuming you were able to get all the data that CrowdScape needs)?
Did you find all the elements of the interface intuitive and understandable? Were there parts of it that were hard to parse?

04/08/2020 – Ziyao Wang – CrowdScape: Interactively Visualizing User Behavior and Output

April 8, 2020 Ziyao Wang Leave a comment

The authors presented CrowdScape, which is a system used for supporting the human evaluation of increasing numbers of complex crowd work. The system used interactive visualization and mixed-initiative machine learning to combine information about worker behavior with the worker outputs. This system can help users to better understand the crowd workers and leverage their strength. They developed the system from three points to meet the requirement of quality control in crowdsourcing: output evaluation, behavioral traces, and integrated quality control. They visualized the workers’ behavior, quality of outputs and combined the findings of user behavior with user outputs to evaluate the work of the crowd workers. This system has some limitations, for example, it cannot work if the user completes the work in a separate text editor and the behavior traces are not detailed enough. However, this system is still good support for quality control.

Reflections:

How to evaluate the quality of the outputs made by the crowdsourcing workers? For those complex tasks, there is no single correct answer, and we can hardly evaluate the work of the workers. Previously, researchers proposed methods in which they traced the behavior of the workers and evaluated their work. However, this kind of method is still not accurate enough as workers may provide the same output while completing tasks in different ways. The authors provide us a novel approach that evaluates the workers from outputs, behavior traces and the combination of these two kinds of information. This combination increases the accuracy of their system and is able to do analysis on some of the complex tasks.

This system is valuable for crowdsourcing users. They can better understand the workers by building a mental model of them. As a result, they can distinguish good results from the poor ones. In projects related to crowdsourcing, developers will sometimes receive a poor response by inactive workers. With this system, they can only keep the valuable results for their research, which may increase the accuracy of their models, get a better view of their systems’ performance and get detailed feedback.

Also, for system designers, the visualization tool for behavioral traces is quite useful if they want to get detailed user feedback and user interactions. If they can analysis on these data, they can know what kinds of interactions are needed by their users and provide a better user experience.

However, I think there may be ethical issues in this system. Using this system, the hits publishers can obtain workers’ behavior while doing the hits. They can collect mouse movement, scrolling, keypresses, focus events and clicks information of the user. I think this may raise some privacy issues and these kinds of information may be used for crimes. The workers’ computers would be risky if their habits are collected by crackers.

Questions:

Can this system be applied to some more complex tasks other than purely generative tasks?

How can the designers use this system to design interfaces which can provide a better user experience?

How can we prevent crackers from using this system to collect user habits and do attacks on their computers?

04/08/20 – Jooyoung Whang – Agency plus automation: Designing artificial intelligence into interactive systems

April 8, 2020 Jooyoung Whang 1 Comment

This paper seeks to investigate a method to achieve AI + IA. That is, enhancing human performance using automated methods but not completely replacing it. The author takes into notice that effective automation should first off bring significant value, second be unobtrusive, third do not require precise user input, and finally, adapt. The author takes these points to account and introduces three interactive systems that he built. All these systems utilize machine computing to handle the initial or small repetitive tasks and rely on human computing to make corrections and improve quality. They are all collaborative systems where AI and humans work together to boost each other’s performance. The AI part of the system tries to predict user intentions while the human part of the system drives the work.

This paper reminded me of Smart-Built Environments (SBE), a term I learned in a Virtual Environments class. SBE is an environment where computing is seamlessly integrated into the environment and interaction with it is very natural. It is capable of “smartly” providing appropriate services to humans in a non-intrusive way. For example, a system where the light automatically lights up upon a person entering a room is a smart feature. I felt that this paper was trying to build something similar in a desktop environment. One core difference with SBEs is that SBE also tries to tackle immersion and presence (which are terms frequently used for evaluating virtual environments). I wonder if the author knows about SBEs or got his project ideas from SBEs.

While reading the paper, I wasn’t sure if the author handled the “unobtrusive” part effectively. In one of the introduced systems, Wrangler was an assist tool for preprocessing data. It tries to predict user intention upon observing certain user behavior and recommends available data transformations on a side panel. I believe this was a similar approach to mimic the Google query auto-completion feature. However, I don’t think it’ll work as well as Google’s auto-completion. Google’s auto-complete suggestions appear right below where the user is typing whereas Wrangler suggests it in the side corner. This requires the user to avert his or her eye from where the point of the previous interaction was, and this is obtrusive.

These are the questions that I had while reading the paper:

1. Do you know any other systems that try to seamlessly integrate AI and human tasks? Is that system effective? How so?

2. The author of this paper mostly uses AI to predict user intentions and process repetitive tasks. What other capabilities of AI would be available for naturally integrating with human tasks? What other tasks are hard to do by humans that machines accel at that could be integrated?

3. Do you agree that “the best kind of systems is one where the user does not even know he or she is using it?” Would there ever be a case where it is crucial that the user feels the presence of the system as a separate entity? This thought came to me because systems could (and ultimately does) fail at some point. If none of the users understand how the system works, wouldn’t that be a problem?

04/08/20 – Jooyoung Whang – CrowdScape: Interactively Visualizing User Behavior and Output

April 8, 2020 Jooyoung Whang Leave a comment

In this paper, the authors try to help Mturk requesters by providing them with an analysis tool called “Crowdscape.” Crowdscape is a ML + visualization tool for viewing and filtering Mturk worker submissions based on the workers’ behaviors. The user of the application can threshold based on certain behavioral attributes such as time spent or typing delay. The application takes in two inputs: Worker behavior and results. The behavior input is a timeseries data of user activity. The result is what the worker submitted for the Mturk work. The authors focused on finding similarities of the answers to graph on parallel coordinates. The authors conducted a user study by launching four different tasks and recording user behavior and result. The authors conclude that their approach is useful.

This paper’s approach of integrating user behavior and result to filter good output was interesting. Although, I think this system should overcome a problem for it to be effective. The problem lies in the ethics area. The authors explicitly stated that they obtained consent from their pool of workers to collect user behavior. However, some Mturk requesters may decide not to do so with some ill intentions. This may result in intrusion of private information and even end up to theft. On the other hand, upon obtaining consent from the Mturk worker, the worker becomes aware of him or her being monitored. This could also result in unnatural behavior which is undesired for system testing.

I thought the individual visualized graphs and figures were effective for better understanding and filtering by user behavior. However, the entire Crowdscape interface looked a bit overpacked with information. I think a small feature to show or hide some of the graphs would be desirable. The same problem existed with another information exploration system from a project that I’ve worked in. In my experience, an effective solution was to provide a set of menus that hierarchically sorted attributes.

These are the questions that I had while reading the paper:

1. A big purpose of Crowdscape is that it can be used to filter and retrieve a subset of the results (that are thought to be high quality results). What other ways could this system be used for? For example, I think this could be used for rejecting undesired results. Suppose you needed 1000 results and you launched 1000 HITs. You know you will get some ill-quality results. However, since there are too many submissions, it’ll take forever to filter by eye. Crowdscape would help accelerate the process.

2. Do you think you can use Crowdscape for your project? If so, how would you use it? Crowdscape is useful if you, the researcher, is the endpoint of the Mturk task (as in the result is ultimately used by you). My project uses the results from Mturk in a systematic way without ever reaching me, so I don’t think I’ll use Crowdscape.

3. Do you think the graphs available in Crowdscape is enough? What other features would you want? For one, I’d love to have a boxplot for the user behavior attributes.

04/08/2020 – Sushmethaa Muhundan – CrowdScape: Interactively Visualizing User Behavior and Output

April 8, 2020 Sushmethaa Muhundan 1 Comment

This work aims to address quality issues in the context of crowdsourcing and explores strategies to involve humans in the evaluation process via interactive visualizations and mixed-initiative machine learning. CrowdScape is a tool proposed that aims to ensure quality even in complex or creative settings. This aims to leverage both the end output as well as workers’ behavior patterns to develop insights about performance. CrowdScape is built on top of Mechanical Turk and obtains data from two sources: the MTurk API in order to obtain the products of work done and Rzeszotarski and Kittur’s Task Fingerprinting system in order to capture worker behavioral traces. The tool utilizes these two data sources and generates an interactive data visualization platform. With respect to worker behavior, raw event logs, and aggregate worker features are incorporated to provide diverse interactive visualizations. Four specific case studies were discussed and these included tasks relating to translation, color preference survey, writing, and video tagging.

In the context of creative works and complex tasks where it is extremely difficult to evaluate the task results objectively, I feel that mixed-initiative approaches like the one described in the paper can be effective to gauge the worker’s performance.

I specifically liked the feature mentioned with respect to aggregating features of worker behavioral traces wherein the user is presented with capabilities to dynamically query the visualization system to support data analysis. This gives the user control over what features are important to them and allows users to focus on those specific behavioral traces as opposed to presenting the users with static visualizations which would have limited impact.

Another interesting feature provided by the system was that it enabled users to cluster submission based on aggregate event features and I feel that this would definitely help save time and effort from the user’s side and would thereby quicken the process.

In the translation case study presented, it was interesting to note that one of the factors that affected the study of lack of focus was tracking copy-paste keyboard usage. This would intuitively translate to the fact that the worker has used third-party software for translation. However, this alone might not be proof enough since it is possible that the worker translated the task locally and was copy-pasting his/her own work. This shows that while user behavior tracking can provide insights, it might not be sufficient to draw conclusions. Hence, coupling it with the output data and comparing and visualizing it would definitely help draw concrete conclusions.

Apart from the techniques mentioned in the paper, what are some alternate techniques to gauge the quality of crowd workers in the context of complex or creative tasks?
Apart from the case studies presented, what are some other domains where such systems can be developed and deployed?
Given that the tool relies on worker’s behavior patterns and given that these may vary largely from worker to worker, are there situations in which the proposed tool would fail to produce reliable results with respect to performance and quality?

04/08/2020 – Sushmethaa Muhundan – Agency plus automation: Designing artificial intelligence into interactive systems

April 8, 2020 Sushmethaa Muhundan 1 Comment

This work explores strategies to balance the role of agency and automation by designing user interfaces that enable the shared representations of AI and humans. The goal is to productively employ AI methods while also ensuring that humans remain in control. Three case studies are discussed and these are data wrangling, data visualization for exploratory analysis, and natural language translation. Across each, strategies for integrating agency and automation by incorporating predictive models and feedback into interactive applications are explored. In the first case study, an interactive system is proposed that aims at reducing human efforts by recommending potential transformation, gaining feedback from the user, and performing the transformations as necessary. This would enable the user to focus on tasks that would require the application of their domain knowledge and expertise rather than spending time and effort manually performing transformations. A similar interactive system was developed to aid visualization efforts. The aim was to encourage more systematic considerations of the data and also reveal potential quality issues. In the case of natural language translation, a mixed-initiative translation approach was explored.

The paper has a pragmatic view of the current AI systems and makes a realistic observation that the current AI systems are not capable of completely replacing humans. There is an emphasis on leveraging the complementary strengths of both the human and the AI throughout the paper which is practical.

Interesting observations were made in the Data Wrangler project with respect to proactive suggestions. If these were presented initially, before the user has had a chance to interact with the system, this feature received negative feedback and was ignored. But, if the same suggestions were presented to users whilst the user was engaging with the system, although the suggestions were not related to the user’s current task, it was received positively. Users viewed themselves as initiators in the latter scenario and hence felt that they were controlling the system. This observation was fascinating since it shows that while designing such user interfaces, the designers should ensure that their users feel in control and are not feeling insecure while using AI systems.

With respect to the second case study, it was reassuring to learn that the inclusion of automated support from the interactive system was able to shift user behavior for the better and helped broaden their understanding of the data. Another positive effect was that the system helped humans combat confirmation bias. This shows that if the interface is designed well, the benefits of AI amplifies the results gained when humans apply their domain expertise.

The paper deals with designing interactive systems where the complementary strengths of agents and automation systems are leveraged. What could be the potential drawbacks of such systems, if any?
How would the findings of this paper be translated in the context of your class project? Is there potential to develop similar interactive systems to improve the user experience of the end-users?
Apart from the three case studies presented, what are some other domains where such systems can be developed and deployed?

04/08/2020 – Dylan Finch – CrowdScape: interactively visualizing user behavior and output

April 8, 2020 Dylan Finch 1 Comment

Word count: 561

Summary of the Reading

This paper describes a system for dealing with crowdsourced work that needs to be evaluated by humans. For complex or especially creative tasks, it can be hard to evaluate the work of crowd workers, because there is some much of it and most of it needs to be evaluated by another human. If the evaluation takes too long, you lose the benefits of using crowd workers in the first place.

To help with these issues, the researchers have developed a system that helps an evaluator deal with all of the data from the tasks. The system leans heavily on the data visualization. The interface for the system shows the user a slathering of different metrics about the crowd work and the workers to help the user determine quality. Specifically, the system helps the user to see information about worker output and behavior at the same time, giving a better indication of performance.

Reflections and Connections

I think that this paper tries to tackle a very important issue of crowd work: evaluation. Evaluation of tasks is not an easy process and for complicated tasks, it can be extremely difficult and, worst of all, hard to automate. If you need humans to review and evaluate work done by crowd workers, and it takes the reviewer a non-insignificant amount of time, then you are not really saving any effort by using the crowd in the first place.

This paper is so important because it provides a way to make it easier for people to evaluate work done by crowd workers, making the use of crowd workers much more efficient, on the whole. If evaluation can be done more quickly, the data from the tasks can be used more quickly, and the whole process of using crowd workers has been made much faster than it was before.

I also think this paper is important because it gives reviewers a new way to look at the work done by crowds: it shows the reviewer both worker output and worker behavior. This would make it much easier for reviewers to decide if a task was completed satisfactorily or not. If we can see that a worker did not spend a lot of time on a task and that their work was significantly different from other workers assigned to the same task, we may be able to tell that that worker did a bad job, and their data should be thrown out.

From the pictures of the system, it does look a little complicated and I would be concerned that the system is hard to use or overly complicated. Having a system that saves time, but that takes a long time to fully understand can be just as bad as not having the time saving system. So, I do think that some effort should be used to make the system look less intimidating and easier to use.

Questions

What are some other possible applications for this type of software, besides the extra one mentioned in the paper?
Do you think there is any way we could fully automate the evaluation of the creative and complex tasks focused on in this research?
Do you think that the large amount of information given to users of the system might overwhelm them?