02/25/2020 – Mohannad Al Ameedi – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

In this paper, the authors study the effect of updating AI system on the human-AI team performance. The study is focused on the decision-making systems where the users decide on whether accept the AI system recommendation or perform a manual process to make a decision. The authors name the users experience a mental model that the users built over a course of the usage of the system. Improving the accuracy of the AI system might disturb the users’ mental model and decrease the overall performance of the system. The paper mentioned two examples of a readmission system that is used by doctors to predict if the patient will get readmitted or not and also another system that is used by judges and shows the negative impact of the system updates on both systems. The authors propose a platform that can be used by the users to recognize objects and can built users mental model and give the user rewards and get feedback to improve the overall system performance which encompass both of the AI system performance and compatibility.

Reflection

I found the idea of the compatibility very interesting. I always thought that the performance of the AI model on the validation is the most and only factor that should be taken into consideration, and I never thought about the negative effect on the user experience or mental model of the user, and now I can see that the compatibility and the performance tradeoff is a key in deploying a successful AI agent.

At the beginning, I thought that the word compatibility was not the right term to describe the subject. My understanding was compatibility in software systems refer to making sure the a newer version of the system should still work in different versions of the operating  system, but now I think the user is taking a similar role as the operation system when dealing with the AI agent.

Updating the AI system looks similar to updating the user interface of an application where the users might not like a newly added feature or the new way used by the system handle a task.

Questions

  • The authors mention the patient readmission and the judge examples to demonstrate how the AI update might affect the users, are there any other examples?
  • The authors propose a platform that can get user feedback but not in real world setting. Can we build a platform that can get feedback at run-time using reinforcement learning where the reward can be calculated in each user action ad adjust the action to use the current model or previous model?
  • If we want to use crowd-sourcing to improve the performance/compatibility of the AI system then the challenge will be on building a mental model for the user since different user will take a different task and we have no control on choosing the same worker every time, any idea that can help on using crowd-sourcing to improve the AI agent.

Read More

02/26/2020 – Mohannad Al Ameedi – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Summary

In this paper, the authors study the social roles of editing tools in Wikipedia and the way vandalism fighting is addressed. The authors focus on the effected automated tools, like robots, and assisted editing tools on the distributed editing used by the encyclopedia. Wikipedia allows anyone in the universe to edit the content of its articles, which make keeping the quality of the content a difficult task. The platform depends on distributed social network of volunteers to approve or deny changes. Wikipedia uses a source control system to help the users see the changes. The source control shows both versions of the edited content side by side which allow the editor to see the change history. The authors mention that Wikipedia uses bots and automated scripts to help editing some content and fight vandalism. They also mentioned different tools used by the platform to assist the editing process. A combination of humans, automated tasks, and assisted edit tools make Wikipedia able to handle such massive number of edits and fight vandalism attempts. Most research papers that studied the editing process are outdated since they didn’t pay a close attention to these tools, while the authors highlights the importance of these tools on improving the overall quality of the content and allow more edits to be performed. These technological tools like bots and assisted editing tools changed the way humans interact with system and have a significant social effect on the types of activities that are made possible in Wikipedia.

Reflection

I found the idea of the distributed editing and vandalism fighting in Wikipedia interesting. Giving the massive amount of contents in Wikipedia, it is very challenging to keep high quality contents giving that anyone in the universe who has access to the internet can make edit. The internal source control and the assisted tools used to help the editing job at a scale are amazing.

I also found the usage of the bots to automate the edit for some content interesting. These automated scripts can help expediting the content refresh in Wikipedia, but also cause errors. Some tools mentioned in the paper don’t even show the bots changes, so I am not sure if there some method that can measure the accuracy f these bots.

The concept of distributed editing is similar to the concept of pull request in GitHub where any one can submit a change to an open source project and only group of system owners or administrator can accept or reject the changes.

Questions

  • Since millions or billions of people have smart phones nowadays, the amount of anonymous edit might significantly increase.  Are these tools still efficient in handling such increased volume of edits?
  • Can we use deep learning or machine learning in fighting vandalism or spams? The number of edits performed on articles can be treated as a rich training dataset.
  • Why don’t Wikipedia combine all the assisted editing tools in to one too that has the best of each tool? Do you think this a good idea or more tools means more innovation and more choices?

Read More

02/19/2020 – Ziyao Wang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

The authors introduced the fact that in human-AI hybrid decision making system, the updates, which aiming at improving accuracy of AI system, may bring harmful effect to the teamwork. For experienced workers who are advised by AI system, they have built a mental model for the AI system, which will improve the correctness of teamwork’s results. However, the updates which will improve the accuracy of the AI system, may result in the difference between the updated model and the worker’s mental model. Finally, the user cannot make appropriate decisions with the help of AI system. In this paper, the researchers proposed a platform named CAJA, which can help to evaluate the compatibility between AI and human. With the results from experiments using CAJA, developers can learn how to make updates compatible while being still of high accuracy.

Reflection:

Before reading this paper, I kept the thought that it is always good to have a AI system with higher accuracy. However, this paper provides me a new point of view. Instead of only the performance of systems, we should also consider the cooperation between the system and human workers. In this paper, the updates in AI system will destroy the mental system in human mind. The experienced workers should have built a good cooperate system with the AI tools. They know about which advices should be taken and which ones may contain errors. If the patch makes the system to be accurate while reducing the correctness rate of the part which is trusted by human, the accuracy of the whole hybrid system will also be reduced. Human may not trust the updated system until they got a new balance with the updated system. During this period, the performance of this hybrid system will be reduced to a low level which is even worse than keeping the previous system which is not updated. For this reason, the developers should also try to maximize the performance of the system before release the application to the users. As a result, new updates will not make large changes to the system, and human can be more familiar to the updated system.

We can learn from this fact that we should never ignore the interaction between human and AI system. A good design of the interaction can contribute to the improvement of the performance of the whole system. In the meantime, a system with poor human-AI interaction may be harmful to the whole system. When we try to implement a system which needs both human affordance and AI affordance, we should pay more attention to the cooperate between human and AI. We should leverage the affordance from both sides, instead of only focusing on the AI system. We should put us in the position in the designer of the whole system with the view of overall situation rather than just consider ourselves as programmer and only focus on the program.

Questions:

What’s the criteria for deciding whether the updates are compatible or not?

Will releasing instructions for each update to the users valuable to reduce the harm of updates?

If we have a new version of system which will improve the accuracy greatly, however the users’ mental model is totally different from it,  how to reach a balance which will maximize the performance of the whole hybrid system?

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia – Myles Frantz

Given an extensive website such as Wikipedia, there is bound to be an abundance of actors, both good and bad. With the scalability and wide ruleset of the popular web forum site, it would be nigh impossible for human moderators to handle the workload and cross examine each page in depth. To alleviate this, programs that use machine learning were created to help cross track user’s usage of the site into a single repository. Once all the information is gathered here, if a user is acting in a malicious way, it can easily be caught by the system and auto-reverted based on the machine learnings predictions. Such was the case for the user from the case study, whom attempted to slander a famous musician, but was caught quickly and with ease.

I absolutely agree with all the moderation going on around Wikipedia. Given the site domain, there are a vast number of pages that must be secured and protected (all to the same level). It is unrealistic to expect a non-profit website to be able to hire more manual workers to accomplish this same task (in contrast to Youtube, or Facebook). Also, the context in which must be followed in order to fully track a malicious user down manually would be completely exhaustive. For the security side of malware tracking, there is a vast amount of decompilers, raw binary program tracers, and even a custom Virtual Machine and Operation System (Security Onion) that contains various amounts of programs “out of the box” that are ready to track the full environment for the malware.

I disagree with one of the major issues that arises, regarding the bots creating and executing their own moral agenda. This is completely learned and based on the various factors (such as the rules, the training data, and correction values). Though they have the power to automatically revert and edit someone else’s page, these are done at the discretion of the person who created the rules. It would likely have some issues, but it is the overall learning process. These false positives would also be able to be appealed if the author so chooses to follow through, so it’s not a fully final decision.

  • I would believe with such a tool suite, there would be a tool that would act as a combination, a “Visual Studio Code” like interface for all these tools. Having all these tools at the ready is useful, however since time is of the essence some tool wrapping all the common functions would be very convenient.
  • I would like to get several how many reviews from moderators are completely biased. Having a moderator work force should ideally be unbiased however realistically it is unlikely to fully happen.
  • I would also like to see the percentage of false positives, even in this robust of a system. Likely with new moderators they are likely to flag or unflag something if they are unfamiliar with the rules.

Read More

2/19 – Dylan Finch – In Search of the Dream Team:Temporally Constrained Multi-Armed Bandits forIdentifying Effective Team Structures

Word count: 517

Summary of the Reading

This paper seeks to help make it faster and easier for teams to find their ideal team structure. While many services allow teams to test out many different team structures to find the best one, many of those services can take a lot of time and can greatly affect the people who work on the team. Often times they have to switch structures so often that it makes it hard for the teams to concentrate on getting work done. 

The method proposed in the paper seemed to be very successful. It resulted in teams that were 38-46% more effective. The system works by testing different team structures and taking automatically generated feedback information (like performance metrics) to figure out how effective each structure is. It will then base its future combinations on this feedback. Each time a new structure is tested, it varies on a five dimensions: hierarchy, interaction patterns, norms of engagement, decision-making norms, and feedback norms.

Reflections and Connections

I think that this paper has an excellent idea for a system that can help teams to work better together. One of the most important things about a team is how it is structured. The structure of a team can make or break its effectiveness, so getting the structure right is very important to making an effective team. A tool like this that can help a team figure out the best structure with minimal interruption will be very useful to everyone in the business world who needs to manage a team. 

I also thought that it was a great idea to integrate the system into Slack. When I worked in industry last summer, all of the teams at my company used Slack. So, it makes a lot of sense to implement this new system in a system that people are already familiar with.  The use of Slack also allows the creators to make the system more friendly. I think it is much better to get feedback from a human-like Slack bot than some other heartless computer program. It is also very cool how the team members can interact with the bot in Slack. 

I also found the dimensions that they used in the team structures to be interesting. It is valuable to be able to classify teams in some concrete way based on certain dimensions of how they perform. This also has a lot of real world applications. I think that a lot of the time, one of the hardest things in any problem space is just to quantify the possible states of the system. They did this very nicely with the team dimensions and all of their values. 

Questions

  1. Would you recommend this system to your boss at your next job as a way to figure out how to organize the team?
  2. Aside from the ones listed in the paper, what do you think could be some limitations of the current system?
  3. Do you think that the possible structures had enough dimensions and values for each dimension?

Read More

2/19 – Dylan Finch – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Word count: 565

Summary of the Reading

This paper analyzes the use of autonomous technologies that are used on Wikipedia. These technologies help to keep the peace on the large platform, helping to flag malicious users and revert inaccurate and spammy changes so that Wikipedia stays accurate and up to date. Many people may think that humans play the major role in policing the platform, but machines and algorithms also play a very large part, aiding the humans to deal with the large amount of edits.

Some tools are completely automated and can prevent vandalism with no human input. Other tools give human contributors tips to help them spot and fight vandalism. Humans work together with the automated systems and each other to edit the site and keep the pages vandal free. The way in which all of the editors edit together, even though they are not physically together or connected as a team, is an impressive feat of human and AI interaction.

Reflections and Connections

To start, I think that Wikipedia is such and interesting thing to examine for a paper like this. While many organizations have a similar structure, I think that WIkipedia is unique and interesting to study because it is so large, so distributed, and so widely used. It can be hard enough to get a small team of people to work together on documentation. At Wikipedia’s size the complexities of making it all work must be unimaginable. It is so interesting to find out how machines and humans work together at that scale to keep the site running smoothly. The ideas and analysis seen here can easily be applied to smaller systems that are trying to accomplish the same thing.

I also think that this article serves as a great reminder of the power of AI. The fact that AI is able to do some much to help editors keep the site running smoothly even with all of the complexities of the site is amazing and it shows just how much power AI can have when applied to the right situation. A lot of the work done on Wikipedia is not hard work. The article mentions some of the things that bots do, like importing data and fixing grammatical mistakes. These things are incredibly tedious for humans to do and yet they are perfect work for machines. They can do this work almost instantly while it may take a human an hour. This not only serves as a great reminder of the power of AI’s and humans complimenting each other’s abilities, but it also shows what the power of what the internet can do. Something like this never would have been possible before in the history of human civilization. The mere fact that we can do something like this now speaks to the amazing power of the current age. 

Questions

  1. Does this research have applications elsewhere? What would be the best place to apply this analysis?
  2. Could this process ever be done with no human input whatsoever? Could Wikipedia one day be completely self sufficient?
  3. This article talks a lot about how the bots of Wikipedia are becoming more and more important, compared to the policies and social interactions between editors. Is this happening elsewhere? Are there bots other places that we might not see and might not notice, even though they are doing a larger and larger share of the work?

Read More

02/19/20 – Lulwah AlKulaib- Dream Team

Summary

The authors mention that the previous HCI research focused on ideal team structures and how roles, norms, and interaction patterns are influenced by systems. The state of research directed teams towards those structures by increasing shared awareness, adding channels of communications, convening effective collaborators. Yet organizational behavior research denies the existence of universally ideal team structures. And believes that structural contingency theory has demonstrated that the best team structures depend on the task, the members, and some other factors. The authors introduce Dream Team, a system that identifies effective team structures for each team by adapting teams to different structures and evaluating each fit. Dream Team explores over time, experimenting with values along many dimensions of team structures such as hierarchy, interaction patterns, and norms. The system utilizes feedback, such as team performance or satisfaction, to iteratively identify the team structures that best fit each team. It helps teams in identifying the structures that are most effective for them by experimenting with different structures over time on multi-armed bandits.

Reflection

The paper presented a system that focuses on virtual teams. In my opinion, the presented system is a very specific application to a very specific problem. The authors address their long list of limitations, including how they don’t believe their system generalizes to other problems easily. I also believe that the way they utilize feedback in the system is complex and unclear. Their reward function did not explain how qualitative factors were taken into consideration. The authors mention that high variance tasks would require more time for DreamTeam to converge.

Which means more time to get a response from the system, and I don’t know how that would be useful if it slows teams down? Also, when looking at the snapshot of the slack integration, it seems that they handle team satisfaction based on users response to a task, which is not always the case when it comes to collaboration on slack. The enthusiasm of the responses just seems out of the norm. The authors did not address how would their system address “team satisfaction” when there’s little to no response? Would that be counted as a negative response? Or would it be neutral? And even though their system worked well for the very specific task they chose, it’s also a virtual team. Which raises questions about how would this method be applicable for in person teams or hybrid teams? It seems that their controlled environment was very controlled. Even though they presented a good idea, I doubt how applicable it is to real life situations.

Discussion

  • In your opinion, what makes a dream team?
  • Are you pro or against ideal team structures? Why?
  • What were the qualities of collaborators in the best group project/research you had?
  • What makes the “chemistry” between team members?
  • What does a successful collaborative team project look like during a cycle?
  • What tools do you use in project management? 
  • Would you use DreamTeam in your project?
  • What would you change in DreamTeam to make it work better for you?

Read More

02/19/2020 – In Search of the Dream Team – Subil Abraham

How do you identify the best way to structure your team? What kind of leadership setup should it have, how should team members collaborate and make decisions? What kind of communication norms should they follow? These are all important questions to ask when setting up a team but answering them is hard, because there is no right answer. Every team is different as a function of its team members. So it is necessary to iterate on these dimensions and experiment with different choices to try and see which setup works best for a particular team. Earlier work in CSCW attempts this with “multi-arm bandits” where each dimension is independently experimented with by a so called “bandit” (a computational decision maker) in order to collectively reach a configuration based on recommendations from each bandit for each dimension. However, this earlier work suffered from the problem of recommending too many changes and overwhelming the teams involved. Thus this paper proposes a version with temporal constraints, that still provides the same benefits of exploration and experimentation while limiting how often changes are recommended to avoid overwhelming the team.

This is my first exposure to this kind of CSCW literature and I find it a very interesting look into how computational decision makers can help make better teams. The idea of a computational agent looking at performance of teams and how they’re functioning and make recommendations to improve the team dynamics intuitively makes sense, because the team members themselves either can’t take an objective view because of their bias, or could be afraid to make recommendations or propose experimentation for fear of upsetting the team dynamic. The fact that this proposal is about incorporating temporal constraints to these systems is also a cool idea because of course humans can’t deal with frequent change because that would be very overwhelming Having an external arbiter do that job instead is very useful. I wonder whether if the failure of the human manager to experiment is because humans in general are risk averse, or the managers that were picked were particularly risk averse. This ties into my next complaint about the experiment sizes; both in the manager section and in the overall, I find the experiment size is awfully small. I feel like you can’t capture proper trends, especially socialogical trends such as being discussed in this paper, with experiments with just 10 teams. I feel a larger experiment should have been done to identify larger trends before this paper was published. Assuming that the related earlier work with multi-arm bandits also had similar experiment sizes, they should have been larger experiments as well before they were published.

  1. could we expand the dreamteam recommendations where, in addition to recommending changes in the different dimensions, it is also able to recommend more specific things. The main thing I was thinking of was if it is changing heirarchy to a leader based setup, it also recommends a leader, or explicitly recommends people vote on a leader, rather than just saying “hey, you guys now need to work with a leader type setup”?
  2. Considering how limited the feedback that DreamTeam could get, what else could be added than just looking at the scores at different time steps?
  3. What would it take for managerial setups to be less loss averse? Is the point of creating something like DreamTeam to help and push managers to have more confidence in instituting change, or is it to just have a robot take care of everything, sans managers at all?

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia – Subil Abraham

This paper is a very interesting inside look at how the inner cogs of Wikipeda functions, particularly relating to how vandalism is managed with the help of automated software tools. The tools developed unofficially by Wikipedia contributors were created out of necessity in order to a) make it easier to identify bad actors, b) automate and speed up reversions of vandalism, and c) give power to the non-experts to police obvious vandalism such as changing or deleting sections without needing a subject matter expert to do a full review of the article. The paper uses trace ethnography in order to study the usage of these tools and puts forth an interesting case study of a vandal defacing various articles and how through distributed actions by various volunteers, assisted by these tools, the vandal was identified, warned for their repeated offenses, and finally banned as their egregious actions continued, all within the span of 15 minutes and no explicit coordination among the volunteers.

I find this to be a fascinating look into distributed cognition in action, where in multiple independent actors are able to take independent action that produce a cohesive result (in the case study, multiple volunteers and automated tools identifying a vandal and issuing warnings, ultimately resulting in their ban). I find I’m thinking the work of these tools as kind of an equivalent to human body’s unconscious activities. For example, the act of walking is incredibly complex involving precise coordination of hundreds of muscles all moving at the right moments. However, we do not have to think any harder than “I want to get from here to there” and our body handles the rest. That’s kind of what it feels like these tools are, something that handles the complex busywork and leave the big decisions to us. I am wondering though how things have changed from 2009. The paper mentions that the bots tend to ignore changes made by other bots because presumably those other bots are being managed by other volunteers but the bot configuration can be changed so that it explicitly monitors other bots. I wonder how much of that functionality is used now because I am sure Wikipedia now has to deal with a lot more politically motivated vandalism, and much of it is being done by bots. Reddit is a big victim of this, so it is not hard to imagine Wikipedia faces the same problem. Of course, the adversarial bots would be a lot more clever than just pretending to be a friendly bot because that might not cut it anymore. It’s still an important thing to think about.

  1. How would the functionality of Huggle and its ilk fare in the space of Reddit’s automoderator, and vice versa? Are they dealing with fundamentally different things or is there overlap?
  2. How has dealing with vandalism changed on Wikipedia in the decade since this paper was published?
  3. Is there a place for a heirarchy of bots, where lower level bots scan for vandalism and higher level bots make the decisions for banning, all with minimal human intervention? Or will there always need active human participation?

Read More

02/19/20 – Fanglan Chen – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

Bansal et al.’s paper “Updates in Human-AI Teams” explores an interesting problem — the influence of updates to an AI system on the overall team performance. Nowadays, AI systems have been deployed to  support human decision making in high-stakes domains including criminal justice and healthcare. In the working process of a team of humans and AI systems, humans make decisions with a reference to AI’s inferences. A successful partnership requires that the human develops an understanding into the AI system performance, especially its error boundary. Updates with algorithms of higher performance can potentially increase the AI’s predictive accuracy. However, that may require humans to regain interactive experiences and rebuild their confidence in the AI system, the adjusting process of which may actually hurt team performance. The authors introduce the concept of compatibility between an AI update and prior user experience and present methods for studying the role of compatibility in human-AI teams. Extensive experiments on three high-stakes classification tasks (recidivism, credit risk, and mortality) demonstrate that current AI systems are not provided with compatible updates, resulting in decreased performance after updating. To improve the compatibility of an update, the authors propose a re-training objective by penalizing new failures from AI systems. Their proposed compatible updates achieve a good balance of the performance and compatibility trade-off in different tasks.

Reflection

I think making AI and humans as a team to take full advantage of the collaboration is a pretty neat idea. Humans are born with the ability to adapt in the face of an uncertain and adverse world with the capacity of logic reasoning. Machines cannot perform well in those areas but can achieve efficient computation and free people for higher-level tasks. Understanding how machines can efficiently enhance what humans perform best and how humans can augment the work scope of machines is the key to rethink and redesign current decision making system.

What I find interesting about the research problem discussed in this paper is that the authors focus on the idea of unifying a decision made by humans and machines but not merely on the performance in tasks to recommend updates. In machine learning with no human involved, the goal is usually to achieve better and better performance which is evaluated by metrics such as accuracy, precision, recall, etc. The compatible updates can be seen as the machine learning algorithms with similar decision boundaries but better performance, which seems to be an even more difficult task to accomplish. To get there, humans need to perform crucial roles. Firstly, humans must train machines to achieve good performance on certain tasks. Next, humans need to understand and be able to explain the outcomes of those tasks, especially where AI systems fail. That requires an interpretability component in the system. As AI systems are increasingly drawing conclusions through opaque processes (also-called black-box problem), there is a large demand of human experts in the field to explain model behavior to non-expert users. Last but not least, humans need to sustain the responsible use of  AI systems by, for example, updating for better decision making discussed in the paper. That would require a large body of human experts who continually work to ensure that AI systems are functioning properly, safely, and responsibly. 

The above discussion is one side of a coin, focusing on how humans can extend what machines can achieve. The other side is comparatively less discussed in the current literature. Except for extending physical capabilities, how humans can learn from the interaction with AI systems and enhance individual abilities is an interesting question to explore. I would imagine, in an advanced Human-AI team, that humans and AI systems communicate in a more interactive way which allows for collaborative learning from their own mistakes and the rationale of the correct decisions made by each other. That leads to another question, if AI systems can exceed or rival humans in high-stake decision making such as recidivism and underwriting, how risky is it to handle the tasks to machines? How can we decide when to let humans take control? 

Discussion

I think the following questions are worthy of further discussion.

  • What can humans do that machines cannot and vice versa?
  • What is the goal of decision making and what factors are stopping humans or machines from making good decisions? 
  • In the Human-AI teams discussed in the paper, what can humans benefit from the interaction with the AI systems?
  • The partnership introduced by the authors is more like a human-assisting-machine approach. Can you provide some examples of machine-assisting-human approaches?

Read More