02/25/2020 – Mohannad Al Ameedi – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

In this paper, the authors study the effect of updating AI system on the human-AI team performance. The study is focused on the decision-making systems where the users decide on whether accept the AI system recommendation or perform a manual process to make a decision. The authors name the users experience a mental model that the users built over a course of the usage of the system. Improving the accuracy of the AI system might disturb the users’ mental model and decrease the overall performance of the system. The paper mentioned two examples of a readmission system that is used by doctors to predict if the patient will get readmitted or not and also another system that is used by judges and shows the negative impact of the system updates on both systems. The authors propose a platform that can be used by the users to recognize objects and can built users mental model and give the user rewards and get feedback to improve the overall system performance which encompass both of the AI system performance and compatibility.

Reflection

I found the idea of the compatibility very interesting. I always thought that the performance of the AI model on the validation is the most and only factor that should be taken into consideration, and I never thought about the negative effect on the user experience or mental model of the user, and now I can see that the compatibility and the performance tradeoff is a key in deploying a successful AI agent.

At the beginning, I thought that the word compatibility was not the right term to describe the subject. My understanding was compatibility in software systems refer to making sure the a newer version of the system should still work in different versions of the operating  system, but now I think the user is taking a similar role as the operation system when dealing with the AI agent.

Updating the AI system looks similar to updating the user interface of an application where the users might not like a newly added feature or the new way used by the system handle a task.

Questions

  • The authors mention the patient readmission and the judge examples to demonstrate how the AI update might affect the users, are there any other examples?
  • The authors propose a platform that can get user feedback but not in real world setting. Can we build a platform that can get feedback at run-time using reinforcement learning where the reward can be calculated in each user action ad adjust the action to use the current model or previous model?
  • If we want to use crowd-sourcing to improve the performance/compatibility of the AI system then the challenge will be on building a mental model for the user since different user will take a different task and we have no control on choosing the same worker every time, any idea that can help on using crowd-sourcing to improve the AI agent.

Read More

02/19/2020 – Ziyao Wang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

The authors introduced the fact that in human-AI hybrid decision making system, the updates, which aiming at improving accuracy of AI system, may bring harmful effect to the teamwork. For experienced workers who are advised by AI system, they have built a mental model for the AI system, which will improve the correctness of teamwork’s results. However, the updates which will improve the accuracy of the AI system, may result in the difference between the updated model and the worker’s mental model. Finally, the user cannot make appropriate decisions with the help of AI system. In this paper, the researchers proposed a platform named CAJA, which can help to evaluate the compatibility between AI and human. With the results from experiments using CAJA, developers can learn how to make updates compatible while being still of high accuracy.

Reflection:

Before reading this paper, I kept the thought that it is always good to have a AI system with higher accuracy. However, this paper provides me a new point of view. Instead of only the performance of systems, we should also consider the cooperation between the system and human workers. In this paper, the updates in AI system will destroy the mental system in human mind. The experienced workers should have built a good cooperate system with the AI tools. They know about which advices should be taken and which ones may contain errors. If the patch makes the system to be accurate while reducing the correctness rate of the part which is trusted by human, the accuracy of the whole hybrid system will also be reduced. Human may not trust the updated system until they got a new balance with the updated system. During this period, the performance of this hybrid system will be reduced to a low level which is even worse than keeping the previous system which is not updated. For this reason, the developers should also try to maximize the performance of the system before release the application to the users. As a result, new updates will not make large changes to the system, and human can be more familiar to the updated system.

We can learn from this fact that we should never ignore the interaction between human and AI system. A good design of the interaction can contribute to the improvement of the performance of the whole system. In the meantime, a system with poor human-AI interaction may be harmful to the whole system. When we try to implement a system which needs both human affordance and AI affordance, we should pay more attention to the cooperate between human and AI. We should leverage the affordance from both sides, instead of only focusing on the AI system. We should put us in the position in the designer of the whole system with the view of overall situation rather than just consider ourselves as programmer and only focus on the program.

Questions:

What’s the criteria for deciding whether the updates are compatible or not?

Will releasing instructions for each update to the users valuable to reduce the harm of updates?

If we have a new version of system which will improve the accuracy greatly, however the users’ mental model is totally different from it,  how to reach a balance which will maximize the performance of the whole hybrid system?

Read More

02/19/20 – Fanglan Chen – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

Bansal et al.’s paper “Updates in Human-AI Teams” explores an interesting problem — the influence of updates to an AI system on the overall team performance. Nowadays, AI systems have been deployed to  support human decision making in high-stakes domains including criminal justice and healthcare. In the working process of a team of humans and AI systems, humans make decisions with a reference to AI’s inferences. A successful partnership requires that the human develops an understanding into the AI system performance, especially its error boundary. Updates with algorithms of higher performance can potentially increase the AI’s predictive accuracy. However, that may require humans to regain interactive experiences and rebuild their confidence in the AI system, the adjusting process of which may actually hurt team performance. The authors introduce the concept of compatibility between an AI update and prior user experience and present methods for studying the role of compatibility in human-AI teams. Extensive experiments on three high-stakes classification tasks (recidivism, credit risk, and mortality) demonstrate that current AI systems are not provided with compatible updates, resulting in decreased performance after updating. To improve the compatibility of an update, the authors propose a re-training objective by penalizing new failures from AI systems. Their proposed compatible updates achieve a good balance of the performance and compatibility trade-off in different tasks.

Reflection

I think making AI and humans as a team to take full advantage of the collaboration is a pretty neat idea. Humans are born with the ability to adapt in the face of an uncertain and adverse world with the capacity of logic reasoning. Machines cannot perform well in those areas but can achieve efficient computation and free people for higher-level tasks. Understanding how machines can efficiently enhance what humans perform best and how humans can augment the work scope of machines is the key to rethink and redesign current decision making system.

What I find interesting about the research problem discussed in this paper is that the authors focus on the idea of unifying a decision made by humans and machines but not merely on the performance in tasks to recommend updates. In machine learning with no human involved, the goal is usually to achieve better and better performance which is evaluated by metrics such as accuracy, precision, recall, etc. The compatible updates can be seen as the machine learning algorithms with similar decision boundaries but better performance, which seems to be an even more difficult task to accomplish. To get there, humans need to perform crucial roles. Firstly, humans must train machines to achieve good performance on certain tasks. Next, humans need to understand and be able to explain the outcomes of those tasks, especially where AI systems fail. That requires an interpretability component in the system. As AI systems are increasingly drawing conclusions through opaque processes (also-called black-box problem), there is a large demand of human experts in the field to explain model behavior to non-expert users. Last but not least, humans need to sustain the responsible use of  AI systems by, for example, updating for better decision making discussed in the paper. That would require a large body of human experts who continually work to ensure that AI systems are functioning properly, safely, and responsibly. 

The above discussion is one side of a coin, focusing on how humans can extend what machines can achieve. The other side is comparatively less discussed in the current literature. Except for extending physical capabilities, how humans can learn from the interaction with AI systems and enhance individual abilities is an interesting question to explore. I would imagine, in an advanced Human-AI team, that humans and AI systems communicate in a more interactive way which allows for collaborative learning from their own mistakes and the rationale of the correct decisions made by each other. That leads to another question, if AI systems can exceed or rival humans in high-stake decision making such as recidivism and underwriting, how risky is it to handle the tasks to machines? How can we decide when to let humans take control? 

Discussion

I think the following questions are worthy of further discussion.

  • What can humans do that machines cannot and vice versa?
  • What is the goal of decision making and what factors are stopping humans or machines from making good decisions? 
  • In the Human-AI teams discussed in the paper, what can humans benefit from the interaction with the AI systems?
  • The partnership introduced by the authors is more like a human-assisting-machine approach. Can you provide some examples of machine-assisting-human approaches?

Read More

2/19/20 – Lee Lisle – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

            Bansal et. al discuss how human-AI teams work in solving high-stakes issues such as hospital patient discharging scenarios or credit risk assessment. They point out that the humans in these teams often create a mental model of the AI suggestions, where the mental model is an understanding of when the AI is likely wrong about the outcome. The authors then show that updates to the AI can produce worse performance if they are not compatible with the already formed mental model of the human user. They go on to define types of compatibility for AI updates, as well as a few other key terms relating to human/AI teams. They develop a platform to measure how compatibility can affect team performance, and then measure AI update compatibility effectiveness through a user study using 25 mTurk workers. In all, they show that incompatible updates reduce performance as compared to no update at all.

Personal Reflection

            The paper was an interesting study in the effect of pushing updates without considering the user involved in the process. I hadn’t thought of the human as an exactly equal player in the team, where the AI likely has more information and could provide a better suggestion. However, it makes sense that the human leverages other sources of information and forms a better understanding of what choice to ultimately make.

            CAJA, the human/AI simulation platform, seems like a good way to test AI updates, however I struggle to see how it can be used to test other theories as the authors seem to suggest. It is, essentially, a simple user-learning game, where users figure out when to trust the machine and when to deviate. While this isn’t exactly my field of expertise, I only see the chance to change information flows and the underlying AI as ways of learning new things about human/AI collaboration. This would mean terming this as a platform is a little excessive.

Questions

  1. The authors mention that, in order to defeat mTurk scammers who click through projects like these quickly, they drop the lowest quartile (in terms of performance) out of their results. Do you think this was an effective countermeasure, or could the authors be cutting good data?
  2. From other sources, such as Weapons of Math Destruction, we can read how some AI suggestions are inherently biased (even racist) due to input data. How might this change the authors results? Do you think this is taken into consideration at all?
  3. One suggestion near the end of the paper stated that, if pushing an incompatible update, the authors of the AI should make the change explicit so that the user could adjust accordingly. Do you think this is an acceptable tradeoff to not creating a compatible update?  Why or why not?
  4. The authors note that, as the complexity of error boundary f became more complex, errors increased, so they kept to relatively simple boundaries. Is this an effective choice for this system, considering real systems are extremely complex? Why or why not?
  5. The authors state that they wanted the “compute” cost to be net 0. Does this effectively simulate real-world experiences? Is the opportunity-cost the only net negative here?

Read More

02/19/2020 – Updates in Human-AI Teams: Understanding and Addressing the Performance / Compatibility Trade off – Yuhang Liu

This paper first proposes the complementarity between humans and artificial intelligence. In many cases, humans and artificial intelligence will form a team. When people make decisions after checking the inferences of AI, this cooperation model has applications in many fields, and achieved significant results. Usually, this kind of achievement requires certain prerequisites. First, people must have their own judgments on the conclusions of artificial intelligence. At the same time, the results of artificial intelligence must be accurate. The tacit cooperation between the two can improve efficiency. However, with the updating of artificial intelligence systems and the expansion of data, this cooperation will be broken. On the one hand, the accuracy of artificial intelligence will decline, and because of the expansion of boundaries, people’s understanding of artificial intelligence will be broken. So after the system update, the efficiency will be reduced instead. This paper mainly studies this situation. The article hopes to be compatible with the previous method after the update, so several methods are proposed to achieve this purpose, so as to achieve more compatible and accurate updates.

It is also suggested that this idea is obtained by analogy. In software engineering, if the updated system can support legacy software, it will be compatible after the update. I agree with this kind of analogy greatly, which is similar to bionics. We can continuously apply new ideas to the computer field through this kind of thought. The method mentioned in this paper is also very necessary. In the ordinary process of artificial intelligence or machine learning, we usually build data sets for each time, and lack the concept of inheritance, which is very inconvenient. After adopting compatible ideas, it will greatly save energy and be able to serve people more smoothly.

This article introduces CAJA, a platform for measuring the impact of AI performance and updates on team performance. At the same time, a practical retraining goal is introduced in the article to improve update compatibility. The main idea is to improve update compatibility by punishing new errors. But it can also be seen from the text that trust is the core of team work. Admittedly, trust is the essence of a team, but only as the basis of work, I think that more simulations and improvements are needed to improve humanity. The combination of problem-solving factors and the key of machine learning, we know that after learning new things, people will not have a negative impact on previous skills, but we will have more perspectives and methods to think about a problem, so I think that humans and machines should be mixed, that is, a team as a whole, so that the results can be more compatible, and the human machine interaction can be more successful.

question:

  1. What are the implications of compatible AI updates?
  2. How to better treat people and machines as a whole?
  3. Whether compatible AI will affect the final training results?

Read More

02/19/2020 – Nan LI – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary:

In this paper, the author presented that it is prevailing nowadays human and AI form a team to make decisions in a way that AI provides recommendations and humans decide to trust or not. In these cases, successful team collaboration mainly based on human knowledge on previous AI system performance. Therefore, the author proposed that though the update of the AI system would enhance the AI predictive precision, it will hurt the team performance since the updated version of the AI system usually not compatible with the mental model that humans developed with previous AI systems. To solve this problem, the author introduced the concept of compatibility of AI update with prior user experience. To examine the role of this compatibility in Human-AI teams, the author proposed methods and designed a platform called CAJA to measure the impact of updates on team performance. The outcomes show that team performance could be harmed even the updated system’s predictive accuracy improved. Finally, the paper proposed a re-training objective that can promote the compatibility of updates. In conclusion, to avoid diminish team performance, the developer should build a more compatible update without surrendering performance. 

Reflection:

In this paper, the author talked about a specific interaction, which is AI-advised human decision making. As the example presented in the paper–Patient readmission system. In these cases, an incompatible update of the AI system would indeed harm the team performance. However, I think the extent of the impact largely depends on the correlation between the human and AI systems.

If the system and the user have a high grade of interdependence, both are not specialists on a task, the system prediction accuracy and user knowledge have the same impact on the decision result, the incompatible update of the AI system will weaken the team performance. Even though this effect can be eliminated by the running-in of the user and the system later, the cost for the decision in the high-stakes domain will be very large.

However, if the system interacts with users frequently, but the system’s predictions are only one of the concerns for humans to make decisions and cannot directly affect the decision, then the impact of incompatible updates on team performance will be limited.

Besides, if humans are more expertise on the task, and can validate the correctness of the recommendation promptly, then both the improvement of the system performance and the new errors caused by the system update will not have much impact on the results. On the other hand, if the error caused by the update does not affect team performance, then when updating the system, we do not need to consider compatibility but only need to consider the improvement of system performance. As a conclusion, if there is not enough interaction between the user and the system, and the degree of interdependence is not high, or the system only serves as an auxiliary or double-check, then the system update will not have a great impact on team performance.

A compatible update is indeed helpful for users to quickly adapt to the new system, but I think the impact of update largely depends on the correlation between the user and the system, or the proportion of the system’s role in teamwork.

Besides, design a compatible update version also requires extra cost. Therefore, I think we should consider minimizing the impact of system errors on the decision-making process when designing the system and establishing human-AI interaction.

Question:

  1. What do you think about the concept of compatibility of AI update?
  2. Do you have any human-AI system examples that apply this author theory?
  3. Under what circumstances do you think the author’s theory is the most used and when it is not applicable?
  4. When we need to update the system frequently, do you think it is better to build a compatible update or to use an alternative method to solve the human adaptation costs?
  5. In my opinion, Huaman’s degree of adaptation is very high, and the cost required for humans to adapt is much smaller than the cost of developing a compatible update version. what do you think?

Word Count: 682

Read More

02/19/2020 – Palakh Mignonne Jude – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

SUMMARY

In this paper, the authors talk about the impact updates made to an AI model can have on the overall human-machine team performance. They describe the mental model that a human develops through the course of interacting with an AI system and how this gets impacted when an update is made to the AI system. They introduce the notion of ‘compatible’ AI updates and propose a new objective that will penalize new errors (errors introduced in the new model that were not present in the original model). The authors introduced terms such as ‘locally-compatible updates’, ‘compatibility score’ as well as ‘globally-compatible updates’. They performed experiments with high-stakes domains such as recidivism prediction, in-hospital mortality prediction, and credit risk assessment. They also developed a platform to study human-AI teams called CAJA, which is a web-based game and the authors claim that no human is a task expert. CAJA enables designers to vary different parameters including the number of human-visible features, AI accuracy, reward function, etc.

REFLECTION

I think this paper was very interesting as I have never considered the impact on team performance due to updates to an AI system. The idea of a mental model, as introduced by the authors of this paper, was novel to me as I have never thought about the human aspect of utilizing such AI systems that make various recommendations. This paper reminded me of the multiple affordances mentioned in the paper ‘An Affordance-Based Framework for Human Computation and Human-Computer Collaboration’ wherein both humans and machine are in pursuit of a common goal and leverage the strengths of both humans and machines.

I thought that it was good that they defined the notion of compatibility to include the human’s mental model and I agree that developers retraining AI models are susceptible to focus on retraining in terms of improving the accuracy of a model and that they tend to ignore the details of human-AI teaming.

I was also happy to read that the workers used as part of the study performed in this paper were paid on average $20/hour as per the ethical guidelines for requesters.

QUESTIONS

  1. The paper mentions the use of Logistic Regression and multi-layer perceptron. Would a more detailed study on the types of classifiers that are used in these systems help?
  2. Would ML models that have better interpretability for the decisions made have given better initial results and prevented the dip in team performance? In such cases, would providing a simple ‘change log’ (as is done in a case of other software applications), have aided in preventing this dip in team performance or would it have still been confusing to the humans interacting with the system?
  3. How were the workers selected for the studies performed on the CAJA platform? Were there any specific criteria used to select such workers? Would the qualifications of the workers have affected the results in anyway?

Read More

2/19/20 – Jooyoung Whang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

According to the paper, most developers of classification or prediction systems focus on the quality of the predictions but not on the system’s team performance with the user. The authors of this paper introduce the problem that may occur according to the current model training loss criteria and provide new methods that address the problem. To develop a more advanced image of the users’ interactions with a classifier system, the authors develop a web-based game system called Caja and conduct a user study using the Amazon Mechanical Turk. They conclude that the increase in performance of the system does not necessarily mean that the team performance of the system with the users also increase. They also confirm that their proposed training method using the new loss function and a new concept called Dissonance improves team performance.

I liked the authors’ new perspective to human-AI collaboration and model training. Now that I think of it, not considering the users of the system during development is contradictory to what the system’s trying to achieve. One thing I was interested in and had thoughts about was their definition of Dissonance. The term is used to compare and link with the old model of a system with the new updated model in terms of user expectation. I saw that the term penalizes a system when the new system misclassifies for a set of input that the old model used to get right. However, what if the users of the old system made predictions according to how the system was wrong? This may be a weird concern and probably an edge case, but if the user made decisions based on the thought that the system was wrong all the time, the team performance of that that person with the updated model will always be worse even if the new system was trained with the suggested loss function.

The followings are the questions that I had while reading the paper:

1. As I have written in my reflection, do you think the new proposed training method will be effective if the users made decisions based on the idea that the system will be always wrong? Or, is this a too extreme and absurd thought?

2. The design of Caja ensures that the user can never arrive at the solution by him or herself because too much about the problem domain is hidden to the user. However, this is often not the case in real world scenarios. The user of the system is often also an expert of the related field. Does this reduce the quality and trustworthiness of the results of this research? Why or why not?

3. The research started from the idea that interaction with the users must be considered when making an update to an AI system. In this case, it was particularly for human-AI collaboration. What if it was the opposite? For example, there are AIs that are built to compete with humans like AlphaGo. These types of AIs are also developed with the goal of producing the most optimal solution to a given input without considering the interaction with the user. How can training be modified to include users for competing AIs?

Read More

02/19/2020 – Updates in Human-AI teams: Understanding and Addressing the Performance/Compatibility Tradeoff – Sushmethaa Muhundan

The paper studies human-AI teams in decision-making settings specifically focusing on updates made to the AI component and its subsequent influence on the decision-making process of the human. In an AI-advised human decision-making interaction, the AI system recommends actions to the human. Based on this recommendation, their past experience as well as domain knowledge, the human takes an informed decision. They can choose to go ahead with the action recommended by the AI or they can choose to disregard the recommendation. During their course of interaction with AI systems, humans develop a mental model of the system. This is developed based on mapping scenarios where the AI’s decision was correct versus when they were incorrect by means of rewards and feedback provided to the humans by the system. As part of the experiment, studies were conducted to establish relationships between updates to AI systems and team performance. User behavior was monitored using a custom platform, CAJA, built to gain insights about the influence of updates to AI models on the user’s mental model and consequently team performance. Consistency metrics were introduced and several real-world domains were analyzed including recidivism prediction, in-hospital mortality prediction, and credit risk assessment. 

It was extremely surprising to note that updates to the AI’s performance that makes it better actually may hurt the team performance. My initial instinct was that with an increase in the AI’s performance, the team performance would increase proportionally but this is not always the case. In certain cases, despite there being an increase in the AI’s performance, the new results might not be consistent with the human’s mental model and as a result, incorrect decisions are taken based on past interactions with the AI and hence the overall team performance decreases. An interesting and relatable parallel is drawn to concepts of backward compatibility in software engineering with respect to updates. The concept of compatibility is introduced using this analogy to describe the ideal scenario where updates to the AI does not introduce further errors.

The platform developed to conduct the studies, CAJA, was an innovative way to overcome the challenges of testing in real-world settings. This platform abstract away the specifics of problem-solving by presenting a range of problems that distills the essence of mental models and trust in one’s AI teammate. It was very interesting to note that these problems were designed such that no human could be a task expert thereby maximizing the importance of mental models and their influence in decision making.

  • What are some efficient means to share the summary of AI updates in a concise, human-readable form that captures the essence of the update along with the reason for the change?
  • What are some innovative ideas that can be used to reduce the cost incurred by re-training humans in an AI-advised human decision-making ecosystem?
  • How can developers be made more aware of the consequences of the updates made to the AI model on team performance? Would increasing awareness help improve team performance?

Read More

02/19/2020 – Nurendra Choudhary – Updates in Human-AI Teams

Summary

In this paper, the authors study the role of studying human-AI team performance in contrast to their individual performance and explain its necessity. They explain the importance of human inference of AI tools. Humans develop mental models of AI’s performance. Advances made in AI’s algorithm only evaluate the improvement in the prediction. However, the improvements cause behavioral changes in AI that do not fit the human’s mental models and reduce the overall performance of their team. To alleviate this, the authors propose a new logarithmic loss that considers the compatibility between human mental models and AI models for making updates to the AI model.

The authors construct user studies to show the development of human mental models across different conditions. Additionally, they illustrate the degradation in overall team performance with improvement in AI’s prediction. Furthermore, they show the addition of the additional loss increases the overall team performance of the AI model while increasing AI’s prediction efficiency. 

Reflection

Humans and AI form formidable teams in multiple environments and I think such a study as a necessity for further development of AI. Most state-of-the-art AI systems are not independently useful in real-world and rely on human intervention from time-to-time (as discussed in previous classes). Till a point of time where this situation exists, we cannot improve AI independently and have to consider the humans involved in the task. I believe the evaluation metrics currently used in AI research are completely focussed on the AI’s prediction. However, this needs to change and the paper is a great primary step in the direction. I believe we should construct more such evaluation metrics for various other AI tasks. But, if we develop our evaluation metrics around human-AI teams, we take the risk of potentially making AI systems reliant on human input. Hence, there is a possibility that AI systems never independently solve our problems. I believe the solution lies in interpretability. 

Current AI techniques rely on statistical spaces that are not human-interpretable. Focusing on making these spaces interpretable allows human comprehensibility. Interpretable AI is a rising research topic in several subareas of AI and I believe it can solve the current dilemma. We can develop AI systems independently and all the updates will be comprehensible by humans and they can accordingly update their mental models. But, we interpretability is not a trivial subject. Recent work has only shown incremental progress and the work still compromises on prediction ability for interpretability. The effectiveness of AI is observed because of their ability to recognize patterns in dimensions incomprehensible to human beings. The current paper and interpretability both require human understanding of the model and I am not sure if this is possible.

Questions

  1. Can we have evaluation metrics for other tasks based on this? Will it involve human evaluation? If so, how do we maintain comparative fairness across such metrics?
  2. If we continue evaluating Human-AI teams together, will we ever be able to develop completely independent AI systems?
  3. Should we focus on making the AI systems interpretable or their performance?
  4. Is interpretable AI the future for real-world systems? Think about, for every search query made, the user is able to see all their features that aids the system’s decision making process.

Word Count: 545

Read More