Summary
Bansal et. al discuss how human-AI teams work in solving high-stakes issues such as hospital patient discharging scenarios or credit risk assessment. They point out that the humans in these teams often create a mental model of the AI suggestions, where the mental model is an understanding of when the AI is likely wrong about the outcome. The authors then show that updates to the AI can produce worse performance if they are not compatible with the already formed mental model of the human user. They go on to define types of compatibility for AI updates, as well as a few other key terms relating to human/AI teams. They develop a platform to measure how compatibility can affect team performance, and then measure AI update compatibility effectiveness through a user study using 25 mTurk workers. In all, they show that incompatible updates reduce performance as compared to no update at all.
Personal Reflection
The paper was an interesting study in the effect of pushing updates without considering the user involved in the process. I hadn’t thought of the human as an exactly equal player in the team, where the AI likely has more information and could provide a better suggestion. However, it makes sense that the human leverages other sources of information and forms a better understanding of what choice to ultimately make.
CAJA, the human/AI simulation platform, seems like a good way to test AI updates, however I struggle to see how it can be used to test other theories as the authors seem to suggest. It is, essentially, a simple user-learning game, where users figure out when to trust the machine and when to deviate. While this isn’t exactly my field of expertise, I only see the chance to change information flows and the underlying AI as ways of learning new things about human/AI collaboration. This would mean terming this as a platform is a little excessive.
Questions
- The authors mention that, in order to defeat mTurk scammers who click through projects like these quickly, they drop the lowest quartile (in terms of performance) out of their results. Do you think this was an effective countermeasure, or could the authors be cutting good data?
- From other sources, such as Weapons of Math Destruction, we can read how some AI suggestions are inherently biased (even racist) due to input data. How might this change the authors results? Do you think this is taken into consideration at all?
- One suggestion near the end of the paper stated that, if pushing an incompatible update, the authors of the AI should make the change explicit so that the user could adjust accordingly. Do you think this is an acceptable tradeoff to not creating a compatible update? Why or why not?
- The authors note that, as the complexity of error boundary f became more complex, errors increased, so they kept to relatively simple boundaries. Is this an effective choice for this system, considering real systems are extremely complex? Why or why not?
- The authors state that they wanted the “compute” cost to be net 0. Does this effectively simulate real-world experiences? Is the opportunity-cost the only net negative here?
Hi Lisle,
Good discussion and good question. I would like to talk about your second question. It is interesting to talk about the situation that what if the input data is bias. Recently I read an article about a new technology called predictive policing. One of the most popular applications is PredPol. Police only need to put the last 3 years crime record, and the system would mark a red square on the map which indicates that there is a high possibility crime would happen in that area. However, based on an interview with polices who use this application, I found they not just trust the prediction or not, they are using this information smartly. Most of the time, they rely more on their expertise. This makes feel that the contribution of the system, and it’s limited influence on its human partner. Just an interesting reading experience and want to share it with you.
To your second question, I am guessing the biased nature of AI may be captured by the mental model of the users with prolonged use of the time. That being said, it remains to be seen. I felt like a much more comprehensive real-world study would have helped understand these aspects. I am a bit unsure about how much generalization can a toy example in CAJA-like platform aim to achieve.