Authors: Rafal Kocielnik, Saleema Amershi, Paul Bennett
Summary
This paper discusses the impact of end-user expectations on the subjective perception of AI-based systems. The authors conduct studies to better understand how different types of errors (i.e. False Positives and False Negatives) are perceived differently by users, even though accuracy remains the same. The paper uses the context of an AI-based scheduling assistant (in an email client) to demonstrate 3 different design interventions for helping end-users adjust their expectations of the system. The studies in this paper showed that these 3 techniques were effective in preserving user satisfaction and acceptance of an imperfect AI-based system.
Reflection
This paper is basically an evaluation of the first 2 guidelines from the “Guidelines of Human-AI Interaction” paper i.e. making clear what the system can do, and how well it can do what it does.
Even though the task in the study was artificial (i.e. using workers from an internal crowdsourcing platform instead of real users of a system and subjecting to an artificial task instead of a real one), the study design, the research questions and the inference from the data initiates the conversation on giving special attention to the user experience in AI-infused systems. Because the tasks were artificial, we could not assess scenarios where users actually have a dog in the fight e.g. they miss an important event by over-relying on the AI assistant and start to depend less on the AI suggestions.
The task here was scheduling events from emails, which is somewhat simple in the sense that users can almost immediately assess how good or bad the system is at. Furthermore, the authors manipulated the dataset for preparing the High Precision and High Recall versions of the system. For conducting this study in a real-world scenario, this would require a better understanding of user mental models with respect to AI imperfections. It becomes slightly trickier when these AI imperfections can not be accurately assessed in a real-world context e.g. search engines may retrieve pages containing the keywords, but may not account context into the results, and thus may not always give users what they want.
The paper makes an excellent case of digging deeper into error recovery costs and correlating that with why participants in this study preferred a system with high false positive rates. This is critical for system designers to keep in mind while dealing with uncertain agents like an AI core. This gets further escalated when it’s a high-stakes scenario.
Questions
- The paper starts off with the hypothesis that avoiding false positives is considered better for user experience, and therefore systems are optimized for high precision. The findings however contradicted it. Can you think about scenarios where you’d prefer a system with a higher likelihood of false positives? Can you think about scenarios where you’d prefer a system with a higher likelihood of false negatives?
- Did you think the design interventions were exhaustive? How would you have added on to the ones suggested in the paper? If you were to adopt something for your own research, what would it be?
- The paper discusses factoring in other aspects, such as workload, both mental and physical, and the criticality of consequences. How would you leverage these aspects in design interventions?
- If you used an AI-infused system every day (to the extent it’s subconsciously a part of your life)
- Would you be able to assess the AI imperfections purely on the basis of usage? How long would it take for you to assess the nature of the AI?
- Would you be aware if the AI model suddenly changed underneath? How long would it take for you to notice the changes? Would your behavior (within the context of the system) be affected in the long term?
In answer to your 4th question, I think that I would get very familiar with an AI system if I used it every day. While perhaps not technically an AI system, I have gotten very good at using Google to get eth results that I want from years of practice. I have tried to change over and use a different search engine, and I can’t get any other system to produce the same quality of results. This may just be because eth other search engines are worse, but it could also be because I have much more experience with making queries for Google. I think that we will get familiar with any system that we use often and when that system changes, for the better or the worse, we will notice, because we just got so used to the way it worked before.
Any system that has high impact in case of false negatives should be adjusted for high false positives. Examples are fraud detection (flagging non-frauds is better than letting go of actual frauds). In case the implications of negative results are high, then high false negatives is good. Example in judiciary, it is considered better to let go of a guilty defendant than punishing a non-guilty defendant.