02/26/20 – Fanglan Chen – Will You Accept an Imperfect AI? Exploring Designs For Adjusting End-user Expectations of AI Systems

Summary

Kocielnik et al.’s paper “Will You Accept an Imperfect AI?” explores approaches for shaping expectations of end-users before their initial working with an AI system and studies how appropriate expectations impact users’ acceptance of the system. Prior study has presented that end-user expectations of AI-powered technologies are influenced by various factors, such as external information, knowledge and understanding, and first hand experience. The researchers indicate that expectations vary among users and users perception/acceptance of AI systems may be negatively impacted when their expectations are set too high. To fill in the gap of understanding how end-user expectations can be directly and explicitly impacted, the researchers use a Scheduling Assistant – an AI system for automatic meeting schedule detection in email – to study the impact of several methods of expectation shaping. Specifically, they explore two system versions with the same accuracy level of the classifier but each is intended to focus on mitigating different types of errors(False Positives and False Negatives). Based on their study, error types highly relate to users’ subjective perceptions of accuracy and acceptance. Expectation adjustment techniques are proposed to make users fully aware of AI imperfections and enhance their acceptance of AI systems.

Reflection

We need to be aware that AI-based technologies cannot be perfect, just like nobody is perfect. Hence, there is no point setting a goal that involves AI systems making no mistake. Realistically defining what success and failure look like associated with working with AI-powered technologies is of great importance in adopting AI to improve the imperfection of nowadays solutions. That calls for an accurate positioning of where AI sits in the bigger picture. I feel the paper mainly focuses on how to set appropriate expectations but lacks a discussion on different scenarios associated with the users expectations to AI. For example, users expectation greatly vary to the same AI system in different decision making frameworks: in human-centric decision making process, the expectation of AI component is comparatively low as AI’s role is more like a counselor who is allowed to make some mistakes; in machine-centric system, all the decisions are made by algorithms which render users’ low tolerance of errors, simply put, some AIs will require more attention than others, because the impact of errors or cost of failures will be higher. Expectations of AI systems vary not only among different users but also under various usage scenarios.

To generate positive user experiences, AI needs to exceed expectations. One simple way to achieve this is to not over-promise the performance of AI in the beginning. That relates with the intention of the researchers on designing the Accuracy Indicator component in the Scheduling Assistant. In the study case, they set the accuracy to 50%. This accuracy is actually very low in AI-based applications. I’m interested in whether the evaluation results would change with AI systems of higher performance (e.g. 70% or 90% in accuracy). I think it is worthwhile to conduct a survey about users’ general expectations of AI-based systems. 

Interpretability of AI is another key component that shapes user experiences. If people cannot understand how AI works or how it comes up with its solutions, and in turn do not trust it, they would probably not choose to use it. As people accumulate more positive experiences, they build trust with AI. In this way, easy-to-interpret models seem to be more promising to deliver success compared with complex black-box models. 

To sum up, by being fully aware of AI’s potential but also its limitations, and developing strategies to set appropriate expectations, users can create positive AI experiences and build trust in an algorithmic approach in decision making processes.

Discussion

I think the following questions are worthy of further discussion.

  • What is your expectation of AI systems in general? 
  • How would users expectations of the same AI system vary in different usage scenarios?
  • What are the negative impacts brought by the inflated expectations? Please give some examples. 
  • How can we determine which type of errors is more severe in an AI system?

Read More

02/26/20 – Lulwah AlKulaib- Explaining Models

Summary

The authors believe that in order to ensure fairness in machine learning systems, it is mandatory to have a human in the loop process. In order to identify fairness problems and make improvements, they suppose relying on developers, users, and the general public is an effective way to follow that process. The paper conducts an empirical study with four types of programmatically generated explanations to understand how they impact people’s fairness judgments of ML systems. They try to answer three research questions:

  • RQ1 How do different styles of explanation impact fairness judgment of a ML system?
  • RQ2 How do individual factors in cognitive style and prior position on algorithmic fairness impact the fairness judgment with regard to different explanations?
  • RQ3 What are the benefits and drawbacks of different explanations in supporting fairness judgment of ML systems?

The authors focus on a racial discrimination case study in terms of model unfairness and Case-specific disparate impact. They performed an experiment with 160 Mechanical Turk workers. Their hypothesis proposed that given local explanations focus on justifying a particular case, they should more effectively surface fairness discrepancies between cases. 

 The authors show that: 

  • Certain explanations are considered inherently less fair, while others can enhance people’s confidence in the fairness of the algorithm
  • Different fairness problems-such as model-wide fairness issues versus case-specific fairness discrepancies-may be more effectively exposed through different styles of explanation
  • Individual differences, including prior positions and judgment criteria of algorithmic fairness, impact how people react to different styles of explanation.

Reflection

This is a really informative paper. I like that it had a straightforward hypothesis and chose one existing case study that they evaluated. But I would have loved to see this addressed with judges instead of crowdworkers. They mentioned it in their limitations and I hope that they find enough judges willing to work on a follow-up paper. I believe that they would have insightful knowledge to contribute especially since they practice it. It would give a more meaningful analysis to the case study itself from professionals in the field.

I also wonder how this might scale to different machine learning systems that cover similar racial biases. Having a specific case study makes it harder to generalize even for something in the same domain. But definitely worth investigating since there are so many existing case studies! I also wonder if changing the case study analyzed, we’d notice a difference in the local vs. global explanations patterns in fairness judgement. And how would a mix of both affect the judgement, too. 

Discussion

  • What are other ways you would approach this case study?
  • What are some explanations that weren’t covered in this study?
  • How would you facilitate this study to be performed with judges?
  • What are other case studies that you could generalize this to with small changes to the hypothesis?

Read More

02/26/20 – Sukrit Venkatagiri – Will You Accept an Imperfect AI?

Paper: Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–14.

Summary: 

This paper explores people’s perceptions and expectations of an intelligent scheduling assistant. The paper specifically considers three broad research questions: the impact of AI’s focus on error avoidance versus user perception, ways to set appropriate expectations, and impact of expectation setting on user satisfaction and acceptance. The paper explores this through an experimental setup, whose design process is explored in detail. 

The authors find that expectation adjustment designs significantly affected the desired aspects of expectations, similar to what was hypothesized. They also find that high recall resulted in significantly higher perceptions of accuracy and acceptance compared to high precision, and that expectation adjustment worked by intelligible explanations and tweaking model evaluation metrics to emphasize one over the other. The paper concludes with a discussion of the findings.

Reflection:

This paper presents some interesting findings using a relatively simple, yet powerful “technology probe.” I appreciate the thorough exploration of the design space, taking into consideration design principles and how they were modified to meet the required goals. I also appreciate the varied and nuanced research questions. However, I feel like the setup may have been too simple to explore in more depth. Certainly, this is valuable as a formative study, but more work needs to be done. 

It was interesting that people valued high recall over high precision. I wonder if the results would differ among people with varied expertise, from different countries, and from different socioeconomic backgrounds. I also wonder how this might differ based on the application scenario, e.g. AI scheduling assistant versus a movie recommendation system. In the latter, a user would not be aware of what movies they were not recommended but that they would actually like, while with an email scheduling assistant, it is easy to see false negatives.

I wonder how these techniques, such as expectation setting, might apply not only to users’ expectations of AI systems, but also to exploring the interpretability or explainability of more complex ML models.

At what point do explanations tend to result in the opposite effect? I.e. reduced user acceptance and preference? It may be interesting to experimentally study how different levels of explanations and expectation settings affect user perceptions versus a binary value. I also wonder how it might change with people of different backgrounds.

In addition, this experiment was relatively short in duration. I wonder how the findings would change over time. Perhaps users would form inaccurate expectations, or their mental models might be better steered through expectation-setting. More work is needed in this regard. 

Questions:

  1. Will you accept an imperfect AI?
  2. How do you determine how much explanation is enough? How would this work for more complex models?
  3. What other evaluation metrics can be used?
  4. When is high precision valued over high recall, and vice versa?

Read More

02/26/2020 – Palakh Mignonne Jude – Interpreting Interpretability: Understanding Data Scientists’ Use Of Interpretability Tools For Machine Learning

SUMMARY

In this paper, the authors attempt to study two interpretability tools – the InterpretML implementation of GAMs and the SHAP Python package. They conducted a contextual inquiry and survey of data scientists in order to analyze the ability of these tools to aid in uncovering common issues that arise when evaluating ML models. The results obtained during the course of these studies indicate that data scientists tend to over-trust these tools. The authors conducted pilot interviews with 6 participants to identify common issues faced by data scientists. The contextual inquiry performed included 11 participants who were allowed to explore the dataset and an ML model in a hands-on manner via the use of a Jupyter notebook whereas the survey comprised of 197 participants and was conducted through Qualtrics. For the survey, the participants were given access to a description of the dataset and a tutorial on the interpretability tool they were to use. The authors found that the visualizations provided by the interpretability tools considered in the study as well as the fact that these tools were popular and publicly available caused the data scientists to over-trust these tools.

REFLECTION

I think it is good that the authors performed a study to observe the usage of interpretability tools by data scientists. I was surprised to learn that a large number of these data scientists over-trusted the tools and that visualizations impacted their ability to judge the tools as well. However, considering that the authors state ‘participants relied too heavily on the interpretability tools because they has not encountered such visualizations before’ makes me wonder if the authors should have created a separate pool of data scientists who had better experience with such tools and visualizations and then presented a separate set of results for that set of individuals. I also found it interesting to learn that some participants used the tools to rationalize suspicious observations.

As indicated by the limitations section of this paper, I think a follow-up study that includes a richer dataset as well as interpretability techniques for deep learning would be very interesting to learn about and I wonder how data scientists would use such tools versus the ones studied in this paper.

QUESTIONS

  1. Considering that the complexity of ML systems and the time taken for researchers to truly understand how to interpret ML, both the contextual inquiry as well as the survey was conducted with people who had as little as 2 months of experience with ML. Would a study with experts in the field of ML (all with over 4 years of experience) have yielded different results? Perhaps these data scientists would have been able to better identify issues and would not have over-trusted the interpretable tools?
  2. Would a more extensive study comprise of a number of different (commonly used as well as not-so-commonly used) interpretability tools have changed the results? If the tools were not available so easily would it truly impact the amount of trust the users had for the tools?
  3. Does a correlation exist between the amount of experience a data scientist has and the amount of trust for a given interpretability tool? Would the replacement of visualizations with other representations of interpretations of the models impact the amount of trust the human had towards the tool?

Read More

02/26/20 – Fanglan Chen – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary

Dodge et al.’s paper “Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment” presents an empirical study on how people make fairness judgments of machine learning systems and how different styles of explanation impact their judgments. Fairness issues of ML systems attract research interests during recent years. Mitigating the unfairness in ML systems is challenging, which requires the good cooperation of  developers, users, and the general public. The researchers state that how explanations are constructed have an impact on users’ confidence in the systems. To further examine the potential impacts on people’s fairness judgments of ML systems, they conduct empirical experiments involving crowdsourcing workers on four types of programmatically generated explanations (influence, demographic-based, sensitivity, and case-based). Their key findings include: 1) some explanations are considered more fair, while others have negative impact on users’ trust of the algorithm in regards of fairness; 2) varied fairness issues (model-wide fairness and case-specific fairness) can be detected more effectively through an examination of different explanation styles; 3) individual differences (prior positions and judgment criteria of algorithmic fairness) lead to how users react to different styles of explanation. 

Reflection

This paper shines light on a very important fact that bias in ML systems can be detected and mitigated. There is a growing attention to the fairness issues in AI-powered technologies in the machine learning research community. Since ML algorithms are widely used to speed up the decision making process in a variety of domains, beyond achieving good performance, they are expected to produce neutral results. There is no denying the fact that algorithms rely on data, “garbage in, garbage out.” Hence, it is incumbent to feed the unbiased data to these systems upon developers in the first place. In many real-world cases, race is actually not used as an input, however, it correlates to other factors that make predictions biased. That case is not as easy as the cases presented in the paper to detect but still requires effort to be corrected. A question here would be in order to counteract this implicit bias, should race be considered and used to calibrate the relative importance of other factors? 

Besides the bias introduced by data input, there are other factors that need to be taken into consideration to deal with the fairness issues in ML systems. Firstly, machine bias can never be neglected. The term bias in the context of the high-stakes tasks (e.g. future criminal prediction) is very important because a false positive decision could have a destructive impact on a person’s life. This is why when an AI system deals with the human subject (in this case human life), the system must be highly precise and accurate and ideally provide reasonable explanation. Making a person’s life harder to live in a society or impacting badly a person’s life due to a flawed computer model is never acceptable. Secondly, the proprietary model is another concern. One thing should be kept in mind that many high-stacks tasks such as future criminal prediction is a matter of public matter and should be transparent and fair. That does not mean that the ML systems used for those tasks need to be completely public and open. However, I believe there should be a regulatory board of experts who can verify and validate the ML systems. More specifically, the experts can verify and validate the risk factors used in a system so that the factors could be widely accepted. They can also verify and validate the algorithmic techniques used in a system so that the system incorporates less bias. 

Discussion

I think the following questions are worthy of further discussion.

  • Besides model unfairness and case-specific disparate impact, are there any other fairness issues?
  • What are the benefits and drawbacks of global and local explanations in supporting fairness judgment of AI systems?
  • Are there any other style or element of explanations that may impact fairness judgement you can think about?
  • If an AI system is not any better than untrained users at predicting recidivism in a fair and accurate way, why do we need the system?

Read More

02/26/2020 – Palakh Mignonne Jude – Explaining Models: An Empirical Study Of How Explanations Impact Fairness Judgment

SUMMARY

The authors of this paper attempt to study the effect explanations of ML systems have in case of fairness judgement. This work attempts to include multiple aspects and heterogeneous standards in making the fairness judgements that go beyond the evaluation of features. In order to perform this task, they utilize four programmatically generated explanations and conduct a study involving over 160 MTurk workers. They consider the impact caused by different explanation styles – global (influence and demographic-based) as well as local (sensitivity and case-based) explanations, fairness issues including model unfairness and case-specific disparate impact, and the impact of individual difference factors such as cognitive style and prior position. They authors utilized the publicly available COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) data set for predicting risk of recidivism which is known to have racial bias. The authors developed a program to generate different explanation versions for a given data point and conducted an online survey style study wherein the participants were made to judge the fairness of a prediction based on a 1 to 7 Likert scale and had to justify the rating given by them.

REFLECTION

I agree that ML systems are often seen as ‘black boxes’ and that this truly does make gauging fairness issues difficult. I believe that this study conducted was indeed very useful in throwing light upon the need for more well-defined fairness judgement methodologies involving humans as well. I feel that the different explanation styles taken into account in this paper – influence, demographic-based, sensitivity, and case-based were good and helped cover various aspects that could contribute in understanding the fairness of the prediction. I found it interesting to learn that the local explanations helped to better understand discrepancies between disparately impacted cases and non-impacted cases whereas the global explanations were more effective in exposing case-specific fairness issues.

I also found interesting to learn that different regions of the feature space may have varied levels of fairness and fairness issues. Having not considered the fairness aspect of my datasets and the impact this would have on the models I build, this made me realize that it would indeed be important to have more fine-grained sampling methods and explanation designs in order to judge the fairness of ML systems.

QUESTIONS

  1. The participants involved in this study comprised of 78.8% self-identified Caucasian MTurk workers. Considering that the COMPAS dataset being considered in this study is known to have racial bias, would changing the percentage of the African American workers involved in these studies have altered the results? The study focused on workers living in the US, perhaps knowing the general judgement of people living across the world from multiple races may have also been interesting to study?
  2. The authors utilize a logistic regression classifier that is known to be relatively more interpretable. How would a study of this kind extend when it comes to other deep learning systems? Could the programs used to generate explanations be used directly? Has any similar study been performed with these kinds of more complex systems?
  3. As part of the limitations of this study, the authors mention that ‘the study was performed with crowd workers, rather than judges who would be the actual users of this type of tool’. How much would the results vary if this study was conducted with judges? Has any follow-up study been conducted?

Read More

02/26/20 – Lulwah AlKulaib- Interpretability

Summary

Machine learning (ML) models are integrated in many departments nowadays (for example: criminal justice, healthcare, marketing, etc.). The universal presence of ML has moved beyond academic research and grew into an engineering discipline. Because of that, it is important to interpret ML models and understand how they work by developing interpretability tools. Machine Learning engineers, practitioners, and data scientists have been using these tools. However, due to the minimal evaluation of the extent to which these tools achieve interpretability, the authors study the use of two interpretability tools to uncover issues that arise when building and evaluating models. The interpretability tools are: InterpretML implementation of GAMs and the SHAP Python package. They conduct a contextual inquiry and survey197 data scientists to observe how they use interpretability tools to uncover common issues that arise when building and evaluating ML models. Their results show that data scientists did utilize visualizations produced by interpretability tools to uncover issues in datasets and models. Yet, the availability of these tools has led to researchers over-trust and misuse of them.

Reflection

Machine learning is now being used to address important problems like predicting crime rates in cities to help police distribute manpower, identifying cancerous cells, predicting recidivism in the judiciary system, and locating buildings that are subject to catching on fire. Unfortunately, these models have been shown to learn biases. Detecting these biases is subtle, especially to beginners in the field. I agree with the authors that it is troublesome when machine learning is misused, whether intently or due to ignorance, in situations where ethics and fairness are eminent. Lacking models explainability can lead to biased and ill-informed decisions. In our ethics class, we went over case studies where interpretability was lacking and caused representing racial bias in facial analysis systems [1], biasing recidivism predictions [2], and textual gender biases learned from language [3]. Some of these systems were used in real life and have affected people’s lives. I think that using a similar analysis to the one presented in this paper before deploying systems into practice should be mandatory. It would give developers better understanding of their systems and help them avoid making biased decisions that can be corrected before going into public use. Also, informing developers on how dependable are interpretability tools and when to tell that they’re over trusting them, or when are they misusing them is important. Interpretability is a “new” field to machine learning and I’ve been seeing conferences adding sessions about it lately. I’m interested in learning more about interpretability and how we can adapt it in different machine learning modules.

Discussion

  • Have you used any of the mentioned interpretability packages in your research? How did it help in improving your model?
  • What are case studies that you know of where machine learning bias is evident? Were these biases corrected? If so, How?
  • Do you have any interpretability related resources that you can share with the rest of the class?
  • Do you plan to use these packages in your project? 

References

  1. https://splinternews.com/predictive-policing-the-future-of-crime-fighting-or-t-1793855820
  2. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
  3. Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).

Read More

02/26/2020 – Ziyao Wang – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

As machine learning models are deployed in variety domains of industry, it is important to design some interpretability to help model users, such as data scientists and machine learning practitioners, better understand how these models work. However, there have been little researches focused on the evaluation of the performance of these tools. The authors in this paper did experiments and surveys to fill this gap. They interviewed 6 data scientists from a large technology company to find out the most common issues faced by data scientists. Then they conducted a contextual inquiry towards 11 participants based on the common issues using the InterpretML implementation of the Gams and the SHAP python package. Finally, they made a survey of 197 data scientists. With the experiments and surveys, the authors highlighted the misuse and over-trust problem and the need for the communication between members of HCI and ML communities.

Reflection:

Before reading this paper, I hold the view that the interpretability tools should be able to cover most of the data scientists’ need. However, now I have the view that the tools for interpretation are not designed by the ML community, which will result in the lack of accuracy of the tools. When data scientists or machine learning practitioners want to use these tools to learn how the models operate, they may face problems like misuse or over-trust. I don not think this is the users’ fault. Tools are designed for make users feel more convenient when doing tasks. If the tools will make users confuse, the developers should make change to the tools to give users better user experiences. In this case, the authors suggested that the members of HCI and ML communities should work together when developing the tools. This need the members to leverage their strength so that the designed tools can let users understand the models easily while the tools are user-friendly. Meanwhile, comprehensive instructions should be written to explain how the users can use the tools to understand the models accurately and easily. Finally, both the efficiency and accuracy of both the tools and the implementation of models will be improved.

From data scientists and machine learning practitioners’ point of view, they should try to avoid to over-trust the tools. The tools cannot fully explain the models and there may be mistakes. The users should always be critic to the tools instead of fully trusting them. They should read the instructions carefully, understand how to use the tools and what the tools are used for, what is the models being used for and how to use the models. If they can consider thoughtfully when using these tools and models, instead of guessing the meaning of the results from the tools, the number of misuse and over-trust cases will be decreased sharply.

Questions:

  1. How to design a proposed interactive interpretability tool? What kinds of interactions should be included?
  2. How to design a tool that can make users to dig the models conveniently instead of letting them use the models without knowing how the models work?
  3. How to design tools which can leverage the strength of mental models mostly

Read More

2/26/20 – Jooyoung Whang – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

The paper provides research on Fairness, Explainable Artificial Intelligence (XAI), and people’s judgment change. The authors introduce a preprocessing method to reduce the bias of a dataset for known bias-inducing attributes. They also show four explanation methods of the classification results: Sensitivity, Input-Influence, Case, and Demographic. Using different combination of the above configurations, AI classifications of the COMPAS data was presented to MTurk workers for feedback. As a result, the paper reports that case-based explanations were often seen as less fair than other explanation methods. The authors also found that sensitivity explanations are the most effective at addressing unfairness. Finally, the paper shows that the evaluator’s position on machine learning heavily impacts his or her reaction to a classifier output and explanations.

When I looked at the paper’s sample sensitivity explanation, it gave me a strong impression that the system was racist. I think many others would have had a similar thought, especially if they do not have enough knowledge about machine learning and regression. Because of this, it concerned me that some people may be lured more towards making the opposite decision than the one that the AI made as a repulsive reaction. This is clearly adding another bias in the opposite direction. I believe an explanatory model should only give helpful information about the model instead of giving bias. Thinking of a possible solution, the authors could have rephrased the same information in a different way. For example, instead of bluntly saying that the classifier would have made a different decision, the system could have reported the probability for each label. This provides the same information but adds less obvious bias. Another solution would be preprocessing the data to not have the bias in the first place like the authors suggested.

I liked the idea of comparing the subject’s prior position to using ML with their judgment of the classifier. This relates to a reflection I made last week, where I stated the possibility that people may make decisions by putting more weight when the model makes a wrong decision. As I have expected, the paper reported that prior positions do in fact make a huge difference in a user’s judgment. Either building more trust with the users or building the software to effectively address both kinds of users would be needed to address this issue.

The followings are the questions I had while reading the paper:

1. Would there be a possibility where preprocessing the data would add bias to the data instead of removing it? What if the attribute that was thought to be unneeded for the classification was actually crucial to the judgment?

2. The authors state that one of the limitations of their study is conducting it with MTurk workers and not the actual users of the software. Do you think this was really a limitation? The attributes used for the classifier and explanations in their experiment seemed general enough for non-professionals to make a meaningful judgment.

3. If you were to design a classifier with an explanation model, which explanation method would you pick? (Out of Sensitivity, Input-Influence, Case, and Demographic) What do you like about the chosen method?

Read More

02/26/2020 – Vikram Mohanty – Will you accept an imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Authors: Rafal Kocielnik, Saleema Amershi, Paul Bennett

Summary


This paper discusses the impact of end-user expectations on the subjective perception of AI-based systems. The authors conduct studies to better understand how different types of errors (i.e. False Positives and False Negatives) are perceived differently by users, even though accuracy remains the same. The paper uses the context of an AI-based scheduling assistant (in an email client) to demonstrate 3 different design interventions for helping end-users adjust their expectations of the system. The studies in this paper showed that these 3 techniques were effective in preserving user satisfaction and acceptance of an imperfect AI-based system. 

Reflection

This paper is basically an evaluation of the first 2 guidelines from the “Guidelines of Human-AI Interaction” paper i.e. making clear what the system can do, and how well it can do what it does. 

Even though the task in the study was artificial (i.e. using workers from an internal crowdsourcing platform instead of real users of a system and subjecting to an artificial task instead of a real one), the study design, the research questions and the inference from the data initiates the conversation on giving special attention to the user experience in AI-infused systems. Because the tasks were artificial, we could not assess scenarios where users actually have a dog in the fight e.g. they miss an important event by over-relying on the AI assistant and start to depend less on the AI suggestions. 

The task here was scheduling events from emails, which is somewhat simple in the sense that users can almost immediately assess how good or bad the system is at. Furthermore, the authors manipulated the dataset for preparing the High Precision and High Recall versions of the system. For conducting this study in a real-world scenario, this would require a better understanding of user mental models with respect to AI imperfections. It becomes slightly trickier when these AI imperfections can not be accurately assessed in a real-world context e.g. search engines may retrieve pages containing the keywords, but may not account context into the results, and thus may not always give users what they want.  

The paper makes an excellent case of digging deeper into error recovery costs and correlating that with why participants in this study preferred a system with high false positive rates. This is critical for system designers to keep in mind while dealing with uncertain agents like an AI core. This gets further escalated when it’s a high-stakes scenario. 

Questions

  1. The paper starts off with the hypothesis that avoiding false positives is considered better for user experience, and therefore systems are optimized for high precision. The findings however contradicted it. Can you think about scenarios where you’d prefer a system with a higher likelihood of false positives? Can you think about scenarios where you’d prefer a system with a higher likelihood of false negatives?
  2. Did you think the design interventions were exhaustive? How would you have added on to the ones suggested in the paper? If you were to adopt something for your own research, what would it be? 
  3. The paper discusses factoring in other aspects, such as workload, both mental and physical, and the criticality of consequences. How would you leverage these aspects in design interventions? 
  4. If you used an AI-infused system every day (to the extent it’s subconsciously a part of your life)
    1. Would you be able to assess the AI imperfections purely on the basis of usage? How long would it take for you to assess the nature of the AI? 
    2. Would you be aware if the AI model suddenly changed underneath? How long would it take for you to notice the changes? Would your behavior (within the context of the system) be affected in the long term? 

Read More