02/26/2020 – Palakh Mignonne Jude – Explaining Models: An Empirical Study Of How Explanations Impact Fairness Judgment

February 25, 2020 Palakh Mignonne Jude 1 Comment

SUMMARY

The authors of this paper attempt to study the effect explanations of ML systems have in case of fairness judgement. This work attempts to include multiple aspects and heterogeneous standards in making the fairness judgements that go beyond the evaluation of features. In order to perform this task, they utilize four programmatically generated explanations and conduct a study involving over 160 MTurk workers. They consider the impact caused by different explanation styles – global (influence and demographic-based) as well as local (sensitivity and case-based) explanations, fairness issues including model unfairness and case-specific disparate impact, and the impact of individual difference factors such as cognitive style and prior position. They authors utilized the publicly available COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) data set for predicting risk of recidivism which is known to have racial bias. The authors developed a program to generate different explanation versions for a given data point and conducted an online survey style study wherein the participants were made to judge the fairness of a prediction based on a 1 to 7 Likert scale and had to justify the rating given by them.

REFLECTION

I agree that ML systems are often seen as ‘black boxes’ and that this truly does make gauging fairness issues difficult. I believe that this study conducted was indeed very useful in throwing light upon the need for more well-defined fairness judgement methodologies involving humans as well. I feel that the different explanation styles taken into account in this paper – influence, demographic-based, sensitivity, and case-based were good and helped cover various aspects that could contribute in understanding the fairness of the prediction. I found it interesting to learn that the local explanations helped to better understand discrepancies between disparately impacted cases and non-impacted cases whereas the global explanations were more effective in exposing case-specific fairness issues.

I also found interesting to learn that different regions of the feature space may have varied levels of fairness and fairness issues. Having not considered the fairness aspect of my datasets and the impact this would have on the models I build, this made me realize that it would indeed be important to have more fine-grained sampling methods and explanation designs in order to judge the fairness of ML systems.

QUESTIONS

The participants involved in this study comprised of 78.8% self-identified Caucasian MTurk workers. Considering that the COMPAS dataset being considered in this study is known to have racial bias, would changing the percentage of the African American workers involved in these studies have altered the results? The study focused on workers living in the US, perhaps knowing the general judgement of people living across the world from multiple races may have also been interesting to study?
The authors utilize a logistic regression classifier that is known to be relatively more interpretable. How would a study of this kind extend when it comes to other deep learning systems? Could the programs used to generate explanations be used directly? Has any similar study been performed with these kinds of more complex systems?
As part of the limitations of this study, the authors mention that ‘the study was performed with crowd workers, rather than judges who would be the actual users of this type of tool’. How much would the results vary if this study was conducted with judges? Has any follow-up study been conducted?

02/26/20 – Lulwah AlKulaib- Interpretability

February 25, 2020 Lulwah AlKulaib 1 Comment

Summary

Machine learning (ML) models are integrated in many departments nowadays (for example: criminal justice, healthcare, marketing, etc.). The universal presence of ML has moved beyond academic research and grew into an engineering discipline. Because of that, it is important to interpret ML models and understand how they work by developing interpretability tools. Machine Learning engineers, practitioners, and data scientists have been using these tools. However, due to the minimal evaluation of the extent to which these tools achieve interpretability, the authors study the use of two interpretability tools to uncover issues that arise when building and evaluating models. The interpretability tools are: InterpretML implementation of GAMs and the SHAP Python package. They conduct a contextual inquiry and survey197 data scientists to observe how they use interpretability tools to uncover common issues that arise when building and evaluating ML models. Their results show that data scientists did utilize visualizations produced by interpretability tools to uncover issues in datasets and models. Yet, the availability of these tools has led to researchers over-trust and misuse of them.

Reflection

Machine learning is now being used to address important problems like predicting crime rates in cities to help police distribute manpower, identifying cancerous cells, predicting recidivism in the judiciary system, and locating buildings that are subject to catching on fire. Unfortunately, these models have been shown to learn biases. Detecting these biases is subtle, especially to beginners in the field. I agree with the authors that it is troublesome when machine learning is misused, whether intently or due to ignorance, in situations where ethics and fairness are eminent. Lacking models explainability can lead to biased and ill-informed decisions. In our ethics class, we went over case studies where interpretability was lacking and caused representing racial bias in facial analysis systems [1], biasing recidivism predictions [2], and textual gender biases learned from language [3]. Some of these systems were used in real life and have affected people’s lives. I think that using a similar analysis to the one presented in this paper before deploying systems into practice should be mandatory. It would give developers better understanding of their systems and help them avoid making biased decisions that can be corrected before going into public use. Also, informing developers on how dependable are interpretability tools and when to tell that they’re over trusting them, or when are they misusing them is important. Interpretability is a “new” field to machine learning and I’ve been seeing conferences adding sessions about it lately. I’m interested in learning more about interpretability and how we can adapt it in different machine learning modules.

Discussion

Have you used any of the mentioned interpretability packages in your research? How did it help in improving your model?
What are case studies that you know of where machine learning bias is evident? Were these biases corrected? If so, How?
Do you have any interpretability related resources that you can share with the rest of the class?
Do you plan to use these packages in your project?

References

https://splinternews.com/predictive-policing-the-future-of-crime-fighting-or-t-1793855820
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).

02/26/2020 – Ziyao Wang – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

February 25, 2020 Ziyao Wang 2 Comments

As machine learning models are deployed in variety domains of industry, it is important to design some interpretability to help model users, such as data scientists and machine learning practitioners, better understand how these models work. However, there have been little researches focused on the evaluation of the performance of these tools. The authors in this paper did experiments and surveys to fill this gap. They interviewed 6 data scientists from a large technology company to find out the most common issues faced by data scientists. Then they conducted a contextual inquiry towards 11 participants based on the common issues using the InterpretML implementation of the Gams and the SHAP python package. Finally, they made a survey of 197 data scientists. With the experiments and surveys, the authors highlighted the misuse and over-trust problem and the need for the communication between members of HCI and ML communities.

Reflection:

Before reading this paper, I hold the view that the interpretability tools should be able to cover most of the data scientists’ need. However, now I have the view that the tools for interpretation are not designed by the ML community, which will result in the lack of accuracy of the tools. When data scientists or machine learning practitioners want to use these tools to learn how the models operate, they may face problems like misuse or over-trust. I don not think this is the users’ fault. Tools are designed for make users feel more convenient when doing tasks. If the tools will make users confuse, the developers should make change to the tools to give users better user experiences. In this case, the authors suggested that the members of HCI and ML communities should work together when developing the tools. This need the members to leverage their strength so that the designed tools can let users understand the models easily while the tools are user-friendly. Meanwhile, comprehensive instructions should be written to explain how the users can use the tools to understand the models accurately and easily. Finally, both the efficiency and accuracy of both the tools and the implementation of models will be improved.

From data scientists and machine learning practitioners’ point of view, they should try to avoid to over-trust the tools. The tools cannot fully explain the models and there may be mistakes. The users should always be critic to the tools instead of fully trusting them. They should read the instructions carefully, understand how to use the tools and what the tools are used for, what is the models being used for and how to use the models. If they can consider thoughtfully when using these tools and models, instead of guessing the meaning of the results from the tools, the number of misuse and over-trust cases will be decreased sharply.

Questions:

How to design a proposed interactive interpretability tool? What kinds of interactions should be included?
How to design a tool that can make users to dig the models conveniently instead of letting them use the models without knowing how the models work?
How to design tools which can leverage the strength of mental models mostly

2/26/20 – Jooyoung Whang – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

February 25, 2020 Jooyoung Whang 2 Comments

The paper provides research on Fairness, Explainable Artificial Intelligence (XAI), and people’s judgment change. The authors introduce a preprocessing method to reduce the bias of a dataset for known bias-inducing attributes. They also show four explanation methods of the classification results: Sensitivity, Input-Influence, Case, and Demographic. Using different combination of the above configurations, AI classifications of the COMPAS data was presented to MTurk workers for feedback. As a result, the paper reports that case-based explanations were often seen as less fair than other explanation methods. The authors also found that sensitivity explanations are the most effective at addressing unfairness. Finally, the paper shows that the evaluator’s position on machine learning heavily impacts his or her reaction to a classifier output and explanations.

When I looked at the paper’s sample sensitivity explanation, it gave me a strong impression that the system was racist. I think many others would have had a similar thought, especially if they do not have enough knowledge about machine learning and regression. Because of this, it concerned me that some people may be lured more towards making the opposite decision than the one that the AI made as a repulsive reaction. This is clearly adding another bias in the opposite direction. I believe an explanatory model should only give helpful information about the model instead of giving bias. Thinking of a possible solution, the authors could have rephrased the same information in a different way. For example, instead of bluntly saying that the classifier would have made a different decision, the system could have reported the probability for each label. This provides the same information but adds less obvious bias. Another solution would be preprocessing the data to not have the bias in the first place like the authors suggested.

I liked the idea of comparing the subject’s prior position to using ML with their judgment of the classifier. This relates to a reflection I made last week, where I stated the possibility that people may make decisions by putting more weight when the model makes a wrong decision. As I have expected, the paper reported that prior positions do in fact make a huge difference in a user’s judgment. Either building more trust with the users or building the software to effectively address both kinds of users would be needed to address this issue.

The followings are the questions I had while reading the paper:

1. Would there be a possibility where preprocessing the data would add bias to the data instead of removing it? What if the attribute that was thought to be unneeded for the classification was actually crucial to the judgment?

2. The authors state that one of the limitations of their study is conducting it with MTurk workers and not the actual users of the software. Do you think this was really a limitation? The attributes used for the classifier and explanations in their experiment seemed general enough for non-professionals to make a meaningful judgment.

3. If you were to design a classifier with an explanation model, which explanation method would you pick? (Out of Sensitivity, Input-Influence, Case, and Demographic) What do you like about the chosen method?

02/26/2020 – Vikram Mohanty – Will you accept an imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

February 25, 2020 Vikram Mohanty 2 Comments

Authors: Rafal Kocielnik, Saleema Amershi, Paul Bennett

Summary

This paper discusses the impact of end-user expectations on the subjective perception of AI-based systems. The authors conduct studies to better understand how different types of errors (i.e. False Positives and False Negatives) are perceived differently by users, even though accuracy remains the same. The paper uses the context of an AI-based scheduling assistant (in an email client) to demonstrate 3 different design interventions for helping end-users adjust their expectations of the system. The studies in this paper showed that these 3 techniques were effective in preserving user satisfaction and acceptance of an imperfect AI-based system.

Reflection

This paper is basically an evaluation of the first 2 guidelines from the “Guidelines of Human-AI Interaction” paper i.e. making clear what the system can do, and how well it can do what it does.

Even though the task in the study was artificial (i.e. using workers from an internal crowdsourcing platform instead of real users of a system and subjecting to an artificial task instead of a real one), the study design, the research questions and the inference from the data initiates the conversation on giving special attention to the user experience in AI-infused systems. Because the tasks were artificial, we could not assess scenarios where users actually have a dog in the fight e.g. they miss an important event by over-relying on the AI assistant and start to depend less on the AI suggestions.

The task here was scheduling events from emails, which is somewhat simple in the sense that users can almost immediately assess how good or bad the system is at. Furthermore, the authors manipulated the dataset for preparing the High Precision and High Recall versions of the system. For conducting this study in a real-world scenario, this would require a better understanding of user mental models with respect to AI imperfections. It becomes slightly trickier when these AI imperfections can not be accurately assessed in a real-world context e.g. search engines may retrieve pages containing the keywords, but may not account context into the results, and thus may not always give users what they want.

The paper makes an excellent case of digging deeper into error recovery costs and correlating that with why participants in this study preferred a system with high false positive rates. This is critical for system designers to keep in mind while dealing with uncertain agents like an AI core. This gets further escalated when it’s a high-stakes scenario.

Questions

The paper starts off with the hypothesis that avoiding false positives is considered better for user experience, and therefore systems are optimized for high precision. The findings however contradicted it. Can you think about scenarios where you’d prefer a system with a higher likelihood of false positives? Can you think about scenarios where you’d prefer a system with a higher likelihood of false negatives?
Did you think the design interventions were exhaustive? How would you have added on to the ones suggested in the paper? If you were to adopt something for your own research, what would it be?
The paper discusses factoring in other aspects, such as workload, both mental and physical, and the criticality of consequences. How would you leverage these aspects in design interventions?
If you used an AI-infused system every day (to the extent it’s subconsciously a part of your life)
1. Would you be able to assess the AI imperfections purely on the basis of usage? How long would it take for you to assess the nature of the AI?
2. Would you be aware if the AI model suddenly changed underneath? How long would it take for you to notice the changes? Would your behavior (within the context of the system) be affected in the long term?

02/26/2020 – Sushmethaa Muhundan – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

February 25, 2020 Sushmethaa Muhundan 1 Comment

The paper explores how people make fairness judgments of ML systems and the impact that different explanations can have on these fairness judgments. The paper also explores how providing personalized and adaptive explanations can support such fairness judgments of ML systems. It is extremely important to ensure algorithm fairness and there is a need to consciously work towards avoiding the risk of amplifying existing biases. In this context, providing explanations can be beneficial in two aspects, not only do they help in providing implementation details which would otherwise be a “black box” to a user, but they also facilitate better human-in-the-loop experiences by enabling people to identify fairness issues. The COMPAS recidivism data was utilized for the study and four different explanations styles were examined: input-influence based, demographic-based, sensitivity-based, and case-based. Through the study, it is highlighted that there is no one-size-fits-all solution for an effective explanation. The dataset, context, kinds of fairness issues, and user profiles vary and need to be addressed individually. The paper proposes providing hybrid explanations as a solution to address this problem thereby providing both an overview of the ML model and information about specific cases to help aid accurate fairness judgment.

While there has been a lot of research focus on developing non-discriminatory ML algorithms, this paper specifically deals with the human aspect which is necessary to identify and remedy fairness issues. I feel that this is equally important and is often overlooked. It was interesting to note that they auto-generated the explanations, unlike previous studies.

With respect to the different explanation styles used, I found the sensitivity-based explanation particularly interesting since it clearly shows the difference in the prediction result if certain attributes were modified. According to me, this form of explanation, out of the four proposed, is extremely effective in bringing out any bias that may be present in the ML system.

I felt that the input-influence based explanation was also effective since it had the +/- markers corresponding to features that match the particular case and this gives the users a clearer picture of which attributes specifically influenced the result thereby providing the implementation details to a certain extent.

The study results documents various insights from participants, and I found some of them to be extremely fascinating. While some believed that certain predictions were biased, others found it normal for that verdict to be predicted. It truly captured the diversity in opinions and perspectives of the same ML system based on the different explanations provided.

Through this study, it is revealed that the perception of bias is not uniform and is extremely subjective. Given this lack of agreement on the definition of moral concepts, how can a truly unbiased ML system be achieved?
What are some practices that can be followed by ML model developers to ensure that the bias in the input dataset is identified and removed?
Apart from gender-bias and ethnic-bias, what are some other prevalent biases in existing ML systems that need to be eradicated?

02/26/20 – Vikram Mohanty – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

February 25, 2020 Vikram Mohanty 1 Comment

Authors: Jonathan Dodge, Q. Vera Liao, Yunfeng Zhang, Rachel K. E. Bellamy, Casey Dugan

Summary

This paper discusses how different types of programmatically generated explanations can impact people’s fairness judgments of ML systems. The authors conduct studies with Mechanical Turk workers by showing them profiles from a recidivism dataset and the explanations for a classifier’s decision. Findings from the paper show that certain explanations can enhance people’s confidence in the fairness of the algorithm, and individual differences, including prior positions and judgment criteria of algorithmic fairness, impact how people react to different styles of explanation.

Reflection

For the sake of the study, the participants were shown only one type of explanation. While that worked for the purpose of this study, there is value in seeing the global and local explanations together. For e.g. the input-influence explanations can highlight the features that is more/less likely to re-offend, and allowing the user to dig deeper into the features by showing a local explanation can help in forming more clarity. There is some scope of building interactive platforms with the “overview first, details on demand” philosophy. It is, therefore, interesting to see the paper discuss about the potentials of a human-in-the-loop workflow.

I agree with the paper that a focus on data oriented explanation has the unintended consequence of shifting blame from the algorithms, which can slow down the “healing process” from the biases we interact with when we use these systems. Re-assessing the “how” explanations i.e. how the decisions were made is the right approach. The Effect of Population and “Structural” Biases on Social Media-based Algorithms – A Case Study in Geolocation Inference Across the Urban-Rural Spectrum by Johnson et al. illustrates how bias can be attributed to the design of algorithms themselves rather than population biases in the underlying data sources.

The paper makes an interesting contribution regarding the participants’ prior beliefs and positions and how that impacts the way they perceive these judgments. In my opinion, as a system developer, it seems like a good option to take a position (obviously, being informed and depends on the task) and advocate for normative explanations, rather than appeasing everyone and reinforcing meaningless biases which could have been avoided otherwise.

Questions

Based on Figure 1, what other explanations would you suggest? If you were to pick 2 explanations, which 2 would you pick and why?
If you were to design a human-in-the-loop workflow, what sort of input would you seek from the user? Can you outline some high-level feedback data points for a dummy case?
Would normative explanations frustrate you if your beliefs didn’t align with the explanations (even though the explanations make perfect sense)? Would you adapt to the explanations? (PS Read about the backfire offer here: https://youarenotsosmart.com/2011/06/10/the-backfire-effect/)

02/26/2020 – Sukrit Venkatagiri – Interpreting Interpretability

February 25, 2020 Sukrit Venkatagiri 1 Comment

Paper: Harmanpreet Kaur, Harsha Nori, Samuel Jenkins, Rich Caruana, Hanna Wallach, and Jennifer Wortman Vaughan. 2020. Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. In CHI 2020, 13.

Summary: There have been a number of tools developed to aid in increasing the interpretability of ML models, which are used in a wide variety of applications today. However, very few of these tools have been studied with a consideration of the context of use and evaluated by actual users. This paper presents a user-centered evaluation of two ML interpretability tools using a combination of interviews, contextual inquiry, and a large-scale survey with data scientists.

From the interviews, they found six key themes: missing values, temporal changes in the data, duplicate data masked as unique, correlated features, adhoc categorization, and difficulty of trying to debug or identify potential improvements. From the contextual inquiry with a glass-box model (GAM) and a post-hoc explanation technique (SHAP), they found a misalignment between data scientists’ understanding of the tools and their intended use. And finally, from the surveys, they found that participants’ mental models differed greatly, and that their interpretations of these interpretability tools also varied on multiple axes. The paper concludes with a discussion on bridging HCI and ML communities and designing more interactive interpretability tools.

Reflection:

Overall, I really liked the paper and it provided a nuanced as well as broad overview of data scientists’ expectations and interpretations of interpretability tools. I especially appreciate the multi-stage, mixed-methods approach that is used in the paper. In addition, I commend the authors for providing access to their semi-structured interview guide, as well as other study materials, and that they had pre-registered their study. I believe other researchers should strive to be this transparent in their research as well.

More specifically, it is interesting that the paper first leveraged a small pilot study to inform the design of a more in-depth “contextual inquiry” and a large-scale study. However, I do not believe the methods that are used for the “contextual inquiry” to be a true contextual inquiry, rather, it is more like a user study involving semi-structured interview. This is especially true since many of the participants were not familiar with the interpretability tools used in the study, which means that it was not their actual context of use/work.

I am also unsure how realistic the survey is, in terms of mimicking what someone would actually do, and appreciate that the authors acknowledge the same in the limitations section. A minor concern is also the 7-point scale that is used in the survey that ranges from “not at all” to “extremely,” which does not follow standard survey science practices.

I wonder what would happen if the participants were a) nudged to not take the visualizations at face value or to employ “system 2”-type thinking, and/or b) asked to use the tool for a longer. Indeed, they do notice some emergent behavior in the findings, such as a participant questioning whether the tool was actually an interpretability tool. I also wonder what would have happened if two people had used the tool side-by-side, as a “pair programming” exercise.

It’s also interesting how varied participants’ backgrounds, skills, baseline expectations, and interpretations were. Certainly, this problem has been studied elsewhere, and I wonder whether the findings in this paper are a result of not only the failure of these tools to be designed in a user-centered manner, but also the broad range in technical skills of the users themselves. What would it mean to develop a tool for users with such a range in skillsets, especially statistical and mathematical skills? This certainly calls for increased certification—at the behest of increased demand for data scientists—within the ML software industry.

I appreciate the point surrounding Kahneman’s system 1 and system 2 work in the discussion, but I believe this section is possibly too short. I acknowledge that there are page restrictions, which meant that the results could not have been discussed in as much depth as is warranted for such a formative study.

Overall, this was a very valuable study that was conducted in a methodical manner and I believe the findings to be interesting to present and future developers of ML interpretability tools, as well as the HCI community that is increasingly interested in improving the process of designing such tools.

Questions:

Is interpretability only something to be checked off a list, and not inspected at depth?
How do you inspect the interpretability of your models, if at all? When do you know you’ve succeeded?
Why is there a disconnect between the way these tools are intended to be used and how they are actually used? How can this be fixed?
Do you think there needs to be greater requirements in terms of certification/baseline understanding and skills for ML engineers?

02/26/2020 – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment – Yuhang Liu

February 25, 2020 yuhang Liu 1 Comment

This paper mainly explores the injustice of the results of machine learning. These injustices are usually reflected in gender and race, so in order to make the results of machine learning better serve people, the author of the paper conducted an empirical study with four types of programmatically generated explanations to understand how they impact people’s fairness judgments of ML systems. In the experiment, these four interpretations have different characteristics, and after the experiment, the author has the following findings:

Some interpretations are inherently considered unfair, while others can increase people’s confidence in the fairness of the algorithm;
Different interpretations can more effectively expose different fairness issues, such as the model-wide fairness issue and the fairness difference of specific cases.
There are differences between people, different people have different positions, and the perspective of understanding things will affect people’s response to different interpretation styles.

In the end, the authors obtained that in order to make the results of machine learning generally fair, in different situations, different corrections are needed and differences between people must be taken into account.

Reflection：

In another class this semester, the teacher gave three reading materials on the results of machine learning and increased discrimination. In the discussion of those three articles, I remember that most students thought that the reason for discrimination should not be Is the inaccuracy of the algorithm or model, and I even think that machine learning is to objectively analyze things and display the results, and the main reason that people feel uncomfortable and even feel immoral in the face of the results is that people are not willing to face these results. It is often difficult for people to have a clear understanding of the whole picture of things, and when these unnoticed places are moved to the table, people will be shocked or even condemn others, but it is difficult to really think about the cause of things. But after reading this paper, I think my previous understanding was narrow: First, the results of the algorithm and the interpretation of the results must be wrong and discriminatory in some cases. So only if we resolve this discrimination can the results of machine learning be able to better serve people. At the same time, I also agree with the ideas and conclusions in the article. Different interpretation methods and different emphasis will indeed affect the fairness of interpretation. All the prerequisites to eliminate injustices are to understand the causes of these injustices. At the same time, I think the main solution to eliminate injustice is still on the researcher. Reason why I think computer is fascinating is it can always keep things rational and objective to deal with problems. People’s response to different results and the influence of different people on different model predictions are the key to eliminating this injustice. Of course, I think people will think that part of the cause of injustice is also the injustice of our own society. When people think that the results of machine learning carry discrimination based on race, sex, religion, etc., should we think about this discrimination itself, should we pay more attention to gender equality, ethnic equality and how to make the results look better.

Question:

Do you think that this unfairness is more because the results of machine learning mislead people or it is existed in people’s society for a long time.
The article proposes that in order to get more fair results, more people need to be considered, what changes should users make.
How to combine the points of different machine learning explanations to create a fairer explanation.

2/26/20 – Jooyoung Whang – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

February 25, 2020 Jooyoung Whang 1 Comment

This paper seeks to study what an AI system could do to get more approved by users even if it is not perfect. The paper focuses on the concept of “Expectation” and the discrepancy between an AI’s ability and a user’s expectation for the system. To explore this problem, the authors implemented an AI-powered scheduling assistant that mimics the look of MS Outlook. The agent detects in an E-mail if there exists an appointment request and asks the user if he or she wants to add a schedule to the calendar. The system was intentionally made to perform worse than the originally trained model to explore mitigation techniques to boost user satisfaction given an imperfect system. After trying out various methods, the authors conclude: Users prefer AI systems focusing on high precision and users like systems that gives direct information about the system, shows explanations, and supports certain measure of control.

This paper was a fresh approach that appropriately addresses the limitations that AI systems would likely have. While many researchers have looked into methods of maximizing the system accuracy, the authors of this paper studied ways to improve user satisfaction even without a high performing AI model.

I did get the feeling that the designs for adjusting end-user expectations were a bit too static. Aside from the controllable slider, the other two designs were basically texts and images with either an indication of the accuracy or a step-by-step guide on how the system works. I wonder if having a more dynamic version where the system reports for a specific instance would be more useful. For example, for every new E-mail, the system could additionally report to the user how confident it is or why it thought that the E-mail included a meeting request.

This research reminded me of one of the UX design techniques: think-aloud testing. In all of their designs, the authors’ common approach was to close the gap between user expectation and system performance. Think-aloud testing is also used to close that gap by analyzing how a user would interact with a system and adjusting from the results. I think this research approached it in the opposite way. Instead of adjusting the system, the authors’ designs try to adjust the user’s mental model.

The followings are the questions that I had while reading the paper:

1. As I’ve written in my reflection portion, do you think the system will be approved more if it reported some information about the system for each instance (E-mail)? Do you think the system may appear to be making excuses for when it is wrong? In what way would this dynamic version be more helpful than the static design from the paper?

2. In the generalizability section, the authors state that they think some parts of their study are scalable to other kinds of AI systems. What other types of AI could benefit from this study? Which one would benefit the most?

3. Many AI applications today are deployed after satisfying a certain accuracy threshold which is pretty high. This can lead to more funds and time needed for development. Do you think this research will allow the stakeholders to lower the threshold? In the end, the stakeholders just want to achieve high user satisfaction.