02/26/2020 – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment – Yuhang Liu

February 25, 2020 yuhang Liu 1 Comment

This paper mainly explores the injustice of the results of machine learning. These injustices are usually reflected in gender and race, so in order to make the results of machine learning better serve people, the author of the paper conducted an empirical study with four types of programmatically generated explanations to understand how they impact people’s fairness judgments of ML systems. In the experiment, these four interpretations have different characteristics, and after the experiment, the author has the following findings:

Some interpretations are inherently considered unfair, while others can increase people’s confidence in the fairness of the algorithm;
Different interpretations can more effectively expose different fairness issues, such as the model-wide fairness issue and the fairness difference of specific cases.
There are differences between people, different people have different positions, and the perspective of understanding things will affect people’s response to different interpretation styles.

In the end, the authors obtained that in order to make the results of machine learning generally fair, in different situations, different corrections are needed and differences between people must be taken into account.

Reflection：

In another class this semester, the teacher gave three reading materials on the results of machine learning and increased discrimination. In the discussion of those three articles, I remember that most students thought that the reason for discrimination should not be Is the inaccuracy of the algorithm or model, and I even think that machine learning is to objectively analyze things and display the results, and the main reason that people feel uncomfortable and even feel immoral in the face of the results is that people are not willing to face these results. It is often difficult for people to have a clear understanding of the whole picture of things, and when these unnoticed places are moved to the table, people will be shocked or even condemn others, but it is difficult to really think about the cause of things. But after reading this paper, I think my previous understanding was narrow: First, the results of the algorithm and the interpretation of the results must be wrong and discriminatory in some cases. So only if we resolve this discrimination can the results of machine learning be able to better serve people. At the same time, I also agree with the ideas and conclusions in the article. Different interpretation methods and different emphasis will indeed affect the fairness of interpretation. All the prerequisites to eliminate injustices are to understand the causes of these injustices. At the same time, I think the main solution to eliminate injustice is still on the researcher. Reason why I think computer is fascinating is it can always keep things rational and objective to deal with problems. People’s response to different results and the influence of different people on different model predictions are the key to eliminating this injustice. Of course, I think people will think that part of the cause of injustice is also the injustice of our own society. When people think that the results of machine learning carry discrimination based on race, sex, religion, etc., should we think about this discrimination itself, should we pay more attention to gender equality, ethnic equality and how to make the results look better.

Question:

Do you think that this unfairness is more because the results of machine learning mislead people or it is existed in people’s society for a long time.
The article proposes that in order to get more fair results, more people need to be considered, what changes should users make.
How to combine the points of different machine learning explanations to create a fairer explanation.

02/26/2020 – Subil Abraham – Explaining models

February 25, 2020 Subil Abraham 1 Comment

A big concern with the usage of current ML systems is the issue of fairness and bias when making their decisions. Bias can creep into ML decisions through either the design of the algorithm or through training datasets that are labeled in a way to bias against certain kinds of things. The example used in this paper is the bias against African Americans in an ML system used by judges to predict the probability of a person re-offending after committing a crime. Fairness is hard to judge when ML systems are black boxes so this paper proposes that if ML systems expose reasons behind the decisions (i.e. the idea of explainable AI), a better judgement of the fairness of the decision can be made by the user. To this end, this paper examines the effect of four of different kinds of explanations of the ML decisions on people’s judgements of the fairness of that decision.

I believe this is a very timely and necessary paper in these times, with ML systems being used more and more for sensitive and life changing decisions. It is probably impossible to stop people from adopting these systems so the next best thing is making explainability of the ML decisions mandatory, so people can see and judge if there was potentially bias in the ML system’s decisions. It is interesting that people were mostly able to perceive that there were fairness issues in the raw data. You would think that that would be hard but the generated explanations may have worked well enough to help with that (though I do wish they could’ve shown an example comparing a raw data point and a processed data point that showed how their pre-processing cleaned things). I did wonder why they didn’t show confidence levels to the users in the evaluation, but their explanation that it was something they could not control for makes sense. People could have different reactions to confidence levels, some thinking that anything less than a 100% is insufficient, while others thinking that 51% is good enough. So keeping it out is a limiting but is logical.

What other kinds of generated explanations could be beneficial, outside of the ones used in the paper?
Checking for racial bias is an important case for fair AI. In what other areas is fairness and bias correction in AI critical?
What would be ways that you could mitigate any inherent racial bias of the users who are using explainable AI, when they are making their decisions?

02/26/2020 – Subil Abraham – Will you accept an imperfect AI?

February 25, 2020 Subil Abraham 1 Comment

Reading: Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–14. https://doi.org/10.1145/3290605.3300641

Different parts of our lives are being infused with AI magic. With this infusion, however, comes problems, because the AI systems deployed aren’t always accurate. Users are used to software systems being precise and doing exactly the right thing. But unfortunately they can’t extend that expectation for AI systems because they are often inaccurate and make mistakes. Thus it is necessary for developers to set expectations of the users ahead of time so that the users are not disappointed. This paper proposes three different visual methods of setting the user’s expectations on how well the AI system will work: an indicator depicting accuracy, a set of examples demonstrating how the sytem works, and a slider that controls how aggressively the system should work. The system under evaluation is a detector that will identify and suggest potential meetings based on the language in an email. The goal of the paper isn’t to improve the AI system itself, but rather to evaluate how well the different expectation setting methods work given an imprecise AI system.

I want to note that I really wanted to see an evaluation on the effects of mixed techniques. I hope that it will be covered in possible future work they do but am also afraid that such work might never get published because it would be classified as incremental (unless they come up with more expectation setting methods beyond the three mentioned in this paper, and do a larger evaluation). It is useful to see that we now have numbers to back up that high-recall applications under certain scenarios are perceived as more accurate. It makes intuitive sense that it would be more convenient to deal with false positives (just close the dialog box) than false negatives (having to manually create a calendar event). Also, seeing the control slider brings to mind the trick that some offices play where they have the climate control box within easy reach of the employees but it actually doesn’t do anything. It’s a placebo to make people think it got warmer/colder when nothing has changed. I realize that the slider in the paper is actually supposed to do what it advertised, but it makes me think of other places where a placebo slider can be given to a user to make them think they have control when in fact the AI system remains completely unchanged.

What other kinds of designs can be useful for expectation setting in AI systems?
How would these designs look different for a more active AI system like medical prediction, rather than a passive AI system like the meeting detector?
The paper claims that the results are generalizable for other passive AI systems, but are there examples of such systems where it is not generalizable?

02/26/2020 – Sushmethaa Muhundan – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

February 25, 2020 Sushmethaa Muhundan 1 Comment

The perception and acceptance of AI systems are impacted by the expectations that the users have on that system as well as their prior experiences working with AI systems. Hence, expectation setting before interacting with the system is extremely pivotal to avoid any inflated expectations which in turn could lead to disappointment if they are not met. A Scheduling Assistant system has been used as an example in this paper and expectation adjustment techniques are discussed. The paper focusses on exploring methods to shape the user’s expectation before they use the system and study the impacts on the acceptance of the system by the user. Apart from this, the impact of different AI imperfections is also studied, specifically the cases of false positives vs false negatives. Accuracy indicator, example-based explanation and performance control are the three techniques proposed and evaluated. Via the studies conducted, it is concluded that a better expectation setting done before using a system decreases the chances of disappointment by highlighting the flaws of the system beforehand.

The study conducted assumes that the users are new to the environment and dedicates time explaining the interface at the initial stage of the experiment. I felt that this was helpful since the people involved in the survey can now follow along. I found this missing in some of the earlier papers read where it was assumed that all the readers had sufficient prior knowledge to follow along. Also, despite the fact that the initial performance of the system was ninety-three percent on the test dataset, in order to gauge the sentiments of the users and evaluate their expectation setting hypothesis, the authors decided to set the accuracy to fifty percent. I felt that this greatly improved the scope for disappointment, thereby helping them efficiently validate their expectation setting system and its effects. I felt that the decision to use visualizations as well as a short summary of the intent in their explanation was helpful since this eradicated the need for the users to read lengthy summaries and would offer better support for user decision. It was also good to note the authors take on deception and marketing as a means to set false expectations. This study went beyond such techniques and focused on shaping the expectations of the people via explaining the accuracy of the system. I felt that this perspective was more ethical compared to the other means adopted in this area.

Apart from the expectations that users have, what other factors influence the perception and acceptance of AI systems by the users?
What are some other techniques, visual or otherwise that can be adopted to set expectations of AI systems?
How can the AI system developers tackle trust issues and acceptance issues? Given that perceptions and individual experiences are extremely diverse, is it possible for an AI system to be capable of completely satisfying all its users?

02/26/20 – Lee Lisle – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

February 25, 2020February 25, 2020 Lorance R Lisle Leave a comment

Summary

Kuar et al. cover how data scientists are now tackling with ways of explaining their algorithm’s results with the public through interpretability tools. They note that machine learning algorithms are often “black boxes” that don’t typically convey how they get to certain results, but that there are several methods of interpreting the results based off these algorithms such as GAMs, LIME, and SHAP. The authors then conduct six interviews, a contextual inquiry of data scientists, and a large-scale survey to see if these tools are being used effectively. They found that, while some tools do perform better than others, these tools are being misused by data scientists in that they misunderstood their intended use. The authors found that the participants either over-utilized or under-utilized the tools and trusted their output and impact too deeply.

Personal Reflection

It was fascinating to see tools that HCI professionals typically use to understand many different aspects of a job turned onto computer science practitioners and algorithm designers as a sort of self-evaluation of the field. I was also surprised to see that there are so many possible errors in training data; I had assumed that these training datasets had been cleaned and verified to make sure there were no duplicates or missing data from them. That part reinforced the need for the tools to find issues with datasets.

The study uncovered that the visualizations made the data scientists over-confident in their results. It was interesting to see that once the tools discovered an insight into the data, the data scientists didn’t look more deeply into that result. That they were fine with not knowing why a key attribute led to a certain result more easily showcased why they might need to look more deeply into the workings of the algorithms. They used a lot of similar answers In that “I guess” and “I suppose” and “not sure why” were all present and are fairly similar responses. It was furthermore odd that, during the survey, they weren’t confident that the underlying models were reasonable but didn’t think the dataset or model was to blame. Does this point to some amount of overconfidence in their own field?

Questions

Since this covered the AI designers mainly, do you think there a key aspect of HCI research that could use a more complete understanding of its practice and practitioners? I.E., is there an issue that could be seen if HCI practitioners performed an ethnography or survey on their colleagues?
Since they had participants essentially perform a simulated task in the second phase, do you think this affected the results?
Would seeing these data scientists work on their own datasets have made a difference to the results? Do you think it would have changed how the data scientists think about their own work?
Since the dataset they used was a relatively low-risk dataset (i.e., it wasn’t like a recidivism predictor or loan-default prediction service), does that impact how the data scientists interacted with the tools?

02/26/2020 – Akshita Jha – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

February 25, 2020February 25, 2020 Akshita Jha Leave a comment

Summary:
“Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning” by Kaur et. al. talks about the interpretability tools that are being used to help data scientists and machine learning researchers. Very few of these tools have been evaluated to understand whether or not they achieve their goals with respect to interpretability. The authors extensively study two models: GAM and SHAP in detail. They conduct a contextual inquiry and a survey of data scientists to figure out how they utilize the information provided by these machine learning tools for their benefit. They highlight the qualitative themes from the model and conclude with the implications for researchers and tool designers.

Reflections:
There are two major aspects of interpretability: (i) Building interpretable models, (ii) Users’ understanding of these interpretable models. The paper does a good job of providing an in-depth analysis of the user’s understanding of these interpretable models. However, the authors focus on understanding a data scientist’s view of these tools. I feel that the quality of the interpretability of these models should be given by unskilled end users. The authors talk about the six themes that are captured by these values: (i) missing values, (ii) changes in data, (iii)duplicate data, (iv)redundant features, (v) ad-hoc categorization and (vi) debugging difficulties. They incorporate these into the “contextual inquiry”. More nuanced patterns for these might be revealed if an in-depth study is conducted. Also, depending on the domain knowledge of the participants, the interpretability scores might be interpreted differently. The authors should have tried to take this into account while surveying the candidates. Also, most people have started using deep learning models. It is, therefore, important to focus on the interpretability of these deep learning models. Authors focus on tabular data which might not be very helpful in the real world. A detailed study needs to be conducted in order to understand the interpretability in deep learning models. Something else I found interesting was the authors attributing the method of usage of these models to understanding system 1 and system 2 as decsribed by Kahneman. Humans make quick and automatic decisions based on ‘system 1’ because of missing values unless they are encouraged to engage their cognitive thought process which prompts ‘system 2’ kind of thinking. The pilot interview was conducted on a very small group of users (N=6) to identify the common issues faced by the data scientists for in their work. A more representative survey should have been conducted for data scientists of different skill sets to help them better.
Questions:
1. What is post-hoc interpretability? Is that enough?
2. Should the burden lie on the developer to explain the predictions of a model?
3. Can we incorporate interpretability while making decisions?
4. How can humans help in such a scenario apart from evaluating the quality of the interpretable model?

02/26/20 – Lee Lisle – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

February 25, 2020February 25, 2020 Lorance R Lisle 2 Comments

Summary

Dodge et al. cover a terribly important issue with artificial intelligence programs and biases from historical datasets, and how to mitigate the inherent racism or other biases within. They also work understand how to better communicate why AIs reach the recommendations they do and how. In an experiment, they look at communicating outcomes from a known biased ML model for predicting recidivism amongst released prisoners called COMPAS. They cleaned the ML model to make race less impactful to the final decision, and then produced 4 ways of explaining the result of the model to 160 mTurk workers: Sensitivity, Input-influence, Case, and demographic. “Input” emphasizes how much each input affected the results, “Demographic” describes how each demographic affects the results, “Sensitivity” shows what flipped demographics would have changed the results, and “Case” finds the most similar cases and details those results. They found that local-based explanations (case and sensitivity) had the largest impact on perceived fairness.

Personal Reflection

This study was pretty interesting to me based on it actually trying to adjust for the biases of input data as well as understanding how to better convey insights from less-biases systems. I am still unsure that the authors removed all bias from the COMPAS system but seeing that they did lower the coefficient significantly shows that it was working on it. In this vein, the paper made me want to read the paper they cited as how they could mitigate biases in these algorithms.

I found their various methods on how to communicate how the algorithm came to its recommendation to be rather incisive. I wasn’t surprised that people found that when the sensitivity explanation said that if the individual’s race was flipped the decision would be flipped lead to more perceived issues with the ML decision. That method of communication seems to lead people to see issues with the dataset more easily in general.

The last notable part of the experiment is that they didn’t give a confidence value for each case – they stated that they could not control for it and so did not present it to participants. That seems like an important part of making a decision based on the algorithm. If the algorithm is really on the fence, but has to recommend one way or the other, it might make it easier to state that the algorithm is biased.

Questions

Would removing the race (or other non-controllable biases) coefficient altogether affect the results too much? Is there merit in zero-ing out the coefficient of these factors?
Having an attention check in the mTurk workflow is, in itself, not surprising. However, the fact that all of the crowdworkers passed the check is surprising. What does this mean for other work that ignores a subset of data assuming the crowdworkers weren’t paying attention? (Like the paper last week that ignored the lowest quartile of results)
What combination of the four different types would be most effective? If you presented more than one type, would it have affected the results?
Do you think showing the confidence value for the algorithm would impact the results significantly? Why or why not?

02/26/2020 – Akshita Jha – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

February 25, 2020February 25, 2020 Akshita Jha Leave a comment

Summary:
“Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment” by Dodge et. al. talks about explainable machine learning and how to ensure fairness. They conduct an empirical study involving around 160 Amazon Mechanical Turk workers. They demonstrate that certain explanations are considered “inherently less fair” while others may help in enhancing the people’s confidence in the fairness of algorithms. They also talk about different kinds of model interpretability; (i) model wide fairness and (ii)case-specific fairness discrepancies. They also show that people react differently to different styles of explanations based on individual differences. They conclude with a discussion on how to provide personalized and adaptive explanations. There are 21 different definitions of fairness. In general, fairness can be defined as “….discrimination is considered to be present if for two individuals that have the same characteristic relevant to the decision making and differ only in the sensitive attribute (e.g., gender/race) a model results in different decisions”. Disparate impact is the consequence of deploying unfair models where one protected group is affected negatively compared to the protected group. This paper talks about the explanation given by machine learning models and how such models can be inherently fair or unfair.

Reflections:
The researchers attempt to answer three primary research questions: (i) How do different styles of explanation impact fairness judgment of an ML system? They study in depth if certain ML models are more effective in teasing out the unfairness of the models. They also analyze if some explanations are inherently fairer. The second questions that the researchers tackle are (ii) How do individual factors in cognitive style and prior position on algorithmic fairness impact the fairness judgment with regard to different explanations? Lastly, the researchers question the benefits and the drawbacks of different explanations in supporting fairness judgment of ML systems? The researchers offer various explanations that can be based on input features, demographic features, sensitive features, and case-based explanations. The authors conduct an online survey and ask participant different questions. However, an individual’s background might also influence the answers given by the mechanical turkers. The authors perform a qualitative as well as quantitative analysis. One of the major limitations of this work is that the analysis was performed by crowd workers with limited experience whereas in real life the decision is made by lawyers. Additionally, the authors could have used LIME (Local Interpretable Model-Agnostic Explanations) and SHAP (Shapley Additive Values) values for offering post-hoc explanations. The authors have also not studied an important element which is the confidence as they did not control for it.

Questions:
1. Which other model is unfair? Give some examples?
2. Are race and gender the only sensitive attributes? Can models discriminate based on some other attribute? If yes, which ones?
3. Who is responsible for building unfair ML models?
4. Are explanations of unfair models enough? Does that build enough confidence in the model?
5. Can you think of any adverse effects of providing model explanations?

02/26/2020 – Ziyao Wang – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

February 25, 2020February 25, 2020 Ziyao Wang 2 Comments

In this paper, the authors focused on fairness of machine learning system. As machine learning has been widely applied, it is important to let the models judge fairly, which needs the evaluation from developers, users and general public. With this aim, the authors conduct an empirical study with different generated explanations to understand the meaning of them towards people’s fairness judgement of ML systems. With an experiment involving 160 MTurk workers, they found that the fairness judgement of models is a complicated problem. They found that certain explanations are considered inherently less fair and others will enhance people’s trust of the algorithm, different fairness problems may be more effectively exposed through different styles of explanation and there are individual differences because each person’s unique background and judgment criteria. Finally, they suggested that the evaluation should support different needs of fairness judgment and consider individual differences.

Reflections:

This paper provides me three main thoughts. Firstly, we should pay attention to explain the machine learning models. Secondly, we should consider different needs when evaluating the fairness of the models. Finally, when train models or design human-AI interactions, we should consider the users’ individual differences.

For the first point, the models are well trained and can perform well. However, the public may not trust them as they know little about the algorithms. If we can provide them fair and friendly explanations, the public trust in the output of machine learning systems may be increased. Also, if they are provided explanations, they may propose suggestions related to practical situations, which will improve the accuracy and fairness of the systems reversely. Due to this, all the machine learning system developers should pay more attention to write appropriate explanations.

Secondly, the explanation and the models should consider different needs. For the experienced data scientists, we could leave comprehensive explanations to let them able to dig deeper. For the people who are experiencing machine learning system for the first time, we should leave user-friendly and easy to understand explanations. For the systems which will be used by users with different backgrounds, we may need to write different versions of explanations, for example, one user-instruction and one developer instruction.

For the third point, which is the most complicated one, it is hard to implement a system which will satisfy people with different judgments. Actually, I am thinking if it is possible to develop systems with interface for users to input their bias, which may solve this problem a little bit. It is not possible to train a model which will satisfy everyone’s preference. As a result, we could only train a model which will make most of the users or majority of the public satisfied or just leave a place to let our users to select their preferences.

Questions:

What kinds of information should we provide to the users in the explanations?

Apart from the crowd-sourcing workers, which groups of people should also be involved in the survey?

What kinds of information you would like to have if you need to judge a system?

02/26/2020 – Nurendra Choudhary -Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

February 25, 2020February 25, 2020 Nurendra Choudhary Leave a comment

Summary

In this paper, the authors discuss the acceptance of imperfect AI systems by human users. They, specifically, consider the case of email-scheduling assistants for their experiments. Three features are adopted to interface between users and the assistant: Accuracy Indicator (indicates expected accuracy of the system), Example-based Explanation design (explanation of sample test-cases for pre-usage development of human mental models) and Control Slider Design (to let users control the system aggressiveness, False Positive vs False Negative Rate).

The participants of the study completed a 6 step procedure. The procedures analyzed their initial expectations of the system. The evaluation showed that the features helped in setting the right expectation for the participants. Additionally, it concluded that systems with High Recall gave a pseudo-sense of higher accuracy than High Precision. The study shows that user expectations and acceptance can be satisfied through not only intelligible explanations but also tweaking model evaluation metrics to emphasize on one over another (Recall over Precision in this case).

Reflection

The paper explores user expectations of AI and their corresponding perceptions and acceptance. An interesting idea is tweaking evaluation metrics. In previous classes and a lot of current research, we discuss utilizing interpretable or explainable AI as the fix for the problem of user acceptance. However, the research shows that even simple measures such as tweaking evaluation metrics to prefer recall over precision can create a pseudo-sense of higher accuracy for users. This makes me question the validity of current evaluation techniques. Current metrics are statistical measures for deterministic models. The statistical measures directly correlated with user acceptance because of human comprehensibility of their behaviour. However, given the indeterministic nature of AI and its incomprehensible nature, old statistical measures may not be the right way to validate AI models. For our specific research/problems, we need to study end-user more closely and design metrics that correlate to the user demands.

Another important aspect is the human comprehensibility of AI systems. We notice from the paper that addition of critical features significantly increased user acceptance and perception. I believe there is a necessity for more such systems across the field. The features help the user expectation of the system and also help adoption of AI systems in real-world scenarios. The slider is a great example of manual control that could be provided to users to enable them to set their expectations from the system. Explanation of the system also helps develop human mental models so they can understand and adapt to AI changes faster. Example, search engines or recommender systems record information on users. If the users understand they store and utilize for their recommendation, they would modify their usage accordingly to fit the system requirements. This will improve system performance and also user experience. Also, it will lead to a sense of pseudo-deterministic nature in AI systems.

Questions

Can such studies help us in finding relevance of evaluation metric in the problem’s context?
Evaluation metrics have been designed as statistical measures. Has this requirement changed? Should we design metrics based on user experience?
Should AI be developed according to human expectations or independently?
Can this also be applied to processes that generally do not directly involve humans such as recommender systems or search engines?

Word Count: 557