02/26/2020 – Subil Abraham – Will you accept an imperfect AI?

Reading: Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–14. https://doi.org/10.1145/3290605.3300641

Different parts of our lives are being infused with AI magic. With this infusion, however, comes problems, because the AI systems deployed aren’t always accurate. Users are used to software systems being precise and doing exactly the right thing. But unfortunately they can’t extend that expectation for AI systems because they are often inaccurate and make mistakes. Thus it is necessary for developers to set expectations of the users ahead of time so that the users are not disappointed. This paper proposes three different visual methods of setting the user’s expectations on how well the AI system will work: an indicator depicting accuracy, a set of examples demonstrating how the sytem works, and a slider that controls how aggressively the system should work. The system under evaluation is a detector that will identify and suggest potential meetings based on the language in an email. The goal of the paper isn’t to improve the AI system itself, but rather to evaluate how well the different expectation setting methods work given an imprecise AI system.

I want to note that I really wanted to see an evaluation on the effects of mixed techniques. I hope that it will be covered in possible future work they do but am also afraid that such work might never get published because it would be classified as incremental (unless they come up with more expectation setting methods beyond the three mentioned in this paper, and do a larger evaluation). It is useful to see that we now have numbers to back up that high-recall applications under certain scenarios are perceived as more accurate. It makes intuitive sense that it would be more convenient to deal with false positives (just close the dialog box) than false negatives (having to manually create a calendar event). Also, seeing the control slider brings to mind the trick that some offices play where they have the climate control box within easy reach of the employees but it actually doesn’t do anything. It’s a placebo to make people think it got warmer/colder when nothing has changed. I realize that the slider in the paper is actually supposed to do what it advertised, but it makes me think of other places where a placebo slider can be given to a user to make them think they have control when in fact the AI system remains completely unchanged.

  1. What other kinds of designs can be useful for expectation setting in AI systems?
  2. How would these designs look different for a more active AI system like medical prediction, rather than a passive AI system like the meeting detector?
  3. The paper claims that the results are generalizable for other passive AI systems, but are there examples of such systems where it is not generalizable?

Read More

02/26/20 – Lee Lisle – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

Summary

            Kuar et al. cover how data scientists are now tackling with ways of explaining their algorithm’s results with the public through interpretability tools. They note that machine learning algorithms are often “black boxes” that don’t typically convey how they get to certain results, but that there are several methods of interpreting the results based off these algorithms such as GAMs, LIME, and SHAP. The authors then conduct six interviews, a contextual inquiry of data scientists, and a large-scale survey to see if these tools are being used effectively. They found that, while some tools do perform better than others, these tools are being misused by data scientists in that they misunderstood their intended use. The authors found that the participants either over-utilized or under-utilized the tools and trusted their output and impact too deeply.

Personal Reflection

It was fascinating to see tools that HCI professionals typically use to understand many different aspects of a job turned onto computer science practitioners and algorithm designers as a sort of self-evaluation of the field. I was also surprised to see that there are so many possible errors in training data; I had assumed that these training datasets had been cleaned and verified to make sure there were no duplicates or missing data from them. That part reinforced the need for the tools to find issues with datasets.

The study uncovered that the visualizations made the data scientists over-confident in their results. It was interesting to see that once the tools discovered an insight into the data, the data scientists didn’t look more deeply into that result. That they were fine with not knowing why a key attribute led to a certain result more easily showcased why they might need to look more deeply into the workings of the algorithms. They used a lot of similar answers In that “I guess” and “I suppose” and “not sure why” were all present and are fairly similar responses. It was furthermore odd that, during the survey, they weren’t confident that the underlying models were reasonable but didn’t think the dataset or model was to blame. Does this point to some amount of overconfidence in their own field?

Questions

  1. Since this covered the AI designers mainly, do you think there a key aspect of HCI research that could use a more complete understanding of its practice and practitioners? I.E., is there an issue that could be seen if HCI practitioners performed an ethnography or survey on their colleagues?
  2.  Since they had participants essentially perform a simulated task in the second phase, do you think this affected the results?
  3.  Would seeing these data scientists work on their own datasets have made a difference to the results? Do you think it would have changed how the data scientists think about their own work?
  4. Since the dataset they used was a relatively low-risk dataset (i.e., it wasn’t like a recidivism predictor or loan-default prediction service), does that impact how the data scientists interacted with the tools?

Read More

02/26/2020 – Akshita Jha – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

Summary:
“Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning” by Kaur et. al. talks about the interpretability tools that are being used to help data scientists and machine learning researchers. Very few of these tools have been evaluated to understand whether or not they achieve their goals with respect to interpretability. The authors extensively study two models: GAM and SHAP in detail. They conduct a contextual inquiry and a survey of data scientists to figure out how they utilize the information provided by these machine learning tools for their benefit. They highlight the qualitative themes from the model and conclude with the implications for researchers and tool designers.

Reflections:
There are two major aspects of interpretability: (i) Building interpretable models, (ii) Users’ understanding of these interpretable models. The paper does a good job of providing an in-depth analysis of the user’s understanding of these interpretable models. However, the authors focus on understanding a data scientist’s view of these tools. I feel that the quality of the interpretability of these models should be given by unskilled end users. The authors talk about the six themes that are captured by these values: (i) missing values, (ii) changes in data, (iii)duplicate data, (iv)redundant features, (v) ad-hoc categorization and (vi) debugging difficulties. They incorporate these into the “contextual inquiry”. More nuanced patterns for these might be revealed if an in-depth study is conducted. Also, depending on the domain knowledge of the participants, the interpretability scores might be interpreted differently. The authors should have tried to take this into account while surveying the candidates. Also, most people have started using deep learning models. It is, therefore, important to focus on the interpretability of these deep learning models. Authors focus on tabular data which might not be very helpful in the real world. A detailed study needs to be conducted in order to understand the interpretability in deep learning models. Something else I found interesting was the authors attributing the method of usage of these models to understanding system 1 and system 2 as decsribed by Kahneman. Humans make quick and automatic decisions based on ‘system 1’ because of missing values unless they are encouraged to engage their cognitive thought process which prompts ‘system 2’ kind of thinking.  The pilot interview was conducted on a very small group of users (N=6) to identify the common issues faced by the data scientists for in their work. A more representative survey should have been conducted for data scientists of different skill sets to help them better.
Questions:
1. What is post-hoc interpretability? Is that enough?
2. Should the burden lie on the developer to explain the predictions of a model?
3. Can we incorporate interpretability while making decisions?
4. How can humans help in such a scenario apart from evaluating the quality of the interpretable model?

Read More

02/26/20 – Lee Lisle – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary

            Dodge et al. cover a terribly important issue with artificial intelligence programs and biases from historical datasets, and how to mitigate the inherent racism or other biases within. They also work understand how to better communicate why AIs reach the recommendations they do and how. In an experiment, they look at communicating outcomes from a known biased ML model for predicting recidivism amongst released prisoners called COMPAS. They cleaned the ML model to make race less impactful to the final decision, and then produced 4 ways of explaining the result of the model to 160 mTurk workers: Sensitivity, Input-influence, Case, and demographic. “Input” emphasizes how much each input affected the results, “Demographic” describes how each demographic affects the results, “Sensitivity” shows what flipped demographics would have changed the results, and “Case” finds the most similar cases and details those results. They found that local-based explanations (case and sensitivity) had the largest impact on perceived fairness.

Personal Reflection

This study was pretty interesting to me based on it actually trying to adjust for the biases of input data as well as understanding how to better convey insights from less-biases systems. I am still unsure that the authors removed all bias from the COMPAS system but seeing that they did lower the coefficient significantly shows that it was working on it. In this vein, the paper made me want to read the paper they cited as how they could mitigate biases in these algorithms.

I found their various methods on how to communicate how the algorithm came to its recommendation to be rather incisive. I wasn’t surprised that people found that when the sensitivity explanation said that if the individual’s race was flipped the decision would be flipped lead to more perceived issues with the ML decision. That method of communication seems to lead people to see issues with the dataset more easily in general.

The last notable part of the experiment is that they didn’t give a confidence value for each case – they stated that they could not control for it and so did not present it to participants. That seems like an important part of making a decision based on the algorithm. If the algorithm is really on the fence, but has to recommend one way or the other, it might make it easier to state that the algorithm is biased.

Questions

  1. Would removing the race (or other non-controllable biases) coefficient altogether affect the results too much? Is there merit in zero-ing out the coefficient of these factors?
  2. Having an attention check in the mTurk workflow is, in itself, not surprising. However, the fact that all of the crowdworkers passed the check is surprising. What does this mean for other work that ignores a subset of data assuming the crowdworkers weren’t paying attention? (Like the paper last week that ignored the lowest quartile of results)
  3.  What combination of the four different types would be most effective? If you presented more than one type, would it have affected the results?
  4. Do you think showing the confidence value for the algorithm would impact the results significantly?  Why or why not?

Read More

02/26/2020 – Akshita Jha – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary:
“Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment” by Dodge et. al. talks about explainable machine learning and how to ensure fairness. They conduct an empirical study involving around 160 Amazon Mechanical Turk workers. They demonstrate that certain explanations are considered “inherently less fair” while others may help in enhancing the people’s confidence in the fairness of algorithms. They also talk about different kinds of model interpretability; (i) model wide fairness and (ii)case-specific fairness discrepancies. They also show that people react differently to different styles of explanations based on individual differences. They conclude with a discussion on how to provide personalized and adaptive explanations. There are 21 different definitions of fairness. In general, fairness can be defined as “….discrimination is considered to be present if for two individuals that have the same characteristic relevant to the decision making and differ only in the sensitive attribute (e.g., gender/race) a model results in different decisions”. Disparate impact is the consequence of deploying unfair models where one protected group is affected negatively compared to the protected group. This paper talks about the explanation given by machine learning models and how such models can be inherently fair or unfair.

Reflections:
The researchers attempt to answer three primary research questions: (i) How do different styles of explanation impact fairness judgment of an ML system? They study in depth if certain ML models are more effective in teasing out the unfairness of the models. They also analyze if some explanations are inherently fairer. The second questions that the researchers tackle are (ii) How do individual factors in cognitive style and prior position on algorithmic fairness impact the fairness judgment with regard to different explanations? Lastly, the researchers question the benefits and the drawbacks of different explanations in supporting fairness judgment of ML systems? The researchers offer various explanations that can be based on input features, demographic features, sensitive features, and case-based explanations. The authors conduct an online survey and ask participant different questions. However, an individual’s background might also influence the answers given by the mechanical turkers. The authors perform a qualitative as well as quantitative analysis. One of the major limitations of this work is that the analysis was performed by crowd workers with limited experience whereas in real life the decision is made by lawyers. Additionally, the authors could have used LIME (Local Interpretable Model-Agnostic Explanations) and SHAP (Shapley Additive Values) values for offering post-hoc explanations. The authors have also not studied an important element which is the confidence as they did not control for it.

Questions:
1. Which other model is unfair? Give some examples?
2. Are race and gender the only sensitive attributes? Can models discriminate based on some other attribute? If yes, which ones?
3. Who is responsible for building unfair ML models?
4. Are explanations of unfair models enough? Does that build enough confidence in the model?
5. Can you think of any adverse effects of providing model explanations?

Read More

02/26/2020 – Ziyao Wang – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

In this paper, the authors focused on fairness of machine learning system. As machine learning has been widely applied, it is important to let the models judge fairly, which needs the evaluation from developers, users and general public. With this aim, the authors conduct an empirical study with different generated explanations to understand the meaning of them towards people’s fairness judgement of ML systems. With an experiment involving 160 MTurk workers, they found that the fairness judgement of models is a complicated problem. They found that certain explanations are considered inherently less fair and others will enhance people’s trust of the algorithm, different fairness problems may be more effectively exposed through different styles of explanation and there are individual differences because each person’s unique background and judgment criteria. Finally, they suggested that the evaluation should support different needs of fairness judgment and consider individual differences.

Reflections:

This paper provides me three main thoughts. Firstly, we should pay attention to explain the machine learning models. Secondly, we should consider different needs when evaluating the fairness of the models. Finally, when train models or design human-AI interactions, we should consider the users’ individual differences.

For the first point, the models are well trained and can perform well. However, the public may not trust them as they know little about the algorithms. If we can provide them fair and friendly explanations, the public trust in the output of machine learning systems may be increased. Also, if they are provided explanations, they may propose suggestions related to practical situations, which will improve the accuracy and fairness of the systems reversely. Due to this, all the machine learning system developers should pay more attention to write appropriate explanations.

Secondly, the explanation and the models should consider different needs. For the experienced data scientists, we could leave comprehensive explanations to let them able to dig deeper. For the people who are experiencing machine learning system for the first time, we should leave user-friendly and easy to understand explanations. For the systems which will be used by users with different backgrounds, we may need to write different versions of explanations, for example, one user-instruction and one developer instruction.

For the third point, which is the most complicated one, it is hard to implement a system which will satisfy people with different judgments. Actually, I am thinking if it is possible to develop systems with interface for users to input their bias, which may solve this problem a little bit. It is not possible to train a model which will satisfy everyone’s preference. As a result, we could only train a model which will make most of the users or majority of the public satisfied or just leave a place to let our users to select their preferences.

Questions:

What kinds of information should we provide to the users in the explanations?

Apart from the crowd-sourcing workers, which groups of people should also be involved in the survey?

What kinds of information you would like to have if you need to judge a system?

Read More

02/26/2020 – Nurendra Choudhary -Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summary

In this paper,  the authors discuss the acceptance of imperfect AI systems by human users. They, specifically, consider the case of email-scheduling assistants for their experiments. Three features are adopted to interface between users and the assistant: Accuracy Indicator (indicates expected accuracy of the system), Example-based Explanation design (explanation of sample test-cases for pre-usage development of human mental models) and Control Slider Design (to let users control the system aggressiveness, False Positive vs False Negative Rate). 

The participants of the study completed a 6 step procedure. The procedures analyzed their initial expectations of the system. The evaluation showed that the features helped in setting the right expectation for the participants. Additionally, it concluded that systems with High Recall gave a pseudo-sense of higher accuracy than High Precision. The study shows that user expectations and acceptance can be satisfied through not only intelligible explanations but also tweaking model evaluation metrics to emphasize on one over another (Recall over Precision in this case). 

Reflection

The paper explores user expectations of AI and their corresponding perceptions and acceptance. An interesting idea is tweaking evaluation metrics. In previous classes and a lot of current research, we discuss utilizing interpretable or explainable AI as the fix for the problem of user acceptance. However, the research shows that even simple measures such as tweaking evaluation metrics to prefer recall over precision can create a pseudo-sense of higher accuracy for users. This makes me question the validity of current evaluation techniques. Current metrics are statistical measures for deterministic models. The statistical measures directly correlated with user acceptance because of human comprehensibility of their behaviour. However, given the indeterministic nature of AI and its incomprehensible nature, old statistical measures may not be the right way to validate AI models. For our specific research/problems, we need to study end-user more closely and design metrics that correlate to the user demands.

Another important aspect is the human comprehensibility of AI systems. We notice from the paper that addition of critical features significantly increased user acceptance and perception. I believe there is a necessity for more such systems across the field. The features help the user expectation of the system and also help adoption of AI systems in real-world scenarios. The slider is a great example of manual control that could be provided to users to enable them to set their expectations from the system. Explanation of the system also helps develop human mental models so they can understand and adapt to AI changes faster. Example, search engines or recommender systems record information on users. If the users understand they store and utilize for their recommendation, they would modify their usage accordingly to fit the system requirements. This will improve system performance and also user experience. Also, it will lead to a sense of pseudo-deterministic nature in AI systems. 

Questions

  1. Can such studies help us in finding relevance of evaluation metric in the problem’s context?
  2. Evaluation metrics have been designed as statistical measures. Has this requirement changed? Should we design metrics based on user experience?
  3. Should AI be developed according to human expectations or independently?
  4. Can this also be applied to processes that generally do not directly involve humans such as recommender systems or search engines?

Word Count: 557

Read More

02/26/2020 – Nurendra Choudhary – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary

In this paper, the authors design explainable Machine Learning models to enhance their fairness perception. In this case, they study COMPAS, a model that predicts a criminal’s chance of reoffending. They explain the drawbacks and fairness issues with COMPAS (overestimates the chance for certain communities) and analyze the significance of change that Explainable AI (XAI) can bring to this fairness issue. They generated automatic explanations for COMPAS utilizing previously developed templates (Binns et al. 2018). The explanations are based on 4 templates: Sensitivity, Case, Demographic and Input-Influence. 

The authors hire 160 MT workers with certain criterias such as US residence and MT expertise. The workers are a diverse set but show no significant impact on the results’ variance. The experimental setup is a questionnaire that judges the worker’s criteria for making fairness judgements. The results show that the workers have heterogeneous criteria for making fairness judgements. Additionally, the experiment highlights two fairness issues: “unfair models (e.g., learned from biased data), and fairness discrepancy of different cases (e.g., in different regions of the feature space)”. 

Reflection

AI works in a very stealthy manner. The reason is that most of the algorithms detect patterns in a latent space that is incomprehensible to humans. The idea of using automatically generated standard templates to construct explanations to AI behaviour should be generalized to other AI research areas. The experiments show the change in human behavior with respect to explanations. I believe such explanations could not only help the general population’s understanding but also help researchers in narrowing down the limitations of these systems.

From the case of COMPAS, I question the future roles that interpretable AI makes possible. If AI is able to give explanations for its prediction, then I think it shall play the role of an unbiased judge better than humans. Societal biases are embedded in humans and they might subconsciously affect our choices. Interpreting these choices in humans is a complex self-criticism endeavour. But, for AI, systems as given in the paper can generate human comprehensible explanations to validate their predictions. Thus, making AI an objectively fairer judge than humans.

Additionally, I believe evaluation metrics for AI lean towards improving their overall prediction. However, I believe that comparable models that emphasize interpretability should be given more importance. But, a drawback to such metrics is the necessity of human evaluation for interpretability. This will impede the rate of progress in AI development. We need to develop better evaluation strategies for interpretability. In this paper, the authors hired 160 MT workers. Given it is a one-time evaluation, this study is possible. However, if this needs to be included in the regular AI development pipeline, we need more scalable approaches to avoid prohibitively expensive evaluation costs. One method could be to rely on a less-diverse test set for the development phase and increase diversity according to the real-world problem setting.

Questions

  1. How difficult is it to provide such explanations for all AI fields? Would it help in progressing AI understanding and development?
  2. How should we balance between explainability and effectiveness of AI models? Is it valid to lose effectiveness in return for interpretability?
  3. Would interpretability lead to adoption of AI systems in sensitive matters such as judiciary and politics?
  4. Can we develop evaluation metrics around suitability of AI systems for real-world scenarios? 

Word Count: 567

Read More

02/26/2020 – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning- Yuhang Liu

This paper discusses people’s dependence on interpretive tools for machine learning. As mentioned in the article, machine learning (ML) models are commonly deployed in various fields from criminal justice to healthcare. Machine learning has gone beyond academia and has developed into an engineering discipline. To this end, interpretability tools have been designed to help data scientists and machine learning practitioners better understand how ML models work. This paper focuses on such software. According to the classification, this software can be divided into two categories, the Interpret ML implementation of GAMs ( glass box models) and the SHAP Python package (a post-hoc explanation technique for blackbox models). The author’s research The results show that users trust machine interpretative results too much and rely too much on the use of machine learning interpretive tools. Few of these users can accurately describe the visualization of the output of these tools. In the end, the authors came to the conclusion that the visualization of the output of the interpretability tool can sometimes help data scientists find problems with data sets or models. For both tools, however, the existence of visualizations and the fact that the tools are publicly available have led to situations of excessive trust and abuse. Therefore, after the final experiments, the authors concluded that experts in two aspects of human-computer interaction and machine learning need to work together. The two interact better together to achieve better results.

First of all, after reading this article, I think that not only the explanatory tools of machine learning will make people over-trusted, including machine learning itself will also make people over-trusted, which may be caused by many aspects such as data sets. This reminds me of the course project I wanted to do this semester. My original intention was because a single, standard data set written by a large number of experts for a long time would cause the trained model to produce too high an accuracy rate, so the data set generated by crowdsourcing was used. Can get better results.

Secondly, for this article, I very much agree with the final solution proposed by the author, which is to better integrate the two aspects of human-machine interaction and machine learning as future research directions. This is because these interpretive tools are a visual display of the results. The better design of human-computer interaction allows users to better extract the results of machine learning, better understand the results, and understand the problems in them. Instead of overly trusting the results of machine learning. The future development direction is definitely that fewer and fewer users understand machine learning, but there will be more people using machine learning, and machine learning will become more and more instrumental, so I think that the interaction aspect will be made more Good for users to understand their results. On the other hand, machine learning should be more diverse and able to adapt to more application scenarios. Only when both aspects are done better can the effects of these tools be achieved.

  1. Is machine learning more academic or tool-oriented in the future?
  2. If the user does not know the meaning of the results, how to understand the accuracy of the results more clearly without using interpretive software
  3. The article mentioned that in the future, the joint efforts of human-computer interaction and machine learning will be required, and what changes should be made in human-computer interaction.

Read More

02/26/2020 – Dylan Finch – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Word count: 573

Summary of the Reading

This paper investigates explaining AI and ML systems. An easy way to explain AI and ML systems is to have another computer program to help generate an explanation of how the AI or ML system works. This paper works towards that goal, comparing 4 different programmatically generated explanations of AI And ML systems and seeing how they impact judgments of fairness. These different explanations had a large impact on perceptions of fairness and bias in the systems, with a large degree of variation between each of the explanation systems.

Not only did the kind of explanation used have a large impact on the perceived fairness of the algorithm, but the pre-existing feelings of the participants towards AI and ML and bias in these fields also had a profound impact on whether or not participants saw the explanations as fair or not. People who did not already trust AI fairness equally distrusted all of the explanations.

Reflections and Connections

To start, I think that this type of work is extremely useful to the future of the AI and ML fields. We need to be able to explain how these kinds of systems work and there needs to be more research into that. This issue of explainable AI becomes even more important when we put it in the context of making AI fair to the people who have to interact with it. We need to be able to tell if an AI system that is deciding whether or not to free people from jail is fair or not. The only way we can really know if these models are fair or not is to have some way to explain the decisions that the AI systems make. 

I think that one of the most interesting parts of the paper is the variation in the number of people with different circumstances who thought that the models were fair or not. Pre-existing ideas about whether or not AI systems are fair had a huge impact on whether or not people thought these models were fair when given an explanation of how they work. This shows how human of a problem this is and how hard it can be to decide if a model is fair or not, even when you have access to an explanation. Views of the model will differ from person to person. 

I also found it interesting how the type of explanation used had a big impact on the judgment of fairness. To me, this congers up ideas of a future where the people who build algorithms can just pick the right kind of explanation to prove that their algorithm is fair, in the same way companies now use language in a very questionable way. I think that this field still has a long way to go and that it will become increasingly important as AI penetrates more and more fasciates of our lives.

Questions

  1. When each explanation produces such different results, is it possible to make a concrete judgment on the fairness of an algorithm?
  2. Could we use computers or maybe even machine learning to decide if an algorithm is fair or would that just produce more problems?
  3. With so many different opinions, even when the same explanation is used, who should be the judge if an algorithm is fair or not?

Read More