02/26/2020 – Palakh Mignonne Jude – Interpreting Interpretability: Understanding Data Scientists’ Use Of Interpretability Tools For Machine Learning

SUMMARY

In this paper, the authors attempt to study two interpretability tools – the InterpretML implementation of GAMs and the SHAP Python package. They conducted a contextual inquiry and survey of data scientists in order to analyze the ability of these tools to aid in uncovering common issues that arise when evaluating ML models. The results obtained during the course of these studies indicate that data scientists tend to over-trust these tools. The authors conducted pilot interviews with 6 participants to identify common issues faced by data scientists. The contextual inquiry performed included 11 participants who were allowed to explore the dataset and an ML model in a hands-on manner via the use of a Jupyter notebook whereas the survey comprised of 197 participants and was conducted through Qualtrics. For the survey, the participants were given access to a description of the dataset and a tutorial on the interpretability tool they were to use. The authors found that the visualizations provided by the interpretability tools considered in the study as well as the fact that these tools were popular and publicly available caused the data scientists to over-trust these tools.

REFLECTION

I think it is good that the authors performed a study to observe the usage of interpretability tools by data scientists. I was surprised to learn that a large number of these data scientists over-trusted the tools and that visualizations impacted their ability to judge the tools as well. However, considering that the authors state ‘participants relied too heavily on the interpretability tools because they has not encountered such visualizations before’ makes me wonder if the authors should have created a separate pool of data scientists who had better experience with such tools and visualizations and then presented a separate set of results for that set of individuals. I also found it interesting to learn that some participants used the tools to rationalize suspicious observations.

As indicated by the limitations section of this paper, I think a follow-up study that includes a richer dataset as well as interpretability techniques for deep learning would be very interesting to learn about and I wonder how data scientists would use such tools versus the ones studied in this paper.

QUESTIONS

  1. Considering that the complexity of ML systems and the time taken for researchers to truly understand how to interpret ML, both the contextual inquiry as well as the survey was conducted with people who had as little as 2 months of experience with ML. Would a study with experts in the field of ML (all with over 4 years of experience) have yielded different results? Perhaps these data scientists would have been able to better identify issues and would not have over-trusted the interpretable tools?
  2. Would a more extensive study comprise of a number of different (commonly used as well as not-so-commonly used) interpretability tools have changed the results? If the tools were not available so easily would it truly impact the amount of trust the users had for the tools?
  3. Does a correlation exist between the amount of experience a data scientist has and the amount of trust for a given interpretability tool? Would the replacement of visualizations with other representations of interpretations of the models impact the amount of trust the human had towards the tool?

Read More

02/26/2020 – Sukrit Venkatagiri – Interpreting Interpretability

Paper: Harmanpreet Kaur, Harsha Nori, Samuel Jenkins, Rich Caruana, Hanna Wallach, and Jennifer Wortman Vaughan. 2020. Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. In CHI 2020, 13.

Summary: There have been a number of tools developed to aid in increasing the interpretability of ML models, which are used in a wide variety of applications today. However, very few of these tools have been studied with a consideration of the context of use and evaluated by actual users. This paper presents a user-centered evaluation of two ML interpretability tools using a combination of interviews, contextual inquiry, and a large-scale survey with data scientists.

From the interviews, they found six key themes: missing values, temporal changes in the data, duplicate data masked as unique, correlated features, adhoc categorization, and difficulty of trying to debug or identify potential improvements. From the contextual inquiry with a glass-box model (GAM) and a post-hoc explanation technique (SHAP), they found a misalignment between data scientists’ understanding of the tools and their intended use. And finally, from the surveys, they found that participants’ mental models differed greatly, and that their interpretations of these interpretability tools also varied on multiple axes. The paper concludes with a discussion on bridging HCI and ML communities and designing more interactive interpretability tools.

Reflection:

Overall, I really liked the paper and it provided a nuanced as well as broad overview of data scientists’ expectations and interpretations of interpretability tools. I especially appreciate the multi-stage, mixed-methods approach that is used in the paper. In addition, I commend the authors for providing access to their semi-structured interview guide, as well as other study materials, and that they had pre-registered their study. I believe other researchers should strive to be this transparent in their research as well.

More specifically, it is interesting that the paper first leveraged a small pilot study to inform the design of a more in-depth “contextual inquiry” and a large-scale study. However, I do not believe the methods that are used for the “contextual inquiry” to be a true contextual inquiry, rather, it is more like a user study involving semi-structured interview. This is especially true since many of the participants were not familiar with the interpretability tools used in the study, which means that it was not their actual context of use/work.

I am also unsure how realistic the survey is, in terms of mimicking what someone would actually do, and appreciate that the authors acknowledge the same in the limitations section. A minor concern is also the 7-point scale that is used in the survey that ranges from “not at all” to “extremely,” which does not follow standard survey science practices.

I wonder what would happen if the participants were a) nudged to not take the visualizations at face value or to employ “system 2”-type thinking, and/or b) asked to use the tool for a longer. Indeed, they do notice some emergent behavior in the findings, such as a participant questioning whether the tool was actually an interpretability tool. I also wonder what would have happened if two people had used the tool side-by-side, as a “pair programming” exercise. 

It’s also interesting how varied participants’ backgrounds, skills, baseline expectations, and interpretations were. Certainly, this problem has been studied elsewhere, and I wonder whether the findings in this paper are a result of not only the failure of these tools to be designed in a user-centered manner, but also the broad range in technical skills of the users themselves. What would it mean to develop a tool for users with such a range in skillsets, especially statistical and mathematical skills? This certainly calls for increased certification—at the behest of increased demand for data scientists—within the ML software industry.

I appreciate the point surrounding Kahneman’s system 1 and system 2 work in the discussion, but I believe this section is possibly too short. I acknowledge that there are page restrictions, which meant that the results could not have been discussed in as much depth as is warranted for such a formative study.

Overall, this was a very valuable study that was conducted in a methodical manner and I believe the findings to be interesting to present and future developers of ML interpretability tools, as well as the HCI community that is increasingly interested in improving the process of designing such tools.

Questions:

  1. Is interpretability only something to be checked off a list, and not inspected at depth?
  2. How do you inspect the interpretability of your models, if at all? When do you know you’ve succeeded?
  3. Why is there a disconnect between the way these tools are intended to be used and how they are actually used? How can this be fixed?
  4. Do you think there needs to be greater requirements in terms of certification/baseline understanding and skills for ML engineers?

Read More

02/26/20 – Lulwah AlKulaib- Interpretability

Summary

Machine learning (ML) models are integrated in many departments nowadays (for example: criminal justice, healthcare, marketing, etc.). The universal presence of ML has moved beyond academic research and grew into an engineering discipline. Because of that, it is important to interpret ML models and understand how they work by developing interpretability tools. Machine Learning engineers, practitioners, and data scientists have been using these tools. However, due to the minimal evaluation of the extent to which these tools achieve interpretability, the authors study the use of two interpretability tools to uncover issues that arise when building and evaluating models. The interpretability tools are: InterpretML implementation of GAMs and the SHAP Python package. They conduct a contextual inquiry and survey197 data scientists to observe how they use interpretability tools to uncover common issues that arise when building and evaluating ML models. Their results show that data scientists did utilize visualizations produced by interpretability tools to uncover issues in datasets and models. Yet, the availability of these tools has led to researchers over-trust and misuse of them.

Reflection

Machine learning is now being used to address important problems like predicting crime rates in cities to help police distribute manpower, identifying cancerous cells, predicting recidivism in the judiciary system, and locating buildings that are subject to catching on fire. Unfortunately, these models have been shown to learn biases. Detecting these biases is subtle, especially to beginners in the field. I agree with the authors that it is troublesome when machine learning is misused, whether intently or due to ignorance, in situations where ethics and fairness are eminent. Lacking models explainability can lead to biased and ill-informed decisions. In our ethics class, we went over case studies where interpretability was lacking and caused representing racial bias in facial analysis systems [1], biasing recidivism predictions [2], and textual gender biases learned from language [3]. Some of these systems were used in real life and have affected people’s lives. I think that using a similar analysis to the one presented in this paper before deploying systems into practice should be mandatory. It would give developers better understanding of their systems and help them avoid making biased decisions that can be corrected before going into public use. Also, informing developers on how dependable are interpretability tools and when to tell that they’re over trusting them, or when are they misusing them is important. Interpretability is a “new” field to machine learning and I’ve been seeing conferences adding sessions about it lately. I’m interested in learning more about interpretability and how we can adapt it in different machine learning modules.

Discussion

  • Have you used any of the mentioned interpretability packages in your research? How did it help in improving your model?
  • What are case studies that you know of where machine learning bias is evident? Were these biases corrected? If so, How?
  • Do you have any interpretability related resources that you can share with the rest of the class?
  • Do you plan to use these packages in your project? 

References

  1. https://splinternews.com/predictive-policing-the-future-of-crime-fighting-or-t-1793855820
  2. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
  3. Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).

Read More

02/26/2020 – Ziyao Wang – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

As machine learning models are deployed in variety domains of industry, it is important to design some interpretability to help model users, such as data scientists and machine learning practitioners, better understand how these models work. However, there have been little researches focused on the evaluation of the performance of these tools. The authors in this paper did experiments and surveys to fill this gap. They interviewed 6 data scientists from a large technology company to find out the most common issues faced by data scientists. Then they conducted a contextual inquiry towards 11 participants based on the common issues using the InterpretML implementation of the Gams and the SHAP python package. Finally, they made a survey of 197 data scientists. With the experiments and surveys, the authors highlighted the misuse and over-trust problem and the need for the communication between members of HCI and ML communities.

Reflection:

Before reading this paper, I hold the view that the interpretability tools should be able to cover most of the data scientists’ need. However, now I have the view that the tools for interpretation are not designed by the ML community, which will result in the lack of accuracy of the tools. When data scientists or machine learning practitioners want to use these tools to learn how the models operate, they may face problems like misuse or over-trust. I don not think this is the users’ fault. Tools are designed for make users feel more convenient when doing tasks. If the tools will make users confuse, the developers should make change to the tools to give users better user experiences. In this case, the authors suggested that the members of HCI and ML communities should work together when developing the tools. This need the members to leverage their strength so that the designed tools can let users understand the models easily while the tools are user-friendly. Meanwhile, comprehensive instructions should be written to explain how the users can use the tools to understand the models accurately and easily. Finally, both the efficiency and accuracy of both the tools and the implementation of models will be improved.

From data scientists and machine learning practitioners’ point of view, they should try to avoid to over-trust the tools. The tools cannot fully explain the models and there may be mistakes. The users should always be critic to the tools instead of fully trusting them. They should read the instructions carefully, understand how to use the tools and what the tools are used for, what is the models being used for and how to use the models. If they can consider thoughtfully when using these tools and models, instead of guessing the meaning of the results from the tools, the number of misuse and over-trust cases will be decreased sharply.

Questions:

  1. How to design a proposed interactive interpretability tool? What kinds of interactions should be included?
  2. How to design a tool that can make users to dig the models conveniently instead of letting them use the models without knowing how the models work?
  3. How to design tools which can leverage the strength of mental models mostly

Read More

02/26/20 – Lee Lisle – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

Summary

            Kuar et al. cover how data scientists are now tackling with ways of explaining their algorithm’s results with the public through interpretability tools. They note that machine learning algorithms are often “black boxes” that don’t typically convey how they get to certain results, but that there are several methods of interpreting the results based off these algorithms such as GAMs, LIME, and SHAP. The authors then conduct six interviews, a contextual inquiry of data scientists, and a large-scale survey to see if these tools are being used effectively. They found that, while some tools do perform better than others, these tools are being misused by data scientists in that they misunderstood their intended use. The authors found that the participants either over-utilized or under-utilized the tools and trusted their output and impact too deeply.

Personal Reflection

It was fascinating to see tools that HCI professionals typically use to understand many different aspects of a job turned onto computer science practitioners and algorithm designers as a sort of self-evaluation of the field. I was also surprised to see that there are so many possible errors in training data; I had assumed that these training datasets had been cleaned and verified to make sure there were no duplicates or missing data from them. That part reinforced the need for the tools to find issues with datasets.

The study uncovered that the visualizations made the data scientists over-confident in their results. It was interesting to see that once the tools discovered an insight into the data, the data scientists didn’t look more deeply into that result. That they were fine with not knowing why a key attribute led to a certain result more easily showcased why they might need to look more deeply into the workings of the algorithms. They used a lot of similar answers In that “I guess” and “I suppose” and “not sure why” were all present and are fairly similar responses. It was furthermore odd that, during the survey, they weren’t confident that the underlying models were reasonable but didn’t think the dataset or model was to blame. Does this point to some amount of overconfidence in their own field?

Questions

  1. Since this covered the AI designers mainly, do you think there a key aspect of HCI research that could use a more complete understanding of its practice and practitioners? I.E., is there an issue that could be seen if HCI practitioners performed an ethnography or survey on their colleagues?
  2.  Since they had participants essentially perform a simulated task in the second phase, do you think this affected the results?
  3.  Would seeing these data scientists work on their own datasets have made a difference to the results? Do you think it would have changed how the data scientists think about their own work?
  4. Since the dataset they used was a relatively low-risk dataset (i.e., it wasn’t like a recidivism predictor or loan-default prediction service), does that impact how the data scientists interacted with the tools?

Read More

02/26/2020 – Akshita Jha – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

Summary:
“Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning” by Kaur et. al. talks about the interpretability tools that are being used to help data scientists and machine learning researchers. Very few of these tools have been evaluated to understand whether or not they achieve their goals with respect to interpretability. The authors extensively study two models: GAM and SHAP in detail. They conduct a contextual inquiry and a survey of data scientists to figure out how they utilize the information provided by these machine learning tools for their benefit. They highlight the qualitative themes from the model and conclude with the implications for researchers and tool designers.

Reflections:
There are two major aspects of interpretability: (i) Building interpretable models, (ii) Users’ understanding of these interpretable models. The paper does a good job of providing an in-depth analysis of the user’s understanding of these interpretable models. However, the authors focus on understanding a data scientist’s view of these tools. I feel that the quality of the interpretability of these models should be given by unskilled end users. The authors talk about the six themes that are captured by these values: (i) missing values, (ii) changes in data, (iii)duplicate data, (iv)redundant features, (v) ad-hoc categorization and (vi) debugging difficulties. They incorporate these into the “contextual inquiry”. More nuanced patterns for these might be revealed if an in-depth study is conducted. Also, depending on the domain knowledge of the participants, the interpretability scores might be interpreted differently. The authors should have tried to take this into account while surveying the candidates. Also, most people have started using deep learning models. It is, therefore, important to focus on the interpretability of these deep learning models. Authors focus on tabular data which might not be very helpful in the real world. A detailed study needs to be conducted in order to understand the interpretability in deep learning models. Something else I found interesting was the authors attributing the method of usage of these models to understanding system 1 and system 2 as decsribed by Kahneman. Humans make quick and automatic decisions based on ‘system 1’ because of missing values unless they are encouraged to engage their cognitive thought process which prompts ‘system 2’ kind of thinking.  The pilot interview was conducted on a very small group of users (N=6) to identify the common issues faced by the data scientists for in their work. A more representative survey should have been conducted for data scientists of different skill sets to help them better.
Questions:
1. What is post-hoc interpretability? Is that enough?
2. Should the burden lie on the developer to explain the predictions of a model?
3. Can we incorporate interpretability while making decisions?
4. How can humans help in such a scenario apart from evaluating the quality of the interpretable model?

Read More

02/26/2020 – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning- Yuhang Liu

This paper discusses people’s dependence on interpretive tools for machine learning. As mentioned in the article, machine learning (ML) models are commonly deployed in various fields from criminal justice to healthcare. Machine learning has gone beyond academia and has developed into an engineering discipline. To this end, interpretability tools have been designed to help data scientists and machine learning practitioners better understand how ML models work. This paper focuses on such software. According to the classification, this software can be divided into two categories, the Interpret ML implementation of GAMs ( glass box models) and the SHAP Python package (a post-hoc explanation technique for blackbox models). The author’s research The results show that users trust machine interpretative results too much and rely too much on the use of machine learning interpretive tools. Few of these users can accurately describe the visualization of the output of these tools. In the end, the authors came to the conclusion that the visualization of the output of the interpretability tool can sometimes help data scientists find problems with data sets or models. For both tools, however, the existence of visualizations and the fact that the tools are publicly available have led to situations of excessive trust and abuse. Therefore, after the final experiments, the authors concluded that experts in two aspects of human-computer interaction and machine learning need to work together. The two interact better together to achieve better results.

First of all, after reading this article, I think that not only the explanatory tools of machine learning will make people over-trusted, including machine learning itself will also make people over-trusted, which may be caused by many aspects such as data sets. This reminds me of the course project I wanted to do this semester. My original intention was because a single, standard data set written by a large number of experts for a long time would cause the trained model to produce too high an accuracy rate, so the data set generated by crowdsourcing was used. Can get better results.

Secondly, for this article, I very much agree with the final solution proposed by the author, which is to better integrate the two aspects of human-machine interaction and machine learning as future research directions. This is because these interpretive tools are a visual display of the results. The better design of human-computer interaction allows users to better extract the results of machine learning, better understand the results, and understand the problems in them. Instead of overly trusting the results of machine learning. The future development direction is definitely that fewer and fewer users understand machine learning, but there will be more people using machine learning, and machine learning will become more and more instrumental, so I think that the interaction aspect will be made more Good for users to understand their results. On the other hand, machine learning should be more diverse and able to adapt to more application scenarios. Only when both aspects are done better can the effects of these tools be achieved.

  1. Is machine learning more academic or tool-oriented in the future?
  2. If the user does not know the meaning of the results, how to understand the accuracy of the results more clearly without using interpretive software
  3. The article mentioned that in the future, the joint efforts of human-computer interaction and machine learning will be required, and what changes should be made in human-computer interaction.

Read More