02/26/2020 – Palakh Mignonne Jude – Interpreting Interpretability: Understanding Data Scientists’ Use Of Interpretability Tools For Machine Learning

SUMMARY

In this paper, the authors attempt to study two interpretability tools – the InterpretML implementation of GAMs and the SHAP Python package. They conducted a contextual inquiry and survey of data scientists in order to analyze the ability of these tools to aid in uncovering common issues that arise when evaluating ML models. The results obtained during the course of these studies indicate that data scientists tend to over-trust these tools. The authors conducted pilot interviews with 6 participants to identify common issues faced by data scientists. The contextual inquiry performed included 11 participants who were allowed to explore the dataset and an ML model in a hands-on manner via the use of a Jupyter notebook whereas the survey comprised of 197 participants and was conducted through Qualtrics. For the survey, the participants were given access to a description of the dataset and a tutorial on the interpretability tool they were to use. The authors found that the visualizations provided by the interpretability tools considered in the study as well as the fact that these tools were popular and publicly available caused the data scientists to over-trust these tools.

REFLECTION

I think it is good that the authors performed a study to observe the usage of interpretability tools by data scientists. I was surprised to learn that a large number of these data scientists over-trusted the tools and that visualizations impacted their ability to judge the tools as well. However, considering that the authors state ‘participants relied too heavily on the interpretability tools because they has not encountered such visualizations before’ makes me wonder if the authors should have created a separate pool of data scientists who had better experience with such tools and visualizations and then presented a separate set of results for that set of individuals. I also found it interesting to learn that some participants used the tools to rationalize suspicious observations.

As indicated by the limitations section of this paper, I think a follow-up study that includes a richer dataset as well as interpretability techniques for deep learning would be very interesting to learn about and I wonder how data scientists would use such tools versus the ones studied in this paper.

QUESTIONS

  1. Considering that the complexity of ML systems and the time taken for researchers to truly understand how to interpret ML, both the contextual inquiry as well as the survey was conducted with people who had as little as 2 months of experience with ML. Would a study with experts in the field of ML (all with over 4 years of experience) have yielded different results? Perhaps these data scientists would have been able to better identify issues and would not have over-trusted the interpretable tools?
  2. Would a more extensive study comprise of a number of different (commonly used as well as not-so-commonly used) interpretability tools have changed the results? If the tools were not available so easily would it truly impact the amount of trust the users had for the tools?
  3. Does a correlation exist between the amount of experience a data scientist has and the amount of trust for a given interpretability tool? Would the replacement of visualizations with other representations of interpretations of the models impact the amount of trust the human had towards the tool?

2 thoughts on “02/26/2020 – Palakh Mignonne Jude – Interpreting Interpretability: Understanding Data Scientists’ Use Of Interpretability Tools For Machine Learning

  1. Your comment/question on doing a study with more experienced scientists who don’t rely as heavily on the interpretability tools reminds of our reading from last week (the “Updates in Human-AI Teams” paper). I think another study like you suggest would be beneficial because we saw in last week’s paper that the users of an AI system gain an understanding of when and when not to trust its judgement i.e. they place a high amount of faith in their tools. So more experienced people are going to have different perspectives.

  2. I agree with your assessment that it might have been beneficial to classify the participants in groups that were experienced in using these tools. It might have given a better idea of how they’d use the tools in the real world (since they had already been using them, in theory). However, as I recall the paper did mention that they selected participants based on if they had any experience with them, and they had already limited their pool down to a low number of participants. Pooling them any lower would likely have caused issues with statistical significance.

    To answer your question 2, I’m not sure. If they included more tools, that would have increased the complexity of the experiment and may have also yielded participant numbers that were too low. However, I would be interested in the results if they were able to get enough.

Leave a Reply