SUMMARY
In this paper, the authors attempt to study two interpretability tools – the InterpretML implementation of GAMs and the SHAP Python package. They conducted a contextual inquiry and survey of data scientists in order to analyze the ability of these tools to aid in uncovering common issues that arise when evaluating ML models. The results obtained during the course of these studies indicate that data scientists tend to over-trust these tools. The authors conducted pilot interviews with 6 participants to identify common issues faced by data scientists. The contextual inquiry performed included 11 participants who were allowed to explore the dataset and an ML model in a hands-on manner via the use of a Jupyter notebook whereas the survey comprised of 197 participants and was conducted through Qualtrics. For the survey, the participants were given access to a description of the dataset and a tutorial on the interpretability tool they were to use. The authors found that the visualizations provided by the interpretability tools considered in the study as well as the fact that these tools were popular and publicly available caused the data scientists to over-trust these tools.
REFLECTION
I think it is good that the authors performed a study to observe the usage of interpretability tools by data scientists. I was surprised to learn that a large number of these data scientists over-trusted the tools and that visualizations impacted their ability to judge the tools as well. However, considering that the authors state ‘participants relied too heavily on the interpretability tools because they has not encountered such visualizations before’ makes me wonder if the authors should have created a separate pool of data scientists who had better experience with such tools and visualizations and then presented a separate set of results for that set of individuals. I also found it interesting to learn that some participants used the tools to rationalize suspicious observations.
As indicated by the limitations section of this paper, I think a follow-up study that includes a richer dataset as well as interpretability techniques for deep learning would be very interesting to learn about and I wonder how data scientists would use such tools versus the ones studied in this paper.
QUESTIONS
- Considering that the complexity of ML systems and the time taken for researchers to truly understand how to interpret ML, both the contextual inquiry as well as the survey was conducted with people who had as little as 2 months of experience with ML. Would a study with experts in the field of ML (all with over 4 years of experience) have yielded different results? Perhaps these data scientists would have been able to better identify issues and would not have over-trusted the interpretable tools?
- Would a more extensive study comprise of a number of different (commonly used as well as not-so-commonly used) interpretability tools have changed the results? If the tools were not available so easily would it truly impact the amount of trust the users had for the tools?
- Does a correlation exist between the amount of experience a data scientist has and the amount of trust for a given interpretability tool? Would the replacement of visualizations with other representations of interpretations of the models impact the amount of trust the human had towards the tool?