Summary
Kuar et al. cover how data scientists are now tackling with ways of explaining their algorithm’s results with the public through interpretability tools. They note that machine learning algorithms are often “black boxes” that don’t typically convey how they get to certain results, but that there are several methods of interpreting the results based off these algorithms such as GAMs, LIME, and SHAP. The authors then conduct six interviews, a contextual inquiry of data scientists, and a large-scale survey to see if these tools are being used effectively. They found that, while some tools do perform better than others, these tools are being misused by data scientists in that they misunderstood their intended use. The authors found that the participants either over-utilized or under-utilized the tools and trusted their output and impact too deeply.
Personal Reflection
It was fascinating to see tools that HCI professionals typically use to understand many different aspects of a job turned onto computer science practitioners and algorithm designers as a sort of self-evaluation of the field. I was also surprised to see that there are so many possible errors in training data; I had assumed that these training datasets had been cleaned and verified to make sure there were no duplicates or missing data from them. That part reinforced the need for the tools to find issues with datasets.
The study uncovered that the visualizations made the data scientists over-confident in their results. It was interesting to see that once the tools discovered an insight into the data, the data scientists didn’t look more deeply into that result. That they were fine with not knowing why a key attribute led to a certain result more easily showcased why they might need to look more deeply into the workings of the algorithms. They used a lot of similar answers In that “I guess” and “I suppose” and “not sure why” were all present and are fairly similar responses. It was furthermore odd that, during the survey, they weren’t confident that the underlying models were reasonable but didn’t think the dataset or model was to blame. Does this point to some amount of overconfidence in their own field?
Questions
- Since this covered the AI designers mainly, do you think there a key aspect of HCI research that could use a more complete understanding of its practice and practitioners? I.E., is there an issue that could be seen if HCI practitioners performed an ethnography or survey on their colleagues?
- Since they had participants essentially perform a simulated task in the second phase, do you think this affected the results?
- Would seeing these data scientists work on their own datasets have made a difference to the results? Do you think it would have changed how the data scientists think about their own work?
- Since the dataset they used was a relatively low-risk dataset (i.e., it wasn’t like a recidivism predictor or loan-default prediction service), does that impact how the data scientists interacted with the tools?