04/29/20 – Sukrit Venkatagiri – Accelerating Innovation Through Analogy Mining

Paper: Tom Hope, Joel Chan, Aniket Kittur, and Dafna Shahaf. 2017. Accelerating Innovation Through Analogy Mining. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17), 235–243. https://doi.org/10.1145/3097983.3098038

Summary: This paper talks about the challenge of mining analogies from large, real-world repositories, such as patent databases. Such databases pose challenges because they are highly relational but sparse in nature. This s a reason why machine learning approaches do not fare well when applied to these types of databases, especially since they cannot formulate a patter of the underlying structure, which is important for analogy mining. The corpora are also expensive to build, store, and update, while automation cannot be easily applied. The authors overcome these limitations by leveraging the creativity of the crowd and affordable computational capabilities of RNNs. The approach is a structured purpose-mechanism schema for identifying analogies between two research papers. Finally, the authors evaluate crowd worker performance by asking graduate students to annotate the ideas generated around three main ideas: quality, novelty, and feasibility. They find that their approach increased feasibility among the participants in the study.

Reflection:
Overall, I really liked the paper in how it attempts to solve a hard problem by using a scalable approach: crowds and RNNs, and tests it on a real-world dataset. I also liked how the paper defines similarity between different ideas (i.e. analogies) based on purpose and the mechanisms through which products work. Further, the paper suggests more complex metrics for research papers. This raises the question: how much more difficult is it to mine analogies for complex/more abstract ideas, compared to simple ideas? Perhaps structured labels could help in that regards.

The approach itself is commendable since it is a great example of a mixed-initiative user interface that combines the creativity of the crowd and the affordable computation of RNNs. Further, this approach does not needlessly waste human computation. The authors also completed a thorough evaluation of the machine intelligence portion.

Second, I appreciate the approach taken towards making something subjective—in this case, creativity—into something more objective, by breaking it down into different rate-able metrics.

Finally, the idea of using randomly generate analogies to “spark creativity” and the results of that show that creativity really does need diverse ideas. I wonder why this may be, and how to introduce such randomness into real-world work practice.

Questions:
1. How scalable do you think the system is? What other limitations does it have?
2. Can this approach be used to generate analogies in other fields? What would be different?
3. Do you think creativity is subject? Can it be made into something objective?

                                                                          

Read More

04/22/2020 – Sukrit Venkatagiri – SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers

Paper: Joel Chan, Joseph Chee Chang, Tom Hope, Dafna Shahaf, and Aniket Kittur. 2018. SOLVENT: A Mixed Initiative System for Finding Analogies Between Research Papers. Proc. ACM Hum.-Comput. Interact. 2, CSCW: 31:1–31:21

Summary: In this paper, the authors attempt to help researchers by generating (mining)  analogies in other domains to help support interdisciplinary research. The paper proposes a unique annotation schema to extend prior work by Hope et al. and has four key facets: background, purpose, mechanism, and findings. The paper also has 3 interesting studies. First, it was collecting annotations from domain experts in research fields, and second, using the Solvent system to generate analogies with real-world usefulness. Finally, the authors scaled up Solvent through the use of crowdsourcing workflows. In each of the three studies, they used semantic vector representations for the annotations. The first study had a dataset focused on papers from CSCW and annotated by a member of the research team, while the second study involved working with an interdisciplinary team in bioengineering and mechanical engineering to determine whether Solvent could help identify analogies not easily found with citation tree search. Finally, in the third study, the authors leveraged crowd workers from UpWork and Amazon Mechanical Turk to generate annotations, and the authors found that workers had difficulties with the purpose and mechanism type annotations. On the whole, the Solvent system was found to help researchers and generate analogies effectively. 

Reflection: Overall, I think this paper is well-motivated, and the 3 studies that form the basis for the results are impressive. It was also interesting that there was significant agreement between crowd workers and researchers in terms of annotation percentage. This proves a useful finding more broadly in that novices may be able to contribute to science not necessarily by doing science (especially as science gets harder to do by “normal” people, and is done in larger and larger teams), but by finding analogies between different disciplines’ literatures.

For their second study, the authors trained a word2vec model on a curated dataset of over 3000 papers from 3 domains. This was also good because they did not limit their work to just one domain and strived to generalize their work/findings. However, they are still largely engineering disciplines, albeit CSCW has a somewhat social science component to it. I wonder how it would work in other disciplines such as between the pure sciences? That might be an interesting follow up study. 

I wonder how such a system might be deployed more broadly, as compared to a limited way that was done in this paper. I also wonder how long it would have taken crowd workers to go through the tasks and generate the findings in total.

Questions:

  1. What other domains do you think Solvent would be useful in? Would it easily generalize?
  2. Is majority vote an appropriate mechanism? What else could be used?
  3. What are the challenges to creating analogies?

Read More

04/22/2020 – Sukrit Venkatagiri – Opportunities for Automating Email Processing: A Need-Finding Study

Paper: Soya Park, Amy X. Zhang, Luke S. Murray, and David R. Karger. 2019. Opportunities for Automating Email Processing: A Need-Finding Study. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–12

Summary: This paper is a need-finding study exploring the opportunities and challenges for automating email processing. The authors conducted a mixed-methods study to pinpoint users’ expectations and needs in terms of automating email handling, in addition to the informational and computational support required for it. This study was divided into three main parts: what types of automated emails users want, what types of information and computation is needed, and then a field deployment of a simple inbox scripting tool. They did so in two steps. First, they had a formative design workshop where 13 computer science students created email processing rules. Second, they had a survey where 77 people (as well as 35 people without a technical background) answered questions to better understand categories of email automation, and their needs. The results show that there is a need to strengthen richer data on email, better management features, use of internal and external context,, and affordances. Finally, the paper describes a platform for writing small scripts for users’ inboxes, called YouPS. They enlisted 12 email users and found that users wanted more automation in their email management, especially in terms of richer data models, and processing content automatically.

Reflection: I agree with the premise of the paper: the fact that we should and can help people better manage their email inboxes to reduce the amount of energy people spend making sense of it. I wonder why email itself has got so overwhelming in the first place, and how it has affected workplace productivity. 

I especially like the multi-pronged approach that the authors took in this paper, with a formative study, a survey, and building a system. I believe this multi-state approach is valuable and can provide multiple insights as well as opportunities for triangulating data. 

With respect to their findings, I think the need for richer data models and rules, as well as ways to leverage internal and external email contexts are very important. If we are able to understand, for example, the senders’ urgency level and the receivers’ commitments to that sender, then we could draft a rule prioritizing or deprioritizing said emails. I also think the use of email templates and autofill options are useful and Google does something but in a more intelligent way with Gmail’s autofill feature. 

However, I wonder how many users actually make use of intelligent filters, and/or would make use of any new tools that are introduced in the feature. It may be the case that only knowledge workers are bombarded with emails that require responses, while most other users simply receive spam (which, I think, is about 90-95% of emails that are sent in the entire world). It would also be interesting to see how this differs between people’s work and home emails. I myself maintain an email address for communications that I know will be spammy, such as insurance applications.

Questions:

  1. How do you manage your email? Do you use filters?
  2. Do you manage your different inboxes differently? How?
  3. What do you think of YouPS? Would you use it? Why or why not?

Read More

04/07/20 – Sukrit Venkatagiri – CrowdScape: Interactively Visualizing User Behavior and Output

Paper: Jeffrey Rzeszotarski and Aniket Kittur. 2012. CrowdScape: interactively visualizing user behavior and output. In Proceedings of the 25th annual ACM symposium on User interface software and technology (UIST ’12), 55–62. https://doi.org/10.1145/2380116.2380125

Summary:

Crowdsourcing has been used to do intelligent tasks/knowledge work at scale and for a lower price, all online. However, there are many challenges with controlling quality in crowdsourcing. This paper talks about how in prior approaches, quality control was done through algorithms evaluated against gold standard or looking at worker agreement and behavior. Yet, these approaches have many limitations, especially for creative tasks or other tasks that are highly complex in nature. This paper presents a system, called CrowdScape, to support manual or human evaluation of complex crowdsourcing task results through a visualization that is interactive and has a mixed initiative machine learning back-end. The paper describes features of the system as well as its uses through 4 very different case studies. First, a translation task from Japanese to English. The next one was a little unique, asking workers to pick their favorite color. The third was about writing about their favorite place, and finally the last one was tagging a video. Finally, the paper concludes with a discussion of the findings.

Reflection:

Overall, I really liked the paper and the CrowdScape system, and I found the multiple case studies really interesting. I especially liked the fact that the case studies varied in terms of complexity, creativity, and open-endedness. However, I found the color-picker task a little off-beat and wonder why the authors chose that task. 

I also appreciate that the system is built on top of existing work, e.g. Amazon Mechanical Turk (a necessity), as well as Rzeszotarski and Kittur’s Task Fingerprinting system to capture worker behavioral  traces. The scenario describing the more general use case was also very clear and concise. The fact that the system, CrowdScape, also utilizes two diverse data sources—as opposed to just one—is interesting. This makes triangulating the findings more easy, as well as observing and discrepancies in the data. More specifically, the CrowdScape system looks at worker’s behavioral traces as well as their output. This allows one to differentiate between workers in terms of their “laziness/eagerness” as well as the actual quality of the output. The system also provides an aggregation of the two features, and all of these are displayed as visualizations which makes it easy for a requester to view tasks and easily discard/include work.

However, I wonder how useful these visualizations might be for tasks such as surveys, or tasks that are less open-ended. Further, although the visualizations are useful, I wonder if they should be used in conjunction with gold standard datasets or not, and how useful that combination would be. Although the paper demonstrates the potential uses of the system via case studies, it does not demonstrate whether real users say it is useful. Thus, an evaluation by real-world users might help.

Questions:

  1. What do you think about the case study evaluation? Are there ways to improve it? How?
  2. What features of the system would you use as a requester?
  3. What are some drawbacks to the system?

Read More

03/25/20 – Sukrit Venkatagiri – Evaluating Visual Conversational Agents

Paper: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, and Devi Parikh. 2017. Evaluating Visual Conversational Agents via Cooperative Human-AI Games. In Fifth AAAI Conference on Human Computation and Crowdsourcing. Retrieved January 22, 2020 from https://aaai.org/ocs/index.php/HCOMP/HCOMP17/paper/view/15936

Summary: The paper measures human-AI team performance and it is compared to AI performance alone. The authors of this paper make use of a game, called GuessWhich, to facilitate visual conversational agents/visual question-and-answer agents. GuessWhich leverages the fact that humans and the AI system, ALICE, have to interact with or converse with each other.

In the game, there are two primary agents, the one who asks a question (the questioner) and the one who answers these questions (the answerer). In the game, the answerer responds to the questions asked of it and attempts to guess a correct answer (an image) from a fixed set of images. With the human-AI team, the agent that asks questions is ALICE, and the “agent” that answers the question is a human. Here, performance is measured based on the number of questions taken to arrive at the correct answer. There’s also a QuestionerBot, or a QBot, that is used instead of a human to compare human-AI performance against AI-AI performance. That is, ALICE-human versus ALICE-QBot.

The paper further discusses the challenges faced with these approaches, including the difficulty in having robust question-answer pairs, and the fact that humans may or may not learn from the AI, among other such challenges. Finally, the paper concludes that ALICE-RL, a high-performing or “state of the art” AI system does not perform as well as ALICE-QBot when compared to ALICE-human pairs. This further points to the increasing disconnect between AI development that occurs independent of human input and considering human-in-the-loop interaction systems.

Reflection: This paper foregrounds a very important challenge, that is, the gap between AI research and development, and its use in the real-world with actual human beings. One thing I found interesting in this paper is that AI systems are ever-dependent on humans for input. Similar to what Gray and Suri mention in their book, Ghost Work, as AI advances, there is a need for humans at the frontier of AI’s capabilities. This is known as AI’s “last mile” problem, and will probably never cease to exist. This is an interesting paradox, where AI seeks to replace humans, only to need humans to do a new type of task.

I think this is one of the major limitations of developing AI independent of real-world applications and usage. If people only use synthetic data, and toy cases within the lab, then AI development cannot really advance in the real world. Instead, AI researchers should strive to work with human computation and human–computer interaction people to further both groups’ needs and requirements. This has even been used in Google Duplex, where a human takes over when the AI is unable to perform well.

Further, I find that there are some limitations to the paper, such as the experimental setup and the dataset that was used. I do not think that QBot was representative of a useful AI and the questions were not on par. I also believe that QBot needed to be able to dynamically learn from and interact with the human, making for a more fair comparison between AI-AI and human-AI teams.

Questions:

  1. How might VQA challenges be made more difficult and applicable in the real-world?
  2. What are the other limitations of this paper? How can they be overcome?
  3. How would you use such an evaluation approach in your class project?
  4. There’s a lot of interesting data generated from the experiments, such as user trust and beliefs. How might that be useful in improving AI performance?

Read More

03/25/20 – Sukrit Venkatagiri – “Like Having a Really Bad PA”: Gulf between User Expectation and Experience

Paper: Ewa Luger and Abigail Sellen. 2016. “Like Having a Really Bad PA”: The Gulf between User Expectation and Experience of Conversational Agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 5286–5297.

Summary: This paper presents findings from 14 semi-structured interviews conducted with users of existing conversational agent systems, and highlights four key areas where current systems fail to support user interaction. It details how conversational agents are increasingly used within online systems, such as banking systems, as well as by larger companies like Google, Facebook, and Apple. The paper attempts to understand end-users and their experiences in dealing with conversational agents, as well as the challenges that they face in doing so—both from the user and the conversational agents’ side. The findings highlight how conversational agents are used by end-users for play, hands-free speech when they are unable to type, for making specific and formal queries, and for simple tasks such as finding out the weather. The paper also talks about how, in most instances, conversational agents are unable to fill the gap between users’ expectations and the actual way the conversational agent behaves, and that incorporating playfulness may be useful. Finally, the paper uses Norman’s gulf of execution and evaluation to provide implications for designing future systems. 

Reflection:
This paper is very interesting, and I have had similar thoughts when using conversational agents in day to day life. I also appreciate the use of semi-structured interviews to get at users’ actual experiences of using conversational agents and how it differed from their expectations prior to using these CAs. 

This work also adds on to prior work, confirming the existence of this gap or gulf between expectations and reality, and that users constantly expect more from CAs than CAs are capable of providing. The paper also speaks to the importance of designing conversational agents where user expectations should be set rather than having users set their own expectations, as we saw in some papers from previous weeks. The authors also discuss emphasizing interaction and constant updates with the CA to improve end-user expectations. 

The paper also suggests ways to hold researchers and developers accountable for the promises that they make when designing such systems, and overhauling the system based on user feedback. 

However, rather than just focusing on where conversational agents failed to support user interaction, I wish the paper had also focused on where the system successfully supports user interaction. Further, I wish they had sampled users who were not only novices but also experts, who might have had different expectations. It might be interesting to scale up this work as a survey to see how users’ expectations differ based on the conversational agent that is being used.

Questions:

  1. How would you work to reduce the gulf between expectation and reality?
  2. What are the challenges to building useful and usable conversational AIs?
  3. Why are conversational AIs sometimes so limited? What affects their performance?
  4. Where do you think humans can play a role? I.e. as humans-in-the-loop?

Read More

03/04/20 – Sukrit Venkatagiri – Toward Scalable Social Alt Text

Paper: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind. In Fifth AAAI Conference on Human Computation and Crowdsourcing.

Summary:
This paper explores a variety of approaches for supporting blind and visually impaired people (BVI) with alt-text captions. They consider two baseline methods using existing computer vision approaches (Vision-to-Language) and Human Corrected Captions. They also considered two workflows that did not depend on CV approaches—TweetTalk conversational workflow, and Structured Q&A workflow. Based on the questions asked from TweetTalk, they generated a set of structured questions to be used in Structured Q&A workflow. They found that V2L performed the worst, and that overall, any approach with CV as a baseline did not perform well. Their TweetTalk conversational approach is more generalizable but also difficult to recruit workers. Finally, they conducted a study of TweetTalk with 7 BVI people and learned that they found it potentially useful. The authors discuss their findings in relation to prior work, as well as the tradeoffs between human-only and AI-only systems, paid v/s volunteer work, and conversational assistants v/s structured Q&A. They also extensively discuss the limitations of this work.

Reflection:
Overall, I really liked this paper and found it very interesting. I think their multiple approaches to evaluating human-AI collaboration was interesting (AI alone, human-corrected, human chat, asynchronous human answers), in addition to the quality perception ratings that were  obtained from third party workers. I think this paper makes a strong contribution, but wish they could go into more detail to clarify exactly how the system worked, the different experimental setups, and any other interesting findings that were there. Sadly, there is an 8-page page limit, which may have prevented them from going into more detail.

I appreciate the fact that they built on and used prior work in this paper, namely MacLeod et al. 2017, Mao et al. 2012, and Microsoft’s Cognitive Services API. This way, they did not need to build their own database, CV algorithms, or real-time crowdworker recruiting system. Instead, it allowed them to focus on more high-level goals.

Their findings were interesting. Especially the fact that human-corrected CV descriptions performed poorly. It is unclear how satisfaction is different between the various conditions, for first-party ratings. It may be because users had context through conversation and but was not included in their ratings. The results also show that current V2L systems have worse accuracy than human-in-the-loop approaches. Sadly, there was no significant difference in accuracy between HCC and description generated after TweetTalk, but SQA improved significantly. 

Finally, the validation with BVI users is welcome, and I believe more Human-AI work needs to actually work with real users. I wonder how the findings might differ if they were used in a real, social context, or with people on MTurk instead of the researchers-as-workers.

Overall, this was a great paper to read and hope others build on this work, similar to how the authors here have directly leveraged prior work to advance our understanding of human-AI collaboration for alt-text generation. 

Questions:

  1. Are there any better human-AI workflows that might be used that the authors did not consider? How would they work and why would they be better?
  2. What are the limitations of CV that led to the findings in this paper that any approach with CV performed poorly?
  3. How would you validate this system in the real world?
  4. What are some other next steps for improving the state of the art in alt-text generation?

Read More

03/04/20 – Sukrit Venkatagiri – Pull the Plug?

Paper: Danna Gurari, Suyog Jain, Margrit Betke, and Kristen Grauman. 2016. Pull the Plug? Predicting If Computers or Humans Should Segment Images. 382–391. 

Summary: 
This paper proposes a resource allocation framework for predicting how best to allocate a fixed budget of human annotation effort in order to collect higher quality segmentations for a given batch of images and methods. The framework uses a “pull-the-plug” model, predicting when to use human versus computer annotators. More specifically, the paper proposes a system that intelligently allocates computer effort to replace human effort for initial coarse segmentations. Second, it automatically identifies images to have humans re-annotate by predicting which of the images the automated methods did not segment well enough. This method could be used for a variety of uses cases, and the paper tests it on three datasets and 8 segmentation methods. The findings show that this method significantly outperformed prior work across a variety of metrics, ranging from quality prediction, initial segmentation, fine-grained segmentation, and cost.

Reflection:
Overall, this was an interesting paper to read that is largely focused on performance and accuracy. The paper shows that the methods are superior to prior work and is now the state of the art for image segmentation when it comes to these three datasets, and for saving costs. 

I wonder what this paper might have looked like if it was more focused on creativity and innovation, rather than performance and cost-savings. For example, in HCI there are studies of using crowds to generate ideas, solve mysteries, and critique designs. Perhaps this approach might be used in a way that humans and machines can provide suggestions and they build off of each other.

More specifically, related to this paper, I wonder how the results would generalize to datasets other than the three used here, or to real-world examples, for perhaps self-driving cars, etc. Certainly, a lot more work needs to be done, and the system would need to be real-time, meaning human computation might not be a feasible method for self-driving cars. Though, certainly they could be used for generating training dataset for self-driving car algorithms.

This entire approach relies on the proposed prediction module, and it would be interesting to explore other edge cases where the predictions are better made by humans rather than through machine intelligence.

Finally, the finding that the computer segmented images more similarly to experts than crowd workers was interesting, and I wonder why—was it because the computer algorithms were trained on expert-generated training sets? Perhaps the crowd workers would perform better over time or with training. In that case, the results might have been better overall when combining the two.

Questions:

  1. How might you use this approach in your class project?
  2. Where does CV fail and where can humans augment it? What about the reverse?
  3. What are the limitations of a “pull-the-plug” approach, and how can they be overcome?
  4. Where else might this approach be used?

Read More

02/26/20 – Sukrit Venkatagiri – Will You Accept an Imperfect AI?

Paper: Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–14.

Summary: 

This paper explores people’s perceptions and expectations of an intelligent scheduling assistant. The paper specifically considers three broad research questions: the impact of AI’s focus on error avoidance versus user perception, ways to set appropriate expectations, and impact of expectation setting on user satisfaction and acceptance. The paper explores this through an experimental setup, whose design process is explored in detail. 

The authors find that expectation adjustment designs significantly affected the desired aspects of expectations, similar to what was hypothesized. They also find that high recall resulted in significantly higher perceptions of accuracy and acceptance compared to high precision, and that expectation adjustment worked by intelligible explanations and tweaking model evaluation metrics to emphasize one over the other. The paper concludes with a discussion of the findings.

Reflection:

This paper presents some interesting findings using a relatively simple, yet powerful “technology probe.” I appreciate the thorough exploration of the design space, taking into consideration design principles and how they were modified to meet the required goals. I also appreciate the varied and nuanced research questions. However, I feel like the setup may have been too simple to explore in more depth. Certainly, this is valuable as a formative study, but more work needs to be done. 

It was interesting that people valued high recall over high precision. I wonder if the results would differ among people with varied expertise, from different countries, and from different socioeconomic backgrounds. I also wonder how this might differ based on the application scenario, e.g. AI scheduling assistant versus a movie recommendation system. In the latter, a user would not be aware of what movies they were not recommended but that they would actually like, while with an email scheduling assistant, it is easy to see false negatives.

I wonder how these techniques, such as expectation setting, might apply not only to users’ expectations of AI systems, but also to exploring the interpretability or explainability of more complex ML models.

At what point do explanations tend to result in the opposite effect? I.e. reduced user acceptance and preference? It may be interesting to experimentally study how different levels of explanations and expectation settings affect user perceptions versus a binary value. I also wonder how it might change with people of different backgrounds.

In addition, this experiment was relatively short in duration. I wonder how the findings would change over time. Perhaps users would form inaccurate expectations, or their mental models might be better steered through expectation-setting. More work is needed in this regard. 

Questions:

  1. Will you accept an imperfect AI?
  2. How do you determine how much explanation is enough? How would this work for more complex models?
  3. What other evaluation metrics can be used?
  4. When is high precision valued over high recall, and vice versa?

Read More

02/26/2020 – Sukrit Venkatagiri – Interpreting Interpretability

Paper: Harmanpreet Kaur, Harsha Nori, Samuel Jenkins, Rich Caruana, Hanna Wallach, and Jennifer Wortman Vaughan. 2020. Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. In CHI 2020, 13.

Summary: There have been a number of tools developed to aid in increasing the interpretability of ML models, which are used in a wide variety of applications today. However, very few of these tools have been studied with a consideration of the context of use and evaluated by actual users. This paper presents a user-centered evaluation of two ML interpretability tools using a combination of interviews, contextual inquiry, and a large-scale survey with data scientists.

From the interviews, they found six key themes: missing values, temporal changes in the data, duplicate data masked as unique, correlated features, adhoc categorization, and difficulty of trying to debug or identify potential improvements. From the contextual inquiry with a glass-box model (GAM) and a post-hoc explanation technique (SHAP), they found a misalignment between data scientists’ understanding of the tools and their intended use. And finally, from the surveys, they found that participants’ mental models differed greatly, and that their interpretations of these interpretability tools also varied on multiple axes. The paper concludes with a discussion on bridging HCI and ML communities and designing more interactive interpretability tools.

Reflection:

Overall, I really liked the paper and it provided a nuanced as well as broad overview of data scientists’ expectations and interpretations of interpretability tools. I especially appreciate the multi-stage, mixed-methods approach that is used in the paper. In addition, I commend the authors for providing access to their semi-structured interview guide, as well as other study materials, and that they had pre-registered their study. I believe other researchers should strive to be this transparent in their research as well.

More specifically, it is interesting that the paper first leveraged a small pilot study to inform the design of a more in-depth “contextual inquiry” and a large-scale study. However, I do not believe the methods that are used for the “contextual inquiry” to be a true contextual inquiry, rather, it is more like a user study involving semi-structured interview. This is especially true since many of the participants were not familiar with the interpretability tools used in the study, which means that it was not their actual context of use/work.

I am also unsure how realistic the survey is, in terms of mimicking what someone would actually do, and appreciate that the authors acknowledge the same in the limitations section. A minor concern is also the 7-point scale that is used in the survey that ranges from “not at all” to “extremely,” which does not follow standard survey science practices.

I wonder what would happen if the participants were a) nudged to not take the visualizations at face value or to employ “system 2”-type thinking, and/or b) asked to use the tool for a longer. Indeed, they do notice some emergent behavior in the findings, such as a participant questioning whether the tool was actually an interpretability tool. I also wonder what would have happened if two people had used the tool side-by-side, as a “pair programming” exercise. 

It’s also interesting how varied participants’ backgrounds, skills, baseline expectations, and interpretations were. Certainly, this problem has been studied elsewhere, and I wonder whether the findings in this paper are a result of not only the failure of these tools to be designed in a user-centered manner, but also the broad range in technical skills of the users themselves. What would it mean to develop a tool for users with such a range in skillsets, especially statistical and mathematical skills? This certainly calls for increased certification—at the behest of increased demand for data scientists—within the ML software industry.

I appreciate the point surrounding Kahneman’s system 1 and system 2 work in the discussion, but I believe this section is possibly too short. I acknowledge that there are page restrictions, which meant that the results could not have been discussed in as much depth as is warranted for such a formative study.

Overall, this was a very valuable study that was conducted in a methodical manner and I believe the findings to be interesting to present and future developers of ML interpretability tools, as well as the HCI community that is increasingly interested in improving the process of designing such tools.

Questions:

  1. Is interpretability only something to be checked off a list, and not inspected at depth?
  2. How do you inspect the interpretability of your models, if at all? When do you know you’ve succeeded?
  3. Why is there a disconnect between the way these tools are intended to be used and how they are actually used? How can this be fixed?
  4. Do you think there needs to be greater requirements in terms of certification/baseline understanding and skills for ML engineers?

Read More