03/04/2020 – Vikram Mohanty – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

Authors: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris

Summary

This paper studies how crowdsourcing can be used to evaluate automated approaches for generating alt-text captions for BVI (Blind or Visually Impaired) users on social media. Further, the paper proposes an effective real-time crowdsourcing workflow to assist BVI users in interpreting captions. The paper shows that the shortcomings of existing AI image captioning systems frequently hinder a user’s understanding of an image they cannot see, much to the extent that clarifying conversations with sighted assistants can’t even correct. The paper finally proposes a detailed set of guidelines for future iterations of AI captioning systems. 

Reflection

This paper is another example of people working with imperfect AI. Here, the imperfect AI is a result of not relying on collecting meaningful datasets, but as a result of building algorithms from constrained datasets without having a foresight of the application i.e. alt-text for BVI users. The paper demonstrates a successful crowdsourcing workflow augmenting the AI’s suggestion, and serves as a motivation for other HCI researchers to think of design workflows that can integrate the strengths of interfaces, crowds and AI together. 

The paper shows an interesting finding where the simulated BVI users found it easier to generate a caption from scratch than from the AI’s suggestion. This shows how the AI’s suggestion can bias a user’s mental model in the wrong direction, from where recovery might be costlier compared to no suggestion in the first place. This once again stresses the need for considering real-world scenarios and users in the evaluation workflow. 

The solution proposed here is bottlenecked by the challenges presented by real-time deployment with crowd workers. Despite that, the paper makes an interesting contribution in the form of guidelines essential for future iteration of AI captioning systems. Involving potential end-users and proposing systematic goals for an AI to achieve is a desirable goal in the long-run.

Questions

  1. Why do you think people preferred to generate the captions from scratch rather than from the AI’s suggestions? 
  2. Do you ever re-initialize a system’s data/suggestions/recommendations to start from blank? Why or why not? 
  3. If you worked with an imperfect AI (which is more than likely), how do you envision mitigating the shortcomings when you are given the task to redesign the client app? 

Read More

03/04/2020 – Sushmethaa Muhundan – Pull the Plug? Predicting If Computers or Humans Should Segment Images

The paper proposes a resource allocation framework that intelligently distributes work between a human and an AI system in the context of foreground object segmentation. The advantages of using a mix of both humans and AI rather than either of them alone is demonstrated via the study conducted. The goal is to ensure that high-quality object segmentation results are produced while using considerably less human efforts involved. Two systems are implemented as part of this paper that automatically decide when to transfer control from the human to the AI component and vice versa, depending on the quality of segmentation encountered at each phase. The first system eliminates the need for human annotation effort by replacing human efforts with computers to generate coarse object segmentation which is refined by segmentation tools. The second system predicts the quality of the annotations and automatically identifies a subset of them that needs to be re-annotated by humans. Three diverse datasets were used to train and validate the system and these include datasets representing visible, phase contrast microscopy, and fluorescence microscopy images.

The paper explores leveraging the complementary strengths of humans and AI and allocates resources accordingly in order to reduce human involvement. I particularly liked the focus on quality throughout the paper. This particular system that employs a mixed approach mechanism ensures that the quality of the traditional systems which relied heavily on human involvement is met. The resultant system was successfully able to reduce significant hours of human effort and also maintain the quality of the resultant foreground object segmentation of images which is great.

Another aspect of the paper that I found impressive was the conscious effort to develop a single prediction model that is applicable across different domains. Three diverse datasets were employed as part of this initiative. The paper talks about the disadvantages of other systems that do not work well on multiple datasets. In such cases, only a domain expert or computer vision expert would be able to predict when the system would succeed. This paper claims that this is altogether avoided in this system. Also, the decision to intentionally include humans only once per image is good as opposed to the existing system where human effort is required multiple times during the initial segmentation phase of each image.

  1. This paper primarily focuses on reducing human involvement in the context of foreground object segmentation. What other applications can extend the principles of this system to achieve reduced involvement of humans in the loop while ensuring that quality is not affected?
  2. The system deals with predicting the quality of image segmentation outputs and involves the human to re-annotate only the lowest quality ones. What other ideas can be employed to ensure reduced human efforts in such a system?
  3. The paper implies that the system proposed can be applied across images from multiple domains. Were the three datasets described varied enough to ensure that this is a generalized solution?

Read More

03/04/2020 – Nurendra Choudhary – Combining crowdsourcing and google street view to identify street-level accessibility problems

Summary

In this paper, the authors discuss a crowd-sourcing method utilizing Amazon MT workers to identify accessibility issues in google street view images. They utilize two levels of street views for annotations: image-level and pixel-level. They evaluate intra and inter-annotator agreement and conclude a feasible level of accuracy of 81% (increased to 93% with minor quality control additions) for real-world scenarios.

The authors initiate the paper with a discussion about the necessity of such approaches. The solution could lead to more accessibility-aware solutions. The paper utilizes precision, recall and f1-score to consolidate and evaluate image-level annotations. For pixel-level annotations, the authors utilize two sets of evaluation metrics: overlap between annotated pixels and precision-recall scores. The experiments depict an inter-annotator agreement that makes the system feasible in real-world scenarios. The authors also utilize majority voting between annotators to improve the accuracy further.  

Reflection

The paper introduces an interesting approach to utilize crowd-sourced annotations for static image databases. This leads me to question other cheaper sources of images that can be utilized for this purpose. For example, google maps provides a more frequently updated set of images. Also, acquiring these images is more cost-effective. I think this will be a better alternative to the street-view images.

Additionally, the paper adopts majority voting to improve its results. Theoretically, this should lead to perfect accuracy. The method gets 93% accuracy after the addition. I would like to see examples where the method fails. This will enable development of better collation strategies in the future. I understand that in some cases, the image might be too unclear. However, examples of such failures would give us more data to improve the strategies.

Also, the images contain much more data than currently being collected. We can build an interpretable representation of such images that collect all world information contained in the images. However, the computational effectiveness and validity is still questionable. But, if we are able to better information systems, such representations might enable a huge leap forward in the AI research (similar to ImageNet). We can also combine this data to build a profile of any place such that it helps any user that wants to access it in the future (e.g.; accessibility of restaurants or schools). Furthermore, given the time-sensitivity of accessibility, I think a dynamic model will be better than the proposed static approach. However, this will require a cheaper method of acquiring street-view data. Hence, we need to look for alternative sources of data that may provide comparable performance while limiting the expenses.

Questions

  1. What is the generalization of this method? Can this be applied to any static image database? The paper focuses on accessibility issues. Can this be extended to other issues such as road repairs and emergency lane systems?
  2. Street view data collection requires significant effort and is also expensive. Could we utilize Google maps to achieve reasonable results? What is a possible limitation to applying the same approach on Google satellite imagery?
  3. What about the time sensitivity of the approach? How will it track real-time changes to the system? Does this approach require constant monitoring?
  4. The images contain much more information. How can we exploit it? Can we use it to detect infrastructural issues with government services such as parks, schools, roads etc.? 

Word Count: 560

Read More

02/26/20 – Fanglan Chen – Will You Accept an Imperfect AI? Exploring Designs For Adjusting End-user Expectations of AI Systems

Summary

Kocielnik et al.’s paper “Will You Accept an Imperfect AI?” explores approaches for shaping expectations of end-users before their initial working with an AI system and studies how appropriate expectations impact users’ acceptance of the system. Prior study has presented that end-user expectations of AI-powered technologies are influenced by various factors, such as external information, knowledge and understanding, and first hand experience. The researchers indicate that expectations vary among users and users perception/acceptance of AI systems may be negatively impacted when their expectations are set too high. To fill in the gap of understanding how end-user expectations can be directly and explicitly impacted, the researchers use a Scheduling Assistant – an AI system for automatic meeting schedule detection in email – to study the impact of several methods of expectation shaping. Specifically, they explore two system versions with the same accuracy level of the classifier but each is intended to focus on mitigating different types of errors(False Positives and False Negatives). Based on their study, error types highly relate to users’ subjective perceptions of accuracy and acceptance. Expectation adjustment techniques are proposed to make users fully aware of AI imperfections and enhance their acceptance of AI systems.

Reflection

We need to be aware that AI-based technologies cannot be perfect, just like nobody is perfect. Hence, there is no point setting a goal that involves AI systems making no mistake. Realistically defining what success and failure look like associated with working with AI-powered technologies is of great importance in adopting AI to improve the imperfection of nowadays solutions. That calls for an accurate positioning of where AI sits in the bigger picture. I feel the paper mainly focuses on how to set appropriate expectations but lacks a discussion on different scenarios associated with the users expectations to AI. For example, users expectation greatly vary to the same AI system in different decision making frameworks: in human-centric decision making process, the expectation of AI component is comparatively low as AI’s role is more like a counselor who is allowed to make some mistakes; in machine-centric system, all the decisions are made by algorithms which render users’ low tolerance of errors, simply put, some AIs will require more attention than others, because the impact of errors or cost of failures will be higher. Expectations of AI systems vary not only among different users but also under various usage scenarios.

To generate positive user experiences, AI needs to exceed expectations. One simple way to achieve this is to not over-promise the performance of AI in the beginning. That relates with the intention of the researchers on designing the Accuracy Indicator component in the Scheduling Assistant. In the study case, they set the accuracy to 50%. This accuracy is actually very low in AI-based applications. I’m interested in whether the evaluation results would change with AI systems of higher performance (e.g. 70% or 90% in accuracy). I think it is worthwhile to conduct a survey about users’ general expectations of AI-based systems. 

Interpretability of AI is another key component that shapes user experiences. If people cannot understand how AI works or how it comes up with its solutions, and in turn do not trust it, they would probably not choose to use it. As people accumulate more positive experiences, they build trust with AI. In this way, easy-to-interpret models seem to be more promising to deliver success compared with complex black-box models. 

To sum up, by being fully aware of AI’s potential but also its limitations, and developing strategies to set appropriate expectations, users can create positive AI experiences and build trust in an algorithmic approach in decision making processes.

Discussion

I think the following questions are worthy of further discussion.

  • What is your expectation of AI systems in general? 
  • How would users expectations of the same AI system vary in different usage scenarios?
  • What are the negative impacts brought by the inflated expectations? Please give some examples. 
  • How can we determine which type of errors is more severe in an AI system?

Read More

02/26/20 – Sukrit Venkatagiri – Will You Accept an Imperfect AI?

Paper: Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–14.

Summary: 

This paper explores people’s perceptions and expectations of an intelligent scheduling assistant. The paper specifically considers three broad research questions: the impact of AI’s focus on error avoidance versus user perception, ways to set appropriate expectations, and impact of expectation setting on user satisfaction and acceptance. The paper explores this through an experimental setup, whose design process is explored in detail. 

The authors find that expectation adjustment designs significantly affected the desired aspects of expectations, similar to what was hypothesized. They also find that high recall resulted in significantly higher perceptions of accuracy and acceptance compared to high precision, and that expectation adjustment worked by intelligible explanations and tweaking model evaluation metrics to emphasize one over the other. The paper concludes with a discussion of the findings.

Reflection:

This paper presents some interesting findings using a relatively simple, yet powerful “technology probe.” I appreciate the thorough exploration of the design space, taking into consideration design principles and how they were modified to meet the required goals. I also appreciate the varied and nuanced research questions. However, I feel like the setup may have been too simple to explore in more depth. Certainly, this is valuable as a formative study, but more work needs to be done. 

It was interesting that people valued high recall over high precision. I wonder if the results would differ among people with varied expertise, from different countries, and from different socioeconomic backgrounds. I also wonder how this might differ based on the application scenario, e.g. AI scheduling assistant versus a movie recommendation system. In the latter, a user would not be aware of what movies they were not recommended but that they would actually like, while with an email scheduling assistant, it is easy to see false negatives.

I wonder how these techniques, such as expectation setting, might apply not only to users’ expectations of AI systems, but also to exploring the interpretability or explainability of more complex ML models.

At what point do explanations tend to result in the opposite effect? I.e. reduced user acceptance and preference? It may be interesting to experimentally study how different levels of explanations and expectation settings affect user perceptions versus a binary value. I also wonder how it might change with people of different backgrounds.

In addition, this experiment was relatively short in duration. I wonder how the findings would change over time. Perhaps users would form inaccurate expectations, or their mental models might be better steered through expectation-setting. More work is needed in this regard. 

Questions:

  1. Will you accept an imperfect AI?
  2. How do you determine how much explanation is enough? How would this work for more complex models?
  3. What other evaluation metrics can be used?
  4. When is high precision valued over high recall, and vice versa?

Read More

02/26/20 – Lulwah AlKulaib- Explaining Models

Summary

The authors believe that in order to ensure fairness in machine learning systems, it is mandatory to have a human in the loop process. In order to identify fairness problems and make improvements, they suppose relying on developers, users, and the general public is an effective way to follow that process. The paper conducts an empirical study with four types of programmatically generated explanations to understand how they impact people’s fairness judgments of ML systems. They try to answer three research questions:

  • RQ1 How do different styles of explanation impact fairness judgment of a ML system?
  • RQ2 How do individual factors in cognitive style and prior position on algorithmic fairness impact the fairness judgment with regard to different explanations?
  • RQ3 What are the benefits and drawbacks of different explanations in supporting fairness judgment of ML systems?

The authors focus on a racial discrimination case study in terms of model unfairness and Case-specific disparate impact. They performed an experiment with 160 Mechanical Turk workers. Their hypothesis proposed that given local explanations focus on justifying a particular case, they should more effectively surface fairness discrepancies between cases. 

 The authors show that: 

  • Certain explanations are considered inherently less fair, while others can enhance people’s confidence in the fairness of the algorithm
  • Different fairness problems-such as model-wide fairness issues versus case-specific fairness discrepancies-may be more effectively exposed through different styles of explanation
  • Individual differences, including prior positions and judgment criteria of algorithmic fairness, impact how people react to different styles of explanation.

Reflection

This is a really informative paper. I like that it had a straightforward hypothesis and chose one existing case study that they evaluated. But I would have loved to see this addressed with judges instead of crowdworkers. They mentioned it in their limitations and I hope that they find enough judges willing to work on a follow-up paper. I believe that they would have insightful knowledge to contribute especially since they practice it. It would give a more meaningful analysis to the case study itself from professionals in the field.

I also wonder how this might scale to different machine learning systems that cover similar racial biases. Having a specific case study makes it harder to generalize even for something in the same domain. But definitely worth investigating since there are so many existing case studies! I also wonder if changing the case study analyzed, we’d notice a difference in the local vs. global explanations patterns in fairness judgement. And how would a mix of both affect the judgement, too. 

Discussion

  • What are other ways you would approach this case study?
  • What are some explanations that weren’t covered in this study?
  • How would you facilitate this study to be performed with judges?
  • What are other case studies that you could generalize this to with small changes to the hypothesis?

Read More

02/26/2020 – Palakh Mignonne Jude – Interpreting Interpretability: Understanding Data Scientists’ Use Of Interpretability Tools For Machine Learning

SUMMARY

In this paper, the authors attempt to study two interpretability tools – the InterpretML implementation of GAMs and the SHAP Python package. They conducted a contextual inquiry and survey of data scientists in order to analyze the ability of these tools to aid in uncovering common issues that arise when evaluating ML models. The results obtained during the course of these studies indicate that data scientists tend to over-trust these tools. The authors conducted pilot interviews with 6 participants to identify common issues faced by data scientists. The contextual inquiry performed included 11 participants who were allowed to explore the dataset and an ML model in a hands-on manner via the use of a Jupyter notebook whereas the survey comprised of 197 participants and was conducted through Qualtrics. For the survey, the participants were given access to a description of the dataset and a tutorial on the interpretability tool they were to use. The authors found that the visualizations provided by the interpretability tools considered in the study as well as the fact that these tools were popular and publicly available caused the data scientists to over-trust these tools.

REFLECTION

I think it is good that the authors performed a study to observe the usage of interpretability tools by data scientists. I was surprised to learn that a large number of these data scientists over-trusted the tools and that visualizations impacted their ability to judge the tools as well. However, considering that the authors state ‘participants relied too heavily on the interpretability tools because they has not encountered such visualizations before’ makes me wonder if the authors should have created a separate pool of data scientists who had better experience with such tools and visualizations and then presented a separate set of results for that set of individuals. I also found it interesting to learn that some participants used the tools to rationalize suspicious observations.

As indicated by the limitations section of this paper, I think a follow-up study that includes a richer dataset as well as interpretability techniques for deep learning would be very interesting to learn about and I wonder how data scientists would use such tools versus the ones studied in this paper.

QUESTIONS

  1. Considering that the complexity of ML systems and the time taken for researchers to truly understand how to interpret ML, both the contextual inquiry as well as the survey was conducted with people who had as little as 2 months of experience with ML. Would a study with experts in the field of ML (all with over 4 years of experience) have yielded different results? Perhaps these data scientists would have been able to better identify issues and would not have over-trusted the interpretable tools?
  2. Would a more extensive study comprise of a number of different (commonly used as well as not-so-commonly used) interpretability tools have changed the results? If the tools were not available so easily would it truly impact the amount of trust the users had for the tools?
  3. Does a correlation exist between the amount of experience a data scientist has and the amount of trust for a given interpretability tool? Would the replacement of visualizations with other representations of interpretations of the models impact the amount of trust the human had towards the tool?

Read More

02/26/2020 – Sushmethaa Muhundan – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

The paper explores how people make fairness judgments of ML systems and the impact that different explanations can have on these fairness judgments. The paper also explores how providing personalized and adaptive explanations can support such fairness judgments of ML systems. It is extremely important to ensure algorithm fairness and there is a need to consciously work towards avoiding the risk of amplifying existing biases. In this context, providing explanations can be beneficial in two aspects, not only do they help in providing implementation details which would otherwise be a “black box” to a user, but they also facilitate better human-in-the-loop experiences by enabling people to identify fairness issues. The COMPAS recidivism data was utilized for the study and four different explanations styles were examined: input-influence based, demographic-based, sensitivity-based, and case-based. Through the study, it is highlighted that there is no one-size-fits-all solution for an effective explanation. The dataset, context, kinds of fairness issues, and user profiles vary and need to be addressed individually. The paper proposes providing hybrid explanations as a solution to address this problem thereby providing both an overview of the ML model and information about specific cases to help aid accurate fairness judgment.

While there has been a lot of research focus on developing non-discriminatory ML algorithms, this paper specifically deals with the human aspect which is necessary to identify and remedy fairness issues. I feel that this is equally important and is often overlooked. It was interesting to note that they auto-generated the explanations, unlike previous studies. 

With respect to the different explanation styles used, I found the sensitivity-based explanation particularly interesting since it clearly shows the difference in the prediction result if certain attributes were modified. According to me, this form of explanation, out of the four proposed, is extremely effective in bringing out any bias that may be present in the ML system.

I felt that the input-influence based explanation was also effective since it had the +/- markers corresponding to features that match the particular case and this gives the users a clearer picture of which attributes specifically influenced the result thereby providing the implementation details to a certain extent.

The study results documents various insights from participants, and I found some of them to be extremely fascinating. While some believed that certain predictions were biased, others found it normal for that verdict to be predicted. It truly captured the diversity in opinions and perspectives of the same ML system based on the different explanations provided.

  1. Through this study, it is revealed that the perception of bias is not uniform and is extremely subjective. Given this lack of agreement on the definition of moral concepts, how can a truly unbiased ML system be achieved?
  2. What are some practices that can be followed by ML model developers to ensure that the bias in the input dataset is identified and removed?
  3. Apart from gender-bias and ethnic-bias, what are some other prevalent biases in existing ML systems that need to be eradicated?

Read More

02/26/2020 – Dylan Finch – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Word count: 556

Summary of the Reading

This paper examines the role of expectations and the role of focusing on certain types of errors to see how this impacts perceptions of AI. The aim of the paper is to figure out how the setting of expectations can help users better see the benefits of an AI system. Users will feel worse about a system that they think can do a lot and then fails to live up to those expectations rather than a system that they think can do less and then succeeds at accomplishing those smaller goals.

Specifically, this paper lays some ways to better set user expectations: an Accuracy Indicator which allows users to better expect what the accuracy of a system should be, an explanation method based on examples to help increase user understanding, and the ability for users to adjust the performance of the system. They also show the usefulness of these 3 techniques and that systems tuned to avoid false positives are generally worse than those tuned to avoid false negatives.

Reflections and Connections

This paper highlights a key problem with AI systems: people expect them to be almost perfect and companies market them as such. Many companies that have deployed AI systems have not done a good job managing expectations for their own AI systems. For example, Apple markets Siri as an assistant that can do almost anything on your iPhone. Then, once you buy one, you find out that it can really only do a few very specialized tasks that you will rarely use. You are unhappy because the company sold you a much more capable product. With so many companies doing this, it is understandable that many people have very high expectations for AI. Many companies seem to market AI as the magic bullet that can solve any problem. But, the reality is often much more underwhelming. I think that companies that develop AI systems need to play a bigger role in managing expectations. They should not sell their products as a system that can do anything. They should be honest and say that their product can do some things but not others and that it will make a lot of mistakes, that is just how these things work. 

I think that the most useful tool this team developed was the slider that allows users to choose between more false positives and more false negatives. I think that this system does a great job of incorporating many of the things they were trying to accomplish into one slick feature. The slider shows people that the AI will make mistakes, so it better sets user expectations. But, it also gives users more control over the system which makes them feel better about it and allows them to tailor the system to their needs. I would love to see more AI systems give users this option. It would make them more functional and understandable. 

Questions

  1. Will AI ever become so accurate that these systems are no longer needed? How long will that take?
  2. Which of the 3 developed features do you think is most influential/most helpful?
  3. What are some other ways that AI developers could temper the expectations of users?

Read More

02/26/2020 – Ziyao Wang – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

As machine learning models are deployed in variety domains of industry, it is important to design some interpretability to help model users, such as data scientists and machine learning practitioners, better understand how these models work. However, there have been little researches focused on the evaluation of the performance of these tools. The authors in this paper did experiments and surveys to fill this gap. They interviewed 6 data scientists from a large technology company to find out the most common issues faced by data scientists. Then they conducted a contextual inquiry towards 11 participants based on the common issues using the InterpretML implementation of the Gams and the SHAP python package. Finally, they made a survey of 197 data scientists. With the experiments and surveys, the authors highlighted the misuse and over-trust problem and the need for the communication between members of HCI and ML communities.

Reflection:

Before reading this paper, I hold the view that the interpretability tools should be able to cover most of the data scientists’ need. However, now I have the view that the tools for interpretation are not designed by the ML community, which will result in the lack of accuracy of the tools. When data scientists or machine learning practitioners want to use these tools to learn how the models operate, they may face problems like misuse or over-trust. I don not think this is the users’ fault. Tools are designed for make users feel more convenient when doing tasks. If the tools will make users confuse, the developers should make change to the tools to give users better user experiences. In this case, the authors suggested that the members of HCI and ML communities should work together when developing the tools. This need the members to leverage their strength so that the designed tools can let users understand the models easily while the tools are user-friendly. Meanwhile, comprehensive instructions should be written to explain how the users can use the tools to understand the models accurately and easily. Finally, both the efficiency and accuracy of both the tools and the implementation of models will be improved.

From data scientists and machine learning practitioners’ point of view, they should try to avoid to over-trust the tools. The tools cannot fully explain the models and there may be mistakes. The users should always be critic to the tools instead of fully trusting them. They should read the instructions carefully, understand how to use the tools and what the tools are used for, what is the models being used for and how to use the models. If they can consider thoughtfully when using these tools and models, instead of guessing the meaning of the results from the tools, the number of misuse and over-trust cases will be decreased sharply.

Questions:

  1. How to design a proposed interactive interpretability tool? What kinds of interactions should be included?
  2. How to design a tool that can make users to dig the models conveniently instead of letting them use the models without knowing how the models work?
  3. How to design tools which can leverage the strength of mental models mostly

Read More