03/04/2020 – Vikram Mohanty – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

March 3, 2020March 4, 2020 Vikram Mohanty Leave a comment

Authors: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris

Summary

This paper studies how crowdsourcing can be used to evaluate automated approaches for generating alt-text captions for BVI (Blind or Visually Impaired) users on social media. Further, the paper proposes an effective real-time crowdsourcing workflow to assist BVI users in interpreting captions. The paper shows that the shortcomings of existing AI image captioning systems frequently hinder a user’s understanding of an image they cannot see, much to the extent that clarifying conversations with sighted assistants can’t even correct. The paper finally proposes a detailed set of guidelines for future iterations of AI captioning systems.

Reflection

This paper is another example of people working with imperfect AI. Here, the imperfect AI is a result of not relying on collecting meaningful datasets, but as a result of building algorithms from constrained datasets without having a foresight of the application i.e. alt-text for BVI users. The paper demonstrates a successful crowdsourcing workflow augmenting the AI’s suggestion, and serves as a motivation for other HCI researchers to think of design workflows that can integrate the strengths of interfaces, crowds and AI together.

The paper shows an interesting finding where the simulated BVI users found it easier to generate a caption from scratch than from the AI’s suggestion. This shows how the AI’s suggestion can bias a user’s mental model in the wrong direction, from where recovery might be costlier compared to no suggestion in the first place. This once again stresses the need for considering real-world scenarios and users in the evaluation workflow.

The solution proposed here is bottlenecked by the challenges presented by real-time deployment with crowd workers. Despite that, the paper makes an interesting contribution in the form of guidelines essential for future iteration of AI captioning systems. Involving potential end-users and proposing systematic goals for an AI to achieve is a desirable goal in the long-run.

Questions

Why do you think people preferred to generate the captions from scratch rather than from the AI’s suggestions?
Do you ever re-initialize a system’s data/suggestions/recommendations to start from blank? Why or why not?
If you worked with an imperfect AI (which is more than likely), how do you envision mitigating the shortcomings when you are given the task to redesign the client app?

03/04/2020 – Sushmethaa Muhundan – Pull the Plug? Predicting If Computers or Humans Should Segment Images

March 2, 2020March 4, 2020 Sushmethaa Muhundan 1 Comment

The paper proposes a resource allocation framework that intelligently distributes work between a human and an AI system in the context of foreground object segmentation. The advantages of using a mix of both humans and AI rather than either of them alone is demonstrated via the study conducted. The goal is to ensure that high-quality object segmentation results are produced while using considerably less human efforts involved. Two systems are implemented as part of this paper that automatically decide when to transfer control from the human to the AI component and vice versa, depending on the quality of segmentation encountered at each phase. The first system eliminates the need for human annotation effort by replacing human efforts with computers to generate coarse object segmentation which is refined by segmentation tools. The second system predicts the quality of the annotations and automatically identifies a subset of them that needs to be re-annotated by humans. Three diverse datasets were used to train and validate the system and these include datasets representing visible, phase contrast microscopy, and fluorescence microscopy images.

The paper explores leveraging the complementary strengths of humans and AI and allocates resources accordingly in order to reduce human involvement. I particularly liked the focus on quality throughout the paper. This particular system that employs a mixed approach mechanism ensures that the quality of the traditional systems which relied heavily on human involvement is met. The resultant system was successfully able to reduce significant hours of human effort and also maintain the quality of the resultant foreground object segmentation of images which is great.

Another aspect of the paper that I found impressive was the conscious effort to develop a single prediction model that is applicable across different domains. Three diverse datasets were employed as part of this initiative. The paper talks about the disadvantages of other systems that do not work well on multiple datasets. In such cases, only a domain expert or computer vision expert would be able to predict when the system would succeed. This paper claims that this is altogether avoided in this system. Also, the decision to intentionally include humans only once per image is good as opposed to the existing system where human effort is required multiple times during the initial segmentation phase of each image.

This paper primarily focuses on reducing human involvement in the context of foreground object segmentation. What other applications can extend the principles of this system to achieve reduced involvement of humans in the loop while ensuring that quality is not affected?
The system deals with predicting the quality of image segmentation outputs and involves the human to re-annotate only the lowest quality ones. What other ideas can be employed to ensure reduced human efforts in such a system?
The paper implies that the system proposed can be applied across images from multiple domains. Were the three datasets described varied enough to ensure that this is a generalized solution?

03/04/2020 – Nurendra Choudhary – Combining crowdsourcing and google street view to identify street-level accessibility problems

March 1, 2020March 4, 2020 Nurendra Choudhary Leave a comment

Summary

In this paper, the authors discuss a crowd-sourcing method utilizing Amazon MT workers to identify accessibility issues in google street view images. They utilize two levels of street views for annotations: image-level and pixel-level. They evaluate intra and inter-annotator agreement and conclude a feasible level of accuracy of 81% (increased to 93% with minor quality control additions) for real-world scenarios.

The authors initiate the paper with a discussion about the necessity of such approaches. The solution could lead to more accessibility-aware solutions. The paper utilizes precision, recall and f1-score to consolidate and evaluate image-level annotations. For pixel-level annotations, the authors utilize two sets of evaluation metrics: overlap between annotated pixels and precision-recall scores. The experiments depict an inter-annotator agreement that makes the system feasible in real-world scenarios. The authors also utilize majority voting between annotators to improve the accuracy further.

Reflection

The paper introduces an interesting approach to utilize crowd-sourced annotations for static image databases. This leads me to question other cheaper sources of images that can be utilized for this purpose. For example, google maps provides a more frequently updated set of images. Also, acquiring these images is more cost-effective. I think this will be a better alternative to the street-view images.

Additionally, the paper adopts majority voting to improve its results. Theoretically, this should lead to perfect accuracy. The method gets 93% accuracy after the addition. I would like to see examples where the method fails. This will enable development of better collation strategies in the future. I understand that in some cases, the image might be too unclear. However, examples of such failures would give us more data to improve the strategies.

Also, the images contain much more data than currently being collected. We can build an interpretable representation of such images that collect all world information contained in the images. However, the computational effectiveness and validity is still questionable. But, if we are able to better information systems, such representations might enable a huge leap forward in the AI research (similar to ImageNet). We can also combine this data to build a profile of any place such that it helps any user that wants to access it in the future (e.g.; accessibility of restaurants or schools). Furthermore, given the time-sensitivity of accessibility, I think a dynamic model will be better than the proposed static approach. However, this will require a cheaper method of acquiring street-view data. Hence, we need to look for alternative sources of data that may provide comparable performance while limiting the expenses.

Questions

What is the generalization of this method? Can this be applied to any static image database? The paper focuses on accessibility issues. Can this be extended to other issues such as road repairs and emergency lane systems?
Street view data collection requires significant effort and is also expensive. Could we utilize Google maps to achieve reasonable results? What is a possible limitation to applying the same approach on Google satellite imagery?
What about the time sensitivity of the approach? How will it track real-time changes to the system? Does this approach require constant monitoring?
The images contain much more information. How can we exploit it? Can we use it to detect infrastructural issues with government services such as parks, schools, roads etc.?

Word Count: 560