03/04/20 – Sukrit Venkatagiri – Toward Scalable Social Alt Text

March 4, 2020 Sukrit Venkatagiri Leave a comment

Paper: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind. In Fifth AAAI Conference on Human Computation and Crowdsourcing.

Summary:
This paper explores a variety of approaches for supporting blind and visually impaired people (BVI) with alt-text captions. They consider two baseline methods using existing computer vision approaches (Vision-to-Language) and Human Corrected Captions. They also considered two workflows that did not depend on CV approaches—TweetTalk conversational workflow, and Structured Q&A workflow. Based on the questions asked from TweetTalk, they generated a set of structured questions to be used in Structured Q&A workflow. They found that V2L performed the worst, and that overall, any approach with CV as a baseline did not perform well. Their TweetTalk conversational approach is more generalizable but also difficult to recruit workers. Finally, they conducted a study of TweetTalk with 7 BVI people and learned that they found it potentially useful. The authors discuss their findings in relation to prior work, as well as the tradeoffs between human-only and AI-only systems, paid v/s volunteer work, and conversational assistants v/s structured Q&A. They also extensively discuss the limitations of this work.

Reflection:
Overall, I really liked this paper and found it very interesting. I think their multiple approaches to evaluating human-AI collaboration was interesting (AI alone, human-corrected, human chat, asynchronous human answers), in addition to the quality perception ratings that were obtained from third party workers. I think this paper makes a strong contribution, but wish they could go into more detail to clarify exactly how the system worked, the different experimental setups, and any other interesting findings that were there. Sadly, there is an 8-page page limit, which may have prevented them from going into more detail.

I appreciate the fact that they built on and used prior work in this paper, namely MacLeod et al. 2017, Mao et al. 2012, and Microsoft’s Cognitive Services API. This way, they did not need to build their own database, CV algorithms, or real-time crowdworker recruiting system. Instead, it allowed them to focus on more high-level goals.

Their findings were interesting. Especially the fact that human-corrected CV descriptions performed poorly. It is unclear how satisfaction is different between the various conditions, for first-party ratings. It may be because users had context through conversation and but was not included in their ratings. The results also show that current V2L systems have worse accuracy than human-in-the-loop approaches. Sadly, there was no significant difference in accuracy between HCC and description generated after TweetTalk, but SQA improved significantly.

Finally, the validation with BVI users is welcome, and I believe more Human-AI work needs to actually work with real users. I wonder how the findings might differ if they were used in a real, social context, or with people on MTurk instead of the researchers-as-workers.

Overall, this was a great paper to read and hope others build on this work, similar to how the authors here have directly leveraged prior work to advance our understanding of human-AI collaboration for alt-text generation.

Questions:

Are there any better human-AI workflows that might be used that the authors did not consider? How would they work and why would they be better?
What are the limitations of CV that led to the findings in this paper that any approach with CV performed poorly?
How would you validate this system in the real world?
What are some other next steps for improving the state of the art in alt-text generation?

03/04/2020 – Mohannad Al Ameedi – Real-Time Captioning by Groups of Non-Experts

March 4, 2020March 4, 2020 mohada4 1 Comment

Summary

In this paper, the authors proposing a low latency captioning solution for the deaf and hard of hearing people that can work in real-time setting. Although, there are available solutions, but they are either very expensive or low quality. The proposed system allows people with hearing disability to request a captioning at any time and get the result in a few seconds. The system depends on a combination of non-expert crowd sourcing workers and local staff to provide the captioning. Each request will be handled by multiple people and the result will be a combination of all the participants’ input. The request will be submitted in an audio stream format and the result will be in a text format. Crowdsource platform is used to submit the request and the result is retrieved in seconds. The proposed system uses an algorithm that work on a stream manner where the input can be process as it is received and aggregate the result at the end. The system outperforms all other available options on both coverage and accuracy. The proposed solution is feasible to be applied in a production setting.

Reflection

I found the idea of real time captioning very interesting. My understanding was there is always a latency when depending on crowdsourcing and cannot be applied in real world scenarios, but it will be interesting to know how the system will work when the number of users increase.

I also found the concept of multiple people working on the same audio stream and combining the result very interesting. Collecting captions from multiple people and then trying to figure out what is unique and what is duplicate and producing a final sentence, paragraph, or script is a challenging task.

This work is like multiple people work on one task or multiple developers writing code to implement a single feature. Normally the supervisor or development lead will merge the result, but in this case the algorithm is taking care of the merge.

Questions

The authors measured the system on a limited number of users, do you think the system will continue outperforming other methods if it is get deployed in real world setting?
Since we have an increasing number of live streaming on work, school, and other places, can we use the same concept to pass the URL and get instance captioning? What are the limitations of this approach?
What are the privacy concerns with this approach especially if it is get used in medical field? Normally limited number of people get hired to help on such tasks, while the crowdsourcing is opened to a wide range of people.

03/04/2020 – Vikram Mohanty – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

March 3, 2020March 4, 2020 Vikram Mohanty Leave a comment

Authors: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris

Summary

This paper studies how crowdsourcing can be used to evaluate automated approaches for generating alt-text captions for BVI (Blind or Visually Impaired) users on social media. Further, the paper proposes an effective real-time crowdsourcing workflow to assist BVI users in interpreting captions. The paper shows that the shortcomings of existing AI image captioning systems frequently hinder a user’s understanding of an image they cannot see, much to the extent that clarifying conversations with sighted assistants can’t even correct. The paper finally proposes a detailed set of guidelines for future iterations of AI captioning systems.

Reflection

This paper is another example of people working with imperfect AI. Here, the imperfect AI is a result of not relying on collecting meaningful datasets, but as a result of building algorithms from constrained datasets without having a foresight of the application i.e. alt-text for BVI users. The paper demonstrates a successful crowdsourcing workflow augmenting the AI’s suggestion, and serves as a motivation for other HCI researchers to think of design workflows that can integrate the strengths of interfaces, crowds and AI together.

The paper shows an interesting finding where the simulated BVI users found it easier to generate a caption from scratch than from the AI’s suggestion. This shows how the AI’s suggestion can bias a user’s mental model in the wrong direction, from where recovery might be costlier compared to no suggestion in the first place. This once again stresses the need for considering real-world scenarios and users in the evaluation workflow.

The solution proposed here is bottlenecked by the challenges presented by real-time deployment with crowd workers. Despite that, the paper makes an interesting contribution in the form of guidelines essential for future iteration of AI captioning systems. Involving potential end-users and proposing systematic goals for an AI to achieve is a desirable goal in the long-run.

Questions

Why do you think people preferred to generate the captions from scratch rather than from the AI’s suggestions?
Do you ever re-initialize a system’s data/suggestions/recommendations to start from blank? Why or why not?
If you worked with an imperfect AI (which is more than likely), how do you envision mitigating the shortcomings when you are given the task to redesign the client app?

03/04/2020 – Sushmethaa Muhundan – Pull the Plug? Predicting If Computers or Humans Should Segment Images

March 2, 2020March 4, 2020 Sushmethaa Muhundan 1 Comment

The paper proposes a resource allocation framework that intelligently distributes work between a human and an AI system in the context of foreground object segmentation. The advantages of using a mix of both humans and AI rather than either of them alone is demonstrated via the study conducted. The goal is to ensure that high-quality object segmentation results are produced while using considerably less human efforts involved. Two systems are implemented as part of this paper that automatically decide when to transfer control from the human to the AI component and vice versa, depending on the quality of segmentation encountered at each phase. The first system eliminates the need for human annotation effort by replacing human efforts with computers to generate coarse object segmentation which is refined by segmentation tools. The second system predicts the quality of the annotations and automatically identifies a subset of them that needs to be re-annotated by humans. Three diverse datasets were used to train and validate the system and these include datasets representing visible, phase contrast microscopy, and fluorescence microscopy images.

The paper explores leveraging the complementary strengths of humans and AI and allocates resources accordingly in order to reduce human involvement. I particularly liked the focus on quality throughout the paper. This particular system that employs a mixed approach mechanism ensures that the quality of the traditional systems which relied heavily on human involvement is met. The resultant system was successfully able to reduce significant hours of human effort and also maintain the quality of the resultant foreground object segmentation of images which is great.

Another aspect of the paper that I found impressive was the conscious effort to develop a single prediction model that is applicable across different domains. Three diverse datasets were employed as part of this initiative. The paper talks about the disadvantages of other systems that do not work well on multiple datasets. In such cases, only a domain expert or computer vision expert would be able to predict when the system would succeed. This paper claims that this is altogether avoided in this system. Also, the decision to intentionally include humans only once per image is good as opposed to the existing system where human effort is required multiple times during the initial segmentation phase of each image.

This paper primarily focuses on reducing human involvement in the context of foreground object segmentation. What other applications can extend the principles of this system to achieve reduced involvement of humans in the loop while ensuring that quality is not affected?
The system deals with predicting the quality of image segmentation outputs and involves the human to re-annotate only the lowest quality ones. What other ideas can be employed to ensure reduced human efforts in such a system?
The paper implies that the system proposed can be applied across images from multiple domains. Were the three datasets described varied enough to ensure that this is a generalized solution?

03/04/2020 – Nurendra Choudhary – Combining crowdsourcing and google street view to identify street-level accessibility problems

March 1, 2020March 4, 2020 Nurendra Choudhary Leave a comment

Summary

In this paper, the authors discuss a crowd-sourcing method utilizing Amazon MT workers to identify accessibility issues in google street view images. They utilize two levels of street views for annotations: image-level and pixel-level. They evaluate intra and inter-annotator agreement and conclude a feasible level of accuracy of 81% (increased to 93% with minor quality control additions) for real-world scenarios.

The authors initiate the paper with a discussion about the necessity of such approaches. The solution could lead to more accessibility-aware solutions. The paper utilizes precision, recall and f1-score to consolidate and evaluate image-level annotations. For pixel-level annotations, the authors utilize two sets of evaluation metrics: overlap between annotated pixels and precision-recall scores. The experiments depict an inter-annotator agreement that makes the system feasible in real-world scenarios. The authors also utilize majority voting between annotators to improve the accuracy further.

Reflection

The paper introduces an interesting approach to utilize crowd-sourced annotations for static image databases. This leads me to question other cheaper sources of images that can be utilized for this purpose. For example, google maps provides a more frequently updated set of images. Also, acquiring these images is more cost-effective. I think this will be a better alternative to the street-view images.

Additionally, the paper adopts majority voting to improve its results. Theoretically, this should lead to perfect accuracy. The method gets 93% accuracy after the addition. I would like to see examples where the method fails. This will enable development of better collation strategies in the future. I understand that in some cases, the image might be too unclear. However, examples of such failures would give us more data to improve the strategies.

Also, the images contain much more data than currently being collected. We can build an interpretable representation of such images that collect all world information contained in the images. However, the computational effectiveness and validity is still questionable. But, if we are able to better information systems, such representations might enable a huge leap forward in the AI research (similar to ImageNet). We can also combine this data to build a profile of any place such that it helps any user that wants to access it in the future (e.g.; accessibility of restaurants or schools). Furthermore, given the time-sensitivity of accessibility, I think a dynamic model will be better than the proposed static approach. However, this will require a cheaper method of acquiring street-view data. Hence, we need to look for alternative sources of data that may provide comparable performance while limiting the expenses.

Questions

What is the generalization of this method? Can this be applied to any static image database? The paper focuses on accessibility issues. Can this be extended to other issues such as road repairs and emergency lane systems?
Street view data collection requires significant effort and is also expensive. Could we utilize Google maps to achieve reasonable results? What is a possible limitation to applying the same approach on Google satellite imagery?
What about the time sensitivity of the approach? How will it track real-time changes to the system? Does this approach require constant monitoring?
The images contain much more information. How can we exploit it? Can we use it to detect infrastructural issues with government services such as parks, schools, roads etc.?

Word Count: 560