Authors: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris
Summary
This paper studies how crowdsourcing can be used to evaluate automated approaches for generating alt-text captions for BVI (Blind or Visually Impaired) users on social media. Further, the paper proposes an effective real-time crowdsourcing workflow to assist BVI users in interpreting captions. The paper shows that the shortcomings of existing AI image captioning systems frequently hinder a user’s understanding of an image they cannot see, much to the extent that clarifying conversations with sighted assistants can’t even correct. The paper finally proposes a detailed set of guidelines for future iterations of AI captioning systems.
Reflection
This paper is another example of people working with imperfect AI. Here, the imperfect AI is a result of not relying on collecting meaningful datasets, but as a result of building algorithms from constrained datasets without having a foresight of the application i.e. alt-text for BVI users. The paper demonstrates a successful crowdsourcing workflow augmenting the AI’s suggestion, and serves as a motivation for other HCI researchers to think of design workflows that can integrate the strengths of interfaces, crowds and AI together.
The paper shows an interesting finding where the simulated BVI users found it easier to generate a caption from scratch than from the AI’s suggestion. This shows how the AI’s suggestion can bias a user’s mental model in the wrong direction, from where recovery might be costlier compared to no suggestion in the first place. This once again stresses the need for considering real-world scenarios and users in the evaluation workflow.
The solution proposed here is bottlenecked by the challenges presented by real-time deployment with crowd workers. Despite that, the paper makes an interesting contribution in the form of guidelines essential for future iteration of AI captioning systems. Involving potential end-users and proposing systematic goals for an AI to achieve is a desirable goal in the long-run.
Questions
- Why do you think people preferred to generate the captions from scratch rather than from the AI’s suggestions?
- Do you ever re-initialize a system’s data/suggestions/recommendations to start from blank? Why or why not?
- If you worked with an imperfect AI (which is more than likely), how do you envision mitigating the shortcomings when you are given the task to redesign the client app?