03/04/2020 – Sushmethaa Muhundan – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

The popularity of social media has increased exponentially over the past few decades and with this comes a wave of image content that is flooding social media. Amidst this growing popularity, people who are blind or visually impaired (BIV) often find it extremely difficult to understand such content. Although existing solutions offer limited capabilities to caption images and provide alternative text, these are often insufficient and have a negative impact on the experience of BIV users if inaccurate. This paper aims to provide a better platform to improve the experience of BIV users by combining crowd input with existing automated captioning approaches. As part of the experiments, numerous workflows with varying degrees of human involvement and automated systems involvement were designed and evaluated. The four frameworks that were introduced as part of this study include a fully-automated captioning workflow, a human-corrected captioning workflow, a conversational assistant workflow, and a structured Q&A workflow. It was observed that though the workflows involving humans in the loop was time-consuming, it increased the user’s satisfaction by providing accurate descriptions of the images.

Throughout the paper, I really liked the focus on improving the experience of blind or visually impaired users while using social media and ensuring that accurate image description is provided so that the BIV users understand the context. The paper explores innovative means of leveraging humans in the loop to solve this pervasive issue.

Also, the particular platform being targeted here is social media which comes with its own challenges. Social media is a setting where the context and emotions of the images are as important as the image description itself to provide the BIV users sufficient information to understand the post. Another aspect that I found interesting was the focus on scalability which is extremely important in a social media setting.

I found the concepts of TweetTalk conversational workflow and the Structured Q&A workflow interesting as they proved a mixed approach by involving humans in the loop whenever necessary. The intent of the conversational workflow is to understand the aspects that make a caption valuable to a BIV user. I felt that this fundamental understanding is extremely essential to build further systems that ensure user satisfaction.

It was good to see that the sample tweets were chosen based on broad areas of topics that represented the various interests reported by blind users. An interesting insight that came out of the study was that no captions were preferred to inaccurate captions to avoid the cost of recovery from misinterpretation based on an inaccurate caption.

  1. Despite being validated by 7 BIV people, the study largely involved simulating a BIV user’s behavior. Do the observations hold good for scenarios with actual BIV users or is the problem not captured via these simulations?
  2. Apart from the two new workflows used in this paper, what are some other techniques that can be used to improve the captioning of the images on social media that captures the essence of the post?
  3. Besides social media, what other applications or platforms have similar drawbacks from the perspective of BIV users? Can the workflows that were introduced in this paper be used to solve those problems as well?

Read More

3/4/20 – Jooyoung Whang – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

In this paper, the authors study the effectiveness of vision-to-language systems for automatically generating alt texts for images and the impact of human-in-the-loop for this task. The authors set up four methods for generating alt text. First is a simple implementation of modern vision-to-language alt text generation. The second is a human-adjusted version of the first method. The third method is a more involved one, where a Blind or Vision Impaired (BVI) user chats with a non-BVI user to gain more context about an image. The final method is a generalized version of the third method, where the authors analyzed the patterns of questions asked during the third method to form a structured set of pre-defined questions that a crowdsource worker can directly provide the answer to without having the need for a lengthy conversation. The authors conclude that current vision-to-language techniques can, in fact, harm context understanding for BVI users, and simple human-in-the-loop methods significantly outperform. They also found that the method of the structured questions worked the best.

This was an interesting study that implicitly pointed out the limitation of computers at understanding social context which is a human affordance. The authors stated that the results of a vision-to-language system often confused the users because the system did not get the point. This made me wonder if the current limitation could be overcome in the future.

I was also concerned whether the authors’ proposed methods were even practical. Sure, the human-in-the-loop method involving Mturk workers greatly enhanced the description of a Twitter image, but based on their report, it’ll take too long to retrieve the description. The paper reports that to answer one of the structured questions, it takes on average, 1 minute. This is excluding the time it takes for a Mturk worker to accept a HIT. The authors suggested pre-generating alt texts for popular Tweets, but this does not completely solve the problem.

I was also skeptical about the way the authors performed validation with the 7 BVI users. In their validation, they simulated their third method (TweetTalk, a conversation between BVI and sighted users). However, they did not do it by using their application, but rather a face-to-face conversation between the researchers and the participants. The authors claimed that they tried to replicate the environment as much as possible, but I think there still can be flaws since the researchers serving as the sighted user already had expert knowledge about their experiment. Also, as stated in the paper’s limitations section, the validation was performed with too fewer participants. This may not fully capture the BVI users’ behaviors.

These are the questions that I had while reading this paper:

1. Do you think the authors’ proposed methods are actually practical? What could be done to make them practical if you don’t think so?

2. What do you think were the human affordances needed for the human element of this experiment other than social awareness?

3. Do you think the authors’ validation with the BVI users is sound? Also, the validation was only done for the third method. How can the validation be done for the rest of the methods?

Read More

03/04/20 – Akshita Jha – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

Summary:
“Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind” by Salisbury et. al. talks about the important problem of accessibility. The authors talk about the challenges that arise from an automatic image captioning system and how the imperfections in the system may hinder a blind person’s understanding of social media posts that have embedded imagery. The authors use mixed methods to evaluate and subsequently modify the captions generated by the automated system for images embedded in social media posts. They study how crowdsourcing can enhance the existing workflows and that provide scalable and useful alt text for the blind. The imperfections of the current automated captioning system hinder the user’s understanding of an image. The authors do a detailed analysis of the conversations collected by them to design user-friendly experiences that can effectively assist blind users. The authors focus on three research questions: (i) What value is provided by a state-of-the-art vision-to-language API in assisting BVI users, and what are the areas for improvement? (ii) What are the trade-offs between alternative workflows
for the crowd assisting BVI users? (iii) Can human-in-the loop workflows result in reusable content that can be shared with other BVI users? The authors study varying levels of human engagements and automated systems to come up with a final system that better understands the requirements for creating good quality al-text for blind and visually impaired users.

Reflections:
This is an interesting work as it talks about the often ignored problem of accessibility. The authors focus on images embedded in social media posts. Most of the times the automatic captions given by an automated system trained using a machine learning algorithm are inadequate and non descriptive. This might not be so much of a problem for day to day users but can be a huge challenge for blind people. This is a thoughtful analysis done by the authors keeping accessibility in mind. The authors validate their approach by running a follow-up study with seven blind and visually impaired users. The users were asked to compare the uncorrected vision to language caption and the alt text provided by their system. The findings showed that the blind and visually impaired users would prefer the conversational system designed by the authors to better understand the images. However, if the authors had taken the feedback from the target user group while developing the system that would have been more helpful instead of just asking the users to test the system. Also, the tweets used by the authors might not be representative of the kinds of tweets in the target users’ timeline.

Questions:
1. What do you think about the approach taken by the authors to generate the alt-text?
2. Would it have been helpful to conduct a survey to understand the needs of the blind and visually impaired users before developing the system?
3. Don’t you think using a conversational agent to understand the image embedded in tweets is too cumbersome and time consuming?

Read More

03/04/2020 – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind – Yuhang Liu

Summary:

The authors of this paper explored that visually impaired users are limited by the availability of suitable alternative text when accessing images in social media. The author believes that the beneficial of those new tools that can automatically generate captions are unknown to the blind. So through experiments, the authors studied how to use crowdsourcing to evaluate the value provided by existing automation methods, and how to provide a scalable and useful alternative text workflow for blind users. Using real-time crowdsourcing, the authors designed crowd-interaction experiments that can change the depth. These experiments can help explain the shortcomings of existing methods. The experiments show that the shortcomings of existing AI image captioning systems often prevent users from understanding the images they cannot see , And even some conversations can produce erroneous results, which greatly affect the user experience. The authors carried out a detailed analysis and designed a design that is scalable, requires crowdsourced workers to participate in improving the display content, and can effectively help users without real-time interaction.

Reflection:

First of all, I very much agree with the author’s approach. In a society where the role of social networks is increasingly important, we really should strive to make social media serve more people, especially for the disadvantaged groups in our lives. The blind daliy travel inconveniently, social media is their main way to understand the world, so designing such a system would be a very good idea if it can help them. Secondly, the author used the crowdsourcing method to study the existing methods. The method they designed is also very effective. As a cheap human resource, the crowdsourcing method can test a large number of systems in a short time, but I think this method There are also some limitations. It may be difficult for these crowdsourced workers to think about the problem from the perspective of the blind, which makes their ideas, although similar to the blind, not very accurate, so there are some gaps of the results with blind users. Finally, I have some doubts about the system proposed by the author. The authors finally proposed a workflow that combines different levels of automation and human participation. This shows that this interaction requires the participation of another person, so I think this interaction There are some disadvantages to this method. Not only will it cause a certain delay, but because it requires other human resources, it also requires some blind users to pay more. I think the ultimate direction of development should be free from human constraints, so I think we can compare the results of workers with the original results and let machine learning. That is to use the results of crowdsourcing workers for machine learning. I think it can reduce the cost of the system while increasing the efficiency of the system, and provide faster and better services for more blind users.

Question:

  1. Do you think there is a better way to implement these functions, such as studying the answers of workers, and achieving a completely automatic display system?
  2. Are there some disadvantages to using crowdsourcing platforms?
  3. Is it better to change text to speech for the visually impaired?

Read More

03/04/20 – Lulwah AlKulaib- SocialAltText

Summary

The authors propose a system to generate Alt text for images embedded in social media posts by utilizing crowd workers. Their goal is to have a better experience for the blind and visually impared (BVI) when using social media. Existing tools provide imperfect descriptions some by automatic caption generation, and others by object recognition. These systems are not enough as in many cases their results aren’t descriptive enough for BVI users. The authors study how crowdsourcing can be used for both:

  • evaluating the value provided of existing automated approaches
  • Enabling workflows that provide scalable and useful alt text for BVI users

They utilize real-time crowdsourcing to test experiments with varied depth levels of interaction of the crowd in assisting visually impaired users. They show the shortcomings of existing AI image captioning systems and compare them with their method. The paper suggests two experiences:

  • TweetTalk: is a conversational assistant workflow.
  • Structured Q&A: that builds upon and enhances the state of the art generated captions.

They evaluated the conversational assistant with 235 crowdworkers. They evaluated 85 tweets for the baseline image caption, each tweet was evaluated 3 times with a total of 255 evaluations.

Reflection

The paper presents a novel concept and their approach is a different take on utilizing crowdworkers. I believe that the experiment would have worked better if they tested it on some visually impared users. Since the crowdworkers hired were not visually impaired, it makes it harder to say that BVI users would have the same reaction. Since the study targets BVI users, they should have been the pool of testers. People interact with the same element in different ways and what they showed seemed too controlled. Also, the questions were not all the same for all images, which makes this harder to generalize. The presented model tries to solve a problem for social media photos and not having a plan to repeat for each photo might make interpreting images difficult. 

I appreciated the authors’ use of existing systems and their attempt at improving the AI generated captions. Their results obtain better accuracy compared to state of the art work.

I would have loved seeing how different social media applications measured compared with each other. Since different applications vary in how they present photos. Twitter for example gives a limited amount of character count while Facebook could present more text which might help BVI users in understanding the image better. 

In the limitations section, the authors mention that human in the loop workflows raise privacy concerns and that the alt text would generalize to friendsourcing and utilizing social network users. I wonder how that generalizes to social media applications in real time. And how reliable would friendsourcing be for BVI users.

Discussion

  • What are improvements that you would suggest to better the TweetTalk experiment?
  • Do you know of any applications that use human in the loop in real time?
  • Would you have any privacy concerns if one of the social media applications integrated a human in the loop approach to help BVI users?

Read More

03/04/20 – Sukrit Venkatagiri – Toward Scalable Social Alt Text

Paper: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind. In Fifth AAAI Conference on Human Computation and Crowdsourcing.

Summary:
This paper explores a variety of approaches for supporting blind and visually impaired people (BVI) with alt-text captions. They consider two baseline methods using existing computer vision approaches (Vision-to-Language) and Human Corrected Captions. They also considered two workflows that did not depend on CV approaches—TweetTalk conversational workflow, and Structured Q&A workflow. Based on the questions asked from TweetTalk, they generated a set of structured questions to be used in Structured Q&A workflow. They found that V2L performed the worst, and that overall, any approach with CV as a baseline did not perform well. Their TweetTalk conversational approach is more generalizable but also difficult to recruit workers. Finally, they conducted a study of TweetTalk with 7 BVI people and learned that they found it potentially useful. The authors discuss their findings in relation to prior work, as well as the tradeoffs between human-only and AI-only systems, paid v/s volunteer work, and conversational assistants v/s structured Q&A. They also extensively discuss the limitations of this work.

Reflection:
Overall, I really liked this paper and found it very interesting. I think their multiple approaches to evaluating human-AI collaboration was interesting (AI alone, human-corrected, human chat, asynchronous human answers), in addition to the quality perception ratings that were  obtained from third party workers. I think this paper makes a strong contribution, but wish they could go into more detail to clarify exactly how the system worked, the different experimental setups, and any other interesting findings that were there. Sadly, there is an 8-page page limit, which may have prevented them from going into more detail.

I appreciate the fact that they built on and used prior work in this paper, namely MacLeod et al. 2017, Mao et al. 2012, and Microsoft’s Cognitive Services API. This way, they did not need to build their own database, CV algorithms, or real-time crowdworker recruiting system. Instead, it allowed them to focus on more high-level goals.

Their findings were interesting. Especially the fact that human-corrected CV descriptions performed poorly. It is unclear how satisfaction is different between the various conditions, for first-party ratings. It may be because users had context through conversation and but was not included in their ratings. The results also show that current V2L systems have worse accuracy than human-in-the-loop approaches. Sadly, there was no significant difference in accuracy between HCC and description generated after TweetTalk, but SQA improved significantly. 

Finally, the validation with BVI users is welcome, and I believe more Human-AI work needs to actually work with real users. I wonder how the findings might differ if they were used in a real, social context, or with people on MTurk instead of the researchers-as-workers.

Overall, this was a great paper to read and hope others build on this work, similar to how the authors here have directly leveraged prior work to advance our understanding of human-AI collaboration for alt-text generation. 

Questions:

  1. Are there any better human-AI workflows that might be used that the authors did not consider? How would they work and why would they be better?
  2. What are the limitations of CV that led to the findings in this paper that any approach with CV performed poorly?
  3. How would you validate this system in the real world?
  4. What are some other next steps for improving the state of the art in alt-text generation?

Read More

03/04/2020 – Vikram Mohanty – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

Authors: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris

Summary

This paper studies how crowdsourcing can be used to evaluate automated approaches for generating alt-text captions for BVI (Blind or Visually Impaired) users on social media. Further, the paper proposes an effective real-time crowdsourcing workflow to assist BVI users in interpreting captions. The paper shows that the shortcomings of existing AI image captioning systems frequently hinder a user’s understanding of an image they cannot see, much to the extent that clarifying conversations with sighted assistants can’t even correct. The paper finally proposes a detailed set of guidelines for future iterations of AI captioning systems. 

Reflection

This paper is another example of people working with imperfect AI. Here, the imperfect AI is a result of not relying on collecting meaningful datasets, but as a result of building algorithms from constrained datasets without having a foresight of the application i.e. alt-text for BVI users. The paper demonstrates a successful crowdsourcing workflow augmenting the AI’s suggestion, and serves as a motivation for other HCI researchers to think of design workflows that can integrate the strengths of interfaces, crowds and AI together. 

The paper shows an interesting finding where the simulated BVI users found it easier to generate a caption from scratch than from the AI’s suggestion. This shows how the AI’s suggestion can bias a user’s mental model in the wrong direction, from where recovery might be costlier compared to no suggestion in the first place. This once again stresses the need for considering real-world scenarios and users in the evaluation workflow. 

The solution proposed here is bottlenecked by the challenges presented by real-time deployment with crowd workers. Despite that, the paper makes an interesting contribution in the form of guidelines essential for future iteration of AI captioning systems. Involving potential end-users and proposing systematic goals for an AI to achieve is a desirable goal in the long-run.

Questions

  1. Why do you think people preferred to generate the captions from scratch rather than from the AI’s suggestions? 
  2. Do you ever re-initialize a system’s data/suggestions/recommendations to start from blank? Why or why not? 
  3. If you worked with an imperfect AI (which is more than likely), how do you envision mitigating the shortcomings when you are given the task to redesign the client app? 

Read More