VQA: Visual Question Answering

Paper: S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” ArXiv preprint arXiv:1505.00468, 2015.

Summary

This paper presents a dataset to use for evaluating AI algorithms involving computer vision, natural language processing, and knowledge representation and reasoning. Many tasks that we once thought would be difficult for an AI algorithm to perform, such as image labeling, have now become fairly commonplace. Thus, the authors hope to help push the boundaries of AI algorithms by providing a data set for a more complex problem that combines multiple disciplines, which they name Visual Question Answering. Others can then use this dataset to test their algorithms that try to solve this highly complex puzzle.

The authors use around 200,000 images from the Microsoft Common Objects in Context data set, and 50,000 abstract scene images created by the authors. For each of these images, they collected three open ended questions from crowdworkers. And then for each of these questions, they collected ten answers from unique crowdworkers. An accuracy measure was then applied to each answer, where if at least three responses to a question were identical, that answer was deemed to be 100% accurate. This concluded the data collection for generating the data set, but they then used crowdworkers to evaluate the complexity of the questions received.

There were three criteria examined to justify the complexity of these questions: whether the image is necessary to answer the question, whether the questions require any common sense knowledge not available in the image, and how well the question can be answered using the captions alone and not the actual image. The studies conducted to support this successfully showed that the questions generated generally required the image to answer, a fair amount require some form of common sense, and the questions are answered significantly better with access to the image than with just access to the captions. Finally, the authors used various algorithms to test their effectiveness against this data set, and found that current algorithms still significantly underperform compared to humans. This means that the data set can successfully test the abilities of a new complex set of AI algorithms.

Reflection

While the purpose of this paper is focused on artificial intelligence algorithms, a large portion of it involves crowd work. It is not specifically mentioned in the body of the paper, but from the description and from the acknowledgements and figure descriptions you can tell that the question and answer data was collected on Amazon Mechanical Turk. And this isn’t surprising given the vast amount of data they collected (nearly 10 million question answers). It would be interesting to learn more about how the tasks were setup and the compensation, but the crowdsourcing aspects are not the focus of the paper.

One part of the paper that I thought was most relevant to are studies of crowd work was the discussion of how to get the best complex, open-ended questions relating to the pictures. The authors used three different prompts to try to get the best answers out of the crowdworkers: ask a question that either a toddler, alien, or smart robot would not understand. I thought it was very interesting that the smart robot prompt produced the best questions. This prompt is actually fairly close to reality, as the smart robot could just be considered modern AI algorithms. Good questions are ones that can stump these algorithms, or the smart robot.

I was surprised that the authors chose to go with exact text matches for all of their metrics, especially given the discussion regarding my project last week with the image comparison labeling. The paper mentions a couple reasons for this, such as not wanting things like “left” and “right” to be grouped together, and because current algorithms don’t do a good enough job of synonym matching for this type of task. It would be interesting to see if the results might differ at all if synonym matching were used. The exact matching was used in all scenarios, however, so adding in synonym matching would theoretically not change the relative results.

That being said, this was a very interesting article that aimed to find human tasks that computers still have difficulty dealing with. Every year that passes, this set of tasks gets smaller and smaller. And this paper is actually trying to help this set get smaller more quickly, by helping test new AI algorithms for effectiveness. The workers may not know it, but for the tasks in this paper they were actually working toward making their own job obsolete.

Questions

How would you have set up these question and answer gathering tasks, regarding the number that each worker performs per HIT? How do you find the right number of tasks per HIT before the worker should just finish the HIT and accept another one?
Is it just a coincidence that the “smart robot” prompt performed the best, or do you think there’s a reason that the closest to the truth version produced the best results (are crowdworkers smart enough to understand what questions are difficult for AI)?
What do you think about the decision to use exact text matching (after some text cleaning) instead of any kind of synonym matching?
How much longer are humans going to be able to come up with questions that are more difficult for computers to answer?

adamq

Leave a Reply Cancel reply