03/25/20 – Sukrit Venkatagiri – Evaluating Visual Conversational Agents

Paper: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, and Devi Parikh. 2017. Evaluating Visual Conversational Agents via Cooperative Human-AI Games. In Fifth AAAI Conference on Human Computation and Crowdsourcing. Retrieved January 22, 2020 from https://aaai.org/ocs/index.php/HCOMP/HCOMP17/paper/view/15936

Summary: The paper measures human-AI team performance and it is compared to AI performance alone. The authors of this paper make use of a game, called GuessWhich, to facilitate visual conversational agents/visual question-and-answer agents. GuessWhich leverages the fact that humans and the AI system, ALICE, have to interact with or converse with each other.

In the game, there are two primary agents, the one who asks a question (the questioner) and the one who answers these questions (the answerer). In the game, the answerer responds to the questions asked of it and attempts to guess a correct answer (an image) from a fixed set of images. With the human-AI team, the agent that asks questions is ALICE, and the “agent” that answers the question is a human. Here, performance is measured based on the number of questions taken to arrive at the correct answer. There’s also a QuestionerBot, or a QBot, that is used instead of a human to compare human-AI performance against AI-AI performance. That is, ALICE-human versus ALICE-QBot.

The paper further discusses the challenges faced with these approaches, including the difficulty in having robust question-answer pairs, and the fact that humans may or may not learn from the AI, among other such challenges. Finally, the paper concludes that ALICE-RL, a high-performing or “state of the art” AI system does not perform as well as ALICE-QBot when compared to ALICE-human pairs. This further points to the increasing disconnect between AI development that occurs independent of human input and considering human-in-the-loop interaction systems.

Reflection: This paper foregrounds a very important challenge, that is, the gap between AI research and development, and its use in the real-world with actual human beings. One thing I found interesting in this paper is that AI systems are ever-dependent on humans for input. Similar to what Gray and Suri mention in their book, Ghost Work, as AI advances, there is a need for humans at the frontier of AI’s capabilities. This is known as AI’s “last mile” problem, and will probably never cease to exist. This is an interesting paradox, where AI seeks to replace humans, only to need humans to do a new type of task.

I think this is one of the major limitations of developing AI independent of real-world applications and usage. If people only use synthetic data, and toy cases within the lab, then AI development cannot really advance in the real world. Instead, AI researchers should strive to work with human computation and human–computer interaction people to further both groups’ needs and requirements. This has even been used in Google Duplex, where a human takes over when the AI is unable to perform well.

Further, I find that there are some limitations to the paper, such as the experimental setup and the dataset that was used. I do not think that QBot was representative of a useful AI and the questions were not on par. I also believe that QBot needed to be able to dynamically learn from and interact with the human, making for a more fair comparison between AI-AI and human-AI teams.

Questions:

  1. How might VQA challenges be made more difficult and applicable in the real-world?
  2. What are the other limitations of this paper? How can they be overcome?
  3. How would you use such an evaluation approach in your class project?
  4. There’s a lot of interesting data generated from the experiments, such as user trust and beliefs. How might that be useful in improving AI performance?

3 thoughts on “03/25/20 – Sukrit Venkatagiri – Evaluating Visual Conversational Agents

  1. From my understanding of the nature of the game they were evaluating, I don’t see the value of having the QBot adapt to the user as the game is being played. Because it seemed to me that the QBot already knows what the content of the picture is and is answering questions accordingly. I don’t see where the dynamic adaptation would come from for this particular case.

  2. I agree with you on the importance of keeping humans in the loop. But I present a counter-argument that was given (essentially) by another paper we read this week: Evorus. The eventual goal is to kick humans out of the loop such that for this task we don’t need them. These papers seem at odds with each other, with this paper showing the downsides of removing the human, but Evorus showing the benefits. I think we should trend towards Evorus’ thesis, however, as that is the point of AI research – to eventually remove the costly components (humans) so that they can work on other tasks.

    I was also under the impression that the AI did perform updates to itself as the games wore on, which was the ALICE_RL condition? I could easily be mistaken though.

  3. VQA challenges can actually be utilized to find best moves in board games. Additionally, they can be utilized to categorize types of land or simply authentication/access control.

Leave a Reply