03/25/2019 – Nurendra Choudhary – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Summary

In this paper, the authors measure the performance of human-AI teams instead of isolated AI. They employ a GuessWhich game in the context of a visual conversation agent system. The game works on interaction between humans and the AI system called ALICE.

The game includes two agents- the questioner and answerer. The answerer has access to images, based on which they ask questions to the answerer. The answerer replies to the questions and tries to guess the correct image from an exhaustive set of images. For the human-AI team, the questioner is ALICE and the answerer is human. The performance in terms of the number of questions needed for the correct guess. Also, the authors utilize a QBot (Questioner Bot) instead of humans for comparative analysis between ALICE-human and ALICE-QBot.

The authors discuss various challenges with the approaches such as robustness to incorrect question-answer pairs, human learning on AI and others. They conclude that  ALICE-RL, the state-of-the-art in AI literature, does not perform better than ALICE-QBot in ALICE-human pairs. This highlights the disconnect between isolated AI development and development in teams with humans.

Reflection

The paper discusses an important problem of disconnect between isolated AI development and its real-world usage with humans in the loop. However, I feel there are some drawbacks in the experimental setup. In the QBot part, I do not agree with the temporally dynamic nature of the questionnaire. I think the QBot should get access to the perfect set of questions (from humans) to generate the new question. This would make the comparison fair and less dependent on its own performance.

An interesting point is the dependence of AI on humans. The perfect AI system should not rely on humans. However, current AI systems rely on humans to be useful in real-world. This leads to a paradox where we need to make AI systems human-compliant but move towards the larger goal of bringing in independent AI. 

To achieve the larger goal, I believe isolated development of AI is crucial. However, the systems also need to contribute to human society. For this, I believe we utilize variants of the underlying system to support human behavior. This approach supports isolated development and additionally collects auxiliary performance of human behavior which can further improve the AI’s performance. This approach is already being applied effectively. For example, in case of Google Translate, the underlying neural network model was developed in isolation. Human corrections to its translations provide auxiliary information and also improve the human-AI team’s overall performance. This leads to a significant overall improvement in the translator’s performance overtime. 

Questions

  1. Is it fair to use the GuessWhich game as an indicator of AI’s success? Shouldn’t we rely on the final goal of an AI to better appreciate the disconnect?
  2. Should the human-AI teams just be part of evaluation or also the development part? How would we include them in the development phase for this problem?
  3. The performance of ALICE relies on the QBot mechanism. Could we use human input to improve QBot’s question generation mechanism to improve its robustness?
  4. The experiment provides a lot of auxiliary data such as correct questions, relevance of questions to the images and robustness of bots with respect to their own answers. Can we integrate this data into the main architecture in a dynamic manner?

Word Count:564

One thought on “03/25/2019 – Nurendra Choudhary – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

  1. Hello, Nurendra.

    I like your idea of separating training and adjustment.
    It sounds like you are saying that the basic model should be built in an isolated environment and human involvement should affect the hyper-parameters.
    I agree that an isolated environment is ideal to maximize performance as there are less dynamic properties.
    I wonder if there’s a case this is not the case.

    As for your first question, yes, I do think that the results are a bit too constrained to be generalized.
    I think similar studies with different tasks should be repeated many times to get a reliable result.
    However, I also do think that a simple Q&A session is a common use of an AI, so while it’s not generalizable, I think the results can apply to many studies.

Leave a Reply