03/25/2020 – Bipasha Banerjee – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Summary 

This paper aims to measure AI performances using the human-in-the-loop approach as opposed to only sticking to the traditional benchmark scores. For this purpose, they have evaluated the performance of a visual conversational agent interaction with humans effectively forming a Human-AI team. GuessWhich is a human computation game that was designed to study the interactions. They named the visual conversational agent ALICE that was trained in a supervised manner on a visual dataset. The visual CA was also pre-trained with supervised learning, and reinforcement learning was used to fine-tune the model. Both the human and the AI component of the experiment needs to be aware of each other’s imperfections and must infer information as and when needed. The experiments were performed using Amazon Mechanical Turks, and the AI component was chosen as the ABOT from the paper in [1]. This combination turned out to be the most effective. The AI component, named ALICE, had two components, one normal supervised learning and the other using reinforcement learning. It was found that the human and ALICESL outperformed human and ALICERL combination, which was contrary to the performance when only using AI. Hence, it proves that AI benchmarking tools do not accurately represent performance when humans are present in the loop.

Reflection

The paper proposes a novel interaction to include humans in the loop when using AI conversational agents. From an AI perspective, we look at standard evaluation metrics like F1, precision, and recall is used to gauge the performance of the model being used. The paper built from a previous work that considered only AI models interacting with each other. It was found that a reinforcement learning model performed way better than a standard supervised technique. However, when humans are involved in the loop, the supervised learning mechanism performs better than its reinforced counterpart. This signifies that our current AI evaluation techniques do not take into account the human context as effectively. 

The authors mentioned that, at times, people would discover the strength of the AI system and try to interact accordingly. Hence, we can conclude that humans and AI are both learning from each other at some capacity. This is a good way to leverage the strengths of humans and AI.

It would also be interesting to see how this combination works in identifying images correctly when the task is complex. If the stakes are high and the image identifying task involves using both humans and machines, would such combinations work? It was noted that the AI system did end up answering some questions incorrectly, which ultimately led to incorrect guessing by humans. Hence, in order to make such combinations work seamlessly, more testing, training with vast amounts of data is necessary. 

Questions

  1. How would we extend this work to other complex applications? Suppose if the AI and humans were required to identify potential threats (where stakes are high) in security?
  2. It was mentioned that the game was played for nine rounds. How was this threshold selected? Would it have worked better if the number was greater? Or would it rather confuse humans more?
  3. The paper mentions that “game-like setting is constrained by the number of unique workers” who accept their tasks. How can this constraint be mitigated? 

References

[1] https://arxiv.org/pdf/1703.06585.pdf

One thought on “03/25/2020 – Bipasha Banerjee – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

  1. Following up on your first question, I do like the potential of AI and Humans working collaboratively and the collaboration of their work. Within security, something that could likely be done and improved within this collaboration would be creating an AI for determining the true threat level from all of the warnings and flags. With a system segregated into different tiers based on their perceived difficulty, the issues start at tier one going up to tier three until the problem is solved. Having an AI system that automatically diverts the issues to the proper tier would likely save time and be confirmed by the security analyst.

Leave a Reply