Summary
Chattopadhyay et al.’s paper “Evaluating Visual Conversational Agents via Cooperative Human-AI Games” explores the research question of how the progress in AI-AI evaluation can translate to the performance of human-AI teams. To effectively deal with real-world problems, an AI system faces the challenge to adapt its performance to humans. Existing works measure the progress in AI isolatedly without human-in-the-loop design. Take visual conversational agents as an example, recent work typically evaluates how well agent pairs perform on goal-based conversational tasks instead of response retrieval from fixed dialogs. The researchers propose to evaluate the performance of AI agents in an interactive setting and design a human computation game, named GuessWhich, to continuously engage humans with agents in a cooperative way. By evaluating the collaborative success between humans and AI agents of two versions (self-supervised learning and reinforcement learning), their experiments find that there is no significant difference in performance between the two versions when paired with human partners, which suggests a disconnect between AI-AI and human-AI evaluations.
Reflection
This paper conducts an interesting study by developing a human computation game to benchmark the performance of visual conversational agents as members of human-AI teams. Nowadays, we are increasingly interacting with intelligent and highly communicative technologies throughout our daily lives. For example, companies utilize the automation of communication with their customers to make their purchase much more efficient and streamlined. However, it is difficult to define what success looks like in this case. Do the dialog bots really bring convenience, or are companies putting up a barrier to their customers? Even though this paper proposes a method to evaluate the performance of AI agents in an interactive setting, it does not discuss how to generalize the evaluation to other communication-related tasks.
In addition, I think the task design is worthy of further discussion. The paper utilizes the number of guesses the human needs to identify the secret image as an indicator of the human-AI team performance. For playing GuessWhich game among human-human teams, it seems how to strategize the questions is an important component in the game. However, the paper does not have much consideration on the question strategies. Would it be helpful if some kind of guideline on communication with machines are provided to the crowd workers? Is it possible some of the questions are clear to humans but ambiguous to machines? Based on the experimental results, the majority of the questions are binary, which are comparatively easier for the conversational agents to answer. One reason behind this I think is due to the given information of the secret picture. Take the case presented in the paper as an example, the basic description of the picture is given as “A man sitting on a couch with a laptop.” If we check out the picture choices, we can observe that few of the choices include all the components. In other words, the basic information provided in the first round made the secret picture not that “secret” anymore and the given description is enough to narrow the choices down to two or three picture candidates. In this scenario, the role of visual conversational agents play in the human-AI teams is minimized and difficult to be precisely evaluated.
Discussion
I think the following questions are worthy of further discussion.
- Is it possible to generalize the evaluation method proposed in the paper to other communication-related tasks? Why or why not?
- In the literature, the AI agents fine-tuned with reinforcement learning has been found to have better performance than its self-supervised learning counterpart. This paper finds that the accuracy of the two versions has no significant difference when evaluated via a human-ALICE team. What reasons can you think about to explain this?
- Do you think there are any improvements that can be made to the experimental design?
- What do you think are the challenges that come with human-in-the-loop evaluations?
In answer to your last question, I think that having humans in the loop presents many possible problems. If the humans need certain skills to test the system you are working on, it could be very hard to find enough people, even on crowdsourcing sites to meet your demand. Beyond this, I think there are also possible ethical concerns as well. If you are testing a completely new system that is has the chance to psychologically or physically harm participants, it could be unsafe to bring humans into the loop.