Summary
The paper presents a cooperative game between humans and AI called GuessWhich. The game is a live interaction in a conversational manner where the human is given multiple photos as choices and the AI has only one photo, the human would ask the AI, ALICE, questions to identify which one is the correct choice. ALICE was trained using both supervised learning and reinforcement learning on a publicly available visual dialog dataset then was used in the evaluation of the human-AI team performance. The authors find no significant difference in performance between ALICE’s supervised learning and reinforcement learning versions when paired with human partners. Their findings suggest that while self-talk and reinforcement learning are interesting directions to pursue for building better visual conversational agents, there appears to be a disconnection between AI-AI and human-AI evaluations. Progress in the former does not seem to be predictive of progress in the latter. It is important to note that measuring AI progress in isolation is not as useful for systems that require human-AI interaction.
Reflection
The concept presented in this paper is interesting. As someone who doesn’t work in the HCI field, it has opened my eyes to thinking of the different ways models that I have worked on shouldn’t be measured in isolation. As the authors showed that evaluating visual conversational agents through a human computation game gives results that differ from our conventional AI-AI evaluation. When thinking of this, I wonder how such methods would apply to tasks in which automated metrics correlate poorly with human judgement. Tasks like natural language generation in image captioning. Comparing how a method inspired by the one given in this paper would differ than the suggested method in last week’s papers. Given the difficulties presented in these tasks and the interactive nature of them, it is clear that the most appropriate way to evaluate these kinds of tasks is with a human in the loop but how would a large-scale human in the loop evaluation happen? Especially when there’s limited financial and infrastructure resources.
This paper made me think of the challenges that come with human in the loop evaluations:
1- In order to have it done properly, we must have a set of clear and simple instructions for crowdworkers.
2- There should be a way to ensure the quality of the crowdworkers.
3- For the evaluation’s sake, we need uninterrupted communication.
My takeaway from the paper is that while traditional platforms were adequate for evaluation tasks using automatic metrics, there is a critical need to support human in the loop evaluation for free form multimodal tasks.
Discussion
- What are the ways that we could use this paper to evaluate tasks like image captioning?
- What are other challenges that come with human in the loop evaluations?
- Is there a benchmark of human- AI in the field of your project? How would you ensure that your results are comparable?
- How would you utilize the knowledge about human- AI evaluation in your project?
- Have you worked with measuring evaluations with human in the loop? What was your experience there?