Summary
Chattopadhyay et al.’s work details the problems with the current (pre-2018) methods of evaluating visual conversational agents. These agents, which are AIs designed to discuss what is in pictures, were typically evaluated through one AI (the primary visual conversational agent) describing a picture while another asked questions about it. However, the authors show how this kind of interaction does not adequately reflect how humans would converse with the agent. They use 2 visual conversation agents, dubbed ALICE_SL and ALICE_RL (for supervised and reinforcement learning, respectively) to play 20 questions with AMT workers. They found that there was no significant difference in the performance of the two versions of ALICE. This stood in contrast to the work done previously which found that ALICE_RL was significantly better than ALICE_SL when tested by AI-AI teams. Both ALICEs perform better than random chance, however. Furthermore, AI-AI teams require fewer guesses than the humans in Human-AI teams.
Personal Reflection
I first found their name for 20-questions was Guess What or Guess Which. This has relatively little to do with the paper, but it was jarring to me at first.
The first thing that struck me was their discussion of the previous methods. If the first few rounds of AI-AI evaluation were monitored, why didn’t they pick up that the interactions weren’t reflective of human usage? If the abnormality didn’t present until later on, could they have monitored late-stage rounds, too? Or was it generally undetectable? I feel like there’s a line of questioning here that wasn’t looked at that might benefit AI as well.
I was amused that, with all the paper being on AI and interactions with humans, that they chose the image set to be medium difficulty based on “manual inspection.” Does this indicate that the AIs don’t really understand difficulty in these datasets?
Another minor quibble is that they say each HIT was 10 games, but then state that they published HITs until they got 28 games completed on each version of ALICE and specify this meant 560 games. They overload the word ‘game’ without describing the actual meaning behind it.
An interesting question that they didn’t discuss investigating further is whether question strategy evolved over time for the humans. Did they change up their style of questions as time went on with ALICE? This might provide some insight as to why there was no significant difference.
Lastly, their discussion on the knowledge leak of evaluating AIs on AMT was quite interesting. I would not have thought that limiting the interaction each turker could have with an AI would improve the AI.
Questions
- Of all of the participants who started a HIT on AMT, only 76.7% of participants actually completed the HIT. What does this mean for HITs like this? Did the turkers just get bored or did the task annoy them in some way?
- The authors pose an interesting question in 6.1 about QBot’s performance. What do you think would happen if the turkers played the role of the answerer instead of the guesser?
- While they didn’t find any statistical differences, figure 4(b) shows that ALICE_SL outperformed ALICE_RL in every round of dialogue. While this wasn’t significant, what can be made of this difference?
- How would you investigate the strategies that humans used in formulating questions? What would you hope to find?