SUMMARY
In this paper, the authors design a cooperative game called GuessWhich (inspired by the 20-Questions game) to measure the perform of human-AI teams in the context of visual conversational agents. The AI system, ALICE, is based on the ABOT developed by Das et. al. in a prior study conducted to measure the performance of AI-AI systems. Two variants of ALICE have been considered for this study – ALICESL (trained in a supervised manner on the Visual Dialog dataset) and ALICERL (pre-trained with supervised learning and fine-tuned using reinforcement learning). The GuessWhich game was designed such that the human is the ‘questioner’ and the AI (ALICE) is the ‘answerer’. Both are given a caption that describes an image. While, ALICE is shown this image, the human can ask the AI multiple questions (9 rounds of dialog) to better understand the image. Post these rounds, the human is made to select the correct image from a set of distractor images that are semantically similar to the image to be identified. The authors found that, contrary to expectation, improvements in AI-AI performance does not translate to an improvement to AI-human performance.
REFLECTION
I like the gamification approach that the authors adopted for this study and I believe that the design of the game works well in the context of visual conversational agents. The authors mention how they aimed to ensure that the game was ‘challenging and engaging’. This reminded me of the discussion we had in class about the paper ‘Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research’, of how researchers often put in extra effort to make tasks for crowd workers more engaging and meaningful. I also liked the approach used to identify ‘distractor’ images and felt that this was useful in making the game challenging for the crowd workers.
I thought that it was interesting to learn that the AI-ALICE teams outperformed the human-ALICE teams. I wonder if this is impacted by the fact that ALICE could get some answers wrong and how that might affect the mental model generated by the human. I thought that it was good that the authors took into account knowledge leak and ensured that the human workers could only play a fixed number (10) of games.
I also liked that the authors gave performance-based incentives to the workers that were tied to the success of the human-AI team. I thought that it was good that the authors published the code of their design as well as provided a link to an example game.
QUESTIONS
- As part of the study conducted in this paper, the authors design an interactive image-guessing game. Can similar games be designed to evaluate human-AI team performance in other applications? What other applications could be included?
- Have any follow up studies been performed to evaluate the QBOT in a ‘QBOT-human team’? In this scenario, would the QBOTRL outperform the QBOTSL?
- The authors found that some of the human workers adopted a single word querying strategy with ALICE. Is there any specific reason that could have caused the humans to do so? Would they have questioned a human in a similar fashion? Would their style of querying have changed if they were unaware if the other party was a human or an AI system?
I liked your first question. I believe yes, the image guessing game can and should be extended to evaluate other human-AI team performance. What about using it in a textual context? I know from my current work that it gets very confusing for AI to get the underlying meaning from texts. Although there is sentiment analysis, however, human emotions range far beyond “positive,” “negative,” or “neutral.” Leverage both the human and AI capabilities would prove to be beneficial in this context.