The primary intent of this paper is to measure the performance of human-AI teams and this is done in the context of the AI being a visual conversational agent using a cooperative game. Oftentimes the performance of AI systems is evaluated in isolation or with respect to interaction with other AI systems. This paper attempts to understand if this AI-AI performance evaluation can be extended to predict the performance of the AI system while it interacts with humans, which is essentially the AI-team performance. To measure the effectiveness of AI in the context of human-AI teams, a game-with-a-purpose (GWAP) is used called GuessWhich. The game involves a human player interacting with an AI component wherein the AI component generates clues pertaining to a secret image that the human is unaware of. Via this question-answer model, the human asks questions regarding the image and attempts to identify the secret image from a pool of images. Two versions of the AI component are used in this experiment, one trained in a supervised manner, and the other which is pre-trained with supervised learning and fine-tuned via reinforcement learning. The experiment results show that there is no significant performance difference between the two versions when interacting with humans.
The trend of humans interacting with AI directly or directly has increased exponentially and therefore it was interesting that the focus of this paper is on the performance of the human-AI team and not AI in isolation. Since it is becoming increasingly common to use AI in the context of humans, a dependency is created that cannot be measured by solely measuring the performance of the AI component alone.
While the results show that there is no significant performance difference between the two versions of AI used, the experiment results also show that while the performance of the AI improved as per AI-AI performance evaluation, this does not directly translate into better human-AI team performance. This was an interesting insight that challenges the existing AI evaluation norms.
Also, the cooperative game used in the experiments was complicated from a development point of view and it was interesting to understand how the AI was developed and how the pool of images was selected. The paper also explores the possibility of the Mturk workers discovering the strength of the AI and framing subsequent questions accordingly in order to leverage the strength discovered. This was a very fascinating possibility as it ties back to the mental models’ humans create while interacting with AI systems.
- Given that the study was conducted in the context of visual conversational agents, are these results generalizable outside of this context?
- It is observed that human-AI team performances are not significantly different for SL when compared to RL. What are some reasons that you can think of that explains this anomaly observed?
- Given that ALICE is imperfect, what would be the recovery cost of an incorrect answer? Would this substantially impact the performance observed?