This paper attempts to measure the performance of a visual conversational bot called ALICE with a human teammate, opposed to what modern AI study commonly does, which is measuring amongst a counterpart AI. The authors design and deploy two versions of ALICE: one trained by supervised learning and the other by reinforced learning. The authors made Mturk workers have a Q&A session with ALICE to discover a hidden image only shown to ALICE, within a pool of similar images. After a fixed set of questions, the Mturk workers were asked to make a guess for which one was the hidden image. The authors evaluated using the resulting mental rankings of the hidden images after the user’s conversations with the AI. They found in previous works that bots trained using reinforced learning performed better than the other. However, the authors discover that there is no significant difference when evaluated in a human-AI team.
This paper was a good reminder that the ultimate user at the end is a human. It’s easy to forget that what computers prefer does not automatically translate over to a human’s case. It was especially interesting to see that a significant performance difference in an AI-AI setting was rendered minimal with humans in the loop. It made me wonder what it was about the reinforced-learned ALICE that QBOT preferred over the other version. Once finding that distinguishing factor, we might be able to make humans learn and adapt to the AI, leading to improved team performance.
It was a little disappointing the same research with QBOTs being the subject was left for future work. I would have loved to see the full picture. It could have also provided insight into what I’ve written above; what was it that QBOTs preferred reinforced learning?
This paper identified that there’s still a good distance between human cognition and AI cognition. If further studies find ways to minimize this gap, it will allow a quicker AI design process, where the resulting AI will be effective for both human and AI without needing extra adjustments for the human side. It would be interesting to see if it is possible to train an AI to think like a human in the first place.
These are the questions I had while reading this paper:
1. This paper was presented in 2017. Do you know any other studies done after this that measured human-AI performance? Do they agree that there’s a gap between humans and AIs?
2. If you have experience training visual conversational bots, do you know if a bot prefers some information over others? What is the most significant difference between a bot trained with supervised learning and reinforced learning?
3. In this study, the Mturk workers were asked to make a guess after a fixed number of questions. The study does not measure what’s the minimum or the maximum number of questions needed on average to make an accurate guess. Do you think the accuracy of the guesses will proportionally increase as the number of questions increases? If not, what kind of regression do you think it will follow?