summary:
The main objective of this paper is to measure AI agents through interactive downstream tasks performed by human-AI teams. To achieve this goal, this author designed a cooperative game – GuessWhich – that require human to identify a secret image among a pool of images through engaging in a dialog with an answerer-bot(Alice). This image is known to Alice but unknown to the human so that human need to ask some related questions and pick out the secret image based on the answer from Alice. There are two versions of Alice presented in the paper. Alice(SL) which is trained in a supervised manner to simulate conversation with humans about image and Alice(RL), which is pre-trained with supervised learning and fine-tuned via reinforcement learning for the image-guessing task. The results indicate that the evaluation of Alice(RL) with another AI more accurate than the evaluation with the human. Therefore, the paper concluded that there is a disconnect between the benchmarking of AI in isolation and the context of human-AI interaction.
Reflection:
This paper reminds me of another paper that we have discussed before: Updates in Human-AI Teams. Both of them are concerning the impact of human involvement in the AI performance. I think this is a great topic, and it is worth putting more attention on this topic. Because as the beginning of the paper said, as AI continues to advance, human-AI teams are inevitable. Many AI products have been widely used in society, including all walks of life. For example, predictive policing, life insurance estimation, sentencing, medical. Their product all requires human-AI to cooperate. Therefore, we already reach an agreement that the development and improvement of AI should always consider the impact of human involvement.
The QBOT-ABOT teams mentioned in this paper have a similar idea as the GAN(Generative adversarial network). Both of them train two systems to use unsupervised training and let them provide feedback for each other to enhance their performance. However, the author made the point that it is unclear if these QBOT and ABOT agents are indeed performing well when interacting with humans. This is an excellent point that we should always consider when we design an AI system. The measuring of the AI system should never be isolated. We should also consider human mental methods. This requires us to consider how the human mental model will impact team performance when they work cooperatively. A suitable human involved evaluation may be a more valuable measurement for the AI system.
Question:
- Do you think when we should measure the performance with human involvement and when we should not?
- Can you see what the main point of this paper is? Why the author uses visual conversational agents to prove the points that it is crucial to benchmark progress in AI in terms of how it translates to helping humans perform a particular task.
- The author mentioned that humans perceive ALICE(SL) and ALICE(RL) as comparable in terms of all metrics at the end of the paper. Why do you think the human will make such a conclusion. Does that indicate human involvement has no difference for two visual conversational agents?
Word Count: 537
I prefer to measure the performance of the system without human involvement in the first stage of development. It would be too expensive and lack of efficiency to hire human evaluators at this stage. However, it is important to hire human participants to evaluate the system before it is published. As is mentioned in the paper, there is a difference between the performance of human-AI cooperation and AI-AI cooperation. Even the system approved the evaluation of an AI, humans may still find the system difficult to use.