03/25/2020 – Ziyao Wang – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

The authors proposed that instead of only measuring the AI progress in isolation, it is better to evaluate the performance of the whole human-AI teams via interactive downstream tasks. They designed a human-AI cooperative game which requires human work with an answerer-bot to identify a secret image known to bot but unknown to the human from a pool of images. At the end of the task, the human needs to identify the secret image from the pool. Though AI trained by reinforcement learning performs better than AI trained by supervised learning with an AI questioner, there is no significant difference between them when they work with human. The result shows that there appears to be a disconnect between AI-AI and human-AI evaluations, which means – progress on former does not seem predictive of progress on latter.

Reflections:

This paper proposed that there appears to be a disconnect between AI-AI and human-AI evaluations, which is a good point in future research. Compared with hiring people to evaluate models, using AI system to evaluate systems is much more efficient and cheaper. However, the AI system which is approved by the evaluating system may performs badly when interact with human. As is proved by the authors, though the model trained by reinforcement learning performs better than models trained by supervised learning, the two kinds of models have similar performance when they cooperate with human workers. A good example is the GAN learning system. Even the generative network passed the evaluation of the discriminative network, human can still easily discriminate the generated results from practical ones in most of the cases. For example, the generated images on the website thispersondoesnotexist.com passed the discriminative network, however, for most of them we can easily find the abnormal part in the pictures, which can prove the picture is faked. This finding is important for future researches. In future researches, the researchers should not only focus on the simulating work environment of systems, which will result in totally different results in the evaluation of system’s performance. Tests in which human involves in the workflow are needed to evaluate a trained model.

However, on the other side, even though the AI evaluated models may not be able to meet human needs, the training process is much more efficient than supervised learning or learning process which involves human evaluation. From this point, though the evaluation in which only AI involves may not be that accurate, we can still apply this kind of measurement in developing. For example, we can let AI to do first round evaluation, which is cheap and highly efficient. Then, we can apply human discriminators to evaluate the AI evaluated system. As a result, the whole developing can benefit from the advantage from both sides and the evaluation process can be both efficient and accurate.

Questions:

What else can the developed Cooperative Human-AI Game can do?

What is the practical use of the ALICE?

Is it for sure that human-AI performance is more important than AI-AI performance? Is there any scenario in which AI-AI performance is more important?

One thought on “03/25/2020 – Ziyao Wang – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

  1. Hi Ziyao, to answer your second question, I think the practical use of ALICE can be in technical support chatbots where users ask questions about a problem that they are experiencing and chatbot can suggest the optimal solution which is similar to the secret image.

Leave a Reply