Authors: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, and Devi Parikh
Summary
In this paper, the authors propose coming up with realistic evaluation benchmarks in the context of human-AI teams, as opposed to evaluating AI systems in isolation. They introduce two chatbots, one better than the other on the basis of standalone AI benchmarks. However, in this paper, when they are evaluated in a task setting that mimics their intended use i.e. humans interacting with chatbots to accomplish a common goal, they perform roughly the same. The work essentially suggests a mismatch between benchmarking of AI in isolation and in the context of human-AI teams.
Reflection
The core contribution of this paper showcases that evaluating AI systems in isolation will never give us the complete picture, and therefore, we should evaluate AI systems under the conditions they are intended to be used in with the targeted players who will be using it. In other words, the need for ecological validity of the evaluation study is stressed here. The flip side of this contribution is, in some ways, being reflected in the trend of AI systems falling short of their intended objectives in real-world scenarios.
Even though the GuessWhich evaluation was closer to a real-world scenario than vanilla isolation evaluation methods, it still remains an artificial evaluation. However, the gap with a possible real-world scenario (where a user is actually interacting with a chatbot to accomplish some real-world task like planning a trip) would be minimal.
The responses returned by the two bots are not wildly different (beige vs brown) since one was the base for the other one, and therefore, a human brain can somehow adapt dynamically based on the chatbot responses and accomplish the overall goal. It would also have been interesting to see how the performance changes when the AI was drastically different, or sent someone down the wrong path.
This paper shows why it is important for AI and HCI researchers to work together to come up with meaningful datasets, setting up a realistic ecosystem for an evaluation benchmark that would be more relevant with potential users.
Questions
- If, in the past, you compared algorithms solely on the basis of precision-recall metrics (let’s say, you built an algorithm and compared it with the baseline), do you feel the findings would hold up in a study with ecological validity?
- How’d you evaluate a conversational agent? (Suggest something different from the GuessWhich platform)
- How much worse or better (or different) would a chatbot have to be for humans to perform significantly different from the current ALICE chatbots in the GuessWhich evaluation setting? (Any kind of subjective interpretation welcome)