Summary
While the general may not realize their full interaction with AI, throughout the day people are truly dependent on it, based on their conversation with a phone assistant or even in the backend of their bank they use. While comparing AI against its own metrics is an imperative part to ensure the highest of quality, this team compared how two different models compared when working in collaboration with Humans. To ensure there is a valid and fair comparison, there is a simple game (similar to Guess Who) where the AI has to work with other AI or humans to guess the selected image based on a series of questions. Though the AI and AI collaboration provides good results, the AI and Human collaboration is relatively weaker.
Reflection
I appreciate the competitive nature of comparing both the supervised learning (SL) and a reinforcement learning (RL) in the same type of game scenario of helping the human succeed by aiding the as best as it can. However as one of their contributions, I have issue with the relative comparison between the SL and RL bots. Within their contributions, they explicitly say they find “no significant difference in performance” between the different models. While they continue to describe the two methods performing approximately equally, their self-reported data describes a better model in most measurements. Within Table 1 (the comparison of humans working with each model), SL is reported as having a better (yet small) increase and decrease in Mean Rank and Mean Reciprocal Rank respectively (lower and then higher is better respectively). Within Table 2 (the comparison of the multitude of teams), there was only one scenario where the RL Model performed better than the SL Model. Lastly even in the participants self-reported perceptions, the SL Model only decreased performance in 1 of 6 different categories. Though it may be a small decrease in performance, they’re diction downplays part of the argument their making. Though I admit the SL model having a better Mean Rank by 0.3 (from Table 1 MR difference or Table 2 Human row) doesn’t appear to be a big difference, I believe part of their contribution statement “This suggests that while self-talk and RL are interesting directions to pursue for building better visual conversational agents…” is not an accurate description since by their own data it’s empirically disproven.
Questions
- Though I admit I focus on the representation of the data and the delivery of their contributions while they focus on the Human-in-the-loop aspect of the data, within the machine learning environment I imagine the decrease in accuracy (by 0.3 or approximately 5%) would not be described as insignificant. Do you think their verbiage is truly representative of the Machine Learning relevance?
- Do you think more Turk Workers (they used data from at least 56 workers) or adding requirements of age would change their data?
- Though evaluating the quality of collaboration is imperative between Humans and AI to ensure AI’s are made adequately, it seems common there is a disparity between comparing that collaboration and AI with AI. Due to this disconnect their statement on progress between the two collaboration studies seems like a fundamental idea. Do you think this work is more idealistic in its contributions or fundamental?
Hi Miles,
To answer your third question, I do believe it’s fundamental for us to understand the relation between Human-AI collaboration especially when models are applied in a such manner. The contributions would have been more idealistic if it was applied to a model that does not need the Human-AI and would be sufficient to only study the AI-AI evaluation.