3/25/20 – Jooyoung Whang – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020March 24, 2020 Vikram Mohanty 1 Comment

This paper attempts to measure the performance of a visual conversational bot called ALICE with a human teammate, opposed to what modern AI study commonly does, which is measuring amongst a counterpart AI. The authors design and deploy two versions of ALICE: one trained by supervised learning and the other by reinforced learning. The authors made Mturk workers have a Q&A session with ALICE to discover a hidden image only shown to ALICE, within a pool of similar images. After a fixed set of questions, the Mturk workers were asked to make a guess for which one was the hidden image. The authors evaluated using the resulting mental rankings of the hidden images after the user’s conversations with the AI. They found in previous works that bots trained using reinforced learning performed better than the other. However, the authors discover that there is no significant difference when evaluated in a human-AI team.

This paper was a good reminder that the ultimate user at the end is a human. It’s easy to forget that what computers prefer does not automatically translate over to a human’s case. It was especially interesting to see that a significant performance difference in an AI-AI setting was rendered minimal with humans in the loop. It made me wonder what it was about the reinforced-learned ALICE that QBOT preferred over the other version. Once finding that distinguishing factor, we might be able to make humans learn and adapt to the AI, leading to improved team performance.

It was a little disappointing the same research with QBOTs being the subject was left for future work. I would have loved to see the full picture. It could have also provided insight into what I’ve written above; what was it that QBOTs preferred reinforced learning?

This paper identified that there’s still a good distance between human cognition and AI cognition. If further studies find ways to minimize this gap, it will allow a quicker AI design process, where the resulting AI will be effective for both human and AI without needing extra adjustments for the human side. It would be interesting to see if it is possible to train an AI to think like a human in the first place.

These are the questions I had while reading this paper:

1. This paper was presented in 2017. Do you know any other studies done after this that measured human-AI performance? Do they agree that there’s a gap between humans and AIs?

2. If you have experience training visual conversational bots, do you know if a bot prefers some information over others? What is the most significant difference between a bot trained with supervised learning and reinforced learning?

3. In this study, the Mturk workers were asked to make a guess after a fixed number of questions. The study does not measure what’s the minimum or the maximum number of questions needed on average to make an accurate guess. Do you think the accuracy of the guesses will proportionally increase as the number of questions increases? If not, what kind of regression do you think it will follow?

03/25/20 – Sukrit Venkatagiri – Evaluating Visual Conversational Agents

March 24, 2020 Sukrit Venkatagiri 3 Comments

Paper: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, and Devi Parikh. 2017. Evaluating Visual Conversational Agents via Cooperative Human-AI Games. In Fifth AAAI Conference on Human Computation and Crowdsourcing. Retrieved January 22, 2020 from https://aaai.org/ocs/index.php/HCOMP/HCOMP17/paper/view/15936

Summary: The paper measures human-AI team performance and it is compared to AI performance alone. The authors of this paper make use of a game, called GuessWhich, to facilitate visual conversational agents/visual question-and-answer agents. GuessWhich leverages the fact that humans and the AI system, ALICE, have to interact with or converse with each other.

In the game, there are two primary agents, the one who asks a question (the questioner) and the one who answers these questions (the answerer). In the game, the answerer responds to the questions asked of it and attempts to guess a correct answer (an image) from a fixed set of images. With the human-AI team, the agent that asks questions is ALICE, and the “agent” that answers the question is a human. Here, performance is measured based on the number of questions taken to arrive at the correct answer. There’s also a QuestionerBot, or a QBot, that is used instead of a human to compare human-AI performance against AI-AI performance. That is, ALICE-human versus ALICE-QBot.

The paper further discusses the challenges faced with these approaches, including the difficulty in having robust question-answer pairs, and the fact that humans may or may not learn from the AI, among other such challenges. Finally, the paper concludes that ALICE-RL, a high-performing or “state of the art” AI system does not perform as well as ALICE-QBot when compared to ALICE-human pairs. This further points to the increasing disconnect between AI development that occurs independent of human input and considering human-in-the-loop interaction systems.

Reflection: This paper foregrounds a very important challenge, that is, the gap between AI research and development, and its use in the real-world with actual human beings. One thing I found interesting in this paper is that AI systems are ever-dependent on humans for input. Similar to what Gray and Suri mention in their book, Ghost Work, as AI advances, there is a need for humans at the frontier of AI’s capabilities. This is known as AI’s “last mile” problem, and will probably never cease to exist. This is an interesting paradox, where AI seeks to replace humans, only to need humans to do a new type of task.

I think this is one of the major limitations of developing AI independent of real-world applications and usage. If people only use synthetic data, and toy cases within the lab, then AI development cannot really advance in the real world. Instead, AI researchers should strive to work with human computation and human–computer interaction people to further both groups’ needs and requirements. This has even been used in Google Duplex, where a human takes over when the AI is unable to perform well.

Further, I find that there are some limitations to the paper, such as the experimental setup and the dataset that was used. I do not think that QBot was representative of a useful AI and the questions were not on par. I also believe that QBot needed to be able to dynamically learn from and interact with the human, making for a more fair comparison between AI-AI and human-AI teams.

Questions:

How might VQA challenges be made more difficult and applicable in the real-world?
What are the other limitations of this paper? How can they be overcome?
How would you use such an evaluation approach in your class project?
There’s a lot of interesting data generated from the experiments, such as user trust and beliefs. How might that be useful in improving AI performance?

03/25/20 – Fanglan Chen – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 Fanglan Chen 1 Comment

Summary

Chattopadhyay et al.’s paper “Evaluating Visual Conversational Agents via Cooperative Human-AI Games” explores the research question of how the progress in AI-AI evaluation can translate to the performance of human-AI teams. To effectively deal with real-world problems, an AI system faces the challenge to adapt its performance to humans. Existing works measure the progress in AI isolatedly without human-in-the-loop design. Take visual conversational agents as an example, recent work typically evaluates how well agent pairs perform on goal-based conversational tasks instead of response retrieval from fixed dialogs. The researchers propose to evaluate the performance of AI agents in an interactive setting and design a human computation game, named GuessWhich, to continuously engage humans with agents in a cooperative way. By evaluating the collaborative success between humans and AI agents of two versions (self-supervised learning and reinforcement learning), their experiments find that there is no significant difference in performance between the two versions when paired with human partners, which suggests a disconnect between AI-AI and human-AI evaluations.

Reflection

This paper conducts an interesting study by developing a human computation game to benchmark the performance of visual conversational agents as members of human-AI teams. Nowadays, we are increasingly interacting with intelligent and highly communicative technologies throughout our daily lives. For example, companies utilize the automation of communication with their customers to make their purchase much more efficient and streamlined. However, it is difficult to define what success looks like in this case. Do the dialog bots really bring convenience, or are companies putting up a barrier to their customers? Even though this paper proposes a method to evaluate the performance of AI agents in an interactive setting, it does not discuss how to generalize the evaluation to other communication-related tasks.

In addition, I think the task design is worthy of further discussion. The paper utilizes the number of guesses the human needs to identify the secret image as an indicator of the human-AI team performance. For playing GuessWhich game among human-human teams, it seems how to strategize the questions is an important component in the game. However, the paper does not have much consideration on the question strategies. Would it be helpful if some kind of guideline on communication with machines are provided to the crowd workers? Is it possible some of the questions are clear to humans but ambiguous to machines? Based on the experimental results, the majority of the questions are binary, which are comparatively easier for the conversational agents to answer. One reason behind this I think is due to the given information of the secret picture. Take the case presented in the paper as an example, the basic description of the picture is given as “A man sitting on a couch with a laptop.” If we check out the picture choices, we can observe that few of the choices include all the components. In other words, the basic information provided in the first round made the secret picture not that “secret” anymore and the given description is enough to narrow the choices down to two or three picture candidates. In this scenario, the role of visual conversational agents play in the human-AI teams is minimized and difficult to be precisely evaluated.

Discussion

I think the following questions are worthy of further discussion.

Is it possible to generalize the evaluation method proposed in the paper to other communication-related tasks? Why or why not?
In the literature, the AI agents fine-tuned with reinforcement learning has been found to have better performance than its self-supervised learning counterpart. This paper finds that the accuracy of the two versions has no significant difference when evaluated via a human-ALICE team. What reasons can you think about to explain this?
Do you think there are any improvements that can be made to the experimental design?
What do you think are the challenges that come with human-in-the-loop evaluations?

03/25/2020 – Bipasha Banerjee – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 bipashab 1 Comment

Summary

This paper aims to measure AI performances using the human-in-the-loop approach as opposed to only sticking to the traditional benchmark scores. For this purpose, they have evaluated the performance of a visual conversational agent interaction with humans effectively forming a Human-AI team. GuessWhich is a human computation game that was designed to study the interactions. They named the visual conversational agent ALICE that was trained in a supervised manner on a visual dataset. The visual CA was also pre-trained with supervised learning, and reinforcement learning was used to fine-tune the model. Both the human and the AI component of the experiment needs to be aware of each other’s imperfections and must infer information as and when needed. The experiments were performed using Amazon Mechanical Turks, and the AI component was chosen as the ABOT from the paper in [1]. This combination turned out to be the most effective. The AI component, named ALICE, had two components, one normal supervised learning and the other using reinforcement learning. It was found that the human and ALICE_SL outperformed human and ALICE_RL combination, which was contrary to the performance when only using AI. Hence, it proves that AI benchmarking tools do not accurately represent performance when humans are present in the loop.

Reflection

The paper proposes a novel interaction to include humans in the loop when using AI conversational agents. From an AI perspective, we look at standard evaluation metrics like F1, precision, and recall is used to gauge the performance of the model being used. The paper built from a previous work that considered only AI models interacting with each other. It was found that a reinforcement learning model performed way better than a standard supervised technique. However, when humans are involved in the loop, the supervised learning mechanism performs better than its reinforced counterpart. This signifies that our current AI evaluation techniques do not take into account the human context as effectively.

The authors mentioned that, at times, people would discover the strength of the AI system and try to interact accordingly. Hence, we can conclude that humans and AI are both learning from each other at some capacity. This is a good way to leverage the strengths of humans and AI.

It would also be interesting to see how this combination works in identifying images correctly when the task is complex. If the stakes are high and the image identifying task involves using both humans and machines, would such combinations work? It was noted that the AI system did end up answering some questions incorrectly, which ultimately led to incorrect guessing by humans. Hence, in order to make such combinations work seamlessly, more testing, training with vast amounts of data is necessary.

Questions

How would we extend this work to other complex applications? Suppose if the AI and humans were required to identify potential threats (where stakes are high) in security?
It was mentioned that the game was played for nine rounds. How was this threshold selected? Would it have worked better if the number was greater? Or would it rather confuse humans more?
The paper mentions that “game-like setting is constrained by the number of unique workers” who accept their tasks. How can this constraint be mitigated?

References

[1] https://arxiv.org/pdf/1703.06585.pdf

03/25/2020 – Ziyao Wang – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 Ziyao Wang 1 Comment

The authors proposed that instead of only measuring the AI progress in isolation, it is better to evaluate the performance of the whole human-AI teams via interactive downstream tasks. They designed a human-AI cooperative game which requires human work with an answerer-bot to identify a secret image known to bot but unknown to the human from a pool of images. At the end of the task, the human needs to identify the secret image from the pool. Though AI trained by reinforcement learning performs better than AI trained by supervised learning with an AI questioner, there is no significant difference between them when they work with human. The result shows that there appears to be a disconnect between AI-AI and human-AI evaluations, which means – progress on former does not seem predictive of progress on latter.

Reflections:

This paper proposed that there appears to be a disconnect between AI-AI and human-AI evaluations, which is a good point in future research. Compared with hiring people to evaluate models, using AI system to evaluate systems is much more efficient and cheaper. However, the AI system which is approved by the evaluating system may performs badly when interact with human. As is proved by the authors, though the model trained by reinforcement learning performs better than models trained by supervised learning, the two kinds of models have similar performance when they cooperate with human workers. A good example is the GAN learning system. Even the generative network passed the evaluation of the discriminative network, human can still easily discriminate the generated results from practical ones in most of the cases. For example, the generated images on the website thispersondoesnotexist.com passed the discriminative network, however, for most of them we can easily find the abnormal part in the pictures, which can prove the picture is faked. This finding is important for future researches. In future researches, the researchers should not only focus on the simulating work environment of systems, which will result in totally different results in the evaluation of system’s performance. Tests in which human involves in the workflow are needed to evaluate a trained model.

However, on the other side, even though the AI evaluated models may not be able to meet human needs, the training process is much more efficient than supervised learning or learning process which involves human evaluation. From this point, though the evaluation in which only AI involves may not be that accurate, we can still apply this kind of measurement in developing. For example, we can let AI to do first round evaluation, which is cheap and highly efficient. Then, we can apply human discriminators to evaluate the AI evaluated system. As a result, the whole developing can benefit from the advantage from both sides and the evaluation process can be both efficient and accurate.

Questions:

What else can the developed Cooperative Human-AI Game can do?

What is the practical use of the ALICE?

Is it for sure that human-AI performance is more important than AI-AI performance? Is there any scenario in which AI-AI performance is more important?

03/25/2020 – Palakh Mignonne Jude – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 Palakh Mignonne Jude 1 Comment

SUMMARY

In this paper, the authors design a cooperative game called GuessWhich (inspired by the 20-Questions game) to measure the perform of human-AI teams in the context of visual conversational agents. The AI system, ALICE, is based on the ABOT developed by Das et. al. in a prior study conducted to measure the performance of AI-AI systems. Two variants of ALICE have been considered for this study – ALICE_SL(trained in a supervised manner on the Visual Dialog dataset) and ALICE_RL (pre-trained with supervised learning and fine-tuned using reinforcement learning). The GuessWhich game was designed such that the human is the ‘questioner’ and the AI (ALICE) is the ‘answerer’. Both are given a caption that describes an image. While, ALICE is shown this image, the human can ask the AI multiple questions (9 rounds of dialog) to better understand the image. Post these rounds, the human is made to select the correct image from a set of distractor images that are semantically similar to the image to be identified. The authors found that, contrary to expectation, improvements in AI-AI performance does not translate to an improvement to AI-human performance.

REFLECTION

I like the gamification approach that the authors adopted for this study and I believe that the design of the game works well in the context of visual conversational agents. The authors mention how they aimed to ensure that the game was ‘challenging and engaging’. This reminded me of the discussion we had in class about the paper ‘Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research’, of how researchers often put in extra effort to make tasks for crowd workers more engaging and meaningful. I also liked the approach used to identify ‘distractor’ images and felt that this was useful in making the game challenging for the crowd workers.

I thought that it was interesting to learn that the AI-ALICE teams outperformed the human-ALICE teams. I wonder if this is impacted by the fact that ALICE could get some answers wrong and how that might affect the mental model generated by the human. I thought that it was good that the authors took into account knowledge leak and ensured that the human workers could only play a fixed number (10) of games.

I also liked that the authors gave performance-based incentives to the workers that were tied to the success of the human-AI team. I thought that it was good that the authors published the code of their design as well as provided a link to an example game.

QUESTIONS

As part of the study conducted in this paper, the authors design an interactive image-guessing game. Can similar games be designed to evaluate human-AI team performance in other applications? What other applications could be included?
Have any follow up studies been performed to evaluate the QBOT in a ‘QBOT-human team’? In this scenario, would the QBOT_RL outperform the QBOT_SL?
The authors found that some of the human workers adopted a single word querying strategy with ALICE. Is there any specific reason that could have caused the humans to do so? Would they have questioned a human in a similar fashion? Would their style of querying have changed if they were unaware if the other party was a human or an AI system?

03/25/20 – Lulwah AlKulaib-VQAGames

March 24, 2020 Lulwah AlKulaib Leave a comment

Summary

The paper presents a cooperative game between humans and AI called GuessWhich. The game is a live interaction in a conversational manner where the human is given multiple photos as choices and the AI has only one photo, the human would ask the AI, ALICE, questions to identify which one is the correct choice. ALICE was trained using both supervised learning and reinforcement learning on a publicly available visual dialog dataset then was used in the evaluation of the human-AI team performance. The authors find no significant difference in performance between ALICE’s supervised learning and reinforcement learning versions when paired with human partners. Their findings suggest that while self-talk and reinforcement learning are interesting directions to pursue for building better visual conversational agents, there appears to be a disconnection between AI-AI and human-AI evaluations. Progress in the former does not seem to be predictive of progress in the latter. It is important to note that measuring AI progress in isolation is not as useful for systems that require human-AI interaction.

Reflection

The concept presented in this paper is interesting. As someone who doesn’t work in the HCI field, it has opened my eyes to thinking of the different ways models that I have worked on shouldn’t be measured in isolation. As the authors showed that evaluating visual conversational agents through a human computation game gives results that differ from our conventional AI-AI evaluation. When thinking of this, I wonder how such methods would apply to tasks in which automated metrics correlate poorly with human judgement. Tasks like natural language generation in image captioning. Comparing how a method inspired by the one given in this paper would differ than the suggested method in last week’s papers. Given the difficulties presented in these tasks and the interactive nature of them, it is clear that the most appropriate way to evaluate these kinds of tasks is with a human in the loop but how would a large-scale human in the loop evaluation happen? Especially when there’s limited financial and infrastructure resources.

This paper made me think of the challenges that come with human in the loop evaluations:

1- In order to have it done properly, we must have a set of clear and simple instructions for crowdworkers.

2- There should be a way to ensure the quality of the crowdworkers.

3- For the evaluation’s sake, we need uninterrupted communication.

My takeaway from the paper is that while traditional platforms were adequate for evaluation tasks using automatic metrics, there is a critical need to support human in the loop evaluation for free form multimodal tasks.

Discussion

What are the ways that we could use this paper to evaluate tasks like image captioning?
What are other challenges that come with human in the loop evaluations?
Is there a benchmark of human- AI in the field of your project? How would you ensure that your results are comparable?
How would you utilize the knowledge about human- AI evaluation in your project?
Have you worked with measuring evaluations with human in the loop? What was your experience there?

03/25/2019 – Nurendra Choudhary – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 Nurendra Choudhary 1 Comment

Summary

In this paper, the authors measure the performance of human-AI teams instead of isolated AI. They employ a GuessWhich game in the context of a visual conversation agent system. The game works on interaction between humans and the AI system called ALICE.

The game includes two agents- the questioner and answerer. The answerer has access to images, based on which they ask questions to the answerer. The answerer replies to the questions and tries to guess the correct image from an exhaustive set of images. For the human-AI team, the questioner is ALICE and the answerer is human. The performance in terms of the number of questions needed for the correct guess. Also, the authors utilize a QBot (Questioner Bot) instead of humans for comparative analysis between ALICE-human and ALICE-QBot.

The authors discuss various challenges with the approaches such as robustness to incorrect question-answer pairs, human learning on AI and others. They conclude that ALICE-RL, the state-of-the-art in AI literature, does not perform better than ALICE-QBot in ALICE-human pairs. This highlights the disconnect between isolated AI development and development in teams with humans.

Reflection

The paper discusses an important problem of disconnect between isolated AI development and its real-world usage with humans in the loop. However, I feel there are some drawbacks in the experimental setup. In the QBot part, I do not agree with the temporally dynamic nature of the questionnaire. I think the QBot should get access to the perfect set of questions (from humans) to generate the new question. This would make the comparison fair and less dependent on its own performance.

An interesting point is the dependence of AI on humans. The perfect AI system should not rely on humans. However, current AI systems rely on humans to be useful in real-world. This leads to a paradox where we need to make AI systems human-compliant but move towards the larger goal of bringing in independent AI.

To achieve the larger goal, I believe isolated development of AI is crucial. However, the systems also need to contribute to human society. For this, I believe we utilize variants of the underlying system to support human behavior. This approach supports isolated development and additionally collects auxiliary performance of human behavior which can further improve the AI’s performance. This approach is already being applied effectively. For example, in case of Google Translate, the underlying neural network model was developed in isolation. Human corrections to its translations provide auxiliary information and also improve the human-AI team’s overall performance. This leads to a significant overall improvement in the translator’s performance overtime.

Questions

Is it fair to use the GuessWhich game as an indicator of AI’s success? Shouldn’t we rely on the final goal of an AI to better appreciate the disconnect?
Should the human-AI teams just be part of evaluation or also the development part? How would we include them in the development phase for this problem?
The performance of ALICE relies on the QBot mechanism. Could we use human input to improve QBot’s question generation mechanism to improve its robustness?
The experiment provides a lot of auxiliary data such as correct questions, relevance of questions to the images and robustness of bots with respect to their own answers. Can we integrate this data into the main architecture in a dynamic manner?

Word Count:564

03/25/20 – Lee Lisle – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020March 24, 2020 Lorance R Lisle Leave a comment

Summary

Chattopadhyay et al.’s work details the problems with the current (pre-2018) methods of evaluating visual conversational agents. These agents, which are AIs designed to discuss what is in pictures, were typically evaluated through one AI (the primary visual conversational agent) describing a picture while another asked questions about it. However, the authors show how this kind of interaction does not adequately reflect how humans would converse with the agent. They use 2 visual conversation agents, dubbed ALICE_SL and ALICE_RL (for supervised and reinforcement learning, respectively) to play 20 questions with AMT workers. They found that there was no significant difference in the performance of the two versions of ALICE. This stood in contrast to the work done previously which found that ALICE_RL was significantly better than ALICE_SL when tested by AI-AI teams. Both ALICEs perform better than random chance, however. Furthermore, AI-AI teams require fewer guesses than the humans in Human-AI teams.

Personal Reflection

I first found their name for 20-questions was Guess What or Guess Which. This has relatively little to do with the paper, but it was jarring to me at first.

The first thing that struck me was their discussion of the previous methods. If the first few rounds of AI-AI evaluation were monitored, why didn’t they pick up that the interactions weren’t reflective of human usage? If the abnormality didn’t present until later on, could they have monitored late-stage rounds, too? Or was it generally undetectable? I feel like there’s a line of questioning here that wasn’t looked at that might benefit AI as well.

I was amused that, with all the paper being on AI and interactions with humans, that they chose the image set to be medium difficulty based on “manual inspection.” Does this indicate that the AIs don’t really understand difficulty in these datasets?

Another minor quibble is that they say each HIT was 10 games, but then state that they published HITs until they got 28 games completed on each version of ALICE and specify this meant 560 games. They overload the word ‘game’ without describing the actual meaning behind it.

An interesting question that they didn’t discuss investigating further is whether question strategy evolved over time for the humans. Did they change up their style of questions as time went on with ALICE? This might provide some insight as to why there was no significant difference.

Lastly, their discussion on the knowledge leak of evaluating AIs on AMT was quite interesting. I would not have thought that limiting the interaction each turker could have with an AI would improve the AI.

Questions

Of all of the participants who started a HIT on AMT, only 76.7% of participants actually completed the HIT. What does this mean for HITs like this? Did the turkers just get bored or did the task annoy them in some way?
The authors pose an interesting question in 6.1 about QBot’s performance. What do you think would happen if the turkers played the role of the answerer instead of the guesser?
While they didn’t find any statistical differences, figure 4(b) shows that ALICE_SL outperformed ALICE_RL in every round of dialogue. While this wasn’t significant, what can be made of this difference?
How would you investigate the strategies that humans used in formulating questions? What would you hope to find?

03/25/2020 – Vikram Mohanty – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020March 24, 2020 Vikram Mohanty Leave a comment

Authors: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, and Devi Parikh

Summary

In this paper, the authors propose coming up with realistic evaluation benchmarks in the context of human-AI teams, as opposed to evaluating AI systems in isolation. They introduce two chatbots, one better than the other on the basis of standalone AI benchmarks. However, in this paper, when they are evaluated in a task setting that mimics their intended use i.e. humans interacting with chatbots to accomplish a common goal, they perform roughly the same. The work essentially suggests a mismatch between benchmarking of AI in isolation and in the context of human-AI teams.

Reflection

The core contribution of this paper showcases that evaluating AI systems in isolation will never give us the complete picture, and therefore, we should evaluate AI systems under the conditions they are intended to be used in with the targeted players who will be using it. In other words, the need for ecological validity of the evaluation study is stressed here. The flip side of this contribution is, in some ways, being reflected in the trend of AI systems falling short of their intended objectives in real-world scenarios.
Even though the GuessWhich evaluation was closer to a real-world scenario than vanilla isolation evaluation methods, it still remains an artificial evaluation. However, the gap with a possible real-world scenario (where a user is actually interacting with a chatbot to accomplish some real-world task like planning a trip) would be minimal.
The responses returned by the two bots are not wildly different (beige vs brown) since one was the base for the other one, and therefore, a human brain can somehow adapt dynamically based on the chatbot responses and accomplish the overall goal. It would also have been interesting to see how the performance changes when the AI was drastically different, or sent someone down the wrong path.
This paper shows why it is important for AI and HCI researchers to work together to come up with meaningful datasets, setting up a realistic ecosystem for an evaluation benchmark that would be more relevant with potential users.

Questions

If, in the past, you compared algorithms solely on the basis of precision-recall metrics (let’s say, you built an algorithm and compared it with the baseline), do you feel the findings would hold up in a study with ecological validity?
How’d you evaluate a conversational agent? (Suggest something different from the GuessWhich platform)
How much worse or better (or different) would a chatbot have to be for humans to perform significantly different from the current ALICE chatbots in the GuessWhich evaluation setting? (Any kind of subjective interpretation welcome)