3/25/20 – Jooyoung Whang – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020March 24, 2020 Vikram Mohanty 1 Comment

This paper attempts to measure the performance of a visual conversational bot called ALICE with a human teammate, opposed to what modern AI study commonly does, which is measuring amongst a counterpart AI. The authors design and deploy two versions of ALICE: one trained by supervised learning and the other by reinforced learning. The authors made Mturk workers have a Q&A session with ALICE to discover a hidden image only shown to ALICE, within a pool of similar images. After a fixed set of questions, the Mturk workers were asked to make a guess for which one was the hidden image. The authors evaluated using the resulting mental rankings of the hidden images after the user’s conversations with the AI. They found in previous works that bots trained using reinforced learning performed better than the other. However, the authors discover that there is no significant difference when evaluated in a human-AI team.

This paper was a good reminder that the ultimate user at the end is a human. It’s easy to forget that what computers prefer does not automatically translate over to a human’s case. It was especially interesting to see that a significant performance difference in an AI-AI setting was rendered minimal with humans in the loop. It made me wonder what it was about the reinforced-learned ALICE that QBOT preferred over the other version. Once finding that distinguishing factor, we might be able to make humans learn and adapt to the AI, leading to improved team performance.

It was a little disappointing the same research with QBOTs being the subject was left for future work. I would have loved to see the full picture. It could have also provided insight into what I’ve written above; what was it that QBOTs preferred reinforced learning?

This paper identified that there’s still a good distance between human cognition and AI cognition. If further studies find ways to minimize this gap, it will allow a quicker AI design process, where the resulting AI will be effective for both human and AI without needing extra adjustments for the human side. It would be interesting to see if it is possible to train an AI to think like a human in the first place.

These are the questions I had while reading this paper:

1. This paper was presented in 2017. Do you know any other studies done after this that measured human-AI performance? Do they agree that there’s a gap between humans and AIs?

2. If you have experience training visual conversational bots, do you know if a bot prefers some information over others? What is the most significant difference between a bot trained with supervised learning and reinforced learning?

3. In this study, the Mturk workers were asked to make a guess after a fixed number of questions. The study does not measure what’s the minimum or the maximum number of questions needed on average to make an accurate guess. Do you think the accuracy of the guesses will proportionally increase as the number of questions increases? If not, what kind of regression do you think it will follow?

03/25/20 – Sukrit Venkatagiri – Evaluating Visual Conversational Agents

March 24, 2020 Sukrit Venkatagiri 3 Comments

Paper: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, and Devi Parikh. 2017. Evaluating Visual Conversational Agents via Cooperative Human-AI Games. In Fifth AAAI Conference on Human Computation and Crowdsourcing. Retrieved January 22, 2020 from https://aaai.org/ocs/index.php/HCOMP/HCOMP17/paper/view/15936

Summary: The paper measures human-AI team performance and it is compared to AI performance alone. The authors of this paper make use of a game, called GuessWhich, to facilitate visual conversational agents/visual question-and-answer agents. GuessWhich leverages the fact that humans and the AI system, ALICE, have to interact with or converse with each other.

In the game, there are two primary agents, the one who asks a question (the questioner) and the one who answers these questions (the answerer). In the game, the answerer responds to the questions asked of it and attempts to guess a correct answer (an image) from a fixed set of images. With the human-AI team, the agent that asks questions is ALICE, and the “agent” that answers the question is a human. Here, performance is measured based on the number of questions taken to arrive at the correct answer. There’s also a QuestionerBot, or a QBot, that is used instead of a human to compare human-AI performance against AI-AI performance. That is, ALICE-human versus ALICE-QBot.

The paper further discusses the challenges faced with these approaches, including the difficulty in having robust question-answer pairs, and the fact that humans may or may not learn from the AI, among other such challenges. Finally, the paper concludes that ALICE-RL, a high-performing or “state of the art” AI system does not perform as well as ALICE-QBot when compared to ALICE-human pairs. This further points to the increasing disconnect between AI development that occurs independent of human input and considering human-in-the-loop interaction systems.

Reflection: This paper foregrounds a very important challenge, that is, the gap between AI research and development, and its use in the real-world with actual human beings. One thing I found interesting in this paper is that AI systems are ever-dependent on humans for input. Similar to what Gray and Suri mention in their book, Ghost Work, as AI advances, there is a need for humans at the frontier of AI’s capabilities. This is known as AI’s “last mile” problem, and will probably never cease to exist. This is an interesting paradox, where AI seeks to replace humans, only to need humans to do a new type of task.

I think this is one of the major limitations of developing AI independent of real-world applications and usage. If people only use synthetic data, and toy cases within the lab, then AI development cannot really advance in the real world. Instead, AI researchers should strive to work with human computation and human–computer interaction people to further both groups’ needs and requirements. This has even been used in Google Duplex, where a human takes over when the AI is unable to perform well.

Further, I find that there are some limitations to the paper, such as the experimental setup and the dataset that was used. I do not think that QBot was representative of a useful AI and the questions were not on par. I also believe that QBot needed to be able to dynamically learn from and interact with the human, making for a more fair comparison between AI-AI and human-AI teams.

Questions:

How might VQA challenges be made more difficult and applicable in the real-world?
What are the other limitations of this paper? How can they be overcome?
How would you use such an evaluation approach in your class project?
There’s a lot of interesting data generated from the experiments, such as user trust and beliefs. How might that be useful in improving AI performance?

03/25/20 – Lulwah AlKulaib- AllWorkNoPlay

March 24, 2020 Lulwah AlKulaib 3 Comments

Summary

The paper studies a field deployment of a question and answer chatbot in the field of human resource. They focus on the users’ conversational interactions with the chatbot. The HR chatbot provided company related information assistance to 377 new employees for 6 weeks. The author’s motivation was that studying conversational interactions and the rich signals which are used for inferring user status. These signals would be utilized to develop adapting agents in terms of functionality and interaction style. By contrasting the signals, they show the various functions of conversational interactions. The authors discuss design implications for conversational agents, and directions for developing adaptive agents based on users’ conversational behaviors. In their paper, they try to address two main research questions:

• RQ1: What kinds of conversational interactions did users have with the QA agent in the wild?

• RQ2: What kinds of conversational interactions can be used as signals for inferring user satisfaction with the agent’s functional performance, and playful interactions?

They answer RQ1 by presenting a characterization of the users’ conversational input and high level conversational acts. Then after providing a characterization of the conversational interactions, the authors study what signals exist in them for inferring user satisfaction (RQ2).

Reflection

In the paper, the authors study and analysis of conversations as signals of user satisfaction (RQ2). I found that part most interesting as their results show that users were fairly divided in terms of opinion when it comes to the chatbot’s functionality and playfulness. This means that there’s a need for adapting system functions and interaction styles for different users.

This observation makes me think of other systems where there’s a human in the loop interaction and how would system functions and interaction styles affect users satisfaction. And in systems that aren’t chatbot based, how is that satisfaction measured? Also, when thinking of systems that handle a substantial amount of interaction, would it be different? Does it matter if satisfaction is self reported by the user? Or would it be better to measure it based on their interaction with the system?

The paper acknowledges that the results are based on a survey data as a limitation. The authors mention that they had a response rate of 34% and that means that they can’t rule out self-selection bias. They also acknowledge that some observations might be specific to the workplace context and the user sample of the study.

The results in this paper provide some understanding of functions of conversational behaviors with conversational agents derived from human conversations. I would love to see similar resources for other non conversational systems and how user satisfaction is measured there.

Discussion

Is user satisfaction an important factor/evaluation method in your project?
How would you quantify user satisfaction in your project?
Would you measure satisfaction using a self reported survey by the user? Or would you measure it based on the user’s interaction with the system? And why?
Did you notice any other limitations in this paper other than the ones mentioned?

03/25/20 – Fanglan Chen – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 Fanglan Chen 1 Comment

Summary

Chattopadhyay et al.’s paper “Evaluating Visual Conversational Agents via Cooperative Human-AI Games” explores the research question of how the progress in AI-AI evaluation can translate to the performance of human-AI teams. To effectively deal with real-world problems, an AI system faces the challenge to adapt its performance to humans. Existing works measure the progress in AI isolatedly without human-in-the-loop design. Take visual conversational agents as an example, recent work typically evaluates how well agent pairs perform on goal-based conversational tasks instead of response retrieval from fixed dialogs. The researchers propose to evaluate the performance of AI agents in an interactive setting and design a human computation game, named GuessWhich, to continuously engage humans with agents in a cooperative way. By evaluating the collaborative success between humans and AI agents of two versions (self-supervised learning and reinforcement learning), their experiments find that there is no significant difference in performance between the two versions when paired with human partners, which suggests a disconnect between AI-AI and human-AI evaluations.

Reflection

This paper conducts an interesting study by developing a human computation game to benchmark the performance of visual conversational agents as members of human-AI teams. Nowadays, we are increasingly interacting with intelligent and highly communicative technologies throughout our daily lives. For example, companies utilize the automation of communication with their customers to make their purchase much more efficient and streamlined. However, it is difficult to define what success looks like in this case. Do the dialog bots really bring convenience, or are companies putting up a barrier to their customers? Even though this paper proposes a method to evaluate the performance of AI agents in an interactive setting, it does not discuss how to generalize the evaluation to other communication-related tasks.

In addition, I think the task design is worthy of further discussion. The paper utilizes the number of guesses the human needs to identify the secret image as an indicator of the human-AI team performance. For playing GuessWhich game among human-human teams, it seems how to strategize the questions is an important component in the game. However, the paper does not have much consideration on the question strategies. Would it be helpful if some kind of guideline on communication with machines are provided to the crowd workers? Is it possible some of the questions are clear to humans but ambiguous to machines? Based on the experimental results, the majority of the questions are binary, which are comparatively easier for the conversational agents to answer. One reason behind this I think is due to the given information of the secret picture. Take the case presented in the paper as an example, the basic description of the picture is given as “A man sitting on a couch with a laptop.” If we check out the picture choices, we can observe that few of the choices include all the components. In other words, the basic information provided in the first round made the secret picture not that “secret” anymore and the given description is enough to narrow the choices down to two or three picture candidates. In this scenario, the role of visual conversational agents play in the human-AI teams is minimized and difficult to be precisely evaluated.

Discussion

I think the following questions are worthy of further discussion.

Is it possible to generalize the evaluation method proposed in the paper to other communication-related tasks? Why or why not?
In the literature, the AI agents fine-tuned with reinforcement learning has been found to have better performance than its self-supervised learning counterpart. This paper finds that the accuracy of the two versions has no significant difference when evaluated via a human-ALICE team. What reasons can you think about to explain this?
Do you think there are any improvements that can be made to the experimental design?
What do you think are the challenges that come with human-in-the-loop evaluations?

03/25/20 – Sukrit Venkatagiri – “Like Having a Really Bad PA”: Gulf between User Expectation and Experience

March 24, 2020 Sukrit Venkatagiri 2 Comments

Paper: Ewa Luger and Abigail Sellen. 2016. “Like Having a Really Bad PA”: The Gulf between User Expectation and Experience of Conversational Agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 5286–5297.

Summary: This paper presents findings from 14 semi-structured interviews conducted with users of existing conversational agent systems, and highlights four key areas where current systems fail to support user interaction. It details how conversational agents are increasingly used within online systems, such as banking systems, as well as by larger companies like Google, Facebook, and Apple. The paper attempts to understand end-users and their experiences in dealing with conversational agents, as well as the challenges that they face in doing so—both from the user and the conversational agents’ side. The findings highlight how conversational agents are used by end-users for play, hands-free speech when they are unable to type, for making specific and formal queries, and for simple tasks such as finding out the weather. The paper also talks about how, in most instances, conversational agents are unable to fill the gap between users’ expectations and the actual way the conversational agent behaves, and that incorporating playfulness may be useful. Finally, the paper uses Norman’s gulf of execution and evaluation to provide implications for designing future systems.

Reflection:
This paper is very interesting, and I have had similar thoughts when using conversational agents in day to day life. I also appreciate the use of semi-structured interviews to get at users’ actual experiences of using conversational agents and how it differed from their expectations prior to using these CAs.

This work also adds on to prior work, confirming the existence of this gap or gulf between expectations and reality, and that users constantly expect more from CAs than CAs are capable of providing. The paper also speaks to the importance of designing conversational agents where user expectations should be set rather than having users set their own expectations, as we saw in some papers from previous weeks. The authors also discuss emphasizing interaction and constant updates with the CA to improve end-user expectations.

The paper also suggests ways to hold researchers and developers accountable for the promises that they make when designing such systems, and overhauling the system based on user feedback.

However, rather than just focusing on where conversational agents failed to support user interaction, I wish the paper had also focused on where the system successfully supports user interaction. Further, I wish they had sampled users who were not only novices but also experts, who might have had different expectations. It might be interesting to scale up this work as a survey to see how users’ expectations differ based on the conversational agent that is being used.

Questions:

How would you work to reduce the gulf between expectation and reality?
What are the challenges to building useful and usable conversational AIs?
Why are conversational AIs sometimes so limited? What affects their performance?
Where do you think humans can play a role? I.e. as humans-in-the-loop?

03/25/20 – Fanglan Chen – Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

March 24, 2020 Fanglan Chen 3 Comments

Summary

Huang et al.’s paper “Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time” explores a novel approach of a crowd-powered system architecture that can support gradual automation. The motivation of the research is that crowd-powered conversational assistants have been shown to achieve better performance than automated systems, but they cannot be widely adopted due to their monetary cost and response latency. Inspired by the idea to combine the crowd-powered and automatic approaches, the researchers develop Evorus, a crowd-powered conversational assistant. Powered by three major components(learning to choose chatbots, reusing prior responses, and automatic voting), new chatbots can be automated to more scenarios over time. The experimental results show that Evorus can evolve without compromising conversational quality. Their proposed framework contributes to the research direction of how automation can be introduced efficiently in a deployed system.

Reflection

Overall, this paper proposes an interesting gradual automatic approach for empowering conversational assistants. One selling point of the paper is that users can converse with the proposed Evorus in open domains instead of limited ones. To achieve that goal, the researchers design the learning framework as it can assign a slightly higher likelihood of newly-added chatbots to collect more data. I would imagine it requires a large amount of time for domain-specific data collection. Sufficient data collection in each domain seems important to ensure the quality of open-domain conversation. Similar to the cold-start problems in recommender systems, the data collected for different domains is likely imbalanced, for example, certain domains may gain no or little data during the general data collection process. It is unclear how the proposed framework can deal with this problem. One direction I can think about is to utilize some machine learning techniques such as zero-shot learning (domain does not appear in prior conversations) and few-shot learning (domain rarely discussed in prior conversations) to deal with the imbalanced data collected by the chatbot selector.

For the second component, the reuse of prior answers seems a good way to reduce the system computational cost. However, text retrieval can be very challenging. Take lexical ambiguity as an example, polysemous words would hinder the accuracy of the derived results because different contexts are mixed in the instances, collected from the corpus, in which a polysemous word occurred. If the retrieval component cannot handle the lexical ambiguity issue well, the reuse of prior answers may find irrelevant responses to user conversations which could potentially introduce errors to the results.

In the design of the third component, both workers and the vote bot can upvote suggested responses. It requires the collection of sufﬁcient vote weight until Evorus accepts the response candidate and sends it to the users. Depending on the threshold of the sufficient vote weight, the latency could be very long. In the design of user-centric applications, it is important to keep the latency/runtime in mind. I feel the paper would be more persuasive if it can provide support experiments on the latency of the proposed system.

Discussion

I think the following questions are worthy of further discussion.

Do you think it is important to ensure users can converse with the conversational agents in open domains in all scenarios? Why or why not?
What improvements can you think about that the researchers could improve in the Evorus learning framework?
At what point, you think the conversational agent can be purely automatic or it is better to always have a human-in-the-loop component?
Do you consider utilizing the gradual automated learning framework in your project? If yes, how are you going to implement it?

03/24/2020 – Akshita Jha – “Like Having a Really bad PA”: The Gulf between User Expectation and Experience of Conversational Agents

March 24, 2020 Akshita Jha 2 Comments

Summary:
“Like Having a Really bad PA: The Gulf between User Expectation and Experience of Conversational Agents” by Luger and Sellen talks about conversational agents and the gap between user expectation and the response given by the conversational agent. Conversational agents have been on the rise for quite some time now. All the big and well-known companies like Apple, Microsoft, Google, IBM, etc. have their own proprietary conversational agents. The authors report the findings of interviews with 14 end-users in order to understand the interactional factors affecting everyday use. The findings show that the end-users use conversational agents: (i) as a form of play, (ii) for a hands-free approach, (iii) for formal queries, and (iv) for simple tasks. The authors use “Norman’s ‘gulfs of execution and evaluation’ and infer the possible implications of their findings for the design of future systems.” The authors found that in the majority of instances the conversational agent was unable to fill the gap between user expectation and how the agent actually operates. It was also found that incorporating playful triggers and trigger responses in the systems increased human engagement.

Reflection:
This is an interesting work as it talks about the gap between user expectation and the system behavior, especially in the context of conversational agents. The researchers confirm that there is a “gulf” between the expectation and the reality and the end-users continually overestimate the amount of demonstratable intelligence that the system possesses. The authors also emphasized on the importance of the design and interactability of these conversational agents to make them better suited for engaging users. The users expect the chatbot to converse like humans but in reality, AI is far from it. The authors suggest considering ways to (a) to reveal system intelligence (b) to change the interactability to reflect the system’s capability (c) reign in the promises made by the scientists (d) revamp the system feedback given. The limitations of the study are that the sample is the male population from the UK. The findings presented in the paper, therefore, might be skewed. The primary use case for a conversational agent, not surprisingly, was ‘hands-free’ usage for saving time. However, if the conversational agent results in an error, the process becomes more cumbersome and time-consuming than originally typing in the query. The user tolerance in such cases might be low and lead to distrust which can negatively affect the feedback the conversational agents receive. The authors also talk about the different approaches that end-users take to interact with Google Now vs Siri. It would be interesting to see how user behavior changes with different conversational agents.

Questions:
1. What are your views about conversational agents?
2. Which conversational agent do you think performs the best? Why?
3. As a computer scientist, what can you do to make the end-users more aware of the limitations of conversational agents?
4. How can be best incorporate feedback into the system?
5. Do you think using multimodal representations of intelligence is the way forward? What challenges do you see in using such a form of representation?

03/25/2020 – Ziyao Wang – All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild

March 24, 2020 Ziyao Wang 1 Comment

The authors recruited 337 participants with diverse backgrounds. The participants are required to use CHIP, a QA agent system, for five to six weeks. After which, they are required to do a survey about their use of the system. Then, the authors did data analytics on the survey results to find out the t kinds of conversational interactions did users have with the QA agent in the wild and the signals for inferring user satisfaction with the agent’s functional performance, and playful interactions. Finally, the authors characterized users’ conversational interactions with a QA agent in the wild, suggested signals for inferring user satisfaction which can be used to develop adaptive agents and provided nuanced understanding of the underlying functions of users’ conversational behaviors, such as distinguish conversations with instrumental purpose from conversations with playful intentions.

Reflections:

The QA agent is an important application of AI technology. Though this kind of system was designed to work as a secretary who knows everything which can be reached through Internet, and it did well in all the conversations which serve primarily instrumental purposes, it can perform badly in conversations with playful intentions. For example, Siri can help you call someone, help you schedule an Uber and help you search the instructions for your device. However, when you are happy with something and sing a song to it, it is not able to understand your meaning and may disappoint you by response to you that it cannot understand you. This is a quite hard task, as each human has its own habits and AI can hardly distinguish whether the conversation has playful intention, or the conversation is about working purpose. As working purpose conversations are more important, the systems pretends to assume most of the conversations are with instrumental purposes. This assumption will ensure that no instrumental purposed conversation will be missed, however, it may decrease users’ satisfaction about the system when user want to play with the agent. Though developers understand this fact, it is still hard to let AI system to distinguish the purpose of the conversation.

This situation can be changed with the findings in this paper. The results in this paper show us about how to distinguish the purpose of the conversations and evaluate whether the user is satisfied with the conversation or not. As a result, the agent system can adapt itself to meet users’ needs and increase users’ satisfaction. So, developers of QA agent systems should consider the characterized forms of conversational interactions and the signals in conversational interactions for inferring user

Satisfaction in their future development. I think the QA agents in the future will become more adaptive according to users’ habits and user satisfaction of the agents will increase.

Questions:

How we can make use of the characterized forms of conversational interactions? Are there any suggestions about what the agent should response in each kind of conversation?

With the signals in conversational interactions for inferring user satisfaction, how can we develop a self-adaptive agent system?

Do the young people use QA agent the most compared with other groups of people? What kinds of participants should also be recruited to extend the coverage of the findings?

03/25/2020 – Ziyao Wang – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 Ziyao Wang 1 Comment

The authors proposed that instead of only measuring the AI progress in isolation, it is better to evaluate the performance of the whole human-AI teams via interactive downstream tasks. They designed a human-AI cooperative game which requires human work with an answerer-bot to identify a secret image known to bot but unknown to the human from a pool of images. At the end of the task, the human needs to identify the secret image from the pool. Though AI trained by reinforcement learning performs better than AI trained by supervised learning with an AI questioner, there is no significant difference between them when they work with human. The result shows that there appears to be a disconnect between AI-AI and human-AI evaluations, which means – progress on former does not seem predictive of progress on latter.

Reflections:

This paper proposed that there appears to be a disconnect between AI-AI and human-AI evaluations, which is a good point in future research. Compared with hiring people to evaluate models, using AI system to evaluate systems is much more efficient and cheaper. However, the AI system which is approved by the evaluating system may performs badly when interact with human. As is proved by the authors, though the model trained by reinforcement learning performs better than models trained by supervised learning, the two kinds of models have similar performance when they cooperate with human workers. A good example is the GAN learning system. Even the generative network passed the evaluation of the discriminative network, human can still easily discriminate the generated results from practical ones in most of the cases. For example, the generated images on the website thispersondoesnotexist.com passed the discriminative network, however, for most of them we can easily find the abnormal part in the pictures, which can prove the picture is faked. This finding is important for future researches. In future researches, the researchers should not only focus on the simulating work environment of systems, which will result in totally different results in the evaluation of system’s performance. Tests in which human involves in the workflow are needed to evaluate a trained model.

However, on the other side, even though the AI evaluated models may not be able to meet human needs, the training process is much more efficient than supervised learning or learning process which involves human evaluation. From this point, though the evaluation in which only AI involves may not be that accurate, we can still apply this kind of measurement in developing. For example, we can let AI to do first round evaluation, which is cheap and highly efficient. Then, we can apply human discriminators to evaluate the AI evaluated system. As a result, the whole developing can benefit from the advantage from both sides and the evaluation process can be both efficient and accurate.

Questions:

What else can the developed Cooperative Human-AI Game can do?

What is the practical use of the ALICE?

Is it for sure that human-AI performance is more important than AI-AI performance? Is there any scenario in which AI-AI performance is more important?

03/25/2020 – Bipasha Banerjee – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 bipashab 1 Comment

Summary

This paper aims to measure AI performances using the human-in-the-loop approach as opposed to only sticking to the traditional benchmark scores. For this purpose, they have evaluated the performance of a visual conversational agent interaction with humans effectively forming a Human-AI team. GuessWhich is a human computation game that was designed to study the interactions. They named the visual conversational agent ALICE that was trained in a supervised manner on a visual dataset. The visual CA was also pre-trained with supervised learning, and reinforcement learning was used to fine-tune the model. Both the human and the AI component of the experiment needs to be aware of each other’s imperfections and must infer information as and when needed. The experiments were performed using Amazon Mechanical Turks, and the AI component was chosen as the ABOT from the paper in [1]. This combination turned out to be the most effective. The AI component, named ALICE, had two components, one normal supervised learning and the other using reinforcement learning. It was found that the human and ALICE_SL outperformed human and ALICE_RL combination, which was contrary to the performance when only using AI. Hence, it proves that AI benchmarking tools do not accurately represent performance when humans are present in the loop.

Reflection

The paper proposes a novel interaction to include humans in the loop when using AI conversational agents. From an AI perspective, we look at standard evaluation metrics like F1, precision, and recall is used to gauge the performance of the model being used. The paper built from a previous work that considered only AI models interacting with each other. It was found that a reinforcement learning model performed way better than a standard supervised technique. However, when humans are involved in the loop, the supervised learning mechanism performs better than its reinforced counterpart. This signifies that our current AI evaluation techniques do not take into account the human context as effectively.

The authors mentioned that, at times, people would discover the strength of the AI system and try to interact accordingly. Hence, we can conclude that humans and AI are both learning from each other at some capacity. This is a good way to leverage the strengths of humans and AI.

It would also be interesting to see how this combination works in identifying images correctly when the task is complex. If the stakes are high and the image identifying task involves using both humans and machines, would such combinations work? It was noted that the AI system did end up answering some questions incorrectly, which ultimately led to incorrect guessing by humans. Hence, in order to make such combinations work seamlessly, more testing, training with vast amounts of data is necessary.

Questions

How would we extend this work to other complex applications? Suppose if the AI and humans were required to identify potential threats (where stakes are high) in security?
It was mentioned that the game was played for nine rounds. How was this threshold selected? Would it have worked better if the number was greater? Or would it rather confuse humans more?
The paper mentions that “game-like setting is constrained by the number of unique workers” who accept their tasks. How can this constraint be mitigated?

References

[1] https://arxiv.org/pdf/1703.06585.pdf