03/25/2020 – Dylan Finch – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Word count: 568

Summary of the Reading

This paper makes many contributions to the field of human and AI interaction. It focuses on presenting a new way to evaluate AI agents. Most evaluations of AI systems are done in isolation, with no human input. One AI system interacts with another AI system and their combined interaction forms the basis of the evaluation for the AI systems. This research presents a new way to evaluate AI systems: bringing humans into the loop and getting them to replace one of the AI systems to better evaluate how AIs work within a more real world scenario: one where humans are present. This paper finds that these two evaluation methods can produce different results. Specifically, when comparing the AI systems, the one that performed worse when evaluated with another AI system actually performs better when evaluated by a human. This raises important questions about the way we test AI systems and suggests that testing should be more human focused.

Reflections and Connections

I think that this paper highlights an important issue that I had never really thought about. Whenever we build any kind of new tool or new system, it must be tested. And, this testing process is extremely important in deciding whether or not the system works. The way that we design tests is just as important as the way that we design the system in the first place. If we design a great system, but design a bad test and then the test says that the system doesn’t work, we have lost a good idea because of a bad test. I think this paper will make me think more critically about how I design my tests in the future. I will put more care into them and make sure that they are well designed and will give me the results that I am looking for. 

When these ideas are applied to AI, I think that they get even more interesting. AI systems can be extremely hard to test and oftentimes, it is much easier to design another automated system, whether that be another AI system or just an automated script, to test an AI system, rather than getting real people to test it. It is just much easier to use machines than it is to use humans. Machines don’t require IRB approval, machines are available 24/7, and machines provide consistent results. However, when we are designing AI systems and especially when we are designing AI systems that are made to be used by humans, it is important that we test them with humans. We cannot truly know if a system designed to work with humans actually works until we test it with humans. 

I hope that this paper will push more research teams to use humans in their testing. Especially with new tools like MTurk, it is easier and cheaper than ever to get humans to test your systems. 

Questions

  1. What other kinds of systems should use humans in testing, rather than bots or automated systems?
  2. Should all AI systems be tested with humans? When is it ok to test with machines?
  3. Should we be more skeptical of past results, considering that this paper showed that an evaluation conducted with machines actually produced a wrong result (the wrong ALICE bot was chosen as better by machines)?

Read More

03/25/2020 – Sushmethaa Muhundan – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

The primary intent of this paper is to measure the performance of human-AI teams and this is done in the context of the AI being a visual conversational agent using a cooperative game. Oftentimes the performance of AI systems is evaluated in isolation or with respect to interaction with other AI systems. This paper attempts to understand if this AI-AI performance evaluation can be extended to predict the performance of the AI system while it interacts with humans, which is essentially the AI-team performance. To measure the effectiveness of AI in the context of human-AI teams, a game-with-a-purpose (GWAP) is used called GuessWhich. The game involves a human player interacting with an AI component wherein the AI component generates clues pertaining to a secret image that the human is unaware of. Via this question-answer model, the human asks questions regarding the image and attempts to identify the secret image from a pool of images. Two versions of the AI component are used in this experiment, one trained in a supervised manner, and the other which is pre-trained with supervised learning and fine-tuned via reinforcement learning. The experiment results show that there is no significant performance difference between the two versions when interacting with humans. 

The trend of humans interacting with AI directly or directly has increased exponentially and therefore it was interesting that the focus of this paper is on the performance of the human-AI team and not AI in isolation. Since it is becoming increasingly common to use AI in the context of humans, a dependency is created that cannot be measured by solely measuring the performance of the AI component alone.

While the results show that there is no significant performance difference between the two versions of AI used, the experiment results also show that while the performance of the AI improved as per AI-AI performance evaluation, this does not directly translate into better human-AI team performance. This was an interesting insight that challenges the existing AI evaluation norms.

Also, the cooperative game used in the experiments was complicated from a development point of view and it was interesting to understand how the AI was developed and how the pool of images was selected. The paper also explores the possibility of the Mturk workers discovering the strength of the AI and framing subsequent questions accordingly in order to leverage the strength discovered. This was a very fascinating possibility as it ties back to the mental models’ humans create while interacting with AI systems.

  1. Given that the study was conducted in the context of visual conversational agents, are these results generalizable outside of this context?
  2. It is observed that human-AI team performances are not significantly different for SL when compared to RL. What are some reasons that you can think of that explains this anomaly observed? 
  3. Given that ALICE is imperfect, what would be the recovery cost of an incorrect answer? Would this substantially impact the performance observed?

Read More

3/25/2020 – Mohannad Al Ameedi – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Summary

Improvements in artificial intelligence systems are normally measured alone without taking into consideration the human element. In this paper, the authors try to measure and evaluate the human-AI team performance by designing an interactive visual conversational agent that involve both human and AI to solve a specific problem. The conversational agent assigns the AI system a secret image with caption which is not known by the human, and the human start rounds of questions to guess the correct image from pool of images. The agent maintains an internal memory of questions and answers to help maintaining the conversation.

The authors use two version of AI systems, the first one is trained using supervised learning and the second is trained using reinforcement learning. The second system outperforms the first, but the improvement doesn’t translate well when interacting with human which proves that advances in AI system doesn’t necessarily means advances in the human-AI team performance.

Reflection

I found the idea of running two AI systems with the same human to be very interesting. Normally we think that advances in AI system can lead to better usage by the human, but the study shows that this is not the case. Putting the human in the loop while improving the AI system will give us the real performance of the system.

I also found the concept of reinforcement learning in conversational agents to be also very interesting. Using online learning by assigning a positive and negative rewards can help to improve the conversation between human and AI system, which can prevent the system from getting stuck on the same answer if the human ask the same question.

The work in somehow like the concept of compatibility. When human makes a mental model about the AI system. Advances in AI system might not be translated into a better usage by the human, and this is what was proven by the authors when they use two AI systems and one is better than the other, but improvement not necessarily translate to better performance by the users.

Questions

  • The authors proved that improvement in AI system alone doesn’t necessarily leads to a better performance when using the system by human, can we involve the human in the process of improving the AI system to lead to a better performance when the AI system get improved?
  • The authors use a single secret image known by the AI system but not known by the human, can we make the image unknown to the AI system too by providing a pool of images and the AI system select the appropriate image? And can we do that with acceptable response latency?
  • If we have to use a conversational agents like bots in production setting, do you think the performance of an AI system trained using a supervised learning can response faster than a system trained using a reinforcement learning giving that the reinforcement learning will need to adjust it is behavior based on the reward or feedback?

Read More

03/25/20 – Myles Frantz – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Summary

Regardless of anyone’s personal perception of chatbots, with around 1.4 billion people using chatbots (smallbizgenius) they’re impact cannot be ignored. With the intention of answering rudimentary questions (often duplicated) many of these chatbots are focused in the Question-and-Answer (QA) domain. Throughout these, the usages and feelings towards chatbots varies usually based on the user, the chatbot, and the overall interaction. Focusing more on the human-centric aspect of the conversation the team proposed a Conversation Agent (CA, a chatbot within QA) with a method to inspect sections of the conversation and determine whether the user enjoyed the conversation. Within introducing a hierarchy of specific natural language classifiers (NLC), the team was able to determine through certain classifications or signals to determine a high-level abstraction of a message or conversation. While the CA did its job sufficiently, the team was able to determine through their created signal methodology that approximately 84% of people engaged in some sort of conversation (outside of a normal question and answer scenario) with the CA.

Reflection

I am surprised at the results gleaned from this survey. While I should not be surprised and should assume the closer CA (and AI in general) get to human-like they appear the better the interaction will be, the percentage of “playfulness” or conversational messages seemed relatively high. This may be due to the experience group of the participants (new hires from college), though this is a promising sign on the progress being made.

I appreciate the aspect (or angle) this research went into. Having a strong technical background, my immediate thought is to ensure all the questions are answered correctly and investigate how it can be integrated with other systems (like a Jenkins Slack bot, polling the survey of a project). The extent of a project (I believe) is not only dependent on how usable it is, but also how user-friendly it is. Given the example MySpace and Facebook, Facebook created a much easier to use and more user-centric experience (based on connecting people), while MySpace suffered from lack of improvement for both of these aspects and is currently degrading in usage.

Questions

  • With only 34% of the participants responding to the survey, do you think a higher percentage would’ve enforced and backup the data currently collected?
  • Given the general maturity and time allocations a new hire from college has, do you think current employees (who have been with the company for a while) would have this percentage of conversation? To shorten it, do you think the normally busy or higher-up employees would have given similar conversational time to the CA?
  • Given the percentage of new hires that responded and responded conversationally to the CA, the opportunity rises for the user to communicate wholly and disregard current work in favor of a conversation (potentially as a form of escapism). Do you think if this kind of CA were implemented throughout companies, these kinds of capabilities would be abused or would be used too much?

Read More

03/18/20 – Nan LI – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

summary:

The main objective of this paper is to measure AI agents through interactive downstream tasks performed by human-AI teams. To achieve this goal, this author designed a cooperative game – GuessWhich – that require human to identify a secret image among a pool of images through engaging in a dialog with an answerer-bot(Alice). This image is known to Alice but unknown to the human so that human need to ask some related questions and pick out the secret image based on the answer from Alice. There are two versions of Alice presented in the paper. Alice(SL) which is trained in a supervised manner to simulate conversation with humans about image and Alice(RL), which is pre-trained with supervised learning and fine-tuned via reinforcement learning for the image-guessing task. The results indicate that the evaluation of Alice(RL) with another AI more accurate than the evaluation with the human. Therefore, the paper concluded that there is a disconnect between the benchmarking of AI in isolation and the context of human-AI interaction. 

Reflection:

This paper reminds me of another paper that we have discussed before: Updates in Human-AI Teams. Both of them are concerning the impact of human involvement in the AI performance. I think this is a great topic, and it is worth putting more attention on this topic. Because as the beginning of the paper said, as AI continues to advance, human-AI teams are inevitable. Many AI products have been widely used in society, including all walks of life. For example, predictive policing, life insurance estimation, sentencing, medical. Their product all requires human-AI to cooperate. Therefore, we already reach an agreement that the development and improvement of AI should always consider the impact of human involvement. 

The QBOT-ABOT teams mentioned in this paper have a similar idea as the GAN(Generative adversarial network). Both of them train two systems to use unsupervised training and let them provide feedback for each other to enhance their performance. However, the author made the point that it is unclear if these QBOT and ABOT agents are indeed performing well when interacting with humans. This is an excellent point that we should always consider when we design an AI system. The measuring of the AI system should never be isolated. We should also consider human mental methods. This requires us to consider how the human mental model will impact team performance when they work cooperatively. A suitable human involved evaluation may be a more valuable measurement for the AI system. 

Question:

  1. Do you think when we should measure the performance with human involvement and when we should not? 
  2. Can you see what the main point of this paper is? Why the author uses visual conversational agents to prove the points that it is crucial to benchmark progress in AI in terms of how it translates to helping humans perform a particular task. 
  3. The author mentioned that humans perceive ALICE(SL) and ALICE(RL) as comparable in terms of all metrics at the end of the paper. Why do you think the human will make such a conclusion. Does that indicate human involvement has no difference for two visual conversational agents?

Word Count: 537

Read More