03-25-2020- Yuhang Liu -Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

March 24, 2020March 24, 2020 yuhang Liu 1 Comment

Summary:

This paper proposes a new system. The system can combine crowdsourcing workers with machine learning to get better chat robots. The motivation for the authors to propose this idea is: fully automatic chat robots usually do not respond as well as crowd-driven conversation assistants, which is also obvious in our lives. But crowd-driven conversation assistants have higher cost and longer response time. So the author built the Evorus system, which is a crowd support conversation assistant. There are three main ways to achieve more efficient:

This new system can integrate other chat robots;
It can reuse previous answers;
It can learn to automatically approve responders;

In short, when users chat with the system, users can evaluate the response, when the response is not ideal, the system can quote the answers from the crowdsourcing workers, and at the same time, it can learn the Q & A, and take the next time to answer the similar questions. In the direction that has been practiced, you can quickly answer questions. This speeds up Q & A, but also improves accuracy.

Reflection:

I believe that in daily life, people will definitely have access to a large number of automated question answering systems. For example, when going to the official website of UPS, there will be a chat robot popping up to ask your purpose, but usually there are only a few directions, However, when people’s requirements tend to be complicated, the chat tends to be complicated, and the automatic answering robot cannot handle it, which will make users go to phone consultation or other homepage. I think this response mainly comes from pre-booked questions and answers, so I think the system proposed by the author has a very important use value.

At the same time, I think that the biggest advantage of this system’s answer through the self-learning crowdsourcing system is that it can be updated continuously in time. Frequent updates usually consume a lot of manpower and resources, and timely updates are more important in communication tools. In network, the terminology and emerging vocabulary are updated very quickly. If it can be updated frequently by studying the accepted answers in each response which need crowdsourcing workers engage, it will have a very positive impact on system maintenance and users.

Finally, the system introduces the answering system into a wider field, not only in the way of updating, revising, and answering questions, but also more importantly, combining humans and machines, and opening a sealing systems to make it can be continuously updated. And more and more innovative projects will be added to it, which is what I think is more meaningful than the system itself.

Question:

Do you think there are any other fields this system can add?
How to evaluate crowdsourcing workers response, in other words how to make sure crowdsourcing workers’ response is better than machine.
What is the difference between the system mentioned in this paper with other Q&A machine in modern society.

03/25/2020 – Mohannad Al Ameedi – “Like Having a Really bad PA”: The Gulf between User Expectation and Experience of Conversational Agents

March 24, 2020March 24, 2020 mohada4 Leave a comment

Summary

In this paper, the authors, try to understand the user experience of conversational agents by examining the factors that motivate users to work with these agents, and also try to propose design considerations that overcome the current limitation and improve human interaction. During their study, they found that there is a huge gap between user expectations and conversational agents’ operations.

They also found that there are limited studies about how agents are used on a daily bases and most of these studies were not about user experiences and more focus on technical architecture, language learning, and other areas.

The authors conducted interviews with 14 individuals who use conversational agents regularly, and their ages varies from 25 to 60 years. Some of these individuals have in depth technical knowledge and the others are regular users of technologies.

They found that the key motivation of using the conversational agents was time saving where users ask the CA to execute simple tasks that normally require multiple steps like checking the weather, setting reminders, setting alarms, getting directions. They also found that the users started the engagement through playful interaction like asking the CA to tell them a joke or playing a music. Only few users, who have technical knowledge, reported using these systems on basic work-related tasks.

The user’s interactions were mainly on non-critical tasks and have reported that the agents were not that successful when they are asked to execute complex tasks. The studies shows that users don’t trust conventional agents when it comes to executing critical tasks like sending emails or making a phone calls and they need a visual confirmation to complete these kind tasks. They also mentioned that these systems don’t accept feedback and there are no transparencies of how things are working internally.

The authors suggest considering ways reveal system intelligence, reconsidering the interactional promise made by humorous engagement, considering how best to indicate capability though interaction, and rethinking system feedback and design goals in light of the dominant use case, as areas for future investigation and development.

Reflection

I found the results reported by the study to be very interesting. Most users learned to use these CA systems as they go by trying different words and keywords unit it worked out, and the conversational agents failed to have a natural interaction with humans.

I also thought that companies like Google, Amazon, Microsoft, and Facebook have developed conversational systems that can perform much better than answering simple questions and struggling with complex questions, but it appears that is not the case. These companies have developed very sophisticated AI systems and services and it seems to me that there are some limitation like computational power or latency considerations are preventing these systems from performing well.

I agree with the authors that providing feedback can improve human interaction with CA systems and communicating the capability can lower the expectation which leads to reducing the gap between the expectation and the operation.

Questions

The authors mentions that most users felt unsure as to whether their conversational agents had a capacity to learn, can we use reinforcement learning to help the CA to adapt and learn while engaging with users in a single session?
The authors mentioned that CA systems are generally good with simple tasks, but not with complex tasks and they are struggling with understanding human requests. Do you think that there are technical limitation or other factors preventing the system from performing well with humans? what are these factors?
The authors mentioned that most instances, the operation of the CA systems failed to bridge the gap between user expectation and system operation. If that the case for conversational agents, do you think that we are far away from deploying autonomous cars, which are far more complicated than CAs, in real time setting since it has direct interaction with environments?

03/25/2020 – Vikram Mohanty – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020March 24, 2020 Vikram Mohanty Leave a comment

Authors: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, and Devi Parikh

Summary

In this paper, the authors propose coming up with realistic evaluation benchmarks in the context of human-AI teams, as opposed to evaluating AI systems in isolation. They introduce two chatbots, one better than the other on the basis of standalone AI benchmarks. However, in this paper, when they are evaluated in a task setting that mimics their intended use i.e. humans interacting with chatbots to accomplish a common goal, they perform roughly the same. The work essentially suggests a mismatch between benchmarking of AI in isolation and in the context of human-AI teams.

Reflection

The core contribution of this paper showcases that evaluating AI systems in isolation will never give us the complete picture, and therefore, we should evaluate AI systems under the conditions they are intended to be used in with the targeted players who will be using it. In other words, the need for ecological validity of the evaluation study is stressed here. The flip side of this contribution is, in some ways, being reflected in the trend of AI systems falling short of their intended objectives in real-world scenarios.
Even though the GuessWhich evaluation was closer to a real-world scenario than vanilla isolation evaluation methods, it still remains an artificial evaluation. However, the gap with a possible real-world scenario (where a user is actually interacting with a chatbot to accomplish some real-world task like planning a trip) would be minimal.
The responses returned by the two bots are not wildly different (beige vs brown) since one was the base for the other one, and therefore, a human brain can somehow adapt dynamically based on the chatbot responses and accomplish the overall goal. It would also have been interesting to see how the performance changes when the AI was drastically different, or sent someone down the wrong path.
This paper shows why it is important for AI and HCI researchers to work together to come up with meaningful datasets, setting up a realistic ecosystem for an evaluation benchmark that would be more relevant with potential users.

Questions

If, in the past, you compared algorithms solely on the basis of precision-recall metrics (let’s say, you built an algorithm and compared it with the baseline), do you feel the findings would hold up in a study with ecological validity?
How’d you evaluate a conversational agent? (Suggest something different from the GuessWhich platform)
How much worse or better (or different) would a chatbot have to be for humans to perform significantly different from the current ALICE chatbots in the GuessWhich evaluation setting? (Any kind of subjective interpretation welcome)

03/25/2020 – Vikram Mohanty – Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

March 24, 2020March 24, 2020 Vikram Mohanty Leave a comment

Authors: Ting-Hao (Kenneth) Huang, Joseph Chee Chang, and Jeffrey P. Bigham

Summary

This paper discusses Evorus, a crowd-powered intelligent conversation agent that is targeted towards automation over time. It allows new chatbots to be integrated, reuses prior crowd responses, and learns to automatically approve responses. It demonstrates how automation can efficiently be deployed by augmenting an existing system. Users used Evorus through Google Hangouts.

Reflection

There’s a lot happening in this paper, but then it’s perfectly justified because of the eventual target — fully automated system. This paper is a great example of how to carefully plan the path to an automated system from manual origins. It is realistic in terms of feasibility, and the transition from a crowd-based system to a crowd-AI collaborative system aimed towards a fully automated one seems organic and efficient as seen from the results.
In terms of their workflow, they break down different elements i.e. chatbots and vote bots, and essentially, scope down the problem to just selecting a chatbot and voting on responses. A far-fetched approach would have been to build (or aim for) an end-to-end (God-mode) chatbot that can give the perfect response. Because the problem is scoped down, and depends on interpretable crowd worker actions, designing a learning framework around these actions and scoped down goals seems like a feasible approach. This is a great takeaway from the paper — how to break down a complex goal into smaller goals. Instead of attempting to automate an end-to-end complex task, crafting ways to automate smaller, realizable elements along the path seems like a smarter alternative.
The voting classifier was carefully designed, considering a lot of interpretable and relevant features again such as message, turn and conversation levels. Again, this was evaluated with a real purpose i.e. reducing the human effort in voting.
This paper also shows how we can still build intelligent systems that improve over time on top of AI engines that we cannot (or may, actually do not have to) modify i.e. third-party developer chatbots, off-the-shelf AI APIs. Crowd-AI collaboration can be useful for this aspect, and therefore designing the user interaction(s) remains critical for a learning framework to be augmented to the fixed AI engine e.g. vote bot or the select bot in this paper’s case.

Questions

If you are working with an off-the-shelf AI engine that cannot be modified, how do you plan on building a system that improves over time?
What other (interaction) areas in the Evorus system do you see for a potential learning framework that would improve the performance of the system (according to the existing metrics)?
If you were working a complex task, would you prefer an end-to-end God-mode solution or adopt a slow approach by carefully breaking it down and automating each element?

Subil Abraham – 03/25/2020 – Huang et al., “Evorus”

March 24, 2020March 24, 2020 Subil Abraham Leave a comment

This paper introduces Evorus, a conversational assistant framework/interface that can serve as a middleman to curate and choose the appropriate responses for a client’s query. The goal of Evorus is to serve as a middleman between a user and many integrated chatbots, while also using crowd workers to vote on which responses are the best given the query and the context. This allows Evorus to be a general purpose chatbot, because it being powered by many domain specific chatbot and (initially) crowd workers. Evorus learns over time from the crowd workers votes on which responses to send to a query, based on its historical knowledge of previous conversations, and also learn which chatbot to direct a query to based on what it knows of which chatbots responded to similar queries in the past. It also prevents bias against newer chatbots by giving them higher initial probabilities when they first start to allow them to be selected even though Evorus does not have any historical data or prior familiarity with that chat bot. The ultimate ideal of Evorus is to eventually minimize the number of crowd worker interventions that are necessary by learning which responses to vote on and pass through to the user, and thus save crowd work costs over time.

This paper seems to follow on the theme of last week’s reading “Pull the Plug? Predicting If Computers or Humans Should Segment Images”. In that paper, the application is trying to decide on the quality of image segmentation of an algorithm, and pass it on to a human in case it was not up to par. The goals of this paper seem similar to that, but for chat bots instead of image segmentation algorithms. I’m starting to think the idea of curation and quality checking is a common refrain that will pop up in other crowd work based applications, if I keep reading in this area. I also find it an interesting choice that Evorus seems to allow multiple responses (either from bots or from crowd workers) to be voted in and displayed to the client. I suppose the idea here is that, as long as the responses made sense and they add more information that can be given to the client, it’s beneficial to allow multiple responses instead of trying to force a single, canonical response. Though I like this paper and the application that it presents, one issue I have is that they don’t show a proper user study. Maybe they felt it was unnecessary because user studies on automatic and crowd based chatbots have been done before and the results of these would be no different. But I still think they should’ve done some client side interviews or observations, or at least shown a graph of the Likert scale responses they collected for the two phases.

Do you see a similarity between this work and the Pull the Plug paper? Is the idea of curation and quality control and teaching AI how to do quality control a common refrain in crowd work research?
Do you find the integration of Filler bot, Interview bot, and clever bot, which are not actually contributing anything useful to the conversation, of any use? Was it just there to add conversation noise? Did they serve a training purpose?
Would a user study have shown anything interesting or suprising compared a standard AI based or crowd based chat bot?

Subil Abraham – 03/25/2020 – Luger and Sellen, “Like Having a Really bad PA”

March 24, 2020March 24, 2020 Subil Abraham 1 Comment

This paper tries to take a hard look at how useful conversational agents like Google Now and Siri are in the real world, when in the hands of real users who try to use them in daily life. The authors conduct interviews with 14 users to get their general thoughts about how they use these tools and in some case, get step by step details on how they do specific tasks. The paper is able to get some interesting insights and provide some useful recommendations on how to improve the existing CAs. Recommendations include making design changes to inform users the limitations of what the CAs can do, tone down some of the more personable aspects which gives a false impression that they are equivalent to humans in understanding, and rethinking design for easier use in hands free scenarios.

First thing that I noticed, after having read and focused primarily on papers that had some quantitative aspect to them, was that this paper is entirely focused on evaluating and interpreting the content of their interviews. I suppose this is another important way in which HCI research is done and shared with the world, because it focuses entirely on the human side of it. I think they have some good interpretations and recommendations from it. The general problem I have with these kinds of studies is the small sample size, which rears up here too. But I can look past that because I think they still are able to get some good insights and make some good recommendations, and provide focus on a mode of interaction that is entirely dialogue based. I do think that if they could have a bigger sample size and do some quantitative work, they could maybe show some trends in the failings of CAs. The most interesting insight for me is the fact that CAs seemed to have been designed with the thought that they would be the focus of attention when used, when in reality people were trying to use it while doing something else and were not looking at their phone. So the feedback mechanism was useless for the users because they were trying to be hands free. From my perspective, that seems to be the most actionable change and can probably lead to (or maybe it already has lead to) interesting design research on how to best provide task feedback for different kinds of tasks for hands free usage.

What kind of design elements can be included to help people understand the limits of what the CA can do, and thereby avoid having unfulfillable expectations?
Similarly, what kind of design elements would be useful to better suit the hands free usage of the CAs?
Should CAs aim to be more task oriented like Google Now, or more personable like Siri? What’s your preferred fit?

03/25/2020 – Dylan Finch – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 23, 2020March 24, 2020 Dylan Finch 1 Comment

Word count: 568

Summary of the Reading

This paper makes many contributions to the field of human and AI interaction. It focuses on presenting a new way to evaluate AI agents. Most evaluations of AI systems are done in isolation, with no human input. One AI system interacts with another AI system and their combined interaction forms the basis of the evaluation for the AI systems. This research presents a new way to evaluate AI systems: bringing humans into the loop and getting them to replace one of the AI systems to better evaluate how AIs work within a more real world scenario: one where humans are present. This paper finds that these two evaluation methods can produce different results. Specifically, when comparing the AI systems, the one that performed worse when evaluated with another AI system actually performs better when evaluated by a human. This raises important questions about the way we test AI systems and suggests that testing should be more human focused.

Reflections and Connections

I think that this paper highlights an important issue that I had never really thought about. Whenever we build any kind of new tool or new system, it must be tested. And, this testing process is extremely important in deciding whether or not the system works. The way that we design tests is just as important as the way that we design the system in the first place. If we design a great system, but design a bad test and then the test says that the system doesn’t work, we have lost a good idea because of a bad test. I think this paper will make me think more critically about how I design my tests in the future. I will put more care into them and make sure that they are well designed and will give me the results that I am looking for.

When these ideas are applied to AI, I think that they get even more interesting. AI systems can be extremely hard to test and oftentimes, it is much easier to design another automated system, whether that be another AI system or just an automated script, to test an AI system, rather than getting real people to test it. It is just much easier to use machines than it is to use humans. Machines don’t require IRB approval, machines are available 24/7, and machines provide consistent results. However, when we are designing AI systems and especially when we are designing AI systems that are made to be used by humans, it is important that we test them with humans. We cannot truly know if a system designed to work with humans actually works until we test it with humans.

I hope that this paper will push more research teams to use humans in their testing. Especially with new tools like MTurk, it is easier and cheaper than ever to get humans to test your systems.

Questions

What other kinds of systems should use humans in testing, rather than bots or automated systems?
Should all AI systems be tested with humans? When is it ok to test with machines?
Should we be more skeptical of past results, considering that this paper showed that an evaluation conducted with machines actually produced a wrong result (the wrong ALICE bot was chosen as better by machines)?

03/25/2020 – Dylan Finch – “Like Having a Really bad PA”: The Gulf between User Expectation and Experience of Conversational Agents

March 23, 2020March 24, 2020 Dylan Finch 1 Comment

Word count: 575

Summary of the Reading

This paper focuses on cataloguing issues that users have with conversational agents and how user expectations of what conversational agents can do differ dramatically from what these systems can actually do. This is where the main issue arises: the difference in perceived and actual usefulness. Conversational agents are the virtual assistants that most of us have on our smartphones. They can do many things, but they will often have trouble with more complicated tasks and they may not be extremely accurate. Many participants in the study said that they would not use their conversation agents to do complicated tasks that required precision, like writing long emails. Other uses assumed that the conversation agents could do something, like book movie tickets, but the system could not accomplish the task for the first few times it was tried. This made the user less likely to try to use those features in the future. This paper lists more of these types of issues and tries to present some solutions to them.

Reflections and Connections

I think that this paper highlights a big problem with conversation agents. It can sometimes be very hard to know what conversational agents can and cannot do. Oftentimes, there is no explicit manual that lists all of the kinds of questions that you can ask or that tells you the limits of what the agent can do. This is unfortunate because being upfront with users is the best way to set expectations to a reasonable level. Conversational agents should do a better job of working expectations into their setups or their instructions.

Companies should also do a better job of telling consumers what these agents can and cannot do. Companies like Apple and Google, some of the companies highlighted in the paper, often build their agents up to be capable of anything. Apple tries to sell you on Siri by promising that it can basically do anything. Apple encourages you to use Siri for as many tasks as you can and advertaties this. But, oftentimes, Siri can’t do everything they imply it can. Or, if Siri can do it, she does it poorly. This compounds the problem even more because it sets user expectations extremely high. Then, users will try to actually use the agents, find out that they can’t do as many things as was advertised, and give up on the system altogether. Companies could do a lot to help solve this problem by just being honest with consumers and saying that there are certain things their agents can do and certain things their agents cannot do.

This is a real problem for people who use these kinds of technologies. When people do not know what kinds of questions the agents can actually answer, they may be more scared to ask any questions, severely limiting the usefulness of the agent. It would vastly improve the user experience if we could solve this issue and make people have more accurate expectations for what conversational agents can do.

Questions

How can companies better set expectations for their conversational agents?
Does anyone else have a role to play in educating people on the capabilities of conversation agents besides companies?
Do we, as computer scientists, have a role to play in educating people about the capabilities of conversational agents?

3/25/20 – Jooyoung Whang – All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild

March 23, 2020March 24, 2020 Jooyoung Whang 2 Comments

In this paper, the authors design and deploy a conversational QA bot called CHIP. It is capable of providing domain-specific information (information about the company it was deployed at) and performing off-topic conversation. The authors’ interest in this study was to observe and characterize the kinds of interactions that happened for users of CHIP and measure CHIP performance. CHIP classified user intention by using two classification systems, where one classified a large category and the other a more specific one. Based on whether the specific classification was a sub-category of the larger classification, the appropriate response was given to the user. The training was done using data collected from other conversational QA agents and anonymized company E-mails. The authors observed that users usually used CHIP for system inquiry, providing feedback, and playful chit-chat.

I personally liked the study due to the interesting topic. My mother owns an Amazon Alexa. I’ve frequently seen Alexa trying to be humorous and I was astonished by how naturally human-like these conversational agents could act. At the start of this paper, I was curious about how the authors approached the highly abstract concept of playfulness in a technical paper. Using an intention classification layer was a great idea. I think it nicely encapsulates the user queries and improves response quality.

One interesting observation in the paper was that casual conversations often occurred in a mix with work-related conversations. Up to now, I thought the two types of conversations happened separately when chatting with a bot. I think this mix happens more frequently when talking with a human, so I assume it was the result of the users trying to anthropomorphize the agent.

Moving on to a more critical reflection, I think it would have been nicer if the paper focused more on one side of the types of conversations (i.e. playful conversations). The paper tries to address both work-related conversations and playful conversations at the same time. I know that the authors were interested in looking at human-AI interaction in the wild, but I think this also made the results less compact and lose focus. I also had the feeling that this study was very specific to the agent that the authors designed (CHIP). I am unsure how the results would generalize to other CAs.

These are the questions that I had while reading the paper:

1. The authors mentioned that certain nuances could be used to detect user intention. What would be considered a nuance of a playful intention? It seems that there’s a high correlation between the user’s urge to anthropomorphize the bot and a playful conversation. Could phrases like ‘you’ or ‘we’ be used to detect playful intention?

2. As in my reflection, I think this study is a bit too specific to CHIP. What do you think? Do you think the results will generalize well to other kinds of conversational bots?

3. According to this paper, work-related conversations and playful conversations frequently happened together. Would there be a case where a playful conversation will never happen? What kind of environment would not require a playful conversation?

03/25/2020 – Sushmethaa Muhundan – Evorus: A crowd-powered Conversational Assistant Built to Automate Itself Over Time

March 22, 2020March 24, 2020 Sushmethaa Muhundan 1 Comment

The paper explores the feasibility of a crowd-powered conversational assistant that is capable of automating itself over time. The main intent of building such a system is to dynamically support a vast set of domains by exploiting the capabilities of numerous chatbots and providing a universal portal to help answer user’s questions. The system, Evorus, is capable of supporting multiple bots and given a query, predicts which bot’s response is most relevant to the current conversation. This prediction is validated using crowd workers from MTurk and the response with the maximum upvotes is sent to the user. The feedback gained from the workers is then used to develop a learning algorithm that helps improve the system. As part of this study, the Evorus chatbot was integrated with Google hangouts and user’s queries were presented to MTurk workers via an interface. The workers are presented with multiple possible answers that come from various bots for each query. The workers can then choose to upvote or downvote the answers presented or respond to the query by typing in an appropriate answer. An automatic voting system was also devised with the aim to reduce worker’s involvement in the process. The results of the study showed that Evorus was able to automate itself over time without compromising conversation quality.

I feel that the problem that this paper is trying to solve is very real: the current landscape of conversational assistants like Apple’s Siri and Amazon’s Echo is limited to specific commands and the users need to be aware of the commands supported in order to maximize the benefit of using them. This oftentimes becomes a roadblock as the AI bots are constrained to specific, pre-defined domains. Evorus tries to solve this problem by creating a platform that is capable of integrating multiple bots and leveraging their skill-set to answer a myriad of questions from different domains.

The focus on manual intervention reduction via automation and focus on quality throughout the paper was good. I found the voting bot particularly interesting where a learning algorithm was developed that used upvotes and downvotes provided by workers on previous conversations to learn from the worker’s voting patterns and would be capable of making similar decisions. Also, the upvotes and downvotes were also used to gauge the quality of responses from candidates and this was used as further input to predict the most suitable bots in the future.

Fact boards were another interesting feature that included chat logs and recorded facts and were part of the interface provided to the workers to provide context about the conversation. This ensures that the workers are caught up to speed and are capable of making informed decisions while responding to the users.

Given the scale at which information generation is growing, is the solution proposed in the paper feasible? Can this truly handle diverse domain queries while reducing human efforts drastically and also maintaining quality?
Given the complexity of natural languages, would the proposed AI system be able to completely understand the user’s need and respond with relevant replies without human intervention? Would the role of a human ever become dispensable?
How long do you think would it take for the training to be sufficient to entirely remove the role of a human in the loop in the Evorus system?