03/25/20 – Lulwah AlKulaib- AllWorkNoPlay

March 24, 2020 Lulwah AlKulaib 3 Comments

Summary

The paper studies a field deployment of a question and answer chatbot in the field of human resource. They focus on the users’ conversational interactions with the chatbot. The HR chatbot provided company related information assistance to 377 new employees for 6 weeks. The author’s motivation was that studying conversational interactions and the rich signals which are used for inferring user status. These signals would be utilized to develop adapting agents in terms of functionality and interaction style. By contrasting the signals, they show the various functions of conversational interactions. The authors discuss design implications for conversational agents, and directions for developing adaptive agents based on users’ conversational behaviors. In their paper, they try to address two main research questions:

• RQ1: What kinds of conversational interactions did users have with the QA agent in the wild?

• RQ2: What kinds of conversational interactions can be used as signals for inferring user satisfaction with the agent’s functional performance, and playful interactions?

They answer RQ1 by presenting a characterization of the users’ conversational input and high level conversational acts. Then after providing a characterization of the conversational interactions, the authors study what signals exist in them for inferring user satisfaction (RQ2).

Reflection

In the paper, the authors study and analysis of conversations as signals of user satisfaction (RQ2). I found that part most interesting as their results show that users were fairly divided in terms of opinion when it comes to the chatbot’s functionality and playfulness. This means that there’s a need for adapting system functions and interaction styles for different users.

This observation makes me think of other systems where there’s a human in the loop interaction and how would system functions and interaction styles affect users satisfaction. And in systems that aren’t chatbot based, how is that satisfaction measured? Also, when thinking of systems that handle a substantial amount of interaction, would it be different? Does it matter if satisfaction is self reported by the user? Or would it be better to measure it based on their interaction with the system?

The paper acknowledges that the results are based on a survey data as a limitation. The authors mention that they had a response rate of 34% and that means that they can’t rule out self-selection bias. They also acknowledge that some observations might be specific to the workplace context and the user sample of the study.

The results in this paper provide some understanding of functions of conversational behaviors with conversational agents derived from human conversations. I would love to see similar resources for other non conversational systems and how user satisfaction is measured there.

Discussion

Is user satisfaction an important factor/evaluation method in your project?
How would you quantify user satisfaction in your project?
Would you measure satisfaction using a self reported survey by the user? Or would you measure it based on the user’s interaction with the system? And why?
Did you notice any other limitations in this paper other than the ones mentioned?

03/25/2020 – Ziyao Wang – All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild

March 24, 2020 Ziyao Wang 1 Comment

The authors recruited 337 participants with diverse backgrounds. The participants are required to use CHIP, a QA agent system, for five to six weeks. After which, they are required to do a survey about their use of the system. Then, the authors did data analytics on the survey results to find out the t kinds of conversational interactions did users have with the QA agent in the wild and the signals for inferring user satisfaction with the agent’s functional performance, and playful interactions. Finally, the authors characterized users’ conversational interactions with a QA agent in the wild, suggested signals for inferring user satisfaction which can be used to develop adaptive agents and provided nuanced understanding of the underlying functions of users’ conversational behaviors, such as distinguish conversations with instrumental purpose from conversations with playful intentions.

Reflections:

The QA agent is an important application of AI technology. Though this kind of system was designed to work as a secretary who knows everything which can be reached through Internet, and it did well in all the conversations which serve primarily instrumental purposes, it can perform badly in conversations with playful intentions. For example, Siri can help you call someone, help you schedule an Uber and help you search the instructions for your device. However, when you are happy with something and sing a song to it, it is not able to understand your meaning and may disappoint you by response to you that it cannot understand you. This is a quite hard task, as each human has its own habits and AI can hardly distinguish whether the conversation has playful intention, or the conversation is about working purpose. As working purpose conversations are more important, the systems pretends to assume most of the conversations are with instrumental purposes. This assumption will ensure that no instrumental purposed conversation will be missed, however, it may decrease users’ satisfaction about the system when user want to play with the agent. Though developers understand this fact, it is still hard to let AI system to distinguish the purpose of the conversation.

This situation can be changed with the findings in this paper. The results in this paper show us about how to distinguish the purpose of the conversations and evaluate whether the user is satisfied with the conversation or not. As a result, the agent system can adapt itself to meet users’ needs and increase users’ satisfaction. So, developers of QA agent systems should consider the characterized forms of conversational interactions and the signals in conversational interactions for inferring user

Satisfaction in their future development. I think the QA agents in the future will become more adaptive according to users’ habits and user satisfaction of the agents will increase.

Questions:

How we can make use of the characterized forms of conversational interactions? Are there any suggestions about what the agent should response in each kind of conversation?

With the signals in conversational interactions for inferring user satisfaction, how can we develop a self-adaptive agent system?

Do the young people use QA agent the most compared with other groups of people? What kinds of participants should also be recruited to extend the coverage of the findings?

03/24/2020 – Akshita Jha – All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild

March 24, 2020 Akshita Jha 4 Comments

Summary:
“All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild” by Liao et. al. talks about conversational agents and their interactions with the end-users. The end-user of a conversation agent might want something more than just information from these chatbots. Some of these can be playful conversations. The authors study a field deployment of human resource chatbot and discuss the interest areas of users with respect to the chatbot. The authors also present a methodology involving statistical modeling to infer user satisfaction from the conversations. This feedback from the user can be used to enrich conversational agents and make them better interact with the end-user in order to guarantee user satisfaction. The authors primarily discuss 2 research questions: (i) What kind of conversational interactions did the user have with the conversational agent in the wild, (ii) What kind of signals given by the user to the conversational agents can be used to study human satisfaction and engagement. The findings show that the main areas of conversations include “feedback-giving, playful chit-chat, system inquiry, and habitual communicative utterances.” The authors also discuss various functions of conversational agents, design implications, and the need for adaptive conversational agents.

Reflection:
This is a very interesting paper because it talks about the surprising dearth of research in the gap between user interactions in the lab and those in the wild. It highlights the differences between the two scenarios and the varying degree of expectations that the end-user might have while interacting with a conversational agent. The authors also mention how the conversation is almost always initiated by the conversational agent which might not be the best scenario depending upon the situation. The authors also raise an interesting point where the conversational agent mostly functions as a question answering system. This is far from ideal and prevents the user from having an organic conversation. To drive home this point further, the authors compare and contrast the signals of an informal playful conversation with that of a functional conversation in order to provide a meaningful and nuanced understanding of user behavior that can be incorporated by the chatbot. The authors also mention that the results were based on survey data which was done in a workplace environment and do not claim generalization. The authors also study only work professionals and the results might not hold for a population from a different age group. An interesting point here is that the users strive for human-like conversations. This got me thinking if this a realistic goal to strive for? What would the research direction look like if we modified our expectations and treated the conversational agent as an independent entity? It might help to not evaluate the conversational agents with human-level conversation skills.

Questions:
1. Have you interacted with a chatbot? What has your experience been like?
2. Which feature do you think should be a must and should be incorporated in the chatbot?
3. Is it a realistic goal to strive for human-like conversations? Why is that so important?

03/25/2019 – Nurendra Choudhary – All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild

March 24, 2020 Nurendra Choudhary 1 Comment

Summary

In this paper, the authors study a Human Resources chatbot to analyze the interactions between the bot and its users. Their primary aim is to utilize the study to enhance the interactions of conversational agents in terms of behavioral aspects such as playfulness and information content. Additionally, the authors emphasize on the adaptivity of the systems based on particular user’s conversational data.

For the experiments, they adopted an agent called Chip (Cognitive Human Interface Personality). Chip has access to all the company related assistance information. The human subjects for this experiment are new employees that need constant assistance to orient themselves in the company. Chip is integrated into the IM services of the company to provide real-time support.

Although Chip is primarily a question-answer agent, the authors are more interested in the behavioral ticks in the interaction such as playful chit-chats, system inquiry, feedbacks and habitual communicative utterances. They utilize the information from such ticks to further enhance the conversational agent and improve its human-like behavior (and not focus solely on answer-retrieval efficiency).

Reflection

All Work and No play is a very appropriate title for the paper. Chip is primarily applied in a formal context where social interactions are considered unnecessary (if not inappropriate). However, human interactions always include a playful feature to improve quality of communication. No matter the context, human conversation is hardly ever void of behavioral features. The features exhibit emotions and significant subtext. Given the setting, it is a good study to analyze the effectiveness of conversational agents with human behavior features. However, some limitations of the study include the selection bias (as indicated in the paper too). The authors pick conversation subparts that are subjectively considered to include the human behavior features. However,I do not see a better contemporary method in the literature to efficiently avoid the selection bias.

Additionally, I see this study as part of a wider move of the community towards appending human-like behavior to their AI systems. If we look at the current popular AI conversation agents like Alexa, Siri, Google Assistant and others, we find a common aim to enhance human-specific features with limited utilitarian value such as jokes, playful ticks among others. I believe this type of learning also mitigates the amount of adaptation humans need before being comfortable with the system. In the previous classes, we have seen the adaptation of mental models with a given AI tool. If the AI systems behave more like humans and learn accordingly, humans would not need significant learning to adopt these AI tools in their daily life. For example, when voice assistants did not include these features, they were significantly less prevalent than in the current society and they are only projected to widen their market.

Questions

How appropriate is it to have playful systems in an office environment? Is it sufficient to have efficient conversational agents or do we require human-like behavior in a formal setting?
The features seem even more relevant for regular conversational agents. How will the application and modeling differ in those cases?
The authors select the phrases or conversational interactions as playful or informal based on their own algorithm. How does this affect the overall analysis setup? Is it fair? Can it be improved?
We are trying to make the AI more human-like and not using it simply as a tool. Is this part of a wider move as the future area of growth in AI?

Word Count: 590

03/25/2020 – Bipasha Banerjee – All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild

March 24, 2020 bipashab Leave a comment

Summary

The paper by Liao et al. talks about conversational agents (CAs) that are used to answer two research questions. The first was to see how CAs interact with users, and the second was to see what kind of conversational interactions could be used for the CAs to gauge user satisfaction. For this task, the authors developed a conversational agent called Cognitive Human Interface Personality (CHIP). The primary function of the agent is to provide HR assistance to new hires to a company. For this research, 377 new employees were the users, and the agent provided support to them for six weeks. CHIP would answer questions related to the company, which is quite natural for newly employed individuals to have. IBM Watson Dialog package has been utilized to incorporate the conversations collected from past CA usages. They made the process iterative, where 20-30 user interaction was taken into account in the development process. The CA was aimed to be conversational and social. In order to do so, users were assisted with regular reminders. Participants in the study were asked to use a specific hashtag, namely, #fail, to provide feedback and consent to the study. The analysis was done using classifiers to provide a characterization of user input. It was concluded that signals in conversational interactions could be used to infer user satisfaction and further develop chat platforms that would utilize such information.

Reflection

The paper does a decent job of investigating conversational agents and finding out the different forms of interactions users have with the system. This work gave an insight into how these conversations could be used to identify user satisfaction. I particularly was interested to see the kind of interaction the users had with the system. It was also noted in the paper that the frequency of usage of the CA declined within two weeks. This was natural when it comes to using the HR system. However, industries like banking, where 24-hour assistance is needed and desired, would have consistent traffic of users. Additionally, it is essential to note how they maintain the security of users while such systems use human data. For example, HR data is sensitive. The paper did not mention anything about how do we actually make sure that personal data is not transferred or used by any unauthorized application or humans.

One more important factor, in my opinion, is the domain. I do understand why the HR domain was selected. New hires are bound to have questions, and a CA is a perfect solution to answer all such frequently asked questions. However, how would the feasibility of using such agent change with other potential uses of the CA? I believe that the performance of the model would decrease if the system was to be more complex. Here the CA mostly had to anticipate or answer questions from a finite range of available questions. However, a more open-ended application could have endless questioning opportunities. To be able to handle such applications would be challenging.

The paper also only uses basic machine learning classifiers to answer their first research question. However, I think some deep learning techniques like those mentioned in [1] would help classify the questions better.

Questions

How would the model perform in domains where continuous usage is necessary? Examples are the banking sectors.
How was the security taken care of in their CA setup?
Would the performance and feasibility change according to the domain?
Could deep learning techniques improve the performance of the model?

References

[1]https://medium.com/@BhashkarKunal/conversational-ai-chatbot-using-deep-learning-how-bi-directional-lstm-machine-reading-38dc5cf5a5a3

3/25/20 – Jooyoung Whang – All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild

March 23, 2020March 24, 2020 Jooyoung Whang 2 Comments

In this paper, the authors design and deploy a conversational QA bot called CHIP. It is capable of providing domain-specific information (information about the company it was deployed at) and performing off-topic conversation. The authors’ interest in this study was to observe and characterize the kinds of interactions that happened for users of CHIP and measure CHIP performance. CHIP classified user intention by using two classification systems, where one classified a large category and the other a more specific one. Based on whether the specific classification was a sub-category of the larger classification, the appropriate response was given to the user. The training was done using data collected from other conversational QA agents and anonymized company E-mails. The authors observed that users usually used CHIP for system inquiry, providing feedback, and playful chit-chat.

I personally liked the study due to the interesting topic. My mother owns an Amazon Alexa. I’ve frequently seen Alexa trying to be humorous and I was astonished by how naturally human-like these conversational agents could act. At the start of this paper, I was curious about how the authors approached the highly abstract concept of playfulness in a technical paper. Using an intention classification layer was a great idea. I think it nicely encapsulates the user queries and improves response quality.

One interesting observation in the paper was that casual conversations often occurred in a mix with work-related conversations. Up to now, I thought the two types of conversations happened separately when chatting with a bot. I think this mix happens more frequently when talking with a human, so I assume it was the result of the users trying to anthropomorphize the agent.

Moving on to a more critical reflection, I think it would have been nicer if the paper focused more on one side of the types of conversations (i.e. playful conversations). The paper tries to address both work-related conversations and playful conversations at the same time. I know that the authors were interested in looking at human-AI interaction in the wild, but I think this also made the results less compact and lose focus. I also had the feeling that this study was very specific to the agent that the authors designed (CHIP). I am unsure how the results would generalize to other CAs.

These are the questions that I had while reading the paper:

1. The authors mentioned that certain nuances could be used to detect user intention. What would be considered a nuance of a playful intention? It seems that there’s a high correlation between the user’s urge to anthropomorphize the bot and a playful conversation. Could phrases like ‘you’ or ‘we’ be used to detect playful intention?

2. As in my reflection, I think this study is a bit too specific to CHIP. What do you think? Do you think the results will generalize well to other kinds of conversational bots?

3. According to this paper, work-related conversations and playful conversations frequently happened together. Would there be a case where a playful conversation will never happen? What kind of environment would not require a playful conversation?

03/25/20 – Myles Frantz – All Work and No Play?

March 21, 2020March 24, 2020 Miles Frantz 1 Comment

Summary

While the general may not realize their full interaction with AI, throughout the day people are truly dependent on it, based on their conversation with a phone assistant or even in the backend of their bank they use. While comparing AI against its own metrics is an imperative part to ensure the highest of quality, this team compared how two different models compared when working in collaboration with Humans. To ensure there is a valid and fair comparison, there is a simple game (similar to Guess Who) where the AI has to work with other AI or humans to guess the selected image based on a series of questions. Though the AI and AI collaboration provides good results, the AI and Human collaboration is relatively weaker.

Reflection

I appreciate the competitive nature of comparing both the supervised learning (SL) and a reinforcement learning (RL) in the same type of game scenario of helping the human succeed by aiding the as best as it can. However as one of their contributions, I have issue with the relative comparison between the SL and RL bots. Within their contributions, they explicitly say they find “no significant difference in performance” between the different models. While they continue to describe the two methods performing approximately equally, their self-reported data describes a better model in most measurements. Within Table 1 (the comparison of humans working with each model), SL is reported as having a better (yet small) increase and decrease in Mean Rank and Mean Reciprocal Rank respectively (lower and then higher is better respectively). Within Table 2 (the comparison of the multitude of teams), there was only one scenario where the RL Model performed better than the SL Model. Lastly even in the participants self-reported perceptions, the SL Model only decreased performance in 1 of 6 different categories. Though it may be a small decrease in performance, they’re diction downplays part of the argument their making. Though I admit the SL model having a better Mean Rank by 0.3 (from Table 1 MR difference or Table 2 Human row) doesn’t appear to be a big difference, I believe part of their contribution statement “This suggests that while self-talk and RL are interesting directions to pursue for building better visual conversational agents…” is not an accurate description since by their own data it’s empirically disproven.

Questions

Though I admit I focus on the representation of the data and the delivery of their contributions while they focus on the Human-in-the-loop aspect of the data, within the machine learning environment I imagine the decrease in accuracy (by 0.3 or approximately 5%) would not be described as insignificant. Do you think their verbiage is truly representative of the Machine Learning relevance?
Do you think more Turk Workers (they used data from at least 56 workers) or adding requirements of age would change their data?
Though evaluating the quality of collaboration is imperative between Humans and AI to ensure AI’s are made adequately, it seems common there is a disparity between comparing that collaboration and AI with AI. Due to this disconnect their statement on progress between the two collaboration studies seems like a fundamental idea. Do you think this work is more idealistic in its contributions or fundamental?