In this paper, the authors design and deploy a conversational QA bot called CHIP. It is capable of providing domain-specific information (information about the company it was deployed at) and performing off-topic conversation. The authors’ interest in this study was to observe and characterize the kinds of interactions that happened for users of CHIP and measure CHIP performance. CHIP classified user intention by using two classification systems, where one classified a large category and the other a more specific one. Based on whether the specific classification was a sub-category of the larger classification, the appropriate response was given to the user. The training was done using data collected from other conversational QA agents and anonymized company E-mails. The authors observed that users usually used CHIP for system inquiry, providing feedback, and playful chit-chat.
I personally liked the study due to the interesting topic. My mother owns an Amazon Alexa. I’ve frequently seen Alexa trying to be humorous and I was astonished by how naturally human-like these conversational agents could act. At the start of this paper, I was curious about how the authors approached the highly abstract concept of playfulness in a technical paper. Using an intention classification layer was a great idea. I think it nicely encapsulates the user queries and improves response quality.
One interesting observation in the paper was that casual conversations often occurred in a mix with work-related conversations. Up to now, I thought the two types of conversations happened separately when chatting with a bot. I think this mix happens more frequently when talking with a human, so I assume it was the result of the users trying to anthropomorphize the agent.
Moving on to a more critical reflection, I think it would have been nicer if the paper focused more on one side of the types of conversations (i.e. playful conversations). The paper tries to address both work-related conversations and playful conversations at the same time. I know that the authors were interested in looking at human-AI interaction in the wild, but I think this also made the results less compact and lose focus. I also had the feeling that this study was very specific to the agent that the authors designed (CHIP). I am unsure how the results would generalize to other CAs.
These are the questions that I had while reading the paper:
1. The authors mentioned that certain nuances could be used to detect user intention. What would be considered a nuance of a playful intention? It seems that there’s a high correlation between the user’s urge to anthropomorphize the bot and a playful conversation. Could phrases like ‘you’ or ‘we’ be used to detect playful intention?
2. As in my reflection, I think this study is a bit too specific to CHIP. What do you think? Do you think the results will generalize well to other kinds of conversational bots?
3. According to this paper, work-related conversations and playful conversations frequently happened together. Would there be a case where a playful conversation will never happen? What kind of environment would not require a playful conversation?
Yes, I agree with you that this study is a bit too specific to CHIP. However, I think some of the points mentioned in this research can be applied to other systems. For example, the authors characterized the forms of conversational interactions users had with the agent, including feedback-giving, playful chit-chat, system inquiry, and habitual communicative utterances, which should be considered in future development. And the highlighted signals in conversational interactions for inferring user satisfaction are valuable in the development of other applications.
I don’t think the results are generalizable. The chatbot works in a closed environment and might not work if released in the wild?