This paper introduces Evorus, a conversational assistant framework/interface that can serve as a middleman to curate and choose the appropriate responses for a client’s query. The goal of Evorus is to serve as a middleman between a user and many integrated chatbots, while also using crowd workers to vote on which responses are the best given the query and the context. This allows Evorus to be a general purpose chatbot, because it being powered by many domain specific chatbot and (initially) crowd workers. Evorus learns over time from the crowd workers votes on which responses to send to a query, based on its historical knowledge of previous conversations, and also learn which chatbot to direct a query to based on what it knows of which chatbots responded to similar queries in the past. It also prevents bias against newer chatbots by giving them higher initial probabilities when they first start to allow them to be selected even though Evorus does not have any historical data or prior familiarity with that chat bot. The ultimate ideal of Evorus is to eventually minimize the number of crowd worker interventions that are necessary by learning which responses to vote on and pass through to the user, and thus save crowd work costs over time.
This paper seems to follow on the theme of last week’s reading “Pull the Plug? Predicting If Computers or Humans Should Segment Images”. In that paper, the application is trying to decide on the quality of image segmentation of an algorithm, and pass it on to a human in case it was not up to par. The goals of this paper seem similar to that, but for chat bots instead of image segmentation algorithms. I’m starting to think the idea of curation and quality checking is a common refrain that will pop up in other crowd work based applications, if I keep reading in this area. I also find it an interesting choice that Evorus seems to allow multiple responses (either from bots or from crowd workers) to be voted in and displayed to the client. I suppose the idea here is that, as long as the responses made sense and they add more information that can be given to the client, it’s beneficial to allow multiple responses instead of trying to force a single, canonical response. Though I like this paper and the application that it presents, one issue I have is that they don’t show a proper user study. Maybe they felt it was unnecessary because user studies on automatic and crowd based chatbots have been done before and the results of these would be no different. But I still think they should’ve done some client side interviews or observations, or at least shown a graph of the Likert scale responses they collected for the two phases.
- Do you see a similarity between this work and the Pull the Plug paper? Is the idea of curation and quality control and teaching AI how to do quality control a common refrain in crowd work research?
- Do you find the integration of Filler bot, Interview bot, and clever bot, which are not actually contributing anything useful to the conversation, of any use? Was it just there to add conversation noise? Did they serve a training purpose?
- Would a user study have shown anything interesting or suprising compared a standard AI based or crowd based chat bot?