Summary
The paper talks about user-expectation when it comes to end-user applications. It is essential to make sure that the user-expectations are set to an optimal level so that the user does not find the end product underwhelming. Most of the related work done in this area highlights the fact that user disappointment occurs if the initial expectation is set to “too high”. Initial expectations can originate from advertisements, product reviews, brands, word of mouth, etc. They tested their hypothesis on an AI-powered scheduling assistant. They created an interface similar to the Microsoft Outlook email system. The main purpose of the interface was to detect if an email was sent with the intention of scheduling a meeting. If so, the AI would automatically highlight the meeting request sentence and then allow the user to schedule the meeting. The authors designed three techniques, namely, accuracy indicator, example-based explanations, and control slider, to design for adjusting end-user expectations. Most of their hypotheses were proved to be true. Yet, it was found that an AI system based on high recall had better user acceptance than high precision.
Reflection
The paper was an interesting read on adjusting the end-user expectation. The AI scheduling assistant was used as a UI-tool to evaluate the users’ reactions and expectations of the system. The authors conducted various experiments based on three design techniques. I was intrigued to find out that the high precision version did not result in a high perception of accuracy. An ML background practitioner always looks at precision (false positive). From this, we can infer that the task at hand should be the judge of what metric we should focus on. It is certainly true that here, displaying a wrongly highlighted sentence would annoy the user less than completely missing out on the meeting details in an email. Hence, I would say this kind of high recall priority should be kept in mind and adjusted according to the end goal of the system.
It would also be interesting to see how such expectation oriented experiments performed in the case of other complex tasks. This AI scheduling task was straight-forward, where there can be only one correct answer. It is necessary to see how the expectation based approach fairs when the task is subjective. By subjective, I mean, the success of the task would vary from user to user. For example, in the case of text summarization, the judgment of the quality of the end product would be highly dependent on the user reading it.
Another critical thing to note is the expectation can also stem from a user’s personal skill level and subsequent expectation from a system. As a crowd-worker, having a wrongly highlighted line might not affect as much when the number of emails and tasks are less. How likely is this to annoy busy professionals who might have to deal with a lot of emails and messages with meeting requests. Having multiple incorrect highlights a day is undoubtedly bound to disappoint the user.
Questions
- How does this extend to other complex user-interactive systems?
- Certain tasks are relative, like text summarization. How would the system evaluate success and gauge expectations in such cases where the task at hand is extremely subjective?
- How would the expectation vary with the skill level of the user?
Hi Bipasha, I agree with you on the point that the minimization of certain type of errors highly relates with the end goal of the system. Also, I think what an individual use the system for would shape the user’s expectation. If the task allows for mistakes, the user would have tolerance to the errors returned by the system. However, incorrect inference by AI systems on high-stakes tasks can hardly be accepted due to the a series of potential consequences. For the subjective tasks you mentioned, I think we still have multiple ways to evaluate. For example, in the text summarization task, we can define a metric to measure the relevance of the returned summary and the given article and another metric to quantify how well the summary captures the information delivered by the whole article. That can give us a measurement similar to the Accuracy Indicator which can be presented to the crowd workers. By adopting some mechanisms (e.g. attention), the referred text for the final summary can be highlighted by the model as example-based explanation.