Summary
The authors (of which there are many) go over the various HCI-related findings for Human-AI interaction and categorize them into eighteen different types over 4 categories (applicable to when the user encounters the AI assistance). The work makes sure the reader knows it was from the past twenty years of research and from a review of industry guidelines, articles and editorials in the public domain, and a (non-exhaustive) survey of scholarly papers on AI design. In all, they found 168 guidelines that they then performed affinity diagramming (and filtering out concepts that were too “vague”), resulting in twenty concepts. Eleven members of their team at Microsoft then performed a modified discount heuristic evaluation (where they identified an application and its issue) and refined their guidelines with that data, resulting in 18 rules. Next, they performed a user study with 49 HCI experts where each was given an AI-tool and asked to evaluate it. Lastly, they had experts validate their revisions in the previous phase.
Personal Reflection
These guidelines are actually quite helpful in evaluating an interface. As someone who has performed several heuristic evaluations in a non-class setting, having defined rules that can be easily determined if they’ve been violated makes the process significantly quicker. Nielsen’s heuristics have been the gold standard for perhaps too long, so revisiting the creation of guidelines is ideal. It also speaks to how new this paper is, being from 2019’s CHI conference.
Various things surprised me in this work. First, I was surprised that they stated that contractions weren’t allowed for their guidelines because they weren’t clear. I haven’t heard that complaint before, and it seemed somewhat arbitrary. A contraction doesn’t change a sentence much (doesn’t in this sentence is clearly “does not”), but I may be mistaken here. I was also surprised to find their tables in figure 1 to be hard to read, as if maybe it as a bit too information dense to clearly impart their findings. I was also surprised about their example for guideline 6, as suggesting personal pronouns and kind of stating there are only 2 is murky, at best (I would’ve used a different example entirely). Lastly, the authors completely ignored the suggestion of keeping the old guideline 15, stating their own reasons despite the expert’s preferences.
I also think this paper in particular will be a valuable resource for future AI development. In particular, it can give a lot of ideas for our semester project. Furthermore, these guidelines can help early on in the process of designing future interactions, as they can refine and correct interaction mistakes before the implementation of many of these features.
Lastly, I thought it was amusing the “newest” member of the team got a shout-out in the acknowledgements.
Questions
- The authors bring up trade-offs as being a common occurrence in balancing these (and past) guidelines. Which of these guidelines do you think is easier or harder to bend?
- The authors ignored the suggestion of their own panel of experts in revising one of their guidelines. Do you think this is appropriate for this kind of evaluation, and why or why not?
- Can you think of an example of one of these guidelines not being followed in an app you use? What is it, and how could it be improved?
I also caught the guideline 15 thing and I really would like to know why the disparity between the user study and the expert opinion was there. I was hoping they would provide an example comment but I guess they deemed it irrelevant or left it out to manage space.