Reading: Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–13. https://doi.org/10.1145/3290605.3300233
With AI and ML making its way into every aspect of our electronic lives, it has become pertinent to examine how well it functions when faced with users. In order to do that, we need to have some set of rules or guidelines that we can use as a reference to identify whether the interaction between a human and an AI powered feature is actually functioning the best way it should function. This paper aims to fill that gap, collating the knowledge of 150 recommendations for human AI interfaces and distilling down into 18 distinct guidelines that can be checked for compliance. They also go through the process of refining and tailoring these guidelines to remove ambiguity through heuristic evaluations where experts try to match the guidelines to sample interactions and identify whether the interaction adheres to or violates the guideline or if the guideline is relevant to that particular interaction at all.
- Though it’s only mentioned in a small sentence in the Discussion section, I’m glad that they point out and acknowledge that there is a tradeoff between being very general (at which point the vocabulary you devise is useless and you have to start defining subcategories), and being very specific (at which point you need to start adding addendums and special cases willy-nilly). I think the set of guidelines in this paper does a good job of trying to strike that balance.
- I do find it unfortunate that they anonymized the products that they used to test interactions on. Maybe this is just standard practice when it comes to this kind of HCI work to not specify the exact products that they evaluate to avoid dating the work in the paper. It probably makes sense this way they have control of the narrative and can simply talk about the application in terms of the feature and interaction tested. This avoids having to grapple over which version of the application they used on which day, because applications get updated all the time and violations might get patched and fixed and thus the application is no longer a good example for a guideline adherence or violation that was noted earlier.
- It is kind of interesting that a majority of the experts in phase 4 preferred the original version of guideline 15 (encourage feedback) as opposed to revised version (provide granular feedback) that was successful in the user study. I wish they had explained or speculated why that was.
- Why do you think experts in phase preferred the original version of guideline 15 as opposed to revised version, even though the revised version was demonstrated to cause less confusion between it and guideline 17 compared to the original version?
- Are we going to see even more guidelines, or a revision of these guidelines 10 years down the line, when AI assisted applications become even ubiquitous?
- As the authors pointed out, the current ethics related guidelines (5 and 6) may not be sufficient to cover all the ethical concerns. What other guidelines should there be?