SUMMARY
The authors of this paper design and evaluate a mixed-initiative fact-checking approach that blends prior human knowledge with the efficiency of automated ML systems. The authors found that users tend to over-trust the model which could degrade human accuracy. They conducted three randomized experiments – the first, compares user who perform the task with or without viewing ML predictions, the second, compares a static interface with an interactive one (that enables users to fix model predictions), and the third, compares a gamifies task design to a non-gamified one. The authors designed an interface that displays the claim, the predicted correctness, and relevant articles. For the first experiment, the authors considered responses from 113 participants with 58 assigned to Control and 55 to System. For the second experiment, the authors considered responses from 109 participants with 51 assigned to Control and 58 to Slider. For the third experiment, the authors considered responses from 106 participants, and found no significant differences between the two groups.
REFLECTION
I liked the idea of a mixed-initiative approach to fact checking that builds on the affordances of both humans and AI. I found that it was good that the authors designed the experiments such that the confidence scores (and therefore the fallibility) of the system was openly shown to the users. I also felt that the interface design was concise and appropriate without being overly complex. I also liked the design of the gamified approach and was surprised to learn that the game design did not impact participant performance.
I agree that for this case in particular, participant demographics may affect the results. Especially since the news articles considered were mainly related to American news. I wonder how much if a difference in the results would be observed in a follow-up study that considers different demographics as compared to this study. I also agree that caution must be exercised with such mixed-initiative systems as imperfect data sets would have a considerable impact on model predictions and that the humans should not blindly trust the AI predictions). It would definitely be interesting to see the results obtained when users check their own claims and interact with other user’s predictions.
QUESTIONS
- The authors explain that the incorrect statement on Tiger Woods was due to the model having learnt the bi-gram ‘Tiger Woods’ incorrectly – something that a more sophisticated classifier may have avoided. How much of an impact would such a classifier have made on the results obtained overall? Have other complementary studies been conducted?
- The authors found that a smaller percentage of users used the sliders than expected. They state that while the sliders were intended to be intuitive, they may require a learning curve causing lesser users to adopt it. Would the use of a tutorial that enabled users to familiarize themselves have helped in this case?
- Were the experiments conducted in this study adequate? Are there any other experiments that the authors should have conducted in addition to the ones mentioned?