Word count: 568
Summary of the Reading
This paper makes many contributions to the field of human and AI interaction. It focuses on presenting a new way to evaluate AI agents. Most evaluations of AI systems are done in isolation, with no human input. One AI system interacts with another AI system and their combined interaction forms the basis of the evaluation for the AI systems. This research presents a new way to evaluate AI systems: bringing humans into the loop and getting them to replace one of the AI systems to better evaluate how AIs work within a more real world scenario: one where humans are present. This paper finds that these two evaluation methods can produce different results. Specifically, when comparing the AI systems, the one that performed worse when evaluated with another AI system actually performs better when evaluated by a human. This raises important questions about the way we test AI systems and suggests that testing should be more human focused.
Reflections and Connections
I think that this paper highlights an important issue that I had never really thought about. Whenever we build any kind of new tool or new system, it must be tested. And, this testing process is extremely important in deciding whether or not the system works. The way that we design tests is just as important as the way that we design the system in the first place. If we design a great system, but design a bad test and then the test says that the system doesn’t work, we have lost a good idea because of a bad test. I think this paper will make me think more critically about how I design my tests in the future. I will put more care into them and make sure that they are well designed and will give me the results that I am looking for.
When these ideas are applied to AI, I think that they get even more interesting. AI systems can be extremely hard to test and oftentimes, it is much easier to design another automated system, whether that be another AI system or just an automated script, to test an AI system, rather than getting real people to test it. It is just much easier to use machines than it is to use humans. Machines don’t require IRB approval, machines are available 24/7, and machines provide consistent results. However, when we are designing AI systems and especially when we are designing AI systems that are made to be used by humans, it is important that we test them with humans. We cannot truly know if a system designed to work with humans actually works until we test it with humans.
I hope that this paper will push more research teams to use humans in their testing. Especially with new tools like MTurk, it is easier and cheaper than ever to get humans to test your systems.
Questions
- What other kinds of systems should use humans in testing, rather than bots or automated systems?
- Should all AI systems be tested with humans? When is it ok to test with machines?
- Should we be more skeptical of past results, considering that this paper showed that an evaluation conducted with machines actually produced a wrong result (the wrong ALICE bot was chosen as better by machines)?