{"id":864,"date":"2020-03-24T18:40:23","date_gmt":"2020-03-24T18:40:23","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/cs6724s20\/?p=864"},"modified":"2020-03-24T23:37:21","modified_gmt":"2020-03-24T23:37:21","slug":"03-25-20-lee-lisle-evaluating-visual-conversational-agents-via-cooperative-human-ai-games","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/2020\/03\/24\/03-25-20-lee-lisle-evaluating-visual-conversational-agents-via-cooperative-human-ai-games\/","title":{"rendered":"03\/25\/20 \u2013 Lee Lisle \u2013 Evaluating Visual Conversational Agents via Cooperative Human-AI Games"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><strong>Summary<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Chattopadhyay\net al.\u2019s work details the problems with the current (pre-2018) methods of\nevaluating visual conversational agents. These agents, which are AIs designed\nto discuss what is in pictures, were typically evaluated through one AI (the\nprimary visual conversational agent) describing a picture while another asked\nquestions about it. However, the authors show how this kind of interaction does\nnot adequately reflect how humans would converse with the agent. They use 2\nvisual conversation agents, dubbed ALICE_SL and ALICE_RL (for supervised and\nreinforcement learning, respectively) to play 20 questions with AMT workers.\nThey found that there was no significant difference in the performance of the\ntwo versions of ALICE. This stood in contrast to the work done previously which\nfound that ALICE_RL was significantly better than ALICE_SL when tested by AI-AI\nteams. Both ALICEs perform better than random chance, however. Furthermore,\nAI-AI teams require fewer guesses than the humans in Human-AI teams.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Personal Reflection<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I first found their name\nfor 20-questions was Guess What or Guess Which. This has relatively little to\ndo with the paper, but it was jarring to me at first.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first thing that\nstruck me was their discussion of the previous methods. If the first few rounds\nof AI-AI evaluation were monitored, why didn\u2019t they pick up that the\ninteractions weren\u2019t reflective of human usage? If the abnormality didn\u2019t\npresent until later on, could they have monitored late-stage rounds, too? Or was\nit generally undetectable? I feel like there\u2019s a line of questioning here that wasn\u2019t\nlooked at that might benefit AI as well.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I was amused that,\nwith all the paper being on AI and interactions with humans, that they chose\nthe image set to be medium difficulty based on \u201cmanual inspection.\u201d Does this\nindicate that the AIs don\u2019t really understand difficulty in these datasets?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another minor quibble\nis that they say each HIT was 10 games, but then state that they published HITs\nuntil they got 28 games completed on each version of ALICE and specify this\nmeant 560 games. They overload the word \u2018game\u2019 without describing the actual\nmeaning behind it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An interesting\nquestion that they didn\u2019t discuss investigating further is whether question\nstrategy evolved over time for the humans. Did they change up their style of\nquestions as time went on with ALICE? This might provide some insight as to why\nthere was no significant difference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Lastly, their\ndiscussion on the knowledge leak of evaluating AIs on AMT was quite\ninteresting. I would not have thought that limiting the interaction each turker\ncould have with an AI would improve the AI. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Questions<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Of all of the participants who started a HIT on\nAMT, only 76.7% of participants actually completed the HIT. What does this mean\nfor HITs like this? Did the turkers just get bored or did the task annoy them\nin some way?<\/li><li>The authors pose an interesting question in 6.1\nabout QBot\u2019s performance. What do you think would happen if the turkers played\nthe role of the answerer instead of the guesser?<\/li><li>While they didn\u2019t find any statistical\ndifferences, figure 4(b) shows that ALICE_SL outperformed ALICE_RL in every\nround of dialogue. While this wasn\u2019t significant, what can be made of this\ndifference?<\/li><li>How would you investigate the strategies that\nhumans used in formulating questions? What would you hope to find?<\/li><\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Summary &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Chattopadhyay et al.\u2019s work details the problems with the current (pre-2018) methods of evaluating visual conversational agents. These agents, which are AIs designed to discuss what is in pictures, were typically evaluated through one AI (the primary visual conversational agent) describing a picture while another asked questions about it. However, the authors show [&hellip;]<\/p>\n","protected":false},"author":105,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[87,81],"class_list":["post-864","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-class10","tag-vqagames"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/posts\/864","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/users\/105"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/comments?post=864"}],"version-history":[{"count":1,"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/posts\/864\/revisions"}],"predecessor-version":[{"id":866,"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/posts\/864\/revisions\/866"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/media?parent=864"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/categories?post=864"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/cs6724s20\/wp-json\/wp\/v2\/tags?post=864"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}