03/25/2020 – Sushmethaa Muhundan – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

The primary intent of this paper is to measure the performance of human-AI teams and this is done in the context of the AI being a visual conversational agent using a cooperative game. Oftentimes the performance of AI systems is evaluated in isolation or with respect to interaction with other AI systems. This paper attempts to understand if this AI-AI performance evaluation can be extended to predict the performance of the AI system while it interacts with humans, which is essentially the AI-team performance. To measure the effectiveness of AI in the context of human-AI teams, a game-with-a-purpose (GWAP) is used called GuessWhich. The game involves a human player interacting with an AI component wherein the AI component generates clues pertaining to a secret image that the human is unaware of. Via this question-answer model, the human asks questions regarding the image and attempts to identify the secret image from a pool of images. Two versions of the AI component are used in this experiment, one trained in a supervised manner, and the other which is pre-trained with supervised learning and fine-tuned via reinforcement learning. The experiment results show that there is no significant performance difference between the two versions when interacting with humans. 

The trend of humans interacting with AI directly or directly has increased exponentially and therefore it was interesting that the focus of this paper is on the performance of the human-AI team and not AI in isolation. Since it is becoming increasingly common to use AI in the context of humans, a dependency is created that cannot be measured by solely measuring the performance of the AI component alone.

While the results show that there is no significant performance difference between the two versions of AI used, the experiment results also show that while the performance of the AI improved as per AI-AI performance evaluation, this does not directly translate into better human-AI team performance. This was an interesting insight that challenges the existing AI evaluation norms.

Also, the cooperative game used in the experiments was complicated from a development point of view and it was interesting to understand how the AI was developed and how the pool of images was selected. The paper also explores the possibility of the Mturk workers discovering the strength of the AI and framing subsequent questions accordingly in order to leverage the strength discovered. This was a very fascinating possibility as it ties back to the mental models’ humans create while interacting with AI systems.

  1. Given that the study was conducted in the context of visual conversational agents, are these results generalizable outside of this context?
  2. It is observed that human-AI team performances are not significantly different for SL when compared to RL. What are some reasons that you can think of that explains this anomaly observed? 
  3. Given that ALICE is imperfect, what would be the recovery cost of an incorrect answer? Would this substantially impact the performance observed?

Read More

3/25/2020 – Mohannad Al Ameedi – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Summary

Improvements in artificial intelligence systems are normally measured alone without taking into consideration the human element. In this paper, the authors try to measure and evaluate the human-AI team performance by designing an interactive visual conversational agent that involve both human and AI to solve a specific problem. The conversational agent assigns the AI system a secret image with caption which is not known by the human, and the human start rounds of questions to guess the correct image from pool of images. The agent maintains an internal memory of questions and answers to help maintaining the conversation.

The authors use two version of AI systems, the first one is trained using supervised learning and the second is trained using reinforcement learning. The second system outperforms the first, but the improvement doesn’t translate well when interacting with human which proves that advances in AI system doesn’t necessarily means advances in the human-AI team performance.

Reflection

I found the idea of running two AI systems with the same human to be very interesting. Normally we think that advances in AI system can lead to better usage by the human, but the study shows that this is not the case. Putting the human in the loop while improving the AI system will give us the real performance of the system.

I also found the concept of reinforcement learning in conversational agents to be also very interesting. Using online learning by assigning a positive and negative rewards can help to improve the conversation between human and AI system, which can prevent the system from getting stuck on the same answer if the human ask the same question.

The work in somehow like the concept of compatibility. When human makes a mental model about the AI system. Advances in AI system might not be translated into a better usage by the human, and this is what was proven by the authors when they use two AI systems and one is better than the other, but improvement not necessarily translate to better performance by the users.

Questions

  • The authors proved that improvement in AI system alone doesn’t necessarily leads to a better performance when using the system by human, can we involve the human in the process of improving the AI system to lead to a better performance when the AI system get improved?
  • The authors use a single secret image known by the AI system but not known by the human, can we make the image unknown to the AI system too by providing a pool of images and the AI system select the appropriate image? And can we do that with acceptable response latency?
  • If we have to use a conversational agents like bots in production setting, do you think the performance of an AI system trained using a supervised learning can response faster than a system trained using a reinforcement learning giving that the reinforcement learning will need to adjust it is behavior based on the reward or feedback?

Read More

03/25/20 – Myles Frantz – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Summary

Regardless of anyone’s personal perception of chatbots, with around 1.4 billion people using chatbots (smallbizgenius) they’re impact cannot be ignored. With the intention of answering rudimentary questions (often duplicated) many of these chatbots are focused in the Question-and-Answer (QA) domain. Throughout these, the usages and feelings towards chatbots varies usually based on the user, the chatbot, and the overall interaction. Focusing more on the human-centric aspect of the conversation the team proposed a Conversation Agent (CA, a chatbot within QA) with a method to inspect sections of the conversation and determine whether the user enjoyed the conversation. Within introducing a hierarchy of specific natural language classifiers (NLC), the team was able to determine through certain classifications or signals to determine a high-level abstraction of a message or conversation. While the CA did its job sufficiently, the team was able to determine through their created signal methodology that approximately 84% of people engaged in some sort of conversation (outside of a normal question and answer scenario) with the CA.

Reflection

I am surprised at the results gleaned from this survey. While I should not be surprised and should assume the closer CA (and AI in general) get to human-like they appear the better the interaction will be, the percentage of “playfulness” or conversational messages seemed relatively high. This may be due to the experience group of the participants (new hires from college), though this is a promising sign on the progress being made.

I appreciate the aspect (or angle) this research went into. Having a strong technical background, my immediate thought is to ensure all the questions are answered correctly and investigate how it can be integrated with other systems (like a Jenkins Slack bot, polling the survey of a project). The extent of a project (I believe) is not only dependent on how usable it is, but also how user-friendly it is. Given the example MySpace and Facebook, Facebook created a much easier to use and more user-centric experience (based on connecting people), while MySpace suffered from lack of improvement for both of these aspects and is currently degrading in usage.

Questions

  • With only 34% of the participants responding to the survey, do you think a higher percentage would’ve enforced and backup the data currently collected?
  • Given the general maturity and time allocations a new hire from college has, do you think current employees (who have been with the company for a while) would have this percentage of conversation? To shorten it, do you think the normally busy or higher-up employees would have given similar conversational time to the CA?
  • Given the percentage of new hires that responded and responded conversationally to the CA, the opportunity rises for the user to communicate wholly and disregard current work in favor of a conversation (potentially as a form of escapism). Do you think if this kind of CA were implemented throughout companies, these kinds of capabilities would be abused or would be used too much?

Read More

03/25/20 – Myles Frantz – All Work and No Play?

Summary

While the general may not realize their full interaction with AI, throughout the day people are truly dependent on it, based on their conversation with a phone assistant or even in the backend of their bank they use. While comparing AI against its own metrics is an imperative part to ensure the highest of quality, this team compared how two different models compared when working in collaboration with Humans. To ensure there is a valid and fair comparison, there is a simple game (similar to Guess Who) where the AI has to work with other AI or humans to guess the selected image based on a series of questions. Though the AI and AI collaboration provides good results, the AI and Human collaboration is relatively weaker.

Reflection

I appreciate the competitive nature of comparing both the supervised learning (SL) and a reinforcement learning (RL) in the same type of game scenario of helping the human succeed by aiding the as best as it can. However as one of their contributions, I have issue with the relative comparison between the SL and RL bots. Within their contributions, they explicitly say they find “no significant difference in performance” between the different models. While they continue to describe the two methods performing approximately equally, their self-reported data describes a better model in most measurements. Within Table 1 (the comparison of humans working with each model), SL is reported as having a better (yet small) increase and decrease in Mean Rank and Mean Reciprocal Rank respectively (lower and then higher is better respectively). Within Table 2 (the comparison of the multitude of teams), there was only one scenario where the RL Model performed better than the SL Model. Lastly even in the participants self-reported perceptions, the SL Model only decreased performance in 1 of 6 different categories. Though it may be a small decrease in performance, they’re diction downplays part of the argument their making. Though I admit the SL model having a better Mean Rank by 0.3 (from Table 1 MR difference or Table 2 Human row) doesn’t appear to be a big difference, I believe part of their contribution statement “This suggests that while self-talk and RL are interesting directions to pursue for building better visual conversational agents…” is not an accurate description since by their own data it’s empirically disproven.

Questions

  • Though I admit I focus on the representation of the data and the delivery of their contributions while they focus on the Human-in-the-loop aspect of the data, within the machine learning environment I imagine the decrease in accuracy (by 0.3 or approximately 5%) would not be described as insignificant. Do you think their verbiage is truly representative of the Machine Learning relevance?
  • Do you think more Turk Workers (they used data from at least 56 workers) or adding requirements of age would change their data?
  • Though evaluating the quality of collaboration is imperative between Humans and AI to ensure AI’s are made adequately, it seems common there is a disparity between comparing that collaboration and AI with AI. Due to this disconnect their statement on progress between the two collaboration studies seems like a fundamental idea. Do you think this work is more idealistic in its contributions or fundamental?

Read More

03/18/2020 – Nan LI – Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

Summary:

The main objective of this paper is to solve the monetary cost and response latency problem of crowd-powered conversational assistants. The first approach developed in this paper is to combine the crowd workers and automated systems to achieve high quality, low latency, and low-cost solution. Based on this idea, the paper designed a crowd-powered conversational assistant, Evorus, which can gradually automate itself over time by including varies responses from chosen chatbots, learn and reuse prior responses, and also reduce the oversight from the crowd via an automatic voting system. To design and refined the flexible framework for open-domain dialog, the author developed two phases of public field deployment and testing with real users. The final goal of this system is to let the automatic components within this system gradually take over from the crowd.

Reflection:

I think this paper presented several excellent points. First, the crowd-powered system has been widely employed because of low monetary costs and high convenience. However, as the number of required crowd workers and tasks increases, the expenditures for those workers gradually increase to a non-negligible number. Besides, even though the platform that enables hiring crowd workers quickly is available, the response latency is still non-ignorable. The author in this paper also realized this deficiency and trying to develop an approach to solve these problems.

Second, it is a prevalent idea that combines crowd workers and automated systems. The novelty of this paper is adding another automatic voting system to decide which response to send to the end-user. This machine learning model enables a high-quality response by reducing crowed-oversight. The increase of error tolerance enables even an imperfect automation component to contribute to the conversation without impact the quality. Thus, the system could integrate more types of chatbots and extend the explore region of different actions. Besides, due to the balance of “upvote” and “downvote” of this system, Evorus enables flexibility and fluid collaboration between humans and chatbots.

Third, another novel attribute of this system is “reuse prior responses.” I think the idea of enabling Evorus to find answers to similar queries in prior conversations to suggest as responses to new queries is a key approach that probably changes the partial crowed-prowed system to a completely automatic system. Because this is a simulation of people learning from the past, furthermore, this is also what we do in daily life conversation. Thus, as the system involves in more conversations, and memorizes more query-response pairs from all the old conversations, the system might be able to build a comprehensive database which stores all type of conversation query and response. On that day, the system might become automatic, ultimately. However, this database might need to up data and become partially crowd-powered once a while regarding the constant change of information and the way people communicate.

Question:

  • What do you think of this system? Do you think it is possible that the system only relies on automation one day?
  • What do you think about the voting system? Do you think it is a critical factor that enables a high-quality response? What do you think about the design of different weights for “upvote” and “downvote”?
  • It is a prevalent idea nowadays to combine crowd-worker and AI systems to achieve high accuracy or high quality. However, the author in the paper expects the system to rely on automation increasingly. Can you see the benefit if this expectation achieved?

Word Count: 567

Read More

03/18/20 – Nan LI – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

summary:

The main objective of this paper is to measure AI agents through interactive downstream tasks performed by human-AI teams. To achieve this goal, this author designed a cooperative game – GuessWhich – that require human to identify a secret image among a pool of images through engaging in a dialog with an answerer-bot(Alice). This image is known to Alice but unknown to the human so that human need to ask some related questions and pick out the secret image based on the answer from Alice. There are two versions of Alice presented in the paper. Alice(SL) which is trained in a supervised manner to simulate conversation with humans about image and Alice(RL), which is pre-trained with supervised learning and fine-tuned via reinforcement learning for the image-guessing task. The results indicate that the evaluation of Alice(RL) with another AI more accurate than the evaluation with the human. Therefore, the paper concluded that there is a disconnect between the benchmarking of AI in isolation and the context of human-AI interaction. 

Reflection:

This paper reminds me of another paper that we have discussed before: Updates in Human-AI Teams. Both of them are concerning the impact of human involvement in the AI performance. I think this is a great topic, and it is worth putting more attention on this topic. Because as the beginning of the paper said, as AI continues to advance, human-AI teams are inevitable. Many AI products have been widely used in society, including all walks of life. For example, predictive policing, life insurance estimation, sentencing, medical. Their product all requires human-AI to cooperate. Therefore, we already reach an agreement that the development and improvement of AI should always consider the impact of human involvement. 

The QBOT-ABOT teams mentioned in this paper have a similar idea as the GAN(Generative adversarial network). Both of them train two systems to use unsupervised training and let them provide feedback for each other to enhance their performance. However, the author made the point that it is unclear if these QBOT and ABOT agents are indeed performing well when interacting with humans. This is an excellent point that we should always consider when we design an AI system. The measuring of the AI system should never be isolated. We should also consider human mental methods. This requires us to consider how the human mental model will impact team performance when they work cooperatively. A suitable human involved evaluation may be a more valuable measurement for the AI system. 

Question:

  1. Do you think when we should measure the performance with human involvement and when we should not? 
  2. Can you see what the main point of this paper is? Why the author uses visual conversational agents to prove the points that it is crucial to benchmark progress in AI in terms of how it translates to helping humans perform a particular task. 
  3. The author mentioned that humans perceive ALICE(SL) and ALICE(RL) as comparable in terms of all metrics at the end of the paper. Why do you think the human will make such a conclusion. Does that indicate human involvement has no difference for two visual conversational agents?

Word Count: 537

Read More

02/26/2020 – Mohannad Al Ameedi – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Summary

In this paper, the authors study the social roles of editing tools in Wikipedia and the way vandalism fighting is addressed. The authors focus on the effected automated tools, like robots, and assisted editing tools on the distributed editing used by the encyclopedia. Wikipedia allows anyone in the universe to edit the content of its articles, which make keeping the quality of the content a difficult task. The platform depends on distributed social network of volunteers to approve or deny changes. Wikipedia uses a source control system to help the users see the changes. The source control shows both versions of the edited content side by side which allow the editor to see the change history. The authors mention that Wikipedia uses bots and automated scripts to help editing some content and fight vandalism. They also mentioned different tools used by the platform to assist the editing process. A combination of humans, automated tasks, and assisted edit tools make Wikipedia able to handle such massive number of edits and fight vandalism attempts. Most research papers that studied the editing process are outdated since they didn’t pay a close attention to these tools, while the authors highlights the importance of these tools on improving the overall quality of the content and allow more edits to be performed. These technological tools like bots and assisted editing tools changed the way humans interact with system and have a significant social effect on the types of activities that are made possible in Wikipedia.

Reflection

I found the idea of the distributed editing and vandalism fighting in Wikipedia interesting. Giving the massive amount of contents in Wikipedia, it is very challenging to keep high quality contents giving that anyone in the universe who has access to the internet can make edit. The internal source control and the assisted tools used to help the editing job at a scale are amazing.

I also found the usage of the bots to automate the edit for some content interesting. These automated scripts can help expediting the content refresh in Wikipedia, but also cause errors. Some tools mentioned in the paper don’t even show the bots changes, so I am not sure if there some method that can measure the accuracy f these bots.

The concept of distributed editing is similar to the concept of pull request in GitHub where any one can submit a change to an open source project and only group of system owners or administrator can accept or reject the changes.

Questions

  • Since millions or billions of people have smart phones nowadays, the amount of anonymous edit might significantly increase.  Are these tools still efficient in handling such increased volume of edits?
  • Can we use deep learning or machine learning in fighting vandalism or spams? The number of edits performed on articles can be treated as a rich training dataset.
  • Why don’t Wikipedia combine all the assisted editing tools in to one too that has the best of each tool? Do you think this a good idea or more tools means more innovation and more choices?

Read More

02/25/2020 – Mohannad Al Ameedi – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

In this paper, the authors study the effect of updating AI system on the human-AI team performance. The study is focused on the decision-making systems where the users decide on whether accept the AI system recommendation or perform a manual process to make a decision. The authors name the users experience a mental model that the users built over a course of the usage of the system. Improving the accuracy of the AI system might disturb the users’ mental model and decrease the overall performance of the system. The paper mentioned two examples of a readmission system that is used by doctors to predict if the patient will get readmitted or not and also another system that is used by judges and shows the negative impact of the system updates on both systems. The authors propose a platform that can be used by the users to recognize objects and can built users mental model and give the user rewards and get feedback to improve the overall system performance which encompass both of the AI system performance and compatibility.

Reflection

I found the idea of the compatibility very interesting. I always thought that the performance of the AI model on the validation is the most and only factor that should be taken into consideration, and I never thought about the negative effect on the user experience or mental model of the user, and now I can see that the compatibility and the performance tradeoff is a key in deploying a successful AI agent.

At the beginning, I thought that the word compatibility was not the right term to describe the subject. My understanding was compatibility in software systems refer to making sure the a newer version of the system should still work in different versions of the operating  system, but now I think the user is taking a similar role as the operation system when dealing with the AI agent.

Updating the AI system looks similar to updating the user interface of an application where the users might not like a newly added feature or the new way used by the system handle a task.

Questions

  • The authors mention the patient readmission and the judge examples to demonstrate how the AI update might affect the users, are there any other examples?
  • The authors propose a platform that can get user feedback but not in real world setting. Can we build a platform that can get feedback at run-time using reinforcement learning where the reward can be calculated in each user action ad adjust the action to use the current model or previous model?
  • If we want to use crowd-sourcing to improve the performance/compatibility of the AI system then the challenge will be on building a mental model for the user since different user will take a different task and we have no control on choosing the same worker every time, any idea that can help on using crowd-sourcing to improve the AI agent.

Read More

03/04/20 – Nan LI – Real-Time Captioning by Groups of Non-Experts

Summary:

In this paper, the author focused on the main limitation of real-time captioning. The author made the point that the caption with high accuracy and low latency requires expensive stenographers who need an appointment in advance and who are trained to use specialized keyboards. The less expensive option is automatic speech recognition. However, its low accuracy and high error rate would greatly influence the user experience and cause many inconveniences for deaf people. To alleviate these problems, the author introduced an end-to-end system called LEGION: SCRIBE, which enables multiple works to provide simultaneous captioning in real-time, and the system combines their input into a final answer with high precision, high coverage, low latency. The author experimented with crowd workers and other local participants and compared the results with CART, ASR, and individual. The results indicate that this end-to-end system with a group of works can outperform both individual and ASR regarding the coverage, precision, and latency.

Reflection:

First, I think the author made a good point about the limitation of real-time captioning, especially the inconvenience that brings to deaf and hard of hearing people. Thus, the greatest contribution this end-to-end system provided is the accessibility of cheap and reliable real-time captioning channel. However, I have several concerns about it.

First, this end-to-end system requires a group of workers, even paid with a low salary for each person, as the caption time increases, the salary for all workers is still a significant overhead.

Second, since to satisfy the coverage requirement, a high precision, high coverage, low latency caption requires at least five more workers to work together. As mentioned in the experiment, the MTruk works need to watch a 40s video to understand how to use this system. Therefore, there may be a problem that the system cannot find the required number of workers on time.

Third, since the system only combines the work from all workers. Thus, there is a coverage problem, which is if all of the workers miss a part of the information, the system output will be incomplete. Based on my experience, if one of the people did not get part of the information, usually, most people cannot get it either. As the example presented in the paper, no workers typed the “non-aqueous” which was used in a clip about chemistry.

Finally, I am considering combining human correction and ASR caption. Since humans have the strength that remembers the pre-mentioned knowledge, for example, an abbreviation, yet they cannot type fast enough to cover all the content. Nevertheless, ASR usually does not miss any portion of the speech, yet it will make some unreasonable mistakes. Thus, it might be a good idea to let humans correct inaccurate captions of ASR instead of trying to type all the speech contents.

Question:

  • What do you think of this end-to-end system? Can you evaluate it from different perspectives, such as expense, accuracy?
  • How would you solve the problem of inadequate speech coverage?
  • What do you think of the idea that combines human and ASR’s work together? Do you think it will be more efficient or less efficient?

Word Count: 517

Read More

03/04/20 – Nan LI – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary:

The main objective of this paper is to investigate the feasibility of using crowd workers to locate and assess sidewalk accessibility problems using Google Street View imagery. To achieve this goal, the author conducted two studies to examine the feasibility of finding, labeling sidewalk accessibility problems. The paper uses the results of the first study to prove the possibility of labeling tasks, define what does good labeling performance like, and also provide verified ground truth labels that can be used to assess the performance of crowd workers. Then, the paper evaluates the annotation correctness from two discrete levels of granularity: image level and pixel level. The previous evaluation check for the absence or presence of a label and the later examination in a more precise way, which related to image segmentation work in computer vision. Finally, the paper talked about the quality control mechanisms, which include statistical filtering, an approach for revealing effective performance thresholds for eliminating poor quality turkers, and verification interface, which is a subjective approach to validates labels.

Reflection:

The most impressive point in this paper is the feasibility study, study 1. Since this study not only investigates the feasibility of the labeling work but also provides a standards of good labeling performance and indicate the validated ground truth labels, which can be used to evaluate the crowd worker’s performance. This pre-study provides all the clues, directions, and even the evaluation matrix for the later experiment. It provides the most valuable information for the early stage of the research with a very low workload and effort. I think sometimes it is a research issue that we put a lot of effort into driving the project forward instead of preparing and investigate the feasibility. As a result, we stuck by some problems that we can foresee if we conduct a pre-study.

However, I don’t think the pixel-level assessment is a good idea for this project. Because the labeling task does not require such a high accuracy for the inaccessible area, and it is to accurate to mark the inaccessible area with the unite of the pixel. As the table indicated in the paper’s results of pixel-level agreement analysis, the area overlaps for both binary classification, and multiclass classification are even no more than 50%. Also, though, the author thinks even a 10-15% overlap agreement at the pixel level would be sufficient to localize problems in images, this makes me more confused about whether the author wants to make an accurate evaluation or not.

Finally, considering our final project, it is worth to think about the number of crowd workers that we need for the task. We need to think about the accuracy of turkers per job. The paper made a point that performance improves with turker count, but these gains diminish in magnitude as group size grows. Thus, we might want to figure out the trade-off between accuracy and cost so that we can have a better idea of choice for hiring the workers.

Questions:

  • What do you think about the approach for this paper? Do you believe a pre-study is valuable? Will you apply this in your research?
  • What do you think about the matrix the author used for evaluating the labeling performance? What else matrix would you like to apply in assessing the rate of overlap area?
  • Have you ever considered how many turkers you need to hire would meet your accuracy need for the task? How do you evaluate this number?

Word Count: 578

Read More