04/29/20 – Fanglan Chen – Accelerating Innovation Through Analogy Mining

April 28, 2020April 28, 2020 Fanglan Chen 1 Comment

Summary

Hope’s paper “Accelerating Innovation Through Analogy Mining” studies how to boost knowledge discovery through searching for analogies in massive and unstructured real-world datasets. This research is motivated by the availability of large idea repositories which can be used as databases to search for analogous problems. However, it is very challenging to find useful analogies among the massive and noisy real-world repositories. Manual and automated methods have their own advantages and disadvantages: hand-created databases have a high relational structure which is central to analogy search but expensive to obtain; naive machine learning or information retrieval can be easily scaled to large datasets with similarity metrics but fail to incorporate structural similarity. To address the challenges, the researchers explore the potential of learning “problem schemas,” which are simpler structural representations that specify the purpose and mechanism of a product. Their proposed approach leverages the power of crowdsourcing and recurrent neural networks to extract purpose and mechanism vector representations from product descriptions. The experimental results indicate that the learned vectors can facilitate the search of analogies with higher precision and recall than traditional information retrieval methods.

Reflection

This paper introduces an innovative approach for analogy search, the task of which is very similar to the “SOLVENT” paper we discussed last week. The “weaker structural representations” idea is very interesting and it allows more flexibility in automatic algorithm design compared with relying on fully structured analogical reasoning. My impression is that the human component in the model design is comparatively weaker than that of the approaches presented in other readings we discussed before. Crowd work in this paper is leveraged as an approach to generate training data and evaluate the experimental results. I have been thinking about if there is any place that has the potential to incorporate human interaction in the model design. As we know, recurrent neural networks have certain limitations. The first weakness is variable length, which means the RNN models cannot handle long sequence data, and this largely constrains the usage scenarios. The second weakness is the sliding window, which ignores the continuity between the sliced subsequences, which is more close to the surface but not as deep as the paper claims. I am wondering if there is a possibility to leverage human interaction to overcome the shortcoming of the model itself.

The performance of machine learning models is highly driven by the quality of the training data. With the crowdsourced product purpose and mechanism annotations, I feel there is a need to incorporate some quality control components in the proposed framework. A few papers we discussed before touching upon this point. Also, though very powerful in numerous complex tasks, deep learning models are usually criticized due to its lack of interpretability. Although the RNN performance reported in the paper in regards to recall and precision is better than that of traditional information retrieval methods. However, those similarity-based methods have their own merits as their mechanism and decision boundaries are more transparent so it would be possible to detect where the problem is and why the results are not desirable. In this case, there is a trade-off between the model performance (accuracy, precision, and recall) and interpretability, it is worthy of thinking about which one to choose over the other.

Discussion

I think the following questions are worthy of further discussion.

What do you think can incorporate more human components in the model design?
Compared with last week’s approach SOLVENT, which one you think works better? Why?
What are some other potential applications for this system outside of knowledge discovery?
Do you think the recurrent neural network method is better than traditional similarity-based methods such as TF-IDF in the analogy search task and other NLP tasks? Why or why not?

04/29/20 – Fanglan Chen – DiscoverySpace: Suggesting Actions in Complex Software

April 28, 2020April 28, 2020 Fanglan Chen 2 Comments

Summary

Fraser et al.’s paper “DiscoverySpace: Suggesting Actions in Complex Software” explores how action suggestions can be incorporated into a complex system that can help beginners and inexperienced users navigate and gain confidence while interacting with the system. This research is motivated by the observation that there are several problems that novice users might face several problems when trying to use complex software: (1) new users may be unfamiliar with the vocabulary used in the application, making it difficult to explore the desired features; (2) online tutorials might be difficult to suit their situation and goals; (3) several shortcuts in the software to accomplish the same task are available but inexperienced users might get overwhelmed by the multiple approaches and have no idea how to locate the most efficient ones; (4) users only get exposure to a small number of software features which may limit their cognition of its potential power. To address the above issues, the researchers develop DiscoverySpace which is a prototype action suggestion software extension for Adobe Photoshop to help beginners get started without feeling overwhelmed.

Reflection

I think this paper conducted a very interesting study on how the proposed action recommendations can help novice users build confidence, accomplish tasks, and discover features. Probably many of us have similar experiences that when we are trying to get started with a complex software used by professionals in different domains, the learning curve is so deep that we feel frustrated and discouraged at the beginning. If there is no necessity to use the software or light-weighted alternatives available, we may give up using it. The proposed extension does a good job presenting some basic features of the complex software and develops “one-click” operations for easy tasks, which can make new users gain some sense of achievement while interacting with the system and may want to continue exploring other features.

As presented in the analysis and experimental results, the largest improvement of the proposed Adobe Photoshop extension is among the users who just get started using the software. The evaluation is mostly based on their self-reported confidence. However, another important aspect is ignored in the process – how much did the users learn in the process. That relates to what goal we want to achieve by using professional software. If the users just expect to leverage the power of Photoshop to do really simple tasks, there are a bunch of mobile applications available with only a few easy clicks. If the objective is just to beautify a selfie, the Instagram application has built-in operations and is very easy to use. As we know, Photoshop is used by professionals for image editing. If the users would like to learn how to use the software and build up their skills over time, the current version of the proposed extension does not seem to be helpful. The proposed approach is to encapsulate a sequence of operations into a single click. There is no denying the fact that it is very easy to use, but the light-weighted operations may not contribute to long-term learning. I am a Photoshop user and use it frequently as needed. The current version of the proposed extension may not be very useful to me, but I feel there is a lot of potentials to improve to make it more powerful. Firstly, it would be very useful to have a dialogue box to present the back-end steps conducted within a one-click function. Knowing the basic sequence to achieve a simple task can help the users build their knowledge and know what to do or at least what to search when they need to achieve a similar task. Secondly, it would be helpful to have some interactive modules that enable users to adjust a number of parameters, such as brightness, contrast, and so forth. These are fundamentals for users who want to enhance their skill levels and get experienced with Photoshop.

Discussion

I think the following questions are worthy of further discussion.

In what user scenarios you think the proposed software extension would be most useful?
Do you think it is helpful to incorporate a sequence of operations in one click or there is a need to present the operations step by step to users?
Do you think that this approach with newly released extensions can assist experts in complex professional software?
Can you think about some examples of other applications in the same or different domains that could benefit from the proposed approach?

04/22/20 – Fanglan Chen – SOLVENT: A Mixed Initiative System for Finding Analogies Between Research Papers

April 21, 2020 Fanglan Chen 1 Comment

Summary

Chan et al.’s paper “SOLVENT: A Mixed Initiative System for Finding Analogies Between Research Papers” explores the feasibility to leverage a mixed-initiative system to categorize research papers into their relational schemas by a collaborative Human-AI team, which can be utilized to identify analogical research papers potentially leading to innovative knowledge discoveries. The motivation of the researchers is the boom of research papers during recent decades, which makes searching for relevant papers in one domain or cross domains become more and more difficult. To facilitate the paper retrieval and interdisciplinary analogies search, the researchers develop a mixed-initiative system called SOLVENT in which humans mark the key aspects of research papers (their background, purpose, mechanism, and findings) with a computational machine learning model extracting semantic representations from these key aspects, which can facilitate the identifying analogies across different research domains.

Reflection

I think this paper conducted an innovative study on how the proposed system can actually support knowledge sharing and discovery in one domain and across different research communities. In the research explosion era, researchers would greatly benefit from using such a system for their own research and explore more interdisciplinary possibilities. That makes me think about why the system can achieve good performance via annotating the content of abstracts in the domains they conducted experiments. As we know, abstracts of the papers usually summarize the most important point of the research papers at a high-level. So it is intuitive and wise to utilize that part for annotating and further tasks. The researchers adopt the pre-trained word embedding models to generate semantic vector representations for each component, which performs pretty well in the tasks presented in the paper. I would imagine that the framework would probably work especially well for experimentation-driven domains, computer science, civil engineering, biology, etc., in which the research papers follow a specific writing structure. Can the proposed framework scale up to other less structured text materials, such as essays, novels, by extending it to full content instead of focusing on the abstracts? I think that would be an interesting future direction to explore.

In addition, one potential future work discussed in the paper is to extend the content-based approach with graph-based approaches like citation networks. I feel this is a novel idea and there is a lot of potential in this direction. Since the proposed system has the ability to find analogies across various research areas, I would be curious to see if it is possible to generate a knowledge graph based on the analogy pairs that can create something similar to a research road map, which indicates how the ideas from different papers in various research areas relate in a larger scope. I would imagine researchers would benefit from a systematized collection of research ideas.

Discussion

I think the following questions are worthy of further discussion.

Would you use this system to support your own research? Why or why not?
Do you think that the annotation categories capture the majority of the research papers? Can you think about other categories the paper did not mention?
What do you think of the researchers’ approach to annotating the abstracts? Would it be helpful to expand on this work to annotate the full content of the papers?
Do you think the domains involved in cross-domain research share the same purpose and mechanism? Can you think about some possible examples?

04/22/20 – Fanglan Chen – The Knowledge Accelerator: Big Picture Thinking in Small Pieces

April 21, 2020April 21, 2020 Fanglan Chen Leave a comment

Summary

Hahn’s paper “The Knowledge Accelerator: Big Picture Thinking in Small Pieces” utilizes a distributed information synthesis task as a probe to explore the opportunities and limitations of accomplishing big picture thinking by breaking it down into small pieces. Most traditional crowdsourcing work targets simple and independent tasks, but real-world tasks are usually complex and interdependent, which may require a big picture thinking. There are a few current crowdsourcing approaches that support the breaking-down of complex tasks by depending on a small group of people to manage the big picture view and control the ultimate objective. This paper proposes the idea that a computational system can automatically support big picture thinking all through the small pieces of work conducted by individuals. The researchers complete the distributed information synthesis in a prototype system for and evaluate the output of the system on different topics to validate the viability, strengths, and weaknesses of their proposed approach.

Reflection

I think this paper introduces an innovative approach for knowledge collection which can potentially replace a group of intermediate moderators/reviewers with an automated system. The example task explored in the paper is to answer a given question by collecting information in a parallel way. That relates with the question about how the proposed system enhances the quality of answer by a structured article compiled with the pieced information collected. To facilitate the similar question-answer task, we actually have a variety of online communities or platforms. Take Stack Overflow for example, it is a site for enthusiast programmers to learn and share their programming knowledge. A large number of professional programmers answer the questions on a voluntary basis, and usually a question would receive several answers detailing different approaches with the best solution on the top with a green check. You can check other answers as well in case you have tried one but that does not work for you. I think the variety of answers from different people sometimes enhance the possibility the problem can be solved. Somehow the proposed system reduces that kind of diversity in the answers. Also, one informative article is the final output of the system to a given question, then its quality would be important, but it seems hard to control the vote-then-edit pattern without any reviewers to ensure the quality of the final answer.

In addition, we need to be aware that much work in the real world can hardly be conducted via crowdsourcing because of the difficulty in decomposing tasks into small, independent units, and more importantly, the objective is beyond to accelerate the computational time or collect complete information. For creative work such writing a song, editing a film, designing a product, the goal is more like to encourage creativity and diversity. In those scenarios, even with a clear big picture in minds, it is very difficult to put together the small pieces of work by a group of recruited crowd workers to create a good piece of work. As a result, I think the proposed approach is limited to comparatively less creative tasks where each piece can be decomposed and processed in an independent way.

Discussion

I think the following questions are worthy of further discussion.

Do you think the proposed system can completely replace the role of moderators/reviewers in that big picture? What are the advantages and disadvantages?
This paper discusses the proposed system in the task of question-answer. What are the other possible applications the system could be helpful?
Can you think about any possible aspect of improving the system to scale it up to other domains or even non-AI domains?
Do you consider the breaking-down approach in your course project? If yes, how would you like to approach that?

04/15/20 – Fanglan Chen – Algorithmic Accountability

April 15, 2020 Fanglan Chen 2 Comments

Summary

Diakopoulos’s paper “Algorithmic Accountability” explores the broad question of how algorithms exert their potential power and are worthy of scrutiny by journalists and studies the role of computational approaches like reverse engineering in articulating algorithmic transparency. Nowadays, Automated decision-making algorithms are now used throughout businesses and governments. Given that such algorithmically informed decisions have the potential for significant societal impact, the goal of this paper is to address algorithmic accountability reported as a mechanism for articulating and elucidating the power structures, biases, and impacts that automated algorithms exercise in our society. Through the use of reverse engineering methods, the researcher conducted five cases of algorithmic accountability reporting, including autocompletion, autocorrection, political email targeting, price discrimination, and executive stock trading plans. Also, the applicability of transparency policies for algorithms is discussed along with the challenges of conducting algorithmic accountability as a broadly viable investigative method.

Reflection

I think this paper touches upon the important research question on the accountability of computational artifacts. Our society currently relies on automated decision-making algorithms on many different aspects, ranging from dynamic pricing to employment practices to criminal sentencing. It is important that developers, product managers, and company/government decision-makers are aware of the possible negative social impacts and necessity for public accountability when they design or implement algorithmic systems.

This research also makes me think about whether we need to be that strict with every algorithmic system. I think to answer the question we need to consider different application scenarios, which are not fully discussed in the paper. Take the object detection problem in the computer vision research area, for example, we have two application scenarios: one is to detect if there is a car in the image for automatic labeling, the other is to check if there is any tumor in the computed tomography for disease diagnosis. Apparently, the level of algorithm accountability is required to be much higher in the second scenario. Hence, in my opinion, the accountability of algorithms needs to be discussed under the application scenarios associated with the user’s expectations and potential consequences when the algorithms go wrong.

The topic of this research is algorithmic accountability. As far as I am concerned, accountability is a wide scope of concept, including but not limited to an obligation to report, explain, and justify algorithmic decision-making as well as mitigate any potential harms. However, I feel this paper mainly focuses on the accountability aspect of the problem with little discussion on other aspects. There is no denying the fact that transparency is one-way algorithms can be made accountable, but just as the paper puts it, “[t]ransparency is far from a complete solution to balancing algorithmic power.” I think other aspects such as responsibility, fairness, and accuracy are worthy of further exploration as well. Considering these aspects throughout the design, implementation, and release cycles of algorithmic system development would lead to a more socially responsible deployment of algorithms.

Discussion

I think the following questions are worthy of further discussion.

What aspects other than transparency you think would be important in the big picture of algorithmic accountability?
Can you think about some domain applications that would hardly let automated algorithms make decisions for humans?
Do you think transparency potentially leaves the algorithm open to manipulation and vulnerable to adversarial attacks? Why or why not?
Who should be responsible if algorithmic systems make mistakes or have undesired consequences?

04/15/20 – Fanglan Chen – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 Fanglan Chen 2 Comments

Summary

Nguyen et al.’s paper “Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking” explores the usage of automatic fact-checking, the task of assessing the veracity of claims, as an assistive technology to augment human decision making. Many previous papers propose automated fact-checking systems, but few of them consider the possibility to have humans as part of a Human-AI partnership to complete the same task. By involving humans in fact-checking, the authors study how people would understand, interact with, and establish trust with an AI fact-checking system. The authors introduce their design and evaluation of a mixed-initiative approach to fact-checking, leveraging people’s judgment and experience with the efficiency and scalability of machine learning and automated information retrieval. Their user study shows that crowd workers involved in the task tend to trust the proposed system with improved participant accuracy with the access to claims when exposed to correct model predictions. But sometimes the trust is so strong that getting exposure to the model’s incorrect predictions reduces their accuracy in the task.

Reflection

Overall, I think this paper conducted an interesting study on how the proposed system actually influences human’s access to the factuality of claims in the fact-checking task. However, the model transparency studied in this research is different from what I expected. When talking about model transparency, I am expecting an explanation of how the training data is collected, what variables are used to train the model, and how the model works in a stepwise process. In this paper, the approach to increase the transparency of the proposed system is to release the source articles based on what the model provides a true or false judgment on the given claim. The next step is letting the crowd workers in the system group go through each source of news articles and see if that makes sense and whether they agree or disagree on the system’s judgment. In this task, I feel a more important transparency problem here is how the model gets the retrieved articles and how it ranks them in a presented way. Some noises in the training data may bring some bias in the model, but there is little we can tell merely based on checking the retrieved results. That makes me think that there might be different levels of transparency, at some level, we can check the input and output at each step, and at another level, we may get exposure to what attributes the model actually uses to make the prediction.

The authors conducted three experiments with a participant survey on how users would understand, interact with, and establish trust with a fact-checking system and how the proposed system actually influences users’ access to the factuality of claims. The experiments are conducted via a comparative study between a control group and a system group to show that the proposed system actually works. Firstly, I would like to know if the randomly recruited workers in the two groups have some differences among demographics that may potentially have an impact on the final results. Is there a better way to conduct such experiments? Secondly, the performance difference between the two groups in regard to human error is so small and there is no additional proof that the performance difference is statistically significant. Thirdly, the paper reports the experimental results on five claims, even with a claim that has incorrectly supportive articles (claim 3), which seems not to be representative. The task is kind of misleading. Would it be better with quality control of the claims in the task design?

Discussion

I think the following questions are worthy of further discussion.

Do you think with the article source presented by the system that the users develop more trust about the system?
What are the reasons behind that some claims with the retrieval results of the proposed system downgrade the human performance in the fact-checking task?
Do you think there is any flaw in the experimentation design? Can you think of a way to improve it?
Do you think we need personalized results in this kind of task where the ground truth is provided? Why or why not?

04/08/20 – Fanglan Chen – CrowdScape: Interactively Visualizing User Behavior and Output

April 8, 2020 Fanglan Chen 4 Comments

Summary

Rzeszotarski and Kittur’s paper “CrowdScape: Interactively Visualizing User Behavior and Output” explores the research question of how to unify different quality control approaches to enhance the quality control of the work conducted by crowd workers. With the emerging crowdsourcing platforms, many tasks can be accomplished quickly by recruiting crowd workers to collaborate in a parallel manner. However, quality control is among the challenges faced by crowdsourcing paradigm. Previous works focus on designing algorithmic quality control approaches based on either worker outputs or worker behavior, but neither of the approaches is effective for complex or creative tasks. To fill in that research gap, the authors develop CrowdScape, a system that leverages interactive visualization and mixed initiative machine learning to support the human evaluation of complex crowd work. With experimentation on a variety of tasks, the authors present the incorporation of information about worker outputs and worker behavior with worker outputs has the potential to assist users to better understand the crowd and further identify reliable workers and outputs.

Reflection

This paper conducts an interesting study by exploring the relationship between the outputs and behavior patterns of crowd workers to achieve better quality control in complex and creative tasks. The proposed method provides a unique angle of quality control and it has wide potential usage in crowdsourced tasks. Although there is a strong relationship between the final outputs and the behavior patterns, I feel the design of CrowdScape relies too heavily on the behavior analysis of crowd workers. In many situations, good behavior can lead to good results, but that cannot be guaranteed. From my understanding, behavior analysis is more suitable to be utilized as a screening mechanism in certain tasks. For example, in the video tagging task, workers that have gone through the whole video are more possible to provide accurate tagging. In this case, behavior such as watching is more like a necessary condition, not a sufficient condition. The group of workers who finish watching the videos may still reach disagreement on the tagging output. A different quality control mechanism is still needed in this round. In creative and open tasks, the behavior patterns are even more difficult to capture. By analyzing the behavior by metrics such as time spent on the task, we cannot directly connect the analysis with the measurement of creativity.

In addition, I think the quality of a crowdsourced task discussed in paper is comparatively narrow. We need to be aware the quality control on crowdsourcing platforms is multifaceted, which depends on the knowledge of the workers on the specific task, the quality of the processes that govern the creation of tasks, the recruiting process of workers, the coordination of subtasks such as reviewing intermediate outputs, aggregating individual contributions, and so forth. A more comprehensive quality control circle needs to take the following aspects into consideration: (1) quality model that clearly defines the dimensions and attributes to control quality in crowdsourcing tasks. (2) assessment metrics which can be utilized to measure the values of the attributes identified by the quality model (3) assurance of quality, which requires a set of actions that aim to achieve expected levels of quality. To prevent low quality, it is important to understand how to design for quality and how to intervene if quality drops below expectations on crowdsourcing platforms.

Discussion

I think the following questions are worthy of further discussion.

Can you think about some other tasks that may benefit from the proposed quality control method?
Do you think the proposed method can perform good quality control on the complex and creative tasks as the paper suggested? Why or why not?
Do you think that an analysis based on worker behavior can assist to determine the quality of the work conducted by crowd workers? Why or why not?
What scenarios do you think would be more useful to trace worker behavior, informing them beforehand or tracing without advance notice? Can you think about some potential ethical issues?

04/08/20 – Fanglan Chen – The State of the Art in Integrating Machine Learning into Visual Analytics

April 8, 2020 Fanglan Chen Leave a comment

Summary

Endert et al.’s paper “The State of the Art in Integrating Machine Learning into Visual Analytics” surveys the recent state-of-the-art models that integrate machine learning to visual analytics and highlights the advances achieved at the intersection of machine learning and visual analytics. In the data-driven era, how to make sense of data and how to facilitate a wider understanding of data attract the interests of researchers in various domains. It is challenging to discover knowledge from data while delivering reliable and interpretable results. Previous studies suggest that machine learning and visual analytics have complementary strengths and weaknesses, there are many works that explore the possibility to combine those two to develop interactive data visualization to promote sensemaking and analytical reasoning. This paper presents a survey of the achievements that have been made by recent state-of-the-art models. Also, it provides a summary of opportunities and challenges to boost the synergy between machine learning and visual analytics as future research directions.

Reflection

Overall, this paper presents a thorough survey of the progress that has been made by highlighting and synthesizing select research advances. The recent advances of deep learning models bring more new challenges and opportunities in the intersection of machine learning and visual analytics. We need to be aware the design of a highly accurate and efficient deep learning model is an iterative and progressive process of training, evaluation, and refinement, which typically relies on a time-consuming trial-and-error procedure where the parameters and the model structures are adjusted based on user expertise. Visualization researchers are making initial attempts to visually illustrate intuitive model behaviors and debug the training processes of widely-used deep learning models such as CNNs and RNNs. However, little effort has been conducted in tightly integrating state-of-the-art deep learning models with interactive visualizations to maximize the value of both. There is full potential in integrating deep learning into visual analytics for a better understanding of current practices.

As we know, the training of deep learning models requires a lot of data. However, sometimes well-labeled data is very expensive to obtain. The injection of a small number of user inputs into the models can potentially solve these problems through a visual analytics system. In real-world applications, a method is impractical if each specific task requires its own separate large-scale collection of training examples. To close the gap between academic research outputs and real-world requirements, it is necessary to reduce the sizes of required training sets by leveraging prior knowledge obtained from previously trained models in similar categories, as well as domain experts. Few-shot learning and zero-shot learning are two of the unsolved problems in the current practice of training deep learning models, which provide a possibility to incorporate prior knowledge on objects into a “prior” probability density function. That is, those models trained using given data and their labels can usually solve only pre-defined problems for which they were originally trained.

Discussion

I think the following questions are worthy of further discussion.

What other challenges or opportunities can you think about a framework to incorporate machine learning and visual analytics?
How to best leverage the advantages of machine learning and visual analytics in a complementary way?
Do you plan to utilize a framework to incorporate machine learning and visual analytics in your course project? If yes, how do you plan to approach it?
Are there any applications that we access in daily life you can think of as good examples that integrate machine learning into visual analytics?

03/25/20 – Fanglan Chen – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 Fanglan Chen 1 Comment

Summary

Chattopadhyay et al.’s paper “Evaluating Visual Conversational Agents via Cooperative Human-AI Games” explores the research question of how the progress in AI-AI evaluation can translate to the performance of human-AI teams. To effectively deal with real-world problems, an AI system faces the challenge to adapt its performance to humans. Existing works measure the progress in AI isolatedly without human-in-the-loop design. Take visual conversational agents as an example, recent work typically evaluates how well agent pairs perform on goal-based conversational tasks instead of response retrieval from fixed dialogs. The researchers propose to evaluate the performance of AI agents in an interactive setting and design a human computation game, named GuessWhich, to continuously engage humans with agents in a cooperative way. By evaluating the collaborative success between humans and AI agents of two versions (self-supervised learning and reinforcement learning), their experiments find that there is no significant difference in performance between the two versions when paired with human partners, which suggests a disconnect between AI-AI and human-AI evaluations.

Reflection

This paper conducts an interesting study by developing a human computation game to benchmark the performance of visual conversational agents as members of human-AI teams. Nowadays, we are increasingly interacting with intelligent and highly communicative technologies throughout our daily lives. For example, companies utilize the automation of communication with their customers to make their purchase much more efficient and streamlined. However, it is difficult to define what success looks like in this case. Do the dialog bots really bring convenience, or are companies putting up a barrier to their customers? Even though this paper proposes a method to evaluate the performance of AI agents in an interactive setting, it does not discuss how to generalize the evaluation to other communication-related tasks.

In addition, I think the task design is worthy of further discussion. The paper utilizes the number of guesses the human needs to identify the secret image as an indicator of the human-AI team performance. For playing GuessWhich game among human-human teams, it seems how to strategize the questions is an important component in the game. However, the paper does not have much consideration on the question strategies. Would it be helpful if some kind of guideline on communication with machines are provided to the crowd workers? Is it possible some of the questions are clear to humans but ambiguous to machines? Based on the experimental results, the majority of the questions are binary, which are comparatively easier for the conversational agents to answer. One reason behind this I think is due to the given information of the secret picture. Take the case presented in the paper as an example, the basic description of the picture is given as “A man sitting on a couch with a laptop.” If we check out the picture choices, we can observe that few of the choices include all the components. In other words, the basic information provided in the first round made the secret picture not that “secret” anymore and the given description is enough to narrow the choices down to two or three picture candidates. In this scenario, the role of visual conversational agents play in the human-AI teams is minimized and difficult to be precisely evaluated.

Discussion

I think the following questions are worthy of further discussion.

Is it possible to generalize the evaluation method proposed in the paper to other communication-related tasks? Why or why not?
In the literature, the AI agents fine-tuned with reinforcement learning has been found to have better performance than its self-supervised learning counterpart. This paper finds that the accuracy of the two versions has no significant difference when evaluated via a human-ALICE team. What reasons can you think about to explain this?
Do you think there are any improvements that can be made to the experimental design?
What do you think are the challenges that come with human-in-the-loop evaluations?

03/25/20 – Fanglan Chen – Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

March 24, 2020 Fanglan Chen 3 Comments

Summary

Huang et al.’s paper “Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time” explores a novel approach of a crowd-powered system architecture that can support gradual automation. The motivation of the research is that crowd-powered conversational assistants have been shown to achieve better performance than automated systems, but they cannot be widely adopted due to their monetary cost and response latency. Inspired by the idea to combine the crowd-powered and automatic approaches, the researchers develop Evorus, a crowd-powered conversational assistant. Powered by three major components(learning to choose chatbots, reusing prior responses, and automatic voting), new chatbots can be automated to more scenarios over time. The experimental results show that Evorus can evolve without compromising conversational quality. Their proposed framework contributes to the research direction of how automation can be introduced efficiently in a deployed system.

Reflection

Overall, this paper proposes an interesting gradual automatic approach for empowering conversational assistants. One selling point of the paper is that users can converse with the proposed Evorus in open domains instead of limited ones. To achieve that goal, the researchers design the learning framework as it can assign a slightly higher likelihood of newly-added chatbots to collect more data. I would imagine it requires a large amount of time for domain-specific data collection. Sufficient data collection in each domain seems important to ensure the quality of open-domain conversation. Similar to the cold-start problems in recommender systems, the data collected for different domains is likely imbalanced, for example, certain domains may gain no or little data during the general data collection process. It is unclear how the proposed framework can deal with this problem. One direction I can think about is to utilize some machine learning techniques such as zero-shot learning (domain does not appear in prior conversations) and few-shot learning (domain rarely discussed in prior conversations) to deal with the imbalanced data collected by the chatbot selector.

For the second component, the reuse of prior answers seems a good way to reduce the system computational cost. However, text retrieval can be very challenging. Take lexical ambiguity as an example, polysemous words would hinder the accuracy of the derived results because different contexts are mixed in the instances, collected from the corpus, in which a polysemous word occurred. If the retrieval component cannot handle the lexical ambiguity issue well, the reuse of prior answers may find irrelevant responses to user conversations which could potentially introduce errors to the results.

In the design of the third component, both workers and the vote bot can upvote suggested responses. It requires the collection of sufﬁcient vote weight until Evorus accepts the response candidate and sends it to the users. Depending on the threshold of the sufficient vote weight, the latency could be very long. In the design of user-centric applications, it is important to keep the latency/runtime in mind. I feel the paper would be more persuasive if it can provide support experiments on the latency of the proposed system.

Discussion

I think the following questions are worthy of further discussion.

Do you think it is important to ensure users can converse with the conversational agents in open domains in all scenarios? Why or why not?
What improvements can you think about that the researchers could improve in the Evorus learning framework?
At what point, you think the conversational agent can be purely automatic or it is better to always have a human-in-the-loop component?
Do you consider utilizing the gradual automated learning framework in your project? If yes, how are you going to implement it?