4/29/2020 – Nan LI – DiscoverySpace: Suggesting Actions in Complex Software

Summary

This paper introduced an interface, DiscoverySpace, which provides task-level action recommendations. The main objective of this interface is to help novices gain confidence when using sophisticated software. To achieve this goal, the author designed the DiscoverySpace prototype as a Photoshop extension panel for Adobe Photoshop. This prototype allows users to explore the software functionality by providing refinement or radical recommendations based on the current task classification or the natural language. The author conducted an experiment that compares the participants in the DiscoverySpace condition used Photoshop with DS panel and the participants in the control condition using Photoshop without it. The experiment results indicate that those beginner participants tend to gain confidence with the help of the DiscoverySpace condition while losing confidence in the Control condition. However, there is no significant difference between participants who already have Photoshop expertise. Nevertheless, most of the participants indicated that this interface is most useful for quick exploration of complicated software.

Reflection

Nowadays, when talking about editing the image, I believe the majority of people use a “one-click” retouching application on their phone, such as meitu, Ulike, VSCO, etc. These tools can provide a variety of retouching effects, provide users with powerful picture editing capabilities, and have no particular requirement for users. I think this is what the user expected in this paper regarding recommending an “item-based” collaborative filtering algorithm. From my perspective, what the user is trying to achieve in this prototype is to encapsulate the complicated process and present users the most straightforward effect based on the image user selected. 

I agree with the assumption the author put up with at the beginning, “complex software offers power for experts, yet overwhelms new users”—my experience of using Photoshop at first, just like what the author described. Thus, I think the author’s work would improve the user experience and confidence significantly. I wish I could have such extension penal when I was first using Photoshop; I might continue to use it now. 

 Besides, I think the target users of using Photoshop are people who have higher requirements of image editing instead of people who just want to publish their selfies to Instagram. Thus, I would expect the system aims to help the novice get familiar with the system faster and facilitate exploration. 

However, I do not think that automatic image analysis would help a lot. One benefit of letting the user select the feature of the image is to provide corresponding advice regarding the user-selected features. For example, the user would perceive an image with Sunset, seaside, back view of themselves as landscape pictures, while automatic image analysis would probably recognize it as a portrait. Therefore, letting the user enter the feature of their figure would facilitate the system to identify the portion that the user wants to emphasize.

Questions

  1. Even though the author just takes Photoshop as an example, I am still curious to do people still use PS frequently except for people who have professional needs nowadays, as the emergence of multiple “one-click” retouching software. Do you agree with the assumption made at the beginning of the paper?
  2. Do you think it is helpful that encapsulates a sequence of operations for users to apply in one click, meanwhile, demonstrate those operations to users?
  3. When would you prefer to use DiscoverSpace instead of “one-click” retouching Apps? 
  4. What kind of software(or what specific software) do you most like to implement the idea similar to DiscoverySpace? Here the idea refers to that system provides users aggregated action suggestions based on users’ current tasks.

Word Count: 602

Read More

04/29/2020 – Nan LI – VisiBlends: A Flexible Workflow for Visual Blends

Summary

The author in this paper introduced an advanced graphic design technique which combines two objects or concept in a novel and meaningful way in conveying a message symbolically. To achieve this, the author presents a tool, VisiBlends, a flexible hybrid system that facilitates the generating of visual blends based on an iterative design process. The author first introduces and defines the problem of vidual blends and then decomposes the process of creating visual blends into sub-task. The baseline of this iterative design process is that let users brainstorm first regarding the concept and then find certain relevant types of images. Then the user annotates images for the convenience of the system automatically detects which images to blend. Finally, users evaluate each blend and decide whether or not iterate the process. To find out whether the system could support decentralized collaboration and co-located teams generate the visual blend, and whether this system would help novices create blends efficiently, the author conducted three user studies. The study results indicate that both decentralized groups and co-located groups can generate visual blends to express their messages efficiently. Further, the system VisiBlends indeed helps novices generate visual blends.

Reflection

I really like this paper, and I even want to try the system. Create something novel out of thin air is always hard. Therefore, people are continually looking for tools that can stimulate creativity and brainstorming, hoping that these tools can inspire us. The paper we discussed last week, which presents a tool to help find analogy from papers also trying to do the same thing. This really shows the essence of creative inspiration.

It really enjoins to see the study process, especially study 2, group collaboration on blends for messages. They discovered many constraints, but they also solved these constraints cleverly, such as focusing on the images of the other concept to increase the chance of finding a blend when the image of another concept is limited. I particularly like the example of women + CS. Workers were trying to avoid gender stereotypes, even though it’s tough to think about the creative way. Thus, the author concludes that it’s hard to meet all the constraints, and we have to decide where to compromise.

The human visual system inspires me of a way to create a database of visual patterns. Since human tends to recognize an object based on its 3D shape, silhouette, depth, color, and details, we could let a group of people identify a blurred shape, which contains certain features but does not clear enough to recognize the actual object. Then we can base on participants’ visual perspectives to perceive the user perspective of the metaphor of this shape.

For the third study, there is an interesting phenomenon pointed out by the author. Participants who saw VisiBlends first then removed VisiBlends have much worse performance than a participant who did not see VisiBlends at all. This reminds me of the participant in one of the previous papers said they are afraid to be spoiled by the automatic system leads to no active thinking. So I think this might be one of the cases.

Questions

  1. Do you think this system would help you generate a visual blend? Will you use the system to help with your design?
  2. It is mentioned in the paper that sometimes we need to compromise to achieve our goals, what do you think about this perspective? Can you think about the examples of this situation?
  3. What is the essential tool for you when you are trying to do the brainstorm?
  4. For the iterative design process described in the paper, which part is the most significant for you? Which part you think could be replaced by an automatic machine.

Read More

04/22/2020 – Nan LI – Opportunities for Automating Email Processing: A Need-Finding Study

Summary:

The main objective of this paper is to investigate the potential of user needs regarding email management automation. To achieve this, the author conducted a mixed-methods need-finding study through three probes. First, the author determined the categories of email automation requirements through a workshop and then conducted a more extensive survey to deepen the understanding of the identified needs. The paper listed the primary needs identified in the workshop. Then, they also investigate the existing email automation software to detect what demands have been addressed and list 8 significant functions of email scripts on Github. Finally, they experimented with a programmable email system, YouPS, which allows users to custom email management automation using simple programmatic language. This experiment lasted a week to observe the user’s interaction with the system. Finally, the author discussed the limation and future works regarding the current email clients. 

Reflection:

I think this is an essential topic regarding the critical proportion of mail in daily study and life. Actually, I did not realize that email can play such an essential role in daily life before I came to America. Because in the place I came from, people prefer to use instant contact software, especially for a private chat or group discussion. In this year, I have gradually become accustomed to using mail, and I have developed many habits that I did not realize, but were identified by this article. For example, I would mark the read email as “unread” if that email contains important information. Even though email has the function called “flag,” but I still ignore the email that I “flag.” In contrast, mark as unread is the best way to remind me there is an important thing that I need to deal with ASAP. Therefore, when reading this article, most of my feelings like, yes, this is just what I want; or, it would be wonderful if this demand could be met. 

On the other hand, there are severy identified demands already achieved. For example, we can reference or quote from another email when sending emails, as well as aggregate responses into a poll based on the same sender. Besides, I think the email modes also implemented already (as I have received the automatic reply from faculty in our university when they are on vacation). These features make the email function more robust. 

Regarding the third probe, there is an obvious limitation, and this limitation also mentioned in the paper, which is a lack of existing non-programmer tools for automating an email. However, I think before creating GUIs, I think the more significant thing is to figure out whether people would utilize those email rules if we implement them. For example, email has the function call “flag,” which makes the vital email “stand out.” They even have the choice to use a different color to distinguish the email. Nevertheless, I still prefer to mark as “unread” when I really need to deal with that email soon. Therefore, it is worth thinking about how to implement these email rules to maximize utilization and convenience. 

Question:

  1. What is your particular need for email automation? Which needs identifying in the paper most suit your needs? Do you have any other needs that not mentioned in the paper?
  2. What do you think about the approach that investigates the user’s need for email automation in the first probe? It seems that this method only allows users to brainstorm, and only 13 participants have uneven gender distribution, do you think it will work well?
  3. Actually, there are a lot of function identified in the paper have achieved nowadays, for example, reference or quote from prior emails, and summary a group of responses containing the initial responses. Do you use these features? What do you think of them?

Word count: 630

Read More

04/22/2020 – Nan LI – SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers

Summary

This paper introduced a hybrid-initiative system called SOLVENT, which aims to find analogies between research papers in different fields through combining the work of human annotations of critical features of research paper and computational model which construct a semantic vector from those annotations. The author conducted three studies to demonstrate the performance and efficiency. In the first study, the author let people with specialized domain knowledge to do the annotation works and proved the system able to find known analogies. In order to prove the effectiveness of using SOLVENT to find analogies with the real-word case, the author demonstrated a real-word scenario, explained the primary process, and let professional without domain knowledge people to do the annotation. The results indicate that the relevant match found by the system was judged to be both useful and novel. The third study proved that they could scale up the system by allowing crowd workers to annotate papers. 

Reflection

I think this paper brought up a novel and needed the idea. One limitation of searching for related work online is that the scope of the searching result is too narrow if we search for the keyword or citation. The results usually only include the related paper, the paper which cited the related paper, or the same paper that published from a different place. However, you can always find some inspiration from paper that not relevant to what you are looking for, or it can be an irrelevant paper. Nevertheless, this situation is usually unattainable. Take our project as an example. Our project was inspired by one of the papers we read before, and we would like to improve the work in that paper further. It should be straightforward to find related work because we have previous work already. Disappointingly, we could not find many useful papers or even latent techniques. The only thing we found appears most frequently is the same paper which inspired us, but published from a different place. Thus, from my perspective, this system is designed for searching inspirations through finding analogies. If the system can achieve this, it would be significant. 

On the other hand, this seems like a costly approach because it requires a large number of workers to do the annotation work of a large scale of paper to guarantee the system’s performance. Besides, based on the results provided in the paper, the system performance can only be described as “useful” instead of “efficiently.” If I urgently need inspiration, I may try such a system, but I would not count on this system.

Question

  1. What do you think of the original idea presented in the paper “Scientific discoveries are often driven by finding analogies in distant domains”? Do you think this theory applies for the majority of people? Do you think finding analogies would inspire your work?
  2. What do you think regarding the system “usefulness” and “efficiency”? 
  3. Can you think about any other way to utilize the SOLVENT system?
  4. The user mentioned in the paper that crowd workers can do the annotation work with no domain knowledge or even with no experience of reading a paper. Do you think this will influence the system performance? What are your criteria for recruit workers?

Word Count: 538

Read More

04/15/2020 – Nan LI – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

Summary

This paper introduced a mix-initiative model which allow human and machine to check the authenticity of a claim cooperatively. This fact-checking model also considers the existing interface design principle, which supports understandability and actionability of interface. The main objective of their design is to provide transparency, supports for integrating user knowledge, and explanation of system uncertainty. To prioritize transparency over raw predictive performance, the author used a more transparent prediction model, linear models instead of deep neural networks. Further, users allow to change the reputations and stances of the system prediction. To evaluate how the system could help users to assess the factuality of claims, the author conducted three user studies with MTurk workers. The study results indicate that users might over-trust the system. The system prediction can help the user when the claim is predicted as correct. At the same time, it also degrades human performance when the system prediction errors due to the biases implicit in training data.

Reflection

I think the design of this approach is valuable. Because it does not blindly pursue the accuracy of prediction results, but also consider the transparency, understandability, and actionability of the interface. These attempts would improve the user experience since the user have more knowledge of how the system works and thus provide more trust. On the other hand, this might be the cause that the user may over-trust the system, as indicated in the paper’s experiment results. But still, I think the design of the approach is a nice try.

However, I don’t see the possibility that this system can help users. Although the design is very user friendly, it does not leverage human ability; instead, it just allows humans to participant in such a fact-checking process. Even though the design of the fact-checking process is reasonable and understandable for users, but the expectations from users side require too much mental work such as read a lot of information, thinking, and reasoning. This is a reasonable process, but it is too burdensome.

Moreover, based on the observation of the figures in the paper, I don’t think the system could facilitate the user in determining the authenticity of the claim, and I believe the experiment results also found this. Further, I found that the accuracy of the user’s judgment depends more on the type of claim. Different claims have a significant difference accuracy; this impact is even higher than the effect of the system.

It is also interesting to see the user’s feedback after they complete the task. It seems one of the users has the same opinion as me regarding the amount of information needed to read. The most impressive feedback is that the user would be confused if they have more options. I think these conditions only happen when they not sure about the correctness, and they have the right to change the system output. Finally, we can also see from the comment that when the system has the same judgment as users, users will be more sure of their answer. Still, if the system predicts results indicate the different judgments with users, this will seriously affect the accuracy of the user’s judgment. This is understandable because when someone questions your decision, no matter how confident you are, you will waver a little, let alone a machine that has 70% accuracy.

Questions:

  1. Do you think the system could really help humans to detect the factuality of claims? Why or why not?
  2. When the author design the model, to achieve the goal of transparency, they give up the higher accuracy prediction model instead of using linear models. What do you think of this? Which one is more important for your design? Transparency? Accuracy?
  3. What do you think of the design interface? Does it provide too much information to users? Do you like the design or not.

Word Count: 638

Read More

04/15/2020 – Nan LI – ALGORITHMIC ACCOUNTABILITY Journalistic investigation of computational power structures

Summary:

In this paper, the author first presents the algorithmic power, prioritization, classification, association, as well as filtering. Then the author concludes based on the description of algorithmic power that a significant number of humans would be influenced by algorithms outcome. Thus, the author made the point that it is significant to interpreting the output of algorithms in the course of making higher-level decisions. Next, the author examined the possibility and weaknesses of requiring algorithm transparency. Therefore, the paper introduces a replaced method called reverse engineering. In this work, journalistics combined the interviews, document reviews as well as reverse engineering analysis to shed light on the algorithms’ functioning. They introduced five cases of studies of journalistic investigations and also presented the challenges and opportunities for doing algorithmic accountability work. The primary process of the inquiry includes identifying newsworthy algorithm, sampling the input-output relationships to study the correlations, and finally seeking a story. Finally, the author provides a series of suggestions regarding the transparency policy for algorithms.

Reflection:

First, algorithm accountability is not a new topic nowadays. With the penetration of algorithms into our lives, the application of algorithms to all walks of life, not only for entertainment, learning, daily tools, but even for the significant impact on our experiences of security issues, privacy issues, and even the distribution of social resources. People are starting to ask the question, can algorithms be trusted? To what extent are they trustworthy? I have also seen many examples of guessing and analyzing the internal structure of such a black box. I want to share one of them.

The approach of reversing engineering, especially the process that sampling the input-output relationships of the algorithms to study the correlations, remind me of a news report which identifies the bias from the algorithm. That algorithm was designed for individual risk assessment, which is predicting the likelihood of each committing a future crime. It has been increasingly common in courtrooms across the nation, but in 2014, it was accused that the risk scores might be injecting bias into the courts. The way people found the bias in the algorithm is the same as reverse engineering. Here’s there finding in that paper ( I also put the link below):

  • The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
  • White defendants were mislabeled as low risk more often than black defendants.

Based on this outcome, it seems that reverse engineering is essential and efficient. I think this is a better way to examine the algorithm accountability than transparency. As mentioned in this article, leaving aside the trade secrets problem, disclose the source code of algorithms might helpful for specialists but does not able to improve user experience since they may not make meaningful choices considering their lack of expertise. Thus, identify the issue of algorithms instead of focus on the implementation process is more efficient in encouraging the designer to perfect the algorithm.

Questions:

  1. Do you think the algorithm is trustworthy? How much confidence do you have in an algorithm?
  2. What do you think about transparency? How transparent do you think the algorithm should be?
  3. What do you think of reverse engineering? Does it work? Do you have any other examples regarding this approach?

Link:https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Word Count: 544

Read More

04/08/2020 – Nan LI – CrowdScape: Interactively Visualizing User Behavior and Output

Summary:

This paper demonstrates a system called CrowdScape that support human to evaluate the quality of crowd work outputs by presenting an interactive visualization about the information of worker behavior and worker outputs through mixed-initiative machine learning(ML). This paper made the point that quality control for complex and creative crowd work based on either the post-hoc output or behavioral traces alone is insufficient. Therefore, the author proposed that we could gain new insight into crowd worker performance by combining both behavioral observations and knowledge of worker output. CrowdScape system presents the visualization of each individual traces that include mouse movement, keypresses, visual scrolling, focus shifts, and clicking through an abstract visual timeline. The system also aggregates these features use a combination of 1-D and 2-D matrix scatter plots to show the distribution of the features to enable dynamic exploration. The system also explores the worker output by recognizing the worker submissions pattern and providing means. Finally, CrwodScape enables users to analyze the mental models of tasks and worker behaviors, and use these models for the verification of worker output and majority or gold standards. The author also demonstrates four experiment results to illustrate the practical operation and also prove the effectiveness of the system.

Reflection:

I think the author made a great point regarding address the quality control issue of crowdsourcing. Quality control approaches are limited and even not guaranteed for most of the system, which using crowdsourcing as part of the component. The most frequently used approach I have seen so far based on the policy that worker’s salary determined by the quality of their work. It is the most reasonable approach to encourage workers to provide high-quality results. Besides, another straightforward approach is to choose the similar or the same work(such as tag, count numbers) provided by most workers.

Nevertheless, the author proposed that we should consider the quality control for more complex and creative work because these type of tasks has appeared more often. However, there is no appropriate quality control mechanism exists. I think this mechanism is essential in order to utilize crowdsourcing better.

I believe the most significant advantage of this CrowdScape is that the system can be used very flexibly based on the type of task. From the scenario and case studies presented in the paper, the user could evaluate the worker’s output using different attributes as well as interactive visualization method based on the type of the task. Further, the types of visualization are varied, and each of them can interpret and detect differences and patterns in workers’ behavior and their work. The system design is impressed because the interface of the system is userfriendly based on the figures in the paper combine with the explanation.

The only concern is that as the increase in the number of workers, the points and lines on the visualization interface will become so dense that no pattern can be detected. Therefore, the system might need data filter tools or interactive systems to deal with this problem.

Questions:

  1. What are the most commonly used quality control approaches? What is the quality control approach that you will apply to your project?
  2. There are many kinds of HITS on MTurk, do you think what type of work requires quality control and what kinds of work do not.
  3. For information visualization, one of the challenges is to deal with a significant amount of data. How we should deal with this problem regarding the CrowdScape system?

Word Count: 588

Read More

04/08/2020 – Nan LI – Agency plus automation: Designing artificial intelligence into interactive systems

Summary:

The main objective of this paper is to present the significance and benefit of integrating the interactive system with artificial intelligence. The author made a point that the current focus on AI limited on full automation, which may lead to a mislead due to inappropriate assumptions or biased training data. As a result, users may rely excessively on computational advice, which may result in the loss of critical participation and domain expertise. To address this issue, the author proposed the idea that integrates AI agency and automation to enhance human ability instead of replacing human work. This approach aims to increase human productivity while preserving the human sense of control and responsibility. In order to investigate the most effective way of integrating the automated method into user-centric interactive systems, the paper examined the strategies. Designing shared representations of possible actions that enable people to perform computational reasoning around tasks so that people can view, select, modify, or cancel algorithm suggestions. Then, the author review three practical implementations that applied principles proposed as above: domain-specific language (DSL) for data transformation called Wrangle; an interactive system for data visualization and exploration, predictive translation memory (PTM) project. Finally, the author discussed the design property and user studies of these three projects.

Reflection:

I really like the idea that the primary goal of AI should be enhancing humans not replace humans. I think the author made a good point that currently, people tend to focus too much on the fully automated implementation of AI. The claim that AI can replace humans is even more exaggerated. Humans indeed make mistakes in judgment due to insufficient information or cognitive biases. However, human has the irreplaceable creativity and in-depth understanding of professional domain knowledge. Even with the help of dominant computing power and exhaustive data form computers, AI institutions cannot achieve it. Thus, I cannot more agree with that “need well-thought-out interactions of humans and computers to solve our most pressing problems.”

The reason why I am interested and passionate about HCI is that the elegant of HCI is that the subtle and simple design idea could have a huge impact on the overall performance. We are not seeking to develop a new interactive system or fancy interface, but design from a humble direction to make subtle and rational adjustments to improve user experience without causing any interruption. Just as the example demonstrated in the paper: spelling and grammar checking routines included within the word processor.

On the other hand, user feedback in the article caught my attention: “These related views are so good, but it’s also spoiling that I start thinking less. I am not sure if that’s really a good thing”. This is really impressive. For a long time, we have been pursuing how to facilitate human work efficiently and easily through using AI or interactive design to replace some tedious and later even developed to let machines to learn adaptively instead of human thinking. However, this comment just pointed out the question, which also stated in the paper that should we accept having the computer ‘think for us’? This is indeed a problem that we need to consider when designing.

Questions:

  1. What do you think about the opinion that AI should “enhancing us, not replacing us”? Are you agree with the idea that we should integrate AI with IA perspectives?
  2. Which do you support more? Direct manipulation, or interface agents? Why?
  3. Take the data visualization system as an example. Would you like the system to provide adaptive recommendations? Do you think it’s helpful or annoying? Do you think it is spoiling and stop you from thinking initiative?

Read More

03/18/2020 – Nan LI – Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

Summary:

The main objective of this paper is to solve the monetary cost and response latency problem of crowd-powered conversational assistants. The first approach developed in this paper is to combine the crowd workers and automated systems to achieve high quality, low latency, and low-cost solution. Based on this idea, the paper designed a crowd-powered conversational assistant, Evorus, which can gradually automate itself over time by including varies responses from chosen chatbots, learn and reuse prior responses, and also reduce the oversight from the crowd via an automatic voting system. To design and refined the flexible framework for open-domain dialog, the author developed two phases of public field deployment and testing with real users. The final goal of this system is to let the automatic components within this system gradually take over from the crowd.

Reflection:

I think this paper presented several excellent points. First, the crowd-powered system has been widely employed because of low monetary costs and high convenience. However, as the number of required crowd workers and tasks increases, the expenditures for those workers gradually increase to a non-negligible number. Besides, even though the platform that enables hiring crowd workers quickly is available, the response latency is still non-ignorable. The author in this paper also realized this deficiency and trying to develop an approach to solve these problems.

Second, it is a prevalent idea that combines crowd workers and automated systems. The novelty of this paper is adding another automatic voting system to decide which response to send to the end-user. This machine learning model enables a high-quality response by reducing crowed-oversight. The increase of error tolerance enables even an imperfect automation component to contribute to the conversation without impact the quality. Thus, the system could integrate more types of chatbots and extend the explore region of different actions. Besides, due to the balance of “upvote” and “downvote” of this system, Evorus enables flexibility and fluid collaboration between humans and chatbots.

Third, another novel attribute of this system is “reuse prior responses.” I think the idea of enabling Evorus to find answers to similar queries in prior conversations to suggest as responses to new queries is a key approach that probably changes the partial crowed-prowed system to a completely automatic system. Because this is a simulation of people learning from the past, furthermore, this is also what we do in daily life conversation. Thus, as the system involves in more conversations, and memorizes more query-response pairs from all the old conversations, the system might be able to build a comprehensive database which stores all type of conversation query and response. On that day, the system might become automatic, ultimately. However, this database might need to up data and become partially crowd-powered once a while regarding the constant change of information and the way people communicate.

Question:

  • What do you think of this system? Do you think it is possible that the system only relies on automation one day?
  • What do you think about the voting system? Do you think it is a critical factor that enables a high-quality response? What do you think about the design of different weights for “upvote” and “downvote”?
  • It is a prevalent idea nowadays to combine crowd-worker and AI systems to achieve high accuracy or high quality. However, the author in the paper expects the system to rely on automation increasingly. Can you see the benefit if this expectation achieved?

Word Count: 567

Read More

03/18/20 – Nan LI – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

summary:

The main objective of this paper is to measure AI agents through interactive downstream tasks performed by human-AI teams. To achieve this goal, this author designed a cooperative game – GuessWhich – that require human to identify a secret image among a pool of images through engaging in a dialog with an answerer-bot(Alice). This image is known to Alice but unknown to the human so that human need to ask some related questions and pick out the secret image based on the answer from Alice. There are two versions of Alice presented in the paper. Alice(SL) which is trained in a supervised manner to simulate conversation with humans about image and Alice(RL), which is pre-trained with supervised learning and fine-tuned via reinforcement learning for the image-guessing task. The results indicate that the evaluation of Alice(RL) with another AI more accurate than the evaluation with the human. Therefore, the paper concluded that there is a disconnect between the benchmarking of AI in isolation and the context of human-AI interaction. 

Reflection:

This paper reminds me of another paper that we have discussed before: Updates in Human-AI Teams. Both of them are concerning the impact of human involvement in the AI performance. I think this is a great topic, and it is worth putting more attention on this topic. Because as the beginning of the paper said, as AI continues to advance, human-AI teams are inevitable. Many AI products have been widely used in society, including all walks of life. For example, predictive policing, life insurance estimation, sentencing, medical. Their product all requires human-AI to cooperate. Therefore, we already reach an agreement that the development and improvement of AI should always consider the impact of human involvement. 

The QBOT-ABOT teams mentioned in this paper have a similar idea as the GAN(Generative adversarial network). Both of them train two systems to use unsupervised training and let them provide feedback for each other to enhance their performance. However, the author made the point that it is unclear if these QBOT and ABOT agents are indeed performing well when interacting with humans. This is an excellent point that we should always consider when we design an AI system. The measuring of the AI system should never be isolated. We should also consider human mental methods. This requires us to consider how the human mental model will impact team performance when they work cooperatively. A suitable human involved evaluation may be a more valuable measurement for the AI system. 

Question:

  1. Do you think when we should measure the performance with human involvement and when we should not? 
  2. Can you see what the main point of this paper is? Why the author uses visual conversational agents to prove the points that it is crucial to benchmark progress in AI in terms of how it translates to helping humans perform a particular task. 
  3. The author mentioned that humans perceive ALICE(SL) and ALICE(RL) as comparable in terms of all metrics at the end of the paper. Why do you think the human will make such a conclusion. Does that indicate human involvement has no difference for two visual conversational agents?

Word Count: 537

Read More