3/4/20 – Jooyoung Whang – Pull the Plug? Predicting If Computers or Humans Should Segment Images

In this paper, the authors attempt to appropriately distribute human and computer resources for creating segmentation of foreground objects in an image to achieve highly precise segmentations. They introduce that the segmentation process consists of roughly segmenting the image (initialization), and then going through another fine-grained iteration to come up with the final result. They repeat their study for both of the steps. To figure out where to allocate human resources, the authors’ proposed an algorithm that tries to score the acquired segmentations by detecting: highly jagged edges on the boundary, non-compact segmentations, near-edge segmentation locations, and segmentation area ratio to the full image. The authors find that a mix of humans and computers for image segmentation performs better than when completely using one or the other.

I liked the authors’ proposed algorithm to detect when a segmentation fails. It was interesting to see that they focused on visible features and qualities that humans can see instead of relying on deep neural networks that are often hard to interpret the internal workings of. At the same time, I am a little concerned about whether the proposed visual features for failed segmentations are enough to generalize and scale for all kinds of images. For example, the authors note that failed segmentations often have highly jagged edges. What if the foreground object (or an animal in this case) was a porcupine? The score would be fairly low even when an algorithm correctly segments the creature from the background. Of course, the paper reports that the method generalized well for everyday images and biomedical images, so my concern may be a trivial one.

As I am not experienced in the field of image segmentation analysis, I wondered if there were any case where an image contained more than one foreground objects and only one of them is of interest to a researcher. From my short knowledge about fore and background separation, a graph search is done by treating the image as a graph of connected pixels to find pixels that stand out. It does not care about “objects of interest.” It made me curious if it was possible to give additional semantic information in the process.

The followings are the questions that I had while reading the paper:

1. Do you think the qualities that PTP looks for is enough to measure the score of the quality of segmented images? What other properties would a failed segmentation have? One quality I can think of is that failed segmentations often have disjoint parts in the segmentations.

2. Can you think of some cases where PTP could fail? Would there be any case where the score for a segmentation score really low even if the segmentation was done correctly?

3. As I’ve written in my reflection, are there methods that allow segmentation algorithms to consider the “interest” for an object? For example, if an image contained a car and a cat both in the foreground and the researcher was interested in the cat, would the algorithm be able to only separate out the cat?

Read More

2/26/20 – Jooyoung Whang – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

The paper provides research on Fairness, Explainable Artificial Intelligence (XAI), and people’s judgment change. The authors introduce a preprocessing method to reduce the bias of a dataset for known bias-inducing attributes. They also show four explanation methods of the classification results: Sensitivity, Input-Influence, Case, and Demographic. Using different combination of the above configurations, AI classifications of the COMPAS data was presented to MTurk workers for feedback. As a result, the paper reports that case-based explanations were often seen as less fair than other explanation methods. The authors also found that sensitivity explanations are the most effective at addressing unfairness. Finally, the paper shows that the evaluator’s position on machine learning heavily impacts his or her reaction to a classifier output and explanations.

When I looked at the paper’s sample sensitivity explanation, it gave me a strong impression that the system was racist. I think many others would have had a similar thought, especially if they do not have enough knowledge about machine learning and regression. Because of this, it concerned me that some people may be lured more towards making the opposite decision than the one that the AI made as a repulsive reaction. This is clearly adding another bias in the opposite direction. I believe an explanatory model should only give helpful information about the model instead of giving bias. Thinking of a possible solution, the authors could have rephrased the same information in a different way. For example, instead of bluntly saying that the classifier would have made a different decision, the system could have reported the probability for each label. This provides the same information but adds less obvious bias. Another solution would be preprocessing the data to not have the bias in the first place like the authors suggested.

I liked the idea of comparing the subject’s prior position to using ML with their judgment of the classifier. This relates to a reflection I made last week, where I stated the possibility that people may make decisions by putting more weight when the model makes a wrong decision. As I have expected, the paper reported that prior positions do in fact make a huge difference in a user’s judgment. Either building more trust with the users or building the software to effectively address both kinds of users would be needed to address this issue.

The followings are the questions I had while reading the paper:

1. Would there be a possibility where preprocessing the data would add bias to the data instead of removing it? What if the attribute that was thought to be unneeded for the classification was actually crucial to the judgment?

2. The authors state that one of the limitations of their study is conducting it with MTurk workers and not the actual users of the software. Do you think this was really a limitation? The attributes used for the classifier and explanations in their experiment seemed general enough for non-professionals to make a meaningful judgment.

3. If you were to design a classifier with an explanation model, which explanation method would you pick? (Out of Sensitivity, Input-Influence, Case, and Demographic) What do you like about the chosen method?

Read More

2/26/20 – Jooyoung Whang – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

This paper seeks to study what an AI system could do to get more approved by users even if it is not perfect. The paper focuses on the concept of “Expectation” and the discrepancy between an AI’s ability and a user’s expectation for the system. To explore this problem, the authors implemented an AI-powered scheduling assistant that mimics the look of MS Outlook. The agent detects in an E-mail if there exists an appointment request and asks the user if he or she wants to add a schedule to the calendar. The system was intentionally made to perform worse than the originally trained model to explore mitigation techniques to boost user satisfaction given an imperfect system. After trying out various methods, the authors conclude: Users prefer AI systems focusing on high precision and users like systems that gives direct information about the system, shows explanations, and supports certain measure of control.

This paper was a fresh approach that appropriately addresses the limitations that AI systems would likely have. While many researchers have looked into methods of maximizing the system accuracy, the authors of this paper studied ways to improve user satisfaction even without a high performing AI model.

I did get the feeling that the designs for adjusting end-user expectations were a bit too static. Aside from the controllable slider, the other two designs were basically texts and images with either an indication of the accuracy or a step-by-step guide on how the system works. I wonder if having a more dynamic version where the system reports for a specific instance would be more useful. For example, for every new E-mail, the system could additionally report to the user how confident it is or why it thought that the E-mail included a meeting request.

This research reminded me of one of the UX design techniques: think-aloud testing. In all of their designs, the authors’ common approach was to close the gap between user expectation and system performance. Think-aloud testing is also used to close that gap by analyzing how a user would interact with a system and adjusting from the results. I think this research approached it in the opposite way. Instead of adjusting the system, the authors’ designs try to adjust the user’s mental model.

The followings are the questions that I had while reading the paper:

1. As I’ve written in my reflection portion, do you think the system will be approved more if it reported some information about the system for each instance (E-mail)? Do you think the system may appear to be making excuses for when it is wrong? In what way would this dynamic version be more helpful than the static design from the paper?

2. In the generalizability section, the authors state that they think some parts of their study are scalable to other kinds of AI systems. What other types of AI could benefit from this study? Which one would benefit the most?

3. Many AI applications today are deployed after satisfying a certain accuracy threshold which is pretty high. This can lead to more funds and time needed for development. Do you think this research will allow the stakeholders to lower the threshold? In the end, the stakeholders just want to achieve high user satisfaction.

Read More

2/19/20 – Jooyoung Whang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

According to the paper, most developers of classification or prediction systems focus on the quality of the predictions but not on the system’s team performance with the user. The authors of this paper introduce the problem that may occur according to the current model training loss criteria and provide new methods that address the problem. To develop a more advanced image of the users’ interactions with a classifier system, the authors develop a web-based game system called Caja and conduct a user study using the Amazon Mechanical Turk. They conclude that the increase in performance of the system does not necessarily mean that the team performance of the system with the users also increase. They also confirm that their proposed training method using the new loss function and a new concept called Dissonance improves team performance.

I liked the authors’ new perspective to human-AI collaboration and model training. Now that I think of it, not considering the users of the system during development is contradictory to what the system’s trying to achieve. One thing I was interested in and had thoughts about was their definition of Dissonance. The term is used to compare and link with the old model of a system with the new updated model in terms of user expectation. I saw that the term penalizes a system when the new system misclassifies for a set of input that the old model used to get right. However, what if the users of the old system made predictions according to how the system was wrong? This may be a weird concern and probably an edge case, but if the user made decisions based on the thought that the system was wrong all the time, the team performance of that that person with the updated model will always be worse even if the new system was trained with the suggested loss function.

The followings are the questions that I had while reading the paper:

1. As I have written in my reflection, do you think the new proposed training method will be effective if the users made decisions based on the idea that the system will be always wrong? Or, is this a too extreme and absurd thought?

2. The design of Caja ensures that the user can never arrive at the solution by him or herself because too much about the problem domain is hidden to the user. However, this is often not the case in real world scenarios. The user of the system is often also an expert of the related field. Does this reduce the quality and trustworthiness of the results of this research? Why or why not?

3. The research started from the idea that interaction with the users must be considered when making an update to an AI system. In this case, it was particularly for human-AI collaboration. What if it was the opposite? For example, there are AIs that are built to compete with humans like AlphaGo. These types of AIs are also developed with the goal of producing the most optimal solution to a given input without considering the interaction with the user. How can training be modified to include users for competing AIs?

Read More

2/19/2020 – Jooyoung Whang – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

This paper introduces how bots and humans interact and collaborate to moderate thousands of wiki pages and ban vandal users. To study the use of moderator bots, the authors use a technique called trace ethnography. The technique traces the logs and records left by using automated services to give an insight into how the moderation was made using various tools. The authors explain how the tools facilitate distributed cognition and enhance teamwork among rather isolated vandal fighters. According to the paper, the set of vandal warnings is logged on the potential vandal user’s talk page which is then used to determine by feature vandal fighters how severe a warning should be given to the user. Temporary bans are made in a similar fashion, where a ban request is sent to the administrator’s ban request board and the next time an administrator finds a vandal activity by the same user, the ban is given. The paper makes use of a detailed use case to explain the process step-by-step.

The paper was interesting in that it shined a light to another pro that automation can bring to collaborative work. The paper emphasizes that it was the automated bots and their efficient reporting system that created a decentralized network of human moderators by pre-processing and analyzing the queued edits to form a ranked queue of potential vandal edits according to previous warnings. As there exist many effective scheduling algorithms, automated scheduling is a great way of handling human teamwork. Wikipedia’s system reminded me of a thread pool system that modern CPUs use, except that each thread’s task is carried out by a human.

Wikipedia’s vandal fighting system makes perfect use of human and AI affordance. The human’s side makes use of their linguistic and complex reasoning ability to determine the vandal edits. The AI side efficiently handles the many repetitive tasks like sorting edit queues and logging and retrieving warnings.

The followings are the questions that I had while reading the paper:

1. At the end of the use case presented in the paper, an obsolete report made after a user’s ban was automatically removed by the system. This is an example of resolving a race condition. Could there be any other possible conflicts that may occur because of the order of edits? Would some of them be difficult to fix by a bot?

2. According to the paper, it seems that the time of the warning by the system is not considered on a potential vandal user’s talk page when assigning a warning. What if the user who have gotten four warnings decided to quit being a vandal, came back a few years later, and accidentally made an edit that was considered vandal? The system would issue a temporary ban. Do you think this is fair?

3. According to the paper, vandal fighters are able to select from a range of helper bots in their activity. All these bots are compatible with each other because of the presence of a talk page provided by Wikipedia. Would there be any case where the different types of bots cause a problem or conflict with each other?

Read More

2/5/2020 – Jooyoung Whang – Guidelines for Human-AI Interaction

The paper is a good extraction of various design recommendations of human-AI interaction systems that have been collected for more than 20 years since the rise of AI. The authors run 4 iterations of filtering to end up with a final set of 18 guidelines that have been thoroughly reviewed and used. Their source of data comes from commercial AI products, user reviews, and related literature. In each of the iterations, the authors:

1. Extracted the initial set of guidelines

2. Reduced the number down via internal evaluation

3. Performed user study to verify relevance and clarity

4. Tested the guidelines with experts of the field

The authors provide a nicely summarized table containing all the guidelines and their examples. Rather than going in-depth about the resulting guidelines themselves, the authors focus more on the process and feedback that they received. The authors conclude by stating that the provided guidelines are mostly for general design cases and not specific ones.

When I was examining the guideline table, I liked how it was divided into four cases in the design iteration. In a usability engineering class that I took, I learned that a product’s design lifecycle consists of Analyze, Design, Prototype, and Evaluate, in their respective order (and can repeat). I could see that the guidelines focus a lot on Analyze, Design, and Evaluate. It was interesting that prototyping wasn’t strongly implied in the guidelines. I assume it may have been because the whole design iteration was considered a pass of prototyping. It may also have been because a system involving artificial intelligence is too hard to create a low fidelity prototype. The reason for going through a prototyping process is to quickly filter out what works and what doesn’t. As the nature of artificial intelligence requires extensive training and complicated reasoning, a pass of prototyping will accordingly take longer than other kinds of products.

It is very interesting that the guidelines (for long term) instruct that the AI system must inform its actions to the users. In my experience using AI systems such as voice recognition not knowing about machine learning techniques, the system mostly appeared as a black box. I also observed many people who intentionally tried not to use these kinds of systems because of suspicion. I think revealing portions of information and giving control to the users is a very good idea. This will allow more people to quickly adjust to the system.

The followings are the questions that came up to me when I was reading the paper:

1. As in my reflection, it is expensive to go through an entire design process for human-AI systems. Would there be a good workaround for this problem?

2. How much control do you think is appropriate to give to the users of the system? The paper mentions informing how the system will react to certain user actions and allowing the user to choose whether or not to use the system. But can we and should we allow further control?

3. The paper focuses on general cases of designing human-AI systems. They note that they’ve intentionally left out special cases. What kinds of special systems do you think will not need to follow the guidelines?

Read More

1/22/20 – Jooyoung Whang – Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass Ch.0-1

The chapters introduce a new trend of work in today’s world called “Ghost Work.” The authors define this work as a hidden human effort that appears to be automized. The authors deny the idea that artificial intelligence (AI) will soon rule over their creators because they awfully lack the ability to deal with the dynamic world without humans in the loop. Ghost work is what provides the backbone of today’s AI algorithms. Companies use ghost work APIs such as Amazon’s Mechanical Turk to easily assign micro-tasks that AI alone cannot handle. Many people today enter this newly formed workforce as a way of earning income in an extremely flexible schedule that they can control. The downside of ghost work is that it’s often hard to track by the government and no labor support such as life insurance is provided. However, the workers do have the ability to counter any unfair treatment legally as shown by an example by the authors.

Something I was very surprised with while I was reading the chapters was that the workforce consisted primarily of people that have a bachelor’s degree or higher. According to my prior experience with Amazon’s MTurk, most of the work listed asked to fill out a multi-page survey for a penny. I did not think people with such high education would want to do such a repetitive job.

I am also skeptical about the authors’ claim that this is a new trend that may replace the “primary work” trend. According to the chapters, only about 1% of the on-demand workers can make a minimum wage purely from ghost work. This makes it necessary for another source of income as the chapters also mentioned. Many people before ghost work were already working in multiple part-time jobs. Yet, the world did not say that the “work trend” had changed. How would ghost work be different? I agree that the market is growing for ghost work, but I think the risk of committing to ghost work will prevent it to become a major trend.

At the end of the reading, I concluded that ghost work is just a special kind of freelance work that involves supporting artificial intelligence. It also made me wonder since freelancing has already been existing for a long time and ghost work strongly resembles a special type of freelance work, wouldn’t other freelance workers and ghost workers have similar demographics?

The followings are additional questions that arose during the reading:

1. Is the ghost workforce really able to support real-time applications? This would require that ghost work is constantly available at any time and anywhere in the world. Wouldn’t there be downtime at any point?

2. What is the country with the largest population of ghost workers? The chapters only mentioned the United States and India as the primary ghost workforce. What other country do populations provide ghost work?

3. How much work can this replace? It seems that ghost work can only replace works that must be done online. It cannot, for example, have testers evaluate prototype machinery.

Read More