02/26/20 – Nan LI – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary:

The main objective of this paper is to investigate how people make fairness judgments of the ML system and how explanations impact their fairness judgments. In particular, they explored the difference between a global explanation, which describes how the model works, and local explanations, which are sensitive and case-based. Besides, the author also demonstrates the impact of individual differences in cognitive style and prior position on algorithmic fairness impact the fairness judgment regarding different explanations. To achieve this goal, the author conducted an online survey-style study with Amazon Mechanical Turk workers with specific criteria. The experiment results indicate that based on different kinds of fairness issues and user profiles, there are varies effective explanation. However, a hybrid explanation that using global explanations to comprehend and evaluate the model and using local explanations to examine individual cases may be essential for accurate fairness judgment. Furthermore, they also demonstrated that individuals’ previous positions on the fairness of algorithms affect their response to different types of interpretations.

Reflection:

First, I think this paper talked about a very critical and imminent topic. Since the exploration and implementation of the machine learning system and AI system, it has been wildly deployed that using ML prediction to make decisions on high-stake fields such as healthcare and criminal predictive. However, societies have great doubts about how the system makes decisions. They cannot accept or even understand why these important decisions should be left to a piece of algorithm. Then, the community’s call for algorithm transparency is getting higher and higher. At this point, an effective, unbiased and user-friendly interpretation of ML system which enables the public to identify fairness problems would not only improve on ensuring the fairness of the ML system, but also increase public trust in ML system output.

However, it is also tricky that there is no one-size-all solution for an effective explanation. I do understand that different people shall have a different reaction to explanations, nevertheless, I was kinda surprised that people have very different opinions on the judgment of fairness. Even though this is understandable considering their prior position on the algorithm, their cognition, and different background, this will make it more complex to ensuring the fairness of the machine learning system. Since the system may need to take into account individual differences in their fairness positions, which may require different corrective or adaptive actions.

Finally, this paper reminds me of another similar topic. When we explain how the model works, how much information should we provide? What kind of information should we preserved so that this information will not be abused? In this paper, the author only mentioned that they would provide two types of explanations, global explanations that describe how the model works, and local explanations that attempt to justify the decision for a specific case. However, they didn’t examine the extent of system model information provided in the explanation. I think this is an interesting topic since we are investigating the impact of explanations on fairness judgment.

Question:

  1. Which type of explanations mentioned in this article would you prefer to see when you judge the fairness of the ML system?
  2. How did the user perceive machine learning system fairness influence the fairness ensuring process when designing the system?
  3. This experiment conducted based on an online survey with crowd workers instead of judges, do you think this would have any influence on experiment results?

Word Count: 564

Read More

02/26/20 – Myles Frantz – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summation

Even though Machine Learning has recently taken the title of “The Best Technology Buzzword” away from cyber security a few years ago, there are two fundamental problem with it; understanding the percentage each feature contributes to an answer and ensuring the model itself is fair. The first problem limits the progress on the second problem and has spawned its own field of research: explainable artificial intelligence. Due to this, it is difficult to ensure models are fair to sensitive data and don’t learn biases. To help ensure models are understood as fair, this team has concentrated on automatically generating four different types of explanations for the rational of the model. These models spawned a multitude of regions of the data, including input influence based, demographic-based, sensitivity-based, and case-based. By showing these heuristics of the same models to a group of crowd-workers, they were able to determine quantitatively determine there is not one perfect explanation method. There must be instead a tailored and customized explanation method.

Response

I agree with the need for explainable machine learning, though I’m unsure about the impact of this team’s work. Using work done previously for the four types and their own preprocessor, they seemingly resolved a question by only continuing it. This may be due to my lack of experience reading psychology papers, though their rationalization for the explanation styles and fairness in judgement seems to be common place. Two of the three conclusions wrapping up the quantitative study seemed appropriate, case-based explanation seemed less fair while local-based explanation was more effective. Though the latter conclusion of people having a previous bias towards machine learning seems to be redundant.

I can appreciate the lengths they went to measure the results against the mechanical turks. Seemingly creating an incremental paper (see the portion about their preprocessor), this may lead to more papers off their gathered heuristics.

Questions

  • I wonder if the impact of the survey for the mechanical turks was limited due to only using the four different types of explanations studied. The conclusion of the paper indicated there is no good average and each explanation type was useful in one scenario or another. In this manner would different explanations lead to a good overall explanation?
  • A known and understood limitation of this paper was in the use of mechanical turks instead of actual judges. This may be better due to representation of the jury; however, it is hard to measure the full impact without including the judge in this. It would be costly and timely, though it would help to better represent the full scenario.
  • Given the only four different types of explanation, would there be room for a combination or collaboration explanation? Though this paper mostly focuses on generating the explanations, there should be room to combine the factors to potentially create the overall good and average explanation, despite the paper limiting itself to the only four explanations early on by fully utilizing the Binns et al survey.

Read More

02/26/20 – Myles Frantz – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summation

Though Machine Learning techniques have advanced greatly within the past few years, human perception on the technology may severely limit the adoption and usage of the technology. To further study how to better encourage and describe the process of using this technology, this team has created a Scheduling Assistant to better monitor and elicit how users would tune and react to from different expectations. Tuning the expectations of the AI (Scheduling Assistant) via a slider (from “Fewer detections” on the left to “More detections” on the right) directly altering the false positive and false negative settings in the AI. This direct feedback loop gave users (mechanical turk workers) more confidence and a better understanding of how the AI works. Though given the various variety of users, having an AI focused on High Precision was not the  best general setting for the scheduling assistant.

Response

I like this kind of raw collaboration between the user and the Machine Learning system. This tailors the algorithm explicitly to the tendencies and mannerisms of each user, allowing easier customization and thus a higher likelihood of usage. This is supported due to the team’s Research hypothesis: “H1.1 An AI system focused on High Precision … will result in higher perceptions of accuracy …”. In this example each user (mechanical turk worker) was only using the subset of Enron emails to confirm or deny the meeting suggestions. Speculating further, this type of system being used in an expansive system across many users, being able to tune the AI would greatly encourage use.

I also strongly agree with the slider bar for ease of use tuning by the individual. In this format the user does not neat to have great technological skill to be able to use it, and it is relatively fast to use. Having it within the same system easily reachable also ensures a better connection between the user and the AI.

Questions

  • I would like to see a greater (and or a beta) study done with a giant email provider. Each email provider likely has their own homegrown Machine Learning model, however providing the capabilities to further tune their own AI for their preferences and tendencies would be a great improvement. The only issue would be with the scalability and providing enough services to make this work for all the users.
  • In tuning the ease of access and usability, I would like to see a study done comparing the different types of interaction tools (sliders, buttons, likert scale settings, etc.…). There likely is a study done about the effectiveness of each type of interaction tool upon a system, however in the context of AI settings it is imperative to have the best tool. This would hopefully be an adopted standard that would be an easy to use tool accessible by everyone.
  • Following along with the first question, I would like to see this kind of study provided to an internal mailing system, potentially at an academic level. Though this was studied with 150 mechanical turk workers and 400 internally provided workers, this was based on a sub-sample on the Enron email dataset. Providing this as a live-beta test in a widely and actively used email system with live emails would be a true test that I would like to see.

Read More

02/26/20 – Nan LI – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summary:

The key motivation of this paper is to investigate the influence factors of user satisfaction and acceptance on an imperfect AI-powered system, here the example used in this paper is an email Scheduling Assistant. To achieve this goal, the author conducted a number of experiments based on three techniques for setting expectations: Accuracy Indicator, Example-based Explanation, and Performance Control. Before the experiments, the author presents three main topic research questions, which about the impact factors on accuracy and acceptance of High Precision(low False Positives) and High recall; the effective design techniques for setting appropriate end-user expectations of AI systems, and impact of expectation-setting intervention techniques. A series hypothesis also made before experiments. Finally, the experiment results indicate the expectation adjustment techniques demonstrated in the paper have impacted the intended aspects of expectations and able to enhance user satisfaction and acceptance with an imperfect AI system. Out of the expectation, the conclusion is that a High Recall system can increase user satisfaction and acceptance than High Precision.

Reflection

I think this paper talked about a critical concern of AI-powered system from an interesting and practical direction. The whole approach of this paper reminds me of the previous paper which talked about a summary of the guideline of Human AI interaction. The first principle is to let humans have a clear mind about what AI can do and the second principle is to let humans understand how well can AI do on what it can do. Thus, I think the three expectation adjusting techniques are designed to give the user a clear clue of these two guidelines. However, instead of using text only to inform the user, the author designed three interfaces based on the principles that combining visualization and text, striving for simplicity.

These designs enable informed of the system accuracy very intuitively. Besides, these designs also allow the user to control the detection accuracy, so that the user could apply their own requirement. Thus, through several adjustments of the control and feedback experience, the user would finally combine their expectation with an appropriate adjustment. I believe this should be the main reason that these techniques could increasing user satisfaction and acceptance with an imperfect AI system successfully.

However, as users mentioned in the paper, the conclusion that users are more satisfied and accept a system with High Recall instead of High Precision based on the fact that users can easily recover from a False Positive in their experiment platform than from a False Negative. In my perspective, the satisfaction between High Recall and High Precision should be different based on vary AI system. Nevertheless, nowadays, the AI system has been wildly applied to the high-stakes domain such as health care or criminal predictive. For these systems, we might want to adjust to different systems to optimize for different goals.

Questions:

  1. Can you see any other related guidelines applied to expectation adjustment techniques designed in the paper?
  2. Is there any other way that we can adjust the user expectation of an imperfect AI system?
  3. What do you think are the key factors that able to decrease user expectations? Do you have a similar experience?

Word Count:525

Read More

02/19/2020 – Ziyao Wang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

The authors introduced the fact that in human-AI hybrid decision making system, the updates, which aiming at improving accuracy of AI system, may bring harmful effect to the teamwork. For experienced workers who are advised by AI system, they have built a mental model for the AI system, which will improve the correctness of teamwork’s results. However, the updates which will improve the accuracy of the AI system, may result in the difference between the updated model and the worker’s mental model. Finally, the user cannot make appropriate decisions with the help of AI system. In this paper, the researchers proposed a platform named CAJA, which can help to evaluate the compatibility between AI and human. With the results from experiments using CAJA, developers can learn how to make updates compatible while being still of high accuracy.

Reflection:

Before reading this paper, I kept the thought that it is always good to have a AI system with higher accuracy. However, this paper provides me a new point of view. Instead of only the performance of systems, we should also consider the cooperation between the system and human workers. In this paper, the updates in AI system will destroy the mental system in human mind. The experienced workers should have built a good cooperate system with the AI tools. They know about which advices should be taken and which ones may contain errors. If the patch makes the system to be accurate while reducing the correctness rate of the part which is trusted by human, the accuracy of the whole hybrid system will also be reduced. Human may not trust the updated system until they got a new balance with the updated system. During this period, the performance of this hybrid system will be reduced to a low level which is even worse than keeping the previous system which is not updated. For this reason, the developers should also try to maximize the performance of the system before release the application to the users. As a result, new updates will not make large changes to the system, and human can be more familiar to the updated system.

We can learn from this fact that we should never ignore the interaction between human and AI system. A good design of the interaction can contribute to the improvement of the performance of the whole system. In the meantime, a system with poor human-AI interaction may be harmful to the whole system. When we try to implement a system which needs both human affordance and AI affordance, we should pay more attention to the cooperate between human and AI. We should leverage the affordance from both sides, instead of only focusing on the AI system. We should put us in the position in the designer of the whole system with the view of overall situation rather than just consider ourselves as programmer and only focus on the program.

Questions:

What’s the criteria for deciding whether the updates are compatible or not?

Will releasing instructions for each update to the users valuable to reduce the harm of updates?

If we have a new version of system which will improve the accuracy greatly, however the users’ mental model is totally different from it,  how to reach a balance which will maximize the performance of the whole hybrid system?

Read More

02/19/20 – Lulwah AlKulaib- Dream Team

Summary

The authors mention that the previous HCI research focused on ideal team structures and how roles, norms, and interaction patterns are influenced by systems. The state of research directed teams towards those structures by increasing shared awareness, adding channels of communications, convening effective collaborators. Yet organizational behavior research denies the existence of universally ideal team structures. And believes that structural contingency theory has demonstrated that the best team structures depend on the task, the members, and some other factors. The authors introduce Dream Team, a system that identifies effective team structures for each team by adapting teams to different structures and evaluating each fit. Dream Team explores over time, experimenting with values along many dimensions of team structures such as hierarchy, interaction patterns, and norms. The system utilizes feedback, such as team performance or satisfaction, to iteratively identify the team structures that best fit each team. It helps teams in identifying the structures that are most effective for them by experimenting with different structures over time on multi-armed bandits.

Reflection

The paper presented a system that focuses on virtual teams. In my opinion, the presented system is a very specific application to a very specific problem. The authors address their long list of limitations, including how they don’t believe their system generalizes to other problems easily. I also believe that the way they utilize feedback in the system is complex and unclear. Their reward function did not explain how qualitative factors were taken into consideration. The authors mention that high variance tasks would require more time for DreamTeam to converge.

Which means more time to get a response from the system, and I don’t know how that would be useful if it slows teams down? Also, when looking at the snapshot of the slack integration, it seems that they handle team satisfaction based on users response to a task, which is not always the case when it comes to collaboration on slack. The enthusiasm of the responses just seems out of the norm. The authors did not address how would their system address “team satisfaction” when there’s little to no response? Would that be counted as a negative response? Or would it be neutral? And even though their system worked well for the very specific task they chose, it’s also a virtual team. Which raises questions about how would this method be applicable for in person teams or hybrid teams? It seems that their controlled environment was very controlled. Even though they presented a good idea, I doubt how applicable it is to real life situations.

Discussion

  • In your opinion, what makes a dream team?
  • Are you pro or against ideal team structures? Why?
  • What were the qualities of collaborators in the best group project/research you had?
  • What makes the “chemistry” between team members?
  • What does a successful collaborative team project look like during a cycle?
  • What tools do you use in project management? 
  • Would you use DreamTeam in your project?
  • What would you change in DreamTeam to make it work better for you?

Read More

2/19 – Dylan Finch – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Word count: 565

Summary of the Reading

This paper analyzes the use of autonomous technologies that are used on Wikipedia. These technologies help to keep the peace on the large platform, helping to flag malicious users and revert inaccurate and spammy changes so that Wikipedia stays accurate and up to date. Many people may think that humans play the major role in policing the platform, but machines and algorithms also play a very large part, aiding the humans to deal with the large amount of edits.

Some tools are completely automated and can prevent vandalism with no human input. Other tools give human contributors tips to help them spot and fight vandalism. Humans work together with the automated systems and each other to edit the site and keep the pages vandal free. The way in which all of the editors edit together, even though they are not physically together or connected as a team, is an impressive feat of human and AI interaction.

Reflections and Connections

To start, I think that Wikipedia is such and interesting thing to examine for a paper like this. While many organizations have a similar structure, I think that WIkipedia is unique and interesting to study because it is so large, so distributed, and so widely used. It can be hard enough to get a small team of people to work together on documentation. At Wikipedia’s size the complexities of making it all work must be unimaginable. It is so interesting to find out how machines and humans work together at that scale to keep the site running smoothly. The ideas and analysis seen here can easily be applied to smaller systems that are trying to accomplish the same thing.

I also think that this article serves as a great reminder of the power of AI. The fact that AI is able to do some much to help editors keep the site running smoothly even with all of the complexities of the site is amazing and it shows just how much power AI can have when applied to the right situation. A lot of the work done on Wikipedia is not hard work. The article mentions some of the things that bots do, like importing data and fixing grammatical mistakes. These things are incredibly tedious for humans to do and yet they are perfect work for machines. They can do this work almost instantly while it may take a human an hour. This not only serves as a great reminder of the power of AI’s and humans complimenting each other’s abilities, but it also shows what the power of what the internet can do. Something like this never would have been possible before in the history of human civilization. The mere fact that we can do something like this now speaks to the amazing power of the current age. 

Questions

  1. Does this research have applications elsewhere? What would be the best place to apply this analysis?
  2. Could this process ever be done with no human input whatsoever? Could Wikipedia one day be completely self sufficient?
  3. This article talks a lot about how the bots of Wikipedia are becoming more and more important, compared to the policies and social interactions between editors. Is this happening elsewhere? Are there bots other places that we might not see and might not notice, even though they are doing a larger and larger share of the work?

Read More

2/19 – Dylan Finch – In Search of the Dream Team:Temporally Constrained Multi-Armed Bandits forIdentifying Effective Team Structures

Word count: 517

Summary of the Reading

This paper seeks to help make it faster and easier for teams to find their ideal team structure. While many services allow teams to test out many different team structures to find the best one, many of those services can take a lot of time and can greatly affect the people who work on the team. Often times they have to switch structures so often that it makes it hard for the teams to concentrate on getting work done. 

The method proposed in the paper seemed to be very successful. It resulted in teams that were 38-46% more effective. The system works by testing different team structures and taking automatically generated feedback information (like performance metrics) to figure out how effective each structure is. It will then base its future combinations on this feedback. Each time a new structure is tested, it varies on a five dimensions: hierarchy, interaction patterns, norms of engagement, decision-making norms, and feedback norms.

Reflections and Connections

I think that this paper has an excellent idea for a system that can help teams to work better together. One of the most important things about a team is how it is structured. The structure of a team can make or break its effectiveness, so getting the structure right is very important to making an effective team. A tool like this that can help a team figure out the best structure with minimal interruption will be very useful to everyone in the business world who needs to manage a team. 

I also thought that it was a great idea to integrate the system into Slack. When I worked in industry last summer, all of the teams at my company used Slack. So, it makes a lot of sense to implement this new system in a system that people are already familiar with.  The use of Slack also allows the creators to make the system more friendly. I think it is much better to get feedback from a human-like Slack bot than some other heartless computer program. It is also very cool how the team members can interact with the bot in Slack. 

I also found the dimensions that they used in the team structures to be interesting. It is valuable to be able to classify teams in some concrete way based on certain dimensions of how they perform. This also has a lot of real world applications. I think that a lot of the time, one of the hardest things in any problem space is just to quantify the possible states of the system. They did this very nicely with the team dimensions and all of their values. 

Questions

  1. Would you recommend this system to your boss at your next job as a way to figure out how to organize the team?
  2. Aside from the ones listed in the paper, what do you think could be some limitations of the current system?
  3. Do you think that the possible structures had enough dimensions and values for each dimension?

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia – Myles Frantz

Given an extensive website such as Wikipedia, there is bound to be an abundance of actors, both good and bad. With the scalability and wide ruleset of the popular web forum site, it would be nigh impossible for human moderators to handle the workload and cross examine each page in depth. To alleviate this, programs that use machine learning were created to help cross track user’s usage of the site into a single repository. Once all the information is gathered here, if a user is acting in a malicious way, it can easily be caught by the system and auto-reverted based on the machine learnings predictions. Such was the case for the user from the case study, whom attempted to slander a famous musician, but was caught quickly and with ease.

I absolutely agree with all the moderation going on around Wikipedia. Given the site domain, there are a vast number of pages that must be secured and protected (all to the same level). It is unrealistic to expect a non-profit website to be able to hire more manual workers to accomplish this same task (in contrast to Youtube, or Facebook). Also, the context in which must be followed in order to fully track a malicious user down manually would be completely exhaustive. For the security side of malware tracking, there is a vast amount of decompilers, raw binary program tracers, and even a custom Virtual Machine and Operation System (Security Onion) that contains various amounts of programs “out of the box” that are ready to track the full environment for the malware.

I disagree with one of the major issues that arises, regarding the bots creating and executing their own moral agenda. This is completely learned and based on the various factors (such as the rules, the training data, and correction values). Though they have the power to automatically revert and edit someone else’s page, these are done at the discretion of the person who created the rules. It would likely have some issues, but it is the overall learning process. These false positives would also be able to be appealed if the author so chooses to follow through, so it’s not a fully final decision.

  • I would believe with such a tool suite, there would be a tool that would act as a combination, a “Visual Studio Code” like interface for all these tools. Having all these tools at the ready is useful, however since time is of the essence some tool wrapping all the common functions would be very convenient.
  • I would like to get several how many reviews from moderators are completely biased. Having a moderator work force should ideally be unbiased however realistically it is unlikely to fully happen.
  • I would also like to see the percentage of false positives, even in this robust of a system. Likely with new moderators they are likely to flag or unflag something if they are unfamiliar with the rules.

Read More

02/19/20 – Lulwah AlKulaib- OrderWikipedia

Summary

The paper examines the roles of software tools in English language Wikipedia. The authors shed light on the process of counter-vandalism in Wikipedia. They explain in detail how participants and their assisted editing tools review Wikipedia contributions and enforce standards. They show that the editing process in Wikipedia is not a disconnected activity where editors force their views on others. Specifically, vandal fighting is shown as a distributed cognition process where users come to know their projects and users who edit it in a way that is impossible for a single individual. The authors claim the blocking of a vandal a cognitive process made possible by a complex network of interactions between humans, encyclopedia articles, software systems, and databases. Humans and non-humans work to produce and maintain a social order in the collaborative production of an encyclopedia with hundreds of thousands of diverse and often unorganized contributors. The authors introduce trace ethnography as a method of studying the seemingly ad-hoc assemblage of editors, administrators, bots, assisted editing tools, and others who constitute Wikipedia’s vandal fighting network.

Reflection

The paper comes off as a survey paper. I found that the authors explained some methods that already existed and used one of the authors experience to elaborate on others’ work. I couldn’t see their contribution but maybe that was needed 10 years ago? The tools that they mentioned (Huggle, AIV, Twinkle, ..etc.) were standard tools to be used when editing Wikipedia’s articles and monitoring edits made by others. They reflected on how those tools were helpful in a manner that made fighting vandalism an easier task. They mention that these tools facilitate viewing each edited article by linking it with a detailed edit summary with an explanation why it was done, by whom, and related IP addresses. They explain how they use such software to detect vandalism and how to revert back to the correct version of the article. They presented a case study of a Wikipedia vandal and showed logs of the changes that he was able to make in an hour. The authors also referenced Ed hutchins who explains how cognitive work must be performed in order to keep US Navy ships on course at any given time. And how that is a similar reference to what it takes to manage Wikipedia. Technological actors in Wikipedia, such as Huggle, make what would be a difficult task into a mundane affair. Reverting edits becomes a matter of pressing a button. The paper was informative for someone who hasn’t worked on editing Wiki articles but I thought that this paper could have been presented as a tutorial, it would’ve been more beneficial. 

Discussion

  • Have you worked on Wikipedia article editing before?
  • Did you encounter using the tools mentioned in the paper?
  • Is there any application that comes to mind where this can be used other than Wikipedia?
  • Do you think such tools could be beneficial when it comes to open source software version control?
  • How would this method generalize to open source software version control?

Read More