02/26/2020 – Nurendra Choudhary – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary

In this paper, the authors design explainable Machine Learning models to enhance their fairness perception. In this case, they study COMPAS, a model that predicts a criminal’s chance of reoffending. They explain the drawbacks and fairness issues with COMPAS (overestimates the chance for certain communities) and analyze the significance of change that Explainable AI (XAI) can bring to this fairness issue. They generated automatic explanations for COMPAS utilizing previously developed templates (Binns et al. 2018). The explanations are based on 4 templates: Sensitivity, Case, Demographic and Input-Influence. 

The authors hire 160 MT workers with certain criterias such as US residence and MT expertise. The workers are a diverse set but show no significant impact on the results’ variance. The experimental setup is a questionnaire that judges the worker’s criteria for making fairness judgements. The results show that the workers have heterogeneous criteria for making fairness judgements. Additionally, the experiment highlights two fairness issues: “unfair models (e.g., learned from biased data), and fairness discrepancy of different cases (e.g., in different regions of the feature space)”. 

Reflection

AI works in a very stealthy manner. The reason is that most of the algorithms detect patterns in a latent space that is incomprehensible to humans. The idea of using automatically generated standard templates to construct explanations to AI behaviour should be generalized to other AI research areas. The experiments show the change in human behavior with respect to explanations. I believe such explanations could not only help the general population’s understanding but also help researchers in narrowing down the limitations of these systems.

From the case of COMPAS, I question the future roles that interpretable AI makes possible. If AI is able to give explanations for its prediction, then I think it shall play the role of an unbiased judge better than humans. Societal biases are embedded in humans and they might subconsciously affect our choices. Interpreting these choices in humans is a complex self-criticism endeavour. But, for AI, systems as given in the paper can generate human comprehensible explanations to validate their predictions. Thus, making AI an objectively fairer judge than humans.

Additionally, I believe evaluation metrics for AI lean towards improving their overall prediction. However, I believe that comparable models that emphasize interpretability should be given more importance. But, a drawback to such metrics is the necessity of human evaluation for interpretability. This will impede the rate of progress in AI development. We need to develop better evaluation strategies for interpretability. In this paper, the authors hired 160 MT workers. Given it is a one-time evaluation, this study is possible. However, if this needs to be included in the regular AI development pipeline, we need more scalable approaches to avoid prohibitively expensive evaluation costs. One method could be to rely on a less-diverse test set for the development phase and increase diversity according to the real-world problem setting.

Questions

  1. How difficult is it to provide such explanations for all AI fields? Would it help in progressing AI understanding and development?
  2. How should we balance between explainability and effectiveness of AI models? Is it valid to lose effectiveness in return for interpretability?
  3. Would interpretability lead to adoption of AI systems in sensitive matters such as judiciary and politics?
  4. Can we develop evaluation metrics around suitability of AI systems for real-world scenarios? 

Word Count: 567

Read More

02/26/2020 – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning- Yuhang Liu

This paper discusses people’s dependence on interpretive tools for machine learning. As mentioned in the article, machine learning (ML) models are commonly deployed in various fields from criminal justice to healthcare. Machine learning has gone beyond academia and has developed into an engineering discipline. To this end, interpretability tools have been designed to help data scientists and machine learning practitioners better understand how ML models work. This paper focuses on such software. According to the classification, this software can be divided into two categories, the Interpret ML implementation of GAMs ( glass box models) and the SHAP Python package (a post-hoc explanation technique for blackbox models). The author’s research The results show that users trust machine interpretative results too much and rely too much on the use of machine learning interpretive tools. Few of these users can accurately describe the visualization of the output of these tools. In the end, the authors came to the conclusion that the visualization of the output of the interpretability tool can sometimes help data scientists find problems with data sets or models. For both tools, however, the existence of visualizations and the fact that the tools are publicly available have led to situations of excessive trust and abuse. Therefore, after the final experiments, the authors concluded that experts in two aspects of human-computer interaction and machine learning need to work together. The two interact better together to achieve better results.

First of all, after reading this article, I think that not only the explanatory tools of machine learning will make people over-trusted, including machine learning itself will also make people over-trusted, which may be caused by many aspects such as data sets. This reminds me of the course project I wanted to do this semester. My original intention was because a single, standard data set written by a large number of experts for a long time would cause the trained model to produce too high an accuracy rate, so the data set generated by crowdsourcing was used. Can get better results.

Secondly, for this article, I very much agree with the final solution proposed by the author, which is to better integrate the two aspects of human-machine interaction and machine learning as future research directions. This is because these interpretive tools are a visual display of the results. The better design of human-computer interaction allows users to better extract the results of machine learning, better understand the results, and understand the problems in them. Instead of overly trusting the results of machine learning. The future development direction is definitely that fewer and fewer users understand machine learning, but there will be more people using machine learning, and machine learning will become more and more instrumental, so I think that the interaction aspect will be made more Good for users to understand their results. On the other hand, machine learning should be more diverse and able to adapt to more application scenarios. Only when both aspects are done better can the effects of these tools be achieved.

  1. Is machine learning more academic or tool-oriented in the future?
  2. If the user does not know the meaning of the results, how to understand the accuracy of the results more clearly without using interpretive software
  3. The article mentioned that in the future, the joint efforts of human-computer interaction and machine learning will be required, and what changes should be made in human-computer interaction.

Read More

02/26/2020 – Dylan Finch – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Word count: 573

Summary of the Reading

This paper investigates explaining AI and ML systems. An easy way to explain AI and ML systems is to have another computer program to help generate an explanation of how the AI or ML system works. This paper works towards that goal, comparing 4 different programmatically generated explanations of AI And ML systems and seeing how they impact judgments of fairness. These different explanations had a large impact on perceptions of fairness and bias in the systems, with a large degree of variation between each of the explanation systems.

Not only did the kind of explanation used have a large impact on the perceived fairness of the algorithm, but the pre-existing feelings of the participants towards AI and ML and bias in these fields also had a profound impact on whether or not participants saw the explanations as fair or not. People who did not already trust AI fairness equally distrusted all of the explanations.

Reflections and Connections

To start, I think that this type of work is extremely useful to the future of the AI and ML fields. We need to be able to explain how these kinds of systems work and there needs to be more research into that. This issue of explainable AI becomes even more important when we put it in the context of making AI fair to the people who have to interact with it. We need to be able to tell if an AI system that is deciding whether or not to free people from jail is fair or not. The only way we can really know if these models are fair or not is to have some way to explain the decisions that the AI systems make. 

I think that one of the most interesting parts of the paper is the variation in the number of people with different circumstances who thought that the models were fair or not. Pre-existing ideas about whether or not AI systems are fair had a huge impact on whether or not people thought these models were fair when given an explanation of how they work. This shows how human of a problem this is and how hard it can be to decide if a model is fair or not, even when you have access to an explanation. Views of the model will differ from person to person. 

I also found it interesting how the type of explanation used had a big impact on the judgment of fairness. To me, this congers up ideas of a future where the people who build algorithms can just pick the right kind of explanation to prove that their algorithm is fair, in the same way companies now use language in a very questionable way. I think that this field still has a long way to go and that it will become increasingly important as AI penetrates more and more fasciates of our lives.

Questions

  1. When each explanation produces such different results, is it possible to make a concrete judgment on the fairness of an algorithm?
  2. Could we use computers or maybe even machine learning to decide if an algorithm is fair or would that just produce more problems?
  3. With so many different opinions, even when the same explanation is used, who should be the judge if an algorithm is fair or not?

Read More

02/26/20 – Nan LI – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary:

The main objective of this paper is to investigate how people make fairness judgments of the ML system and how explanations impact their fairness judgments. In particular, they explored the difference between a global explanation, which describes how the model works, and local explanations, which are sensitive and case-based. Besides, the author also demonstrates the impact of individual differences in cognitive style and prior position on algorithmic fairness impact the fairness judgment regarding different explanations. To achieve this goal, the author conducted an online survey-style study with Amazon Mechanical Turk workers with specific criteria. The experiment results indicate that based on different kinds of fairness issues and user profiles, there are varies effective explanation. However, a hybrid explanation that using global explanations to comprehend and evaluate the model and using local explanations to examine individual cases may be essential for accurate fairness judgment. Furthermore, they also demonstrated that individuals’ previous positions on the fairness of algorithms affect their response to different types of interpretations.

Reflection:

First, I think this paper talked about a very critical and imminent topic. Since the exploration and implementation of the machine learning system and AI system, it has been wildly deployed that using ML prediction to make decisions on high-stake fields such as healthcare and criminal predictive. However, societies have great doubts about how the system makes decisions. They cannot accept or even understand why these important decisions should be left to a piece of algorithm. Then, the community’s call for algorithm transparency is getting higher and higher. At this point, an effective, unbiased and user-friendly interpretation of ML system which enables the public to identify fairness problems would not only improve on ensuring the fairness of the ML system, but also increase public trust in ML system output.

However, it is also tricky that there is no one-size-all solution for an effective explanation. I do understand that different people shall have a different reaction to explanations, nevertheless, I was kinda surprised that people have very different opinions on the judgment of fairness. Even though this is understandable considering their prior position on the algorithm, their cognition, and different background, this will make it more complex to ensuring the fairness of the machine learning system. Since the system may need to take into account individual differences in their fairness positions, which may require different corrective or adaptive actions.

Finally, this paper reminds me of another similar topic. When we explain how the model works, how much information should we provide? What kind of information should we preserved so that this information will not be abused? In this paper, the author only mentioned that they would provide two types of explanations, global explanations that describe how the model works, and local explanations that attempt to justify the decision for a specific case. However, they didn’t examine the extent of system model information provided in the explanation. I think this is an interesting topic since we are investigating the impact of explanations on fairness judgment.

Question:

  1. Which type of explanations mentioned in this article would you prefer to see when you judge the fairness of the ML system?
  2. How did the user perceive machine learning system fairness influence the fairness ensuring process when designing the system?
  3. This experiment conducted based on an online survey with crowd workers instead of judges, do you think this would have any influence on experiment results?

Word Count: 564

Read More

02/26/20 – Myles Frantz – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summation

Even though Machine Learning has recently taken the title of “The Best Technology Buzzword” away from cyber security a few years ago, there are two fundamental problem with it; understanding the percentage each feature contributes to an answer and ensuring the model itself is fair. The first problem limits the progress on the second problem and has spawned its own field of research: explainable artificial intelligence. Due to this, it is difficult to ensure models are fair to sensitive data and don’t learn biases. To help ensure models are understood as fair, this team has concentrated on automatically generating four different types of explanations for the rational of the model. These models spawned a multitude of regions of the data, including input influence based, demographic-based, sensitivity-based, and case-based. By showing these heuristics of the same models to a group of crowd-workers, they were able to determine quantitatively determine there is not one perfect explanation method. There must be instead a tailored and customized explanation method.

Response

I agree with the need for explainable machine learning, though I’m unsure about the impact of this team’s work. Using work done previously for the four types and their own preprocessor, they seemingly resolved a question by only continuing it. This may be due to my lack of experience reading psychology papers, though their rationalization for the explanation styles and fairness in judgement seems to be common place. Two of the three conclusions wrapping up the quantitative study seemed appropriate, case-based explanation seemed less fair while local-based explanation was more effective. Though the latter conclusion of people having a previous bias towards machine learning seems to be redundant.

I can appreciate the lengths they went to measure the results against the mechanical turks. Seemingly creating an incremental paper (see the portion about their preprocessor), this may lead to more papers off their gathered heuristics.

Questions

  • I wonder if the impact of the survey for the mechanical turks was limited due to only using the four different types of explanations studied. The conclusion of the paper indicated there is no good average and each explanation type was useful in one scenario or another. In this manner would different explanations lead to a good overall explanation?
  • A known and understood limitation of this paper was in the use of mechanical turks instead of actual judges. This may be better due to representation of the jury; however, it is hard to measure the full impact without including the judge in this. It would be costly and timely, though it would help to better represent the full scenario.
  • Given the only four different types of explanation, would there be room for a combination or collaboration explanation? Though this paper mostly focuses on generating the explanations, there should be room to combine the factors to potentially create the overall good and average explanation, despite the paper limiting itself to the only four explanations early on by fully utilizing the Binns et al survey.

Read More

02/26/20 – Myles Frantz – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summation

Though Machine Learning techniques have advanced greatly within the past few years, human perception on the technology may severely limit the adoption and usage of the technology. To further study how to better encourage and describe the process of using this technology, this team has created a Scheduling Assistant to better monitor and elicit how users would tune and react to from different expectations. Tuning the expectations of the AI (Scheduling Assistant) via a slider (from “Fewer detections” on the left to “More detections” on the right) directly altering the false positive and false negative settings in the AI. This direct feedback loop gave users (mechanical turk workers) more confidence and a better understanding of how the AI works. Though given the various variety of users, having an AI focused on High Precision was not the  best general setting for the scheduling assistant.

Response

I like this kind of raw collaboration between the user and the Machine Learning system. This tailors the algorithm explicitly to the tendencies and mannerisms of each user, allowing easier customization and thus a higher likelihood of usage. This is supported due to the team’s Research hypothesis: “H1.1 An AI system focused on High Precision … will result in higher perceptions of accuracy …”. In this example each user (mechanical turk worker) was only using the subset of Enron emails to confirm or deny the meeting suggestions. Speculating further, this type of system being used in an expansive system across many users, being able to tune the AI would greatly encourage use.

I also strongly agree with the slider bar for ease of use tuning by the individual. In this format the user does not neat to have great technological skill to be able to use it, and it is relatively fast to use. Having it within the same system easily reachable also ensures a better connection between the user and the AI.

Questions

  • I would like to see a greater (and or a beta) study done with a giant email provider. Each email provider likely has their own homegrown Machine Learning model, however providing the capabilities to further tune their own AI for their preferences and tendencies would be a great improvement. The only issue would be with the scalability and providing enough services to make this work for all the users.
  • In tuning the ease of access and usability, I would like to see a study done comparing the different types of interaction tools (sliders, buttons, likert scale settings, etc.…). There likely is a study done about the effectiveness of each type of interaction tool upon a system, however in the context of AI settings it is imperative to have the best tool. This would hopefully be an adopted standard that would be an easy to use tool accessible by everyone.
  • Following along with the first question, I would like to see this kind of study provided to an internal mailing system, potentially at an academic level. Though this was studied with 150 mechanical turk workers and 400 internally provided workers, this was based on a sub-sample on the Enron email dataset. Providing this as a live-beta test in a widely and actively used email system with live emails would be a true test that I would like to see.

Read More

02/26/20 – Nan LI – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summary:

The key motivation of this paper is to investigate the influence factors of user satisfaction and acceptance on an imperfect AI-powered system, here the example used in this paper is an email Scheduling Assistant. To achieve this goal, the author conducted a number of experiments based on three techniques for setting expectations: Accuracy Indicator, Example-based Explanation, and Performance Control. Before the experiments, the author presents three main topic research questions, which about the impact factors on accuracy and acceptance of High Precision(low False Positives) and High recall; the effective design techniques for setting appropriate end-user expectations of AI systems, and impact of expectation-setting intervention techniques. A series hypothesis also made before experiments. Finally, the experiment results indicate the expectation adjustment techniques demonstrated in the paper have impacted the intended aspects of expectations and able to enhance user satisfaction and acceptance with an imperfect AI system. Out of the expectation, the conclusion is that a High Recall system can increase user satisfaction and acceptance than High Precision.

Reflection

I think this paper talked about a critical concern of AI-powered system from an interesting and practical direction. The whole approach of this paper reminds me of the previous paper which talked about a summary of the guideline of Human AI interaction. The first principle is to let humans have a clear mind about what AI can do and the second principle is to let humans understand how well can AI do on what it can do. Thus, I think the three expectation adjusting techniques are designed to give the user a clear clue of these two guidelines. However, instead of using text only to inform the user, the author designed three interfaces based on the principles that combining visualization and text, striving for simplicity.

These designs enable informed of the system accuracy very intuitively. Besides, these designs also allow the user to control the detection accuracy, so that the user could apply their own requirement. Thus, through several adjustments of the control and feedback experience, the user would finally combine their expectation with an appropriate adjustment. I believe this should be the main reason that these techniques could increasing user satisfaction and acceptance with an imperfect AI system successfully.

However, as users mentioned in the paper, the conclusion that users are more satisfied and accept a system with High Recall instead of High Precision based on the fact that users can easily recover from a False Positive in their experiment platform than from a False Negative. In my perspective, the satisfaction between High Recall and High Precision should be different based on vary AI system. Nevertheless, nowadays, the AI system has been wildly applied to the high-stakes domain such as health care or criminal predictive. For these systems, we might want to adjust to different systems to optimize for different goals.

Questions:

  1. Can you see any other related guidelines applied to expectation adjustment techniques designed in the paper?
  2. Is there any other way that we can adjust the user expectation of an imperfect AI system?
  3. What do you think are the key factors that able to decrease user expectations? Do you have a similar experience?

Word Count:525

Read More

02/19/2020 – Ziyao Wang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

The authors introduced the fact that in human-AI hybrid decision making system, the updates, which aiming at improving accuracy of AI system, may bring harmful effect to the teamwork. For experienced workers who are advised by AI system, they have built a mental model for the AI system, which will improve the correctness of teamwork’s results. However, the updates which will improve the accuracy of the AI system, may result in the difference between the updated model and the worker’s mental model. Finally, the user cannot make appropriate decisions with the help of AI system. In this paper, the researchers proposed a platform named CAJA, which can help to evaluate the compatibility between AI and human. With the results from experiments using CAJA, developers can learn how to make updates compatible while being still of high accuracy.

Reflection:

Before reading this paper, I kept the thought that it is always good to have a AI system with higher accuracy. However, this paper provides me a new point of view. Instead of only the performance of systems, we should also consider the cooperation between the system and human workers. In this paper, the updates in AI system will destroy the mental system in human mind. The experienced workers should have built a good cooperate system with the AI tools. They know about which advices should be taken and which ones may contain errors. If the patch makes the system to be accurate while reducing the correctness rate of the part which is trusted by human, the accuracy of the whole hybrid system will also be reduced. Human may not trust the updated system until they got a new balance with the updated system. During this period, the performance of this hybrid system will be reduced to a low level which is even worse than keeping the previous system which is not updated. For this reason, the developers should also try to maximize the performance of the system before release the application to the users. As a result, new updates will not make large changes to the system, and human can be more familiar to the updated system.

We can learn from this fact that we should never ignore the interaction between human and AI system. A good design of the interaction can contribute to the improvement of the performance of the whole system. In the meantime, a system with poor human-AI interaction may be harmful to the whole system. When we try to implement a system which needs both human affordance and AI affordance, we should pay more attention to the cooperate between human and AI. We should leverage the affordance from both sides, instead of only focusing on the AI system. We should put us in the position in the designer of the whole system with the view of overall situation rather than just consider ourselves as programmer and only focus on the program.

Questions:

What’s the criteria for deciding whether the updates are compatible or not?

Will releasing instructions for each update to the users valuable to reduce the harm of updates?

If we have a new version of system which will improve the accuracy greatly, however the users’ mental model is totally different from it,  how to reach a balance which will maximize the performance of the whole hybrid system?

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia – Myles Frantz

Given an extensive website such as Wikipedia, there is bound to be an abundance of actors, both good and bad. With the scalability and wide ruleset of the popular web forum site, it would be nigh impossible for human moderators to handle the workload and cross examine each page in depth. To alleviate this, programs that use machine learning were created to help cross track user’s usage of the site into a single repository. Once all the information is gathered here, if a user is acting in a malicious way, it can easily be caught by the system and auto-reverted based on the machine learnings predictions. Such was the case for the user from the case study, whom attempted to slander a famous musician, but was caught quickly and with ease.

I absolutely agree with all the moderation going on around Wikipedia. Given the site domain, there are a vast number of pages that must be secured and protected (all to the same level). It is unrealistic to expect a non-profit website to be able to hire more manual workers to accomplish this same task (in contrast to Youtube, or Facebook). Also, the context in which must be followed in order to fully track a malicious user down manually would be completely exhaustive. For the security side of malware tracking, there is a vast amount of decompilers, raw binary program tracers, and even a custom Virtual Machine and Operation System (Security Onion) that contains various amounts of programs “out of the box” that are ready to track the full environment for the malware.

I disagree with one of the major issues that arises, regarding the bots creating and executing their own moral agenda. This is completely learned and based on the various factors (such as the rules, the training data, and correction values). Though they have the power to automatically revert and edit someone else’s page, these are done at the discretion of the person who created the rules. It would likely have some issues, but it is the overall learning process. These false positives would also be able to be appealed if the author so chooses to follow through, so it’s not a fully final decision.

  • I would believe with such a tool suite, there would be a tool that would act as a combination, a “Visual Studio Code” like interface for all these tools. Having all these tools at the ready is useful, however since time is of the essence some tool wrapping all the common functions would be very convenient.
  • I would like to get several how many reviews from moderators are completely biased. Having a moderator work force should ideally be unbiased however realistically it is unlikely to fully happen.
  • I would also like to see the percentage of false positives, even in this robust of a system. Likely with new moderators they are likely to flag or unflag something if they are unfamiliar with the rules.

Read More

2/19 – Dylan Finch – In Search of the Dream Team:Temporally Constrained Multi-Armed Bandits forIdentifying Effective Team Structures

Word count: 517

Summary of the Reading

This paper seeks to help make it faster and easier for teams to find their ideal team structure. While many services allow teams to test out many different team structures to find the best one, many of those services can take a lot of time and can greatly affect the people who work on the team. Often times they have to switch structures so often that it makes it hard for the teams to concentrate on getting work done. 

The method proposed in the paper seemed to be very successful. It resulted in teams that were 38-46% more effective. The system works by testing different team structures and taking automatically generated feedback information (like performance metrics) to figure out how effective each structure is. It will then base its future combinations on this feedback. Each time a new structure is tested, it varies on a five dimensions: hierarchy, interaction patterns, norms of engagement, decision-making norms, and feedback norms.

Reflections and Connections

I think that this paper has an excellent idea for a system that can help teams to work better together. One of the most important things about a team is how it is structured. The structure of a team can make or break its effectiveness, so getting the structure right is very important to making an effective team. A tool like this that can help a team figure out the best structure with minimal interruption will be very useful to everyone in the business world who needs to manage a team. 

I also thought that it was a great idea to integrate the system into Slack. When I worked in industry last summer, all of the teams at my company used Slack. So, it makes a lot of sense to implement this new system in a system that people are already familiar with.  The use of Slack also allows the creators to make the system more friendly. I think it is much better to get feedback from a human-like Slack bot than some other heartless computer program. It is also very cool how the team members can interact with the bot in Slack. 

I also found the dimensions that they used in the team structures to be interesting. It is valuable to be able to classify teams in some concrete way based on certain dimensions of how they perform. This also has a lot of real world applications. I think that a lot of the time, one of the hardest things in any problem space is just to quantify the possible states of the system. They did this very nicely with the team dimensions and all of their values. 

Questions

  1. Would you recommend this system to your boss at your next job as a way to figure out how to organize the team?
  2. Aside from the ones listed in the paper, what do you think could be some limitations of the current system?
  3. Do you think that the possible structures had enough dimensions and values for each dimension?

Read More