02/26/20 – Nan LI – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary:

The main objective of this paper is to investigate how people make fairness judgments of the ML system and how explanations impact their fairness judgments. In particular, they explored the difference between a global explanation, which describes how the model works, and local explanations, which are sensitive and case-based. Besides, the author also demonstrates the impact of individual differences in cognitive style and prior position on algorithmic fairness impact the fairness judgment regarding different explanations. To achieve this goal, the author conducted an online survey-style study with Amazon Mechanical Turk workers with specific criteria. The experiment results indicate that based on different kinds of fairness issues and user profiles, there are varies effective explanation. However, a hybrid explanation that using global explanations to comprehend and evaluate the model and using local explanations to examine individual cases may be essential for accurate fairness judgment. Furthermore, they also demonstrated that individuals’ previous positions on the fairness of algorithms affect their response to different types of interpretations.

Reflection:

First, I think this paper talked about a very critical and imminent topic. Since the exploration and implementation of the machine learning system and AI system, it has been wildly deployed that using ML prediction to make decisions on high-stake fields such as healthcare and criminal predictive. However, societies have great doubts about how the system makes decisions. They cannot accept or even understand why these important decisions should be left to a piece of algorithm. Then, the community’s call for algorithm transparency is getting higher and higher. At this point, an effective, unbiased and user-friendly interpretation of ML system which enables the public to identify fairness problems would not only improve on ensuring the fairness of the ML system, but also increase public trust in ML system output.

However, it is also tricky that there is no one-size-all solution for an effective explanation. I do understand that different people shall have a different reaction to explanations, nevertheless, I was kinda surprised that people have very different opinions on the judgment of fairness. Even though this is understandable considering their prior position on the algorithm, their cognition, and different background, this will make it more complex to ensuring the fairness of the machine learning system. Since the system may need to take into account individual differences in their fairness positions, which may require different corrective or adaptive actions.

Finally, this paper reminds me of another similar topic. When we explain how the model works, how much information should we provide? What kind of information should we preserved so that this information will not be abused? In this paper, the author only mentioned that they would provide two types of explanations, global explanations that describe how the model works, and local explanations that attempt to justify the decision for a specific case. However, they didn’t examine the extent of system model information provided in the explanation. I think this is an interesting topic since we are investigating the impact of explanations on fairness judgment.

Question:

  1. Which type of explanations mentioned in this article would you prefer to see when you judge the fairness of the ML system?
  2. How did the user perceive machine learning system fairness influence the fairness ensuring process when designing the system?
  3. This experiment conducted based on an online survey with crowd workers instead of judges, do you think this would have any influence on experiment results?

Word Count: 564

Read More

02/26/20 – Myles Frantz – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summation

Even though Machine Learning has recently taken the title of “The Best Technology Buzzword” away from cyber security a few years ago, there are two fundamental problem with it; understanding the percentage each feature contributes to an answer and ensuring the model itself is fair. The first problem limits the progress on the second problem and has spawned its own field of research: explainable artificial intelligence. Due to this, it is difficult to ensure models are fair to sensitive data and don’t learn biases. To help ensure models are understood as fair, this team has concentrated on automatically generating four different types of explanations for the rational of the model. These models spawned a multitude of regions of the data, including input influence based, demographic-based, sensitivity-based, and case-based. By showing these heuristics of the same models to a group of crowd-workers, they were able to determine quantitatively determine there is not one perfect explanation method. There must be instead a tailored and customized explanation method.

Response

I agree with the need for explainable machine learning, though I’m unsure about the impact of this team’s work. Using work done previously for the four types and their own preprocessor, they seemingly resolved a question by only continuing it. This may be due to my lack of experience reading psychology papers, though their rationalization for the explanation styles and fairness in judgement seems to be common place. Two of the three conclusions wrapping up the quantitative study seemed appropriate, case-based explanation seemed less fair while local-based explanation was more effective. Though the latter conclusion of people having a previous bias towards machine learning seems to be redundant.

I can appreciate the lengths they went to measure the results against the mechanical turks. Seemingly creating an incremental paper (see the portion about their preprocessor), this may lead to more papers off their gathered heuristics.

Questions

  • I wonder if the impact of the survey for the mechanical turks was limited due to only using the four different types of explanations studied. The conclusion of the paper indicated there is no good average and each explanation type was useful in one scenario or another. In this manner would different explanations lead to a good overall explanation?
  • A known and understood limitation of this paper was in the use of mechanical turks instead of actual judges. This may be better due to representation of the jury; however, it is hard to measure the full impact without including the judge in this. It would be costly and timely, though it would help to better represent the full scenario.
  • Given the only four different types of explanation, would there be room for a combination or collaboration explanation? Though this paper mostly focuses on generating the explanations, there should be room to combine the factors to potentially create the overall good and average explanation, despite the paper limiting itself to the only four explanations early on by fully utilizing the Binns et al survey.

Read More

02/26/20 – Myles Frantz – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summation

Though Machine Learning techniques have advanced greatly within the past few years, human perception on the technology may severely limit the adoption and usage of the technology. To further study how to better encourage and describe the process of using this technology, this team has created a Scheduling Assistant to better monitor and elicit how users would tune and react to from different expectations. Tuning the expectations of the AI (Scheduling Assistant) via a slider (from “Fewer detections” on the left to “More detections” on the right) directly altering the false positive and false negative settings in the AI. This direct feedback loop gave users (mechanical turk workers) more confidence and a better understanding of how the AI works. Though given the various variety of users, having an AI focused on High Precision was not the  best general setting for the scheduling assistant.

Response

I like this kind of raw collaboration between the user and the Machine Learning system. This tailors the algorithm explicitly to the tendencies and mannerisms of each user, allowing easier customization and thus a higher likelihood of usage. This is supported due to the team’s Research hypothesis: “H1.1 An AI system focused on High Precision … will result in higher perceptions of accuracy …”. In this example each user (mechanical turk worker) was only using the subset of Enron emails to confirm or deny the meeting suggestions. Speculating further, this type of system being used in an expansive system across many users, being able to tune the AI would greatly encourage use.

I also strongly agree with the slider bar for ease of use tuning by the individual. In this format the user does not neat to have great technological skill to be able to use it, and it is relatively fast to use. Having it within the same system easily reachable also ensures a better connection between the user and the AI.

Questions

  • I would like to see a greater (and or a beta) study done with a giant email provider. Each email provider likely has their own homegrown Machine Learning model, however providing the capabilities to further tune their own AI for their preferences and tendencies would be a great improvement. The only issue would be with the scalability and providing enough services to make this work for all the users.
  • In tuning the ease of access and usability, I would like to see a study done comparing the different types of interaction tools (sliders, buttons, likert scale settings, etc.…). There likely is a study done about the effectiveness of each type of interaction tool upon a system, however in the context of AI settings it is imperative to have the best tool. This would hopefully be an adopted standard that would be an easy to use tool accessible by everyone.
  • Following along with the first question, I would like to see this kind of study provided to an internal mailing system, potentially at an academic level. Though this was studied with 150 mechanical turk workers and 400 internally provided workers, this was based on a sub-sample on the Enron email dataset. Providing this as a live-beta test in a widely and actively used email system with live emails would be a true test that I would like to see.

Read More

02/26/20 – Nan LI – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summary:

The key motivation of this paper is to investigate the influence factors of user satisfaction and acceptance on an imperfect AI-powered system, here the example used in this paper is an email Scheduling Assistant. To achieve this goal, the author conducted a number of experiments based on three techniques for setting expectations: Accuracy Indicator, Example-based Explanation, and Performance Control. Before the experiments, the author presents three main topic research questions, which about the impact factors on accuracy and acceptance of High Precision(low False Positives) and High recall; the effective design techniques for setting appropriate end-user expectations of AI systems, and impact of expectation-setting intervention techniques. A series hypothesis also made before experiments. Finally, the experiment results indicate the expectation adjustment techniques demonstrated in the paper have impacted the intended aspects of expectations and able to enhance user satisfaction and acceptance with an imperfect AI system. Out of the expectation, the conclusion is that a High Recall system can increase user satisfaction and acceptance than High Precision.

Reflection

I think this paper talked about a critical concern of AI-powered system from an interesting and practical direction. The whole approach of this paper reminds me of the previous paper which talked about a summary of the guideline of Human AI interaction. The first principle is to let humans have a clear mind about what AI can do and the second principle is to let humans understand how well can AI do on what it can do. Thus, I think the three expectation adjusting techniques are designed to give the user a clear clue of these two guidelines. However, instead of using text only to inform the user, the author designed three interfaces based on the principles that combining visualization and text, striving for simplicity.

These designs enable informed of the system accuracy very intuitively. Besides, these designs also allow the user to control the detection accuracy, so that the user could apply their own requirement. Thus, through several adjustments of the control and feedback experience, the user would finally combine their expectation with an appropriate adjustment. I believe this should be the main reason that these techniques could increasing user satisfaction and acceptance with an imperfect AI system successfully.

However, as users mentioned in the paper, the conclusion that users are more satisfied and accept a system with High Recall instead of High Precision based on the fact that users can easily recover from a False Positive in their experiment platform than from a False Negative. In my perspective, the satisfaction between High Recall and High Precision should be different based on vary AI system. Nevertheless, nowadays, the AI system has been wildly applied to the high-stakes domain such as health care or criminal predictive. For these systems, we might want to adjust to different systems to optimize for different goals.

Questions:

  1. Can you see any other related guidelines applied to expectation adjustment techniques designed in the paper?
  2. Is there any other way that we can adjust the user expectation of an imperfect AI system?
  3. What do you think are the key factors that able to decrease user expectations? Do you have a similar experience?

Word Count:525

Read More