02/26/2020 – Nurendra Choudhary – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary

In this paper, the authors design explainable Machine Learning models to enhance their fairness perception. In this case, they study COMPAS, a model that predicts a criminal’s chance of reoffending. They explain the drawbacks and fairness issues with COMPAS (overestimates the chance for certain communities) and analyze the significance of change that Explainable AI (XAI) can bring to this fairness issue. They generated automatic explanations for COMPAS utilizing previously developed templates (Binns et al. 2018). The explanations are based on 4 templates: Sensitivity, Case, Demographic and Input-Influence. 

The authors hire 160 MT workers with certain criterias such as US residence and MT expertise. The workers are a diverse set but show no significant impact on the results’ variance. The experimental setup is a questionnaire that judges the worker’s criteria for making fairness judgements. The results show that the workers have heterogeneous criteria for making fairness judgements. Additionally, the experiment highlights two fairness issues: “unfair models (e.g., learned from biased data), and fairness discrepancy of different cases (e.g., in different regions of the feature space)”. 

Reflection

AI works in a very stealthy manner. The reason is that most of the algorithms detect patterns in a latent space that is incomprehensible to humans. The idea of using automatically generated standard templates to construct explanations to AI behaviour should be generalized to other AI research areas. The experiments show the change in human behavior with respect to explanations. I believe such explanations could not only help the general population’s understanding but also help researchers in narrowing down the limitations of these systems.

From the case of COMPAS, I question the future roles that interpretable AI makes possible. If AI is able to give explanations for its prediction, then I think it shall play the role of an unbiased judge better than humans. Societal biases are embedded in humans and they might subconsciously affect our choices. Interpreting these choices in humans is a complex self-criticism endeavour. But, for AI, systems as given in the paper can generate human comprehensible explanations to validate their predictions. Thus, making AI an objectively fairer judge than humans.

Additionally, I believe evaluation metrics for AI lean towards improving their overall prediction. However, I believe that comparable models that emphasize interpretability should be given more importance. But, a drawback to such metrics is the necessity of human evaluation for interpretability. This will impede the rate of progress in AI development. We need to develop better evaluation strategies for interpretability. In this paper, the authors hired 160 MT workers. Given it is a one-time evaluation, this study is possible. However, if this needs to be included in the regular AI development pipeline, we need more scalable approaches to avoid prohibitively expensive evaluation costs. One method could be to rely on a less-diverse test set for the development phase and increase diversity according to the real-world problem setting.

Questions

  1. How difficult is it to provide such explanations for all AI fields? Would it help in progressing AI understanding and development?
  2. How should we balance between explainability and effectiveness of AI models? Is it valid to lose effectiveness in return for interpretability?
  3. Would interpretability lead to adoption of AI systems in sensitive matters such as judiciary and politics?
  4. Can we develop evaluation metrics around suitability of AI systems for real-world scenarios? 

Word Count: 567

Read More

02/26/2020 – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning- Yuhang Liu

This paper discusses people’s dependence on interpretive tools for machine learning. As mentioned in the article, machine learning (ML) models are commonly deployed in various fields from criminal justice to healthcare. Machine learning has gone beyond academia and has developed into an engineering discipline. To this end, interpretability tools have been designed to help data scientists and machine learning practitioners better understand how ML models work. This paper focuses on such software. According to the classification, this software can be divided into two categories, the Interpret ML implementation of GAMs ( glass box models) and the SHAP Python package (a post-hoc explanation technique for blackbox models). The author’s research The results show that users trust machine interpretative results too much and rely too much on the use of machine learning interpretive tools. Few of these users can accurately describe the visualization of the output of these tools. In the end, the authors came to the conclusion that the visualization of the output of the interpretability tool can sometimes help data scientists find problems with data sets or models. For both tools, however, the existence of visualizations and the fact that the tools are publicly available have led to situations of excessive trust and abuse. Therefore, after the final experiments, the authors concluded that experts in two aspects of human-computer interaction and machine learning need to work together. The two interact better together to achieve better results.

First of all, after reading this article, I think that not only the explanatory tools of machine learning will make people over-trusted, including machine learning itself will also make people over-trusted, which may be caused by many aspects such as data sets. This reminds me of the course project I wanted to do this semester. My original intention was because a single, standard data set written by a large number of experts for a long time would cause the trained model to produce too high an accuracy rate, so the data set generated by crowdsourcing was used. Can get better results.

Secondly, for this article, I very much agree with the final solution proposed by the author, which is to better integrate the two aspects of human-machine interaction and machine learning as future research directions. This is because these interpretive tools are a visual display of the results. The better design of human-computer interaction allows users to better extract the results of machine learning, better understand the results, and understand the problems in them. Instead of overly trusting the results of machine learning. The future development direction is definitely that fewer and fewer users understand machine learning, but there will be more people using machine learning, and machine learning will become more and more instrumental, so I think that the interaction aspect will be made more Good for users to understand their results. On the other hand, machine learning should be more diverse and able to adapt to more application scenarios. Only when both aspects are done better can the effects of these tools be achieved.

  1. Is machine learning more academic or tool-oriented in the future?
  2. If the user does not know the meaning of the results, how to understand the accuracy of the results more clearly without using interpretive software
  3. The article mentioned that in the future, the joint efforts of human-computer interaction and machine learning will be required, and what changes should be made in human-computer interaction.

Read More

02/26/2020 – Dylan Finch – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Word count: 573

Summary of the Reading

This paper investigates explaining AI and ML systems. An easy way to explain AI and ML systems is to have another computer program to help generate an explanation of how the AI or ML system works. This paper works towards that goal, comparing 4 different programmatically generated explanations of AI And ML systems and seeing how they impact judgments of fairness. These different explanations had a large impact on perceptions of fairness and bias in the systems, with a large degree of variation between each of the explanation systems.

Not only did the kind of explanation used have a large impact on the perceived fairness of the algorithm, but the pre-existing feelings of the participants towards AI and ML and bias in these fields also had a profound impact on whether or not participants saw the explanations as fair or not. People who did not already trust AI fairness equally distrusted all of the explanations.

Reflections and Connections

To start, I think that this type of work is extremely useful to the future of the AI and ML fields. We need to be able to explain how these kinds of systems work and there needs to be more research into that. This issue of explainable AI becomes even more important when we put it in the context of making AI fair to the people who have to interact with it. We need to be able to tell if an AI system that is deciding whether or not to free people from jail is fair or not. The only way we can really know if these models are fair or not is to have some way to explain the decisions that the AI systems make. 

I think that one of the most interesting parts of the paper is the variation in the number of people with different circumstances who thought that the models were fair or not. Pre-existing ideas about whether or not AI systems are fair had a huge impact on whether or not people thought these models were fair when given an explanation of how they work. This shows how human of a problem this is and how hard it can be to decide if a model is fair or not, even when you have access to an explanation. Views of the model will differ from person to person. 

I also found it interesting how the type of explanation used had a big impact on the judgment of fairness. To me, this congers up ideas of a future where the people who build algorithms can just pick the right kind of explanation to prove that their algorithm is fair, in the same way companies now use language in a very questionable way. I think that this field still has a long way to go and that it will become increasingly important as AI penetrates more and more fasciates of our lives.

Questions

  1. When each explanation produces such different results, is it possible to make a concrete judgment on the fairness of an algorithm?
  2. Could we use computers or maybe even machine learning to decide if an algorithm is fair or would that just produce more problems?
  3. With so many different opinions, even when the same explanation is used, who should be the judge if an algorithm is fair or not?

Read More

02/26/20 – Nan LI – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary:

The main objective of this paper is to investigate how people make fairness judgments of the ML system and how explanations impact their fairness judgments. In particular, they explored the difference between a global explanation, which describes how the model works, and local explanations, which are sensitive and case-based. Besides, the author also demonstrates the impact of individual differences in cognitive style and prior position on algorithmic fairness impact the fairness judgment regarding different explanations. To achieve this goal, the author conducted an online survey-style study with Amazon Mechanical Turk workers with specific criteria. The experiment results indicate that based on different kinds of fairness issues and user profiles, there are varies effective explanation. However, a hybrid explanation that using global explanations to comprehend and evaluate the model and using local explanations to examine individual cases may be essential for accurate fairness judgment. Furthermore, they also demonstrated that individuals’ previous positions on the fairness of algorithms affect their response to different types of interpretations.

Reflection:

First, I think this paper talked about a very critical and imminent topic. Since the exploration and implementation of the machine learning system and AI system, it has been wildly deployed that using ML prediction to make decisions on high-stake fields such as healthcare and criminal predictive. However, societies have great doubts about how the system makes decisions. They cannot accept or even understand why these important decisions should be left to a piece of algorithm. Then, the community’s call for algorithm transparency is getting higher and higher. At this point, an effective, unbiased and user-friendly interpretation of ML system which enables the public to identify fairness problems would not only improve on ensuring the fairness of the ML system, but also increase public trust in ML system output.

However, it is also tricky that there is no one-size-all solution for an effective explanation. I do understand that different people shall have a different reaction to explanations, nevertheless, I was kinda surprised that people have very different opinions on the judgment of fairness. Even though this is understandable considering their prior position on the algorithm, their cognition, and different background, this will make it more complex to ensuring the fairness of the machine learning system. Since the system may need to take into account individual differences in their fairness positions, which may require different corrective or adaptive actions.

Finally, this paper reminds me of another similar topic. When we explain how the model works, how much information should we provide? What kind of information should we preserved so that this information will not be abused? In this paper, the author only mentioned that they would provide two types of explanations, global explanations that describe how the model works, and local explanations that attempt to justify the decision for a specific case. However, they didn’t examine the extent of system model information provided in the explanation. I think this is an interesting topic since we are investigating the impact of explanations on fairness judgment.

Question:

  1. Which type of explanations mentioned in this article would you prefer to see when you judge the fairness of the ML system?
  2. How did the user perceive machine learning system fairness influence the fairness ensuring process when designing the system?
  3. This experiment conducted based on an online survey with crowd workers instead of judges, do you think this would have any influence on experiment results?

Word Count: 564

Read More

02/26/20 – Myles Frantz – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summation

Even though Machine Learning has recently taken the title of “The Best Technology Buzzword” away from cyber security a few years ago, there are two fundamental problem with it; understanding the percentage each feature contributes to an answer and ensuring the model itself is fair. The first problem limits the progress on the second problem and has spawned its own field of research: explainable artificial intelligence. Due to this, it is difficult to ensure models are fair to sensitive data and don’t learn biases. To help ensure models are understood as fair, this team has concentrated on automatically generating four different types of explanations for the rational of the model. These models spawned a multitude of regions of the data, including input influence based, demographic-based, sensitivity-based, and case-based. By showing these heuristics of the same models to a group of crowd-workers, they were able to determine quantitatively determine there is not one perfect explanation method. There must be instead a tailored and customized explanation method.

Response

I agree with the need for explainable machine learning, though I’m unsure about the impact of this team’s work. Using work done previously for the four types and their own preprocessor, they seemingly resolved a question by only continuing it. This may be due to my lack of experience reading psychology papers, though their rationalization for the explanation styles and fairness in judgement seems to be common place. Two of the three conclusions wrapping up the quantitative study seemed appropriate, case-based explanation seemed less fair while local-based explanation was more effective. Though the latter conclusion of people having a previous bias towards machine learning seems to be redundant.

I can appreciate the lengths they went to measure the results against the mechanical turks. Seemingly creating an incremental paper (see the portion about their preprocessor), this may lead to more papers off their gathered heuristics.

Questions

  • I wonder if the impact of the survey for the mechanical turks was limited due to only using the four different types of explanations studied. The conclusion of the paper indicated there is no good average and each explanation type was useful in one scenario or another. In this manner would different explanations lead to a good overall explanation?
  • A known and understood limitation of this paper was in the use of mechanical turks instead of actual judges. This may be better due to representation of the jury; however, it is hard to measure the full impact without including the judge in this. It would be costly and timely, though it would help to better represent the full scenario.
  • Given the only four different types of explanation, would there be room for a combination or collaboration explanation? Though this paper mostly focuses on generating the explanations, there should be room to combine the factors to potentially create the overall good and average explanation, despite the paper limiting itself to the only four explanations early on by fully utilizing the Binns et al survey.

Read More

02/26/20 – Myles Frantz – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summation

Though Machine Learning techniques have advanced greatly within the past few years, human perception on the technology may severely limit the adoption and usage of the technology. To further study how to better encourage and describe the process of using this technology, this team has created a Scheduling Assistant to better monitor and elicit how users would tune and react to from different expectations. Tuning the expectations of the AI (Scheduling Assistant) via a slider (from “Fewer detections” on the left to “More detections” on the right) directly altering the false positive and false negative settings in the AI. This direct feedback loop gave users (mechanical turk workers) more confidence and a better understanding of how the AI works. Though given the various variety of users, having an AI focused on High Precision was not the  best general setting for the scheduling assistant.

Response

I like this kind of raw collaboration between the user and the Machine Learning system. This tailors the algorithm explicitly to the tendencies and mannerisms of each user, allowing easier customization and thus a higher likelihood of usage. This is supported due to the team’s Research hypothesis: “H1.1 An AI system focused on High Precision … will result in higher perceptions of accuracy …”. In this example each user (mechanical turk worker) was only using the subset of Enron emails to confirm or deny the meeting suggestions. Speculating further, this type of system being used in an expansive system across many users, being able to tune the AI would greatly encourage use.

I also strongly agree with the slider bar for ease of use tuning by the individual. In this format the user does not neat to have great technological skill to be able to use it, and it is relatively fast to use. Having it within the same system easily reachable also ensures a better connection between the user and the AI.

Questions

  • I would like to see a greater (and or a beta) study done with a giant email provider. Each email provider likely has their own homegrown Machine Learning model, however providing the capabilities to further tune their own AI for their preferences and tendencies would be a great improvement. The only issue would be with the scalability and providing enough services to make this work for all the users.
  • In tuning the ease of access and usability, I would like to see a study done comparing the different types of interaction tools (sliders, buttons, likert scale settings, etc.…). There likely is a study done about the effectiveness of each type of interaction tool upon a system, however in the context of AI settings it is imperative to have the best tool. This would hopefully be an adopted standard that would be an easy to use tool accessible by everyone.
  • Following along with the first question, I would like to see this kind of study provided to an internal mailing system, potentially at an academic level. Though this was studied with 150 mechanical turk workers and 400 internally provided workers, this was based on a sub-sample on the Enron email dataset. Providing this as a live-beta test in a widely and actively used email system with live emails would be a true test that I would like to see.

Read More

02/26/20 – Nan LI – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summary:

The key motivation of this paper is to investigate the influence factors of user satisfaction and acceptance on an imperfect AI-powered system, here the example used in this paper is an email Scheduling Assistant. To achieve this goal, the author conducted a number of experiments based on three techniques for setting expectations: Accuracy Indicator, Example-based Explanation, and Performance Control. Before the experiments, the author presents three main topic research questions, which about the impact factors on accuracy and acceptance of High Precision(low False Positives) and High recall; the effective design techniques for setting appropriate end-user expectations of AI systems, and impact of expectation-setting intervention techniques. A series hypothesis also made before experiments. Finally, the experiment results indicate the expectation adjustment techniques demonstrated in the paper have impacted the intended aspects of expectations and able to enhance user satisfaction and acceptance with an imperfect AI system. Out of the expectation, the conclusion is that a High Recall system can increase user satisfaction and acceptance than High Precision.

Reflection

I think this paper talked about a critical concern of AI-powered system from an interesting and practical direction. The whole approach of this paper reminds me of the previous paper which talked about a summary of the guideline of Human AI interaction. The first principle is to let humans have a clear mind about what AI can do and the second principle is to let humans understand how well can AI do on what it can do. Thus, I think the three expectation adjusting techniques are designed to give the user a clear clue of these two guidelines. However, instead of using text only to inform the user, the author designed three interfaces based on the principles that combining visualization and text, striving for simplicity.

These designs enable informed of the system accuracy very intuitively. Besides, these designs also allow the user to control the detection accuracy, so that the user could apply their own requirement. Thus, through several adjustments of the control and feedback experience, the user would finally combine their expectation with an appropriate adjustment. I believe this should be the main reason that these techniques could increasing user satisfaction and acceptance with an imperfect AI system successfully.

However, as users mentioned in the paper, the conclusion that users are more satisfied and accept a system with High Recall instead of High Precision based on the fact that users can easily recover from a False Positive in their experiment platform than from a False Negative. In my perspective, the satisfaction between High Recall and High Precision should be different based on vary AI system. Nevertheless, nowadays, the AI system has been wildly applied to the high-stakes domain such as health care or criminal predictive. For these systems, we might want to adjust to different systems to optimize for different goals.

Questions:

  1. Can you see any other related guidelines applied to expectation adjustment techniques designed in the paper?
  2. Is there any other way that we can adjust the user expectation of an imperfect AI system?
  3. What do you think are the key factors that able to decrease user expectations? Do you have a similar experience?

Word Count:525

Read More

02/19/2020 – Ziyao Wang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

The authors introduced the fact that in human-AI hybrid decision making system, the updates, which aiming at improving accuracy of AI system, may bring harmful effect to the teamwork. For experienced workers who are advised by AI system, they have built a mental model for the AI system, which will improve the correctness of teamwork’s results. However, the updates which will improve the accuracy of the AI system, may result in the difference between the updated model and the worker’s mental model. Finally, the user cannot make appropriate decisions with the help of AI system. In this paper, the researchers proposed a platform named CAJA, which can help to evaluate the compatibility between AI and human. With the results from experiments using CAJA, developers can learn how to make updates compatible while being still of high accuracy.

Reflection:

Before reading this paper, I kept the thought that it is always good to have a AI system with higher accuracy. However, this paper provides me a new point of view. Instead of only the performance of systems, we should also consider the cooperation between the system and human workers. In this paper, the updates in AI system will destroy the mental system in human mind. The experienced workers should have built a good cooperate system with the AI tools. They know about which advices should be taken and which ones may contain errors. If the patch makes the system to be accurate while reducing the correctness rate of the part which is trusted by human, the accuracy of the whole hybrid system will also be reduced. Human may not trust the updated system until they got a new balance with the updated system. During this period, the performance of this hybrid system will be reduced to a low level which is even worse than keeping the previous system which is not updated. For this reason, the developers should also try to maximize the performance of the system before release the application to the users. As a result, new updates will not make large changes to the system, and human can be more familiar to the updated system.

We can learn from this fact that we should never ignore the interaction between human and AI system. A good design of the interaction can contribute to the improvement of the performance of the whole system. In the meantime, a system with poor human-AI interaction may be harmful to the whole system. When we try to implement a system which needs both human affordance and AI affordance, we should pay more attention to the cooperate between human and AI. We should leverage the affordance from both sides, instead of only focusing on the AI system. We should put us in the position in the designer of the whole system with the view of overall situation rather than just consider ourselves as programmer and only focus on the program.

Questions:

What’s the criteria for deciding whether the updates are compatible or not?

Will releasing instructions for each update to the users valuable to reduce the harm of updates?

If we have a new version of system which will improve the accuracy greatly, however the users’ mental model is totally different from it,  how to reach a balance which will maximize the performance of the whole hybrid system?

Read More

02/19/20 – Lulwah AlKulaib- Dream Team

Summary

The authors mention that the previous HCI research focused on ideal team structures and how roles, norms, and interaction patterns are influenced by systems. The state of research directed teams towards those structures by increasing shared awareness, adding channels of communications, convening effective collaborators. Yet organizational behavior research denies the existence of universally ideal team structures. And believes that structural contingency theory has demonstrated that the best team structures depend on the task, the members, and some other factors. The authors introduce Dream Team, a system that identifies effective team structures for each team by adapting teams to different structures and evaluating each fit. Dream Team explores over time, experimenting with values along many dimensions of team structures such as hierarchy, interaction patterns, and norms. The system utilizes feedback, such as team performance or satisfaction, to iteratively identify the team structures that best fit each team. It helps teams in identifying the structures that are most effective for them by experimenting with different structures over time on multi-armed bandits.

Reflection

The paper presented a system that focuses on virtual teams. In my opinion, the presented system is a very specific application to a very specific problem. The authors address their long list of limitations, including how they don’t believe their system generalizes to other problems easily. I also believe that the way they utilize feedback in the system is complex and unclear. Their reward function did not explain how qualitative factors were taken into consideration. The authors mention that high variance tasks would require more time for DreamTeam to converge.

Which means more time to get a response from the system, and I don’t know how that would be useful if it slows teams down? Also, when looking at the snapshot of the slack integration, it seems that they handle team satisfaction based on users response to a task, which is not always the case when it comes to collaboration on slack. The enthusiasm of the responses just seems out of the norm. The authors did not address how would their system address “team satisfaction” when there’s little to no response? Would that be counted as a negative response? Or would it be neutral? And even though their system worked well for the very specific task they chose, it’s also a virtual team. Which raises questions about how would this method be applicable for in person teams or hybrid teams? It seems that their controlled environment was very controlled. Even though they presented a good idea, I doubt how applicable it is to real life situations.

Discussion

  • In your opinion, what makes a dream team?
  • Are you pro or against ideal team structures? Why?
  • What were the qualities of collaborators in the best group project/research you had?
  • What makes the “chemistry” between team members?
  • What does a successful collaborative team project look like during a cycle?
  • What tools do you use in project management? 
  • Would you use DreamTeam in your project?
  • What would you change in DreamTeam to make it work better for you?

Read More

2/19 – Dylan Finch – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Word count: 565

Summary of the Reading

This paper analyzes the use of autonomous technologies that are used on Wikipedia. These technologies help to keep the peace on the large platform, helping to flag malicious users and revert inaccurate and spammy changes so that Wikipedia stays accurate and up to date. Many people may think that humans play the major role in policing the platform, but machines and algorithms also play a very large part, aiding the humans to deal with the large amount of edits.

Some tools are completely automated and can prevent vandalism with no human input. Other tools give human contributors tips to help them spot and fight vandalism. Humans work together with the automated systems and each other to edit the site and keep the pages vandal free. The way in which all of the editors edit together, even though they are not physically together or connected as a team, is an impressive feat of human and AI interaction.

Reflections and Connections

To start, I think that Wikipedia is such and interesting thing to examine for a paper like this. While many organizations have a similar structure, I think that WIkipedia is unique and interesting to study because it is so large, so distributed, and so widely used. It can be hard enough to get a small team of people to work together on documentation. At Wikipedia’s size the complexities of making it all work must be unimaginable. It is so interesting to find out how machines and humans work together at that scale to keep the site running smoothly. The ideas and analysis seen here can easily be applied to smaller systems that are trying to accomplish the same thing.

I also think that this article serves as a great reminder of the power of AI. The fact that AI is able to do some much to help editors keep the site running smoothly even with all of the complexities of the site is amazing and it shows just how much power AI can have when applied to the right situation. A lot of the work done on Wikipedia is not hard work. The article mentions some of the things that bots do, like importing data and fixing grammatical mistakes. These things are incredibly tedious for humans to do and yet they are perfect work for machines. They can do this work almost instantly while it may take a human an hour. This not only serves as a great reminder of the power of AI’s and humans complimenting each other’s abilities, but it also shows what the power of what the internet can do. Something like this never would have been possible before in the history of human civilization. The mere fact that we can do something like this now speaks to the amazing power of the current age. 

Questions

  1. Does this research have applications elsewhere? What would be the best place to apply this analysis?
  2. Could this process ever be done with no human input whatsoever? Could Wikipedia one day be completely self sufficient?
  3. This article talks a lot about how the bots of Wikipedia are becoming more and more important, compared to the policies and social interactions between editors. Is this happening elsewhere? Are there bots other places that we might not see and might not notice, even though they are doing a larger and larger share of the work?

Read More