02/19/2020 – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal – Sushmethaa Muhundan

The paper takes about the counter-vandalism process in Wikipedia focussing on both the human efforts as well as the silent non-human efforts put in. Fully-automated anti-vandalism bots are a key part of this process and play a critical role in managing the content on Wikipedia. The actors involved range from being fully autonomous software to semi-automated programs to user interfaces used by humans. A case study is presented which is an account of detecting and banning a vandal. This aims to highlight the importance and impact of bots and assisted editing programs. Vandalism-reverting software use queuing algorithms teamed with a ranking mechanism based on vandalism-identification algorithms. The queuing algorithm takes into account multiple factors like the kind of user who made the edit, revert history of the user as well as the type of edit made. The software proves to be extremely effective in presenting prospective vandals to the reviewers. User talk pages are forums utilized to take action after an offense has been reverted. This largely invisible infrastructure has been extremely critical in insulating Wikipedia from vandals, spammers, and other malevolent editors. 

I feel that the case study presented helps understand the internal working of vandalism-reverting software and it is a great example of handling a problem by leveraging the complementary strengths of AI and humans via technology. It is interesting to note that the cognitive work of identifying a vandal is distributed across a heterogeneous network and is unified using technology! This lends speed and efficiency and makes the entire system robust. I found it particularly interesting that ClueBot, after identifying a vandal, immediately reverted the edit within seconds. This edit did not have a wait in a queue for a human or a non-human bot to review but was resolved immediately using a bot.

A pivotal feature of this ecosystem that I found very fascinating was the fact that domain expertise or skill is not required to handle such vandal cases. The only expertise required of vandal fighters is in the use of the assisted editing tools themselves, and the kinds of commonsensical judgment those tools enable. This widens the eligibility criteria for prospective workers since specialized domain experts are not required.

  • The queuing algorithm takes into account multiple factors like the kind of user who made the edit, revert history of the user as well as the type of edit made. Apart from the factors mentioned in the paper, what other factors can be incorporated into the queuing algorithm to improve its efficiency?
  • What are some innovative ideas that can be used to further minimize the turnaround reaction time to a vandal in this ecosystem?
  • What other tools can be used to leverage the complementary strengths of humans and AI using technology to detect and handle vandals in an efficient manner?

Read More

02/19/2020 – Updates in Human-AI teams: Understanding and Addressing the Performance/Compatibility Tradeoff – Sushmethaa Muhundan

The paper studies human-AI teams in decision-making settings specifically focusing on updates made to the AI component and its subsequent influence on the decision-making process of the human. In an AI-advised human decision-making interaction, the AI system recommends actions to the human. Based on this recommendation, their past experience as well as domain knowledge, the human takes an informed decision. They can choose to go ahead with the action recommended by the AI or they can choose to disregard the recommendation. During their course of interaction with AI systems, humans develop a mental model of the system. This is developed based on mapping scenarios where the AI’s decision was correct versus when they were incorrect by means of rewards and feedback provided to the humans by the system. As part of the experiment, studies were conducted to establish relationships between updates to AI systems and team performance. User behavior was monitored using a custom platform, CAJA, built to gain insights about the influence of updates to AI models on the user’s mental model and consequently team performance. Consistency metrics were introduced and several real-world domains were analyzed including recidivism prediction, in-hospital mortality prediction, and credit risk assessment. 

It was extremely surprising to note that updates to the AI’s performance that makes it better actually may hurt the team performance. My initial instinct was that with an increase in the AI’s performance, the team performance would increase proportionally but this is not always the case. In certain cases, despite there being an increase in the AI’s performance, the new results might not be consistent with the human’s mental model and as a result, incorrect decisions are taken based on past interactions with the AI and hence the overall team performance decreases. An interesting and relatable parallel is drawn to concepts of backward compatibility in software engineering with respect to updates. The concept of compatibility is introduced using this analogy to describe the ideal scenario where updates to the AI does not introduce further errors.

The platform developed to conduct the studies, CAJA, was an innovative way to overcome the challenges of testing in real-world settings. This platform abstract away the specifics of problem-solving by presenting a range of problems that distills the essence of mental models and trust in one’s AI teammate. It was very interesting to note that these problems were designed such that no human could be a task expert thereby maximizing the importance of mental models and their influence in decision making.

  • What are some efficient means to share the summary of AI updates in a concise, human-readable form that captures the essence of the update along with the reason for the change?
  • What are some innovative ideas that can be used to reduce the cost incurred by re-training humans in an AI-advised human decision-making ecosystem?
  • How can developers be made more aware of the consequences of the updates made to the AI model on team performance? Would increasing awareness help improve team performance?

Read More

2/19/20 – Jooyoung Whang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

According to the paper, most developers of classification or prediction systems focus on the quality of the predictions but not on the system’s team performance with the user. The authors of this paper introduce the problem that may occur according to the current model training loss criteria and provide new methods that address the problem. To develop a more advanced image of the users’ interactions with a classifier system, the authors develop a web-based game system called Caja and conduct a user study using the Amazon Mechanical Turk. They conclude that the increase in performance of the system does not necessarily mean that the team performance of the system with the users also increase. They also confirm that their proposed training method using the new loss function and a new concept called Dissonance improves team performance.

I liked the authors’ new perspective to human-AI collaboration and model training. Now that I think of it, not considering the users of the system during development is contradictory to what the system’s trying to achieve. One thing I was interested in and had thoughts about was their definition of Dissonance. The term is used to compare and link with the old model of a system with the new updated model in terms of user expectation. I saw that the term penalizes a system when the new system misclassifies for a set of input that the old model used to get right. However, what if the users of the old system made predictions according to how the system was wrong? This may be a weird concern and probably an edge case, but if the user made decisions based on the thought that the system was wrong all the time, the team performance of that that person with the updated model will always be worse even if the new system was trained with the suggested loss function.

The followings are the questions that I had while reading the paper:

1. As I have written in my reflection, do you think the new proposed training method will be effective if the users made decisions based on the idea that the system will be always wrong? Or, is this a too extreme and absurd thought?

2. The design of Caja ensures that the user can never arrive at the solution by him or herself because too much about the problem domain is hidden to the user. However, this is often not the case in real world scenarios. The user of the system is often also an expert of the related field. Does this reduce the quality and trustworthiness of the results of this research? Why or why not?

3. The research started from the idea that interaction with the users must be considered when making an update to an AI system. In this case, it was particularly for human-AI collaboration. What if it was the opposite? For example, there are AIs that are built to compete with humans like AlphaGo. These types of AIs are also developed with the goal of producing the most optimal solution to a given input without considering the interaction with the user. How can training be modified to include users for competing AIs?

Read More

02/18/20 – Akshita Jha – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Summary:
“The Work of Sustaining Order in Wikipedia: The Banning of a Vandal” by Geiger and Ribes examines the role of software tools in the English Wikipedia, specifically involving autonomous and assisted editing. Wikipedia is a “free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.” Bots are “fully-automated software agents that perform algorithmically-defined tasks involved with editing, maintenance, and administration in Wikipedia.” Different bots have different functions which can range from simple tasks like correcting grammatical errors to more complicated tasks like detecting personal insults. The authors present a detailed case study: “The Banning of a Vandal”. The authors talk about “Huggle”, that is the most widely used editing tool across Wikipedia that queues all the edits. The user then has the option to perform a variety of actions like ‘revert’, ‘warn’, etc. on each of the edits that is displayed. The user does not have the option to select which edit he wants to make changes to. An anonymous user had been vandalizing multiple Wikipedia pages and was not discouraged by the warning and comments given by the moderators. Eventually, this rogue user was blocked by making use of the network of moderators or vandal fighters and the bots but it was more cumbersome than expected. In addition to the quantitative and the qualitative studies, the research also demonstrated the importance of trace ethnography for studying such sociotechnical systems.

Reflections:
This is an interesting work. It was particularly insightful as I was unaware of the role of multiple bots in Wikipedia editing. Bots and humans working cohesively have helped make Wikipedia the widely used resource it currently is. Making Wikipedia a free resource that allows editing by volunteers comes with a cost. This paper helped highlight the limitations of the Wikipedia bots and how a significant amount of effort is needed from multiple moderators to ban a vandal from Wikipedia. Each moderator makes a local judgement but the Wikipedia talk pages help keep a record of all the warnings against a particular user. Certain kinds of vandalism, like inserting obscenities and profanities, are easy to detect. However, if a vandal deletes an important section from the Wikipedia page, that might involve significant cognitive effort from moderators to identify and rectify. An interesting question is how would Wikipedia be effected, if it made use of a completely automated bot instead of the hybrid system it currently uses. Would the bots be able to determine the significance of an edit or a change? How would that change the moderators behaviors and actions? Since, automated tools help determine the kind of social activities that are possible on Wikipedia, will having a completely automated bot significantly alter Wikipedia and the user involvement? It would also be interesting to see if we can use trace ethnography to study Reddit, which is another big sociotechnical system.

Questions:
1. How did such a network come into place?
2. Do you think certain kinds of Wikipedia pages are more susceptible than others to vandalism?
3. Will completely automated bots help?
4. Can we conduct such a case study for Reddit? Why? Why not?

Read More

02/19/2020 – Nurendra Choudhary – Updates in Human-AI Teams

Summary

In this paper, the authors study the role of studying human-AI team performance in contrast to their individual performance and explain its necessity. They explain the importance of human inference of AI tools. Humans develop mental models of AI’s performance. Advances made in AI’s algorithm only evaluate the improvement in the prediction. However, the improvements cause behavioral changes in AI that do not fit the human’s mental models and reduce the overall performance of their team. To alleviate this, the authors propose a new logarithmic loss that considers the compatibility between human mental models and AI models for making updates to the AI model.

The authors construct user studies to show the development of human mental models across different conditions. Additionally, they illustrate the degradation in overall team performance with improvement in AI’s prediction. Furthermore, they show the addition of the additional loss increases the overall team performance of the AI model while increasing AI’s prediction efficiency. 

Reflection

Humans and AI form formidable teams in multiple environments and I think such a study as a necessity for further development of AI. Most state-of-the-art AI systems are not independently useful in real-world and rely on human intervention from time-to-time (as discussed in previous classes). Till a point of time where this situation exists, we cannot improve AI independently and have to consider the humans involved in the task. I believe the evaluation metrics currently used in AI research are completely focussed on the AI’s prediction. However, this needs to change and the paper is a great primary step in the direction. I believe we should construct more such evaluation metrics for various other AI tasks. But, if we develop our evaluation metrics around human-AI teams, we take the risk of potentially making AI systems reliant on human input. Hence, there is a possibility that AI systems never independently solve our problems. I believe the solution lies in interpretability. 

Current AI techniques rely on statistical spaces that are not human-interpretable. Focusing on making these spaces interpretable allows human comprehensibility. Interpretable AI is a rising research topic in several subareas of AI and I believe it can solve the current dilemma. We can develop AI systems independently and all the updates will be comprehensible by humans and they can accordingly update their mental models. But, we interpretability is not a trivial subject. Recent work has only shown incremental progress and the work still compromises on prediction ability for interpretability. The effectiveness of AI is observed because of their ability to recognize patterns in dimensions incomprehensible to human beings. The current paper and interpretability both require human understanding of the model and I am not sure if this is possible.

Questions

  1. Can we have evaluation metrics for other tasks based on this? Will it involve human evaluation? If so, how do we maintain comparative fairness across such metrics?
  2. If we continue evaluating Human-AI teams together, will we ever be able to develop completely independent AI systems?
  3. Should we focus on making the AI systems interpretable or their performance?
  4. Is interpretable AI the future for real-world systems? Think about, for every search query made, the user is able to see all their features that aids the system’s decision making process.

Word Count: 545

Read More

02/18/20 – Akshita Jha – Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator

Summary:
“Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator” by Jhaver et al. talks about the popular social media website Reddit and the unusual unpaid human moderators and automated moderator collaboration. Reddit moderators make use of the heavily configurable automated program called, ‘Automoderator’ to help make decisions about the content that should be removed from the website. The authors interview 16 Reddit moderators to understand how they benefit from the moderating tool, ‘Automod’ and how they adapt and configure it to reflect the subreddit’s policies to help them moderate the subreddit effectively. The authors also offer valuable insights that may benefit the creators of the platforms, designers of automated regulation systems, scholars of platform governance, and content moderators. The authors conclude by pointing out that the moderation system in reddit is a collaborative effort between humans as well as the automated systems. This hybrid system works but there is definitely a scope for improvement in the development and deployment of these tools.

Reflections:
Online platforms can be a boon or a bane depending on how people choose to engage with it. Regulation might seem necessary to ensure that low quality posts (these posts can be treated as noise) do not drown out informative and worthy posts on the site. However, this is a challenging task. Deciding whether a post is appropriate for the subreddit puts a lot of responsibility on the moderator. In some cases the moderator might be a bot, ‘Automod’ and in other cases the platform relies on paid or unpaid volunteers. Reddit moderators are unpaid. The authors in this work analysed 5 different subreddits: ‘r/photoshopbattles’, ‘r/space’, ‘r/oddlysatisfying’, ‘r/explainlikeiamfive’ and ‘r/politics’. It’s interesting that some reddit moderators prefer to implement moderation bots from scratch while others make use of tools made by others. It’s intriguing how making use of tools made by others forms a sense of community of moderators within the bigger community of reddit. Most redditors use ‘Automod’ which was initially created by ‘Chad Birch’ using the Reddit API in January 2012. However, a major drawback of this study is that all the moderators that the authors interviewed were males. It would be helpful to get the perspective of female moderators, if there are any, since the user base for Reddit is disproportionately male. I feel the authors should have selected ‘r/AskHistorians’ as one of the subreddits for analysis since it’s widely known to be highly moderated and content driven. It would have also been interesting to deep dive into the comments that ‘Automod’ marked as offensive but were not. This would help improve the performance of the moderator while informing us of its limitations. One might also need to wonder about the consequences if the subreddit community grows larger. There might be a need to reflect on the existing tools and their scale.

Questions:
1. Do you agree that social media content should be moderated?
2. What about the mental health of the moderators?
3. What kind of resources should be make available to the moderators since they are dealing with sensitive content all the time?

Read More

02/19/2020 – Nurendra Choudhary – The Work of Sustaining Order in Wikipedia

Summary

In this paper, the authors discuss the problem of maintaining order in open-edit information corpora, specifically Wikipedia here. They start with explaining the near-immunity of Wikipedia to vandalism that is achieved through a synergy between humans and AI. Wikipedia is open to all editors and the team behind the system is highly technical. However, the authors study on its immunity dependence on the community’s social behavior. They show that vandal fighters are networks of people that identify the vandals based on a network of behavior. They are supported by AI tools but banning a vandal is yet not a completely automated process. The process of banning a user is a requires individual editor judgements at a local level and a collective decision at a global level. This creates a heterogeneous network and emphasizes on decision corroboration by different actors.

As given in the conclusion, “this research has shown the salience of trace ethnography for the study of distributed sociotechnical systems”.  Here, trace ethnography combines the ability of editors with data across their actions to analyze vandalism in Wikipedia.

Reflection

It is interesting to see that Wikipedia’s vandal fighters include such a seamless cooperation between humans and AI. I think this is another case where AI can leverage human networks for support. The more significant part is that the tasks are not trivial and require human specialization and not just plain effort. Also, collaboration is a significant part of AI’s capability. Human editors analyze the articles in the local context. AI can efficiently combine the results and target the source of these errors by building a heterogeneous network of such decisions. Further, human beings analyze these networks to ban vandals. This methodology applies the most important abilities of both humans and bots. The collaboration involves the best attributes of humans, i.e; judgement and of AI, i.e; pattern recognition. Also, it effectively utilizes this collaboration against vandals who are independent or small networks of mal-practitioners who do not have access to the bigger picture.

The methodology utilizes distributed work patterns for accomplishing different tasks of editing and moral agency. Distributing the work enables involvement of human beings on trivial tasks. However, combining the results to attain logical inferences is not humanly possible. This is because the vast amount of data is incomprehensible to humans. But, humans have the ability to develop algorithms that the machine can apply at a larger-scale to get such inferences. However, the inferences do not have a fixed structure and require human intelligence to retrieve desired actions against vandalism. Given that, most of the cases of such vandalism are by independent humans, a collaborative effort by AI can greatly turn the odds for vandal fighters. This is because AI aids humans by utilizing the bigger picture incomprehensible to just humans.

Questions

  1. If vandals have access to the network, will they be able to destroy the synergy?
  2. If there’s more motivation like political or monetary gain, will it give rise to a kind-of mafia network of such mal-practitioners? Will the current methodology still be valid in such a case?
  3. Do we need a trust-worthiness metric for each Wikipedia page? Can the page be utilized as reference for absolute information?
  4. Wikipedia is a great example of crowd-sourcing and this is a great article for crowd-control on these networks. Can this be extended to other crowd-sourcing softwares like Amazon MT or information blogs?

Read More

02/05/20 – Mohannad Al Ameedi – Guidelines for Human-AI Interaction

Summary

In this paper, the authors suggest 18 design guidelines to build for Human – AI infused systems. These guidelines try to resolve the issues with many human interaction systems that either don’t follow guidelines or follow some guidelines that are not tested or evaluated. These issues include producing offensive, unpredictable, or dangerous results and might let users stop using these systems and therefore proper guidelines are necessary. in addition, advances in the AI field introduced new user demands in area like sound recognition, pattern recognition, and translation. These 18 guidelines help users to understand what the AI systems can do and cannot do and how well can do it, show information related to the task on focus with the appropriate timing in a way that can fit the user social and culture context, make sure that the user can request services when needed or ignore unnecessary services, offer explanation why the system do certain things, maintain a memory of the user recent action and try to learn from it, etc. These guidelines went through different phases starting from consolidating different guidelines, heuristics evaluation, user study, and expert evaluation and revisions. The authors hope that these guidelines will help building better AI-infused systems that can scale and work better with the increased number of users and advances AI algorithms and systems.

Reflection

I found the idea of putting together design guidelines very interesting as it make a standardization that can help building human-AI systems, and also help evaluating or testing these systems and can be used as a baseline when building large scale AI infused systems to avoid an well known issues associated with the previous systems.

I also found that the collection of academic and industrial guidelines are interested since it is collected based on over 20 years human interaction which can be regarded as a very valuable and rich information that can be used in different domains and fields.  

I agree with the authors that some of AI-infused systems that not follow certain guidelines are confusing and not effective and sometimes counterproductive when the suggestions or recommendations are irrelevant, and that explains why some AI enabled or infused systems were popular on certain times but they couldn’t satisfy the user demands and eventually stopped being used by users.  

Questions

  • Are these guidelines followed in Amazon Mechanical Turk?
  • The authors mention that there is a tradeoff between generality and specializations, what tradeoff factors when need to consider?

Read More

02/05/20 – Ziyao Wang – Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research

The author did a survey about how crowdsourcing was applied in research on machine learning. Firstly, previous researches were reviewed to conclude categories for the application of crowdsourcing in the machine learning area. The applications were broken into four categories: data generation, evaluating and debugging models, hybrid intelligence systems and behavior studies to inform machine learning research. In each of the categories, the author discussed several specific areas. For each area, the author concluded several related types of research and made an introduction to each of the research. Finally, the author did analyze on understanding the crowd workers. Though crowdsourcing seemed to greatly help machine learning researches, the author did not ignore the problems in this system, such as dishonesty among workers. Finally, this survey gave researchers who focused on machine learning and applied to crowdsource four advice: maintain a three relationship with crowd workers, care about good task design and use pilots.

Reflection:

From this survey, readers can have a thorough view of the applications of crowdsourcing in machine learning research. It concludes most of the state-of-the-art in machine learning areas related to crowdsourcing. Traditional machine learning always facing problems like lack of data, models cannot be judged, lack of user feedback or system not trustable. However, with the application of crowdsourcing, all these problems can be solved with the help of crowdsourcing workers. Though this is only a survey of previous researches, it actually lets readers get a comprehensive view of this combination of technology.

This survey reminds us of the importance of reviewing previous works. When we want to do research about a topic, there will be thousands of researches which may help. However, it is impossible to view all the papers. Instead, if there is a survey that summarized all previous works and categorized them into several more specific categories, we can easily get a comprehensive view of the topic and new ideas may occur. In this paper, with the research of the four categories of application of crowdsourcing in machine learning, the author comes up with the idea to do research to understand the crowd and finally made suggestions for future researchers. Similarly, if we can do a survey of what we want to do as our projects, we may find out what is a need and what is novel in this field, which will lead to the success of the projects and the development of the field.

Also, it is important to consider critically. In this survey, though the author concluded numerous contributions of crowdsourcing towards machine learning researches, he still discussed the potential risk of this application, for example, dishonesty among workers. This is important for future researches and should not be ignored. In our projects, we should also think critically so that the drawbacks of the ideas we proposed can be judged fairly and the project can be practical and valuable.

Problems:

Which factors can contribute to a good task design?

Is there any solution that can solve the problem of dishonesty among workers instead of mitigating it?

In the experiments which aim to find out user reaction towards something, can the reaction of the paid workers be considered similar to the reaction of practical users?

Read More

02/05/2020 – The Role of Humans in Interactive Machine Learning – Subil Abraham

Reading: Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35, 4: 105–120. https://doi.org/10.1609/aimag.v35i4.2513

Machine learning systems typically are built by collaboration between the domain experts and the ML experts. The domain experts provide data to the ML experts, who will carefully configure and tune the ML model which is then sent back to the domain experts for review, who will then recommend further changes and the cycle continues until the model reaches an acceptable accuracy level. However, this tends to be a slow and frustrating process and there exists a need to get the actual users involved in a more active manner. Hence, the study of interactive machine learning arose to identify how users can best interact with and improve the ML models through faster, interactive feedback loops. This paper surveys the field, looking at what users like and don’t like when teaching machines, what kind of interfaces are best suited for these interaction cycles and what unique interfaces can exist beyond the simple labelling-learning feedback loop.

When reading about the novel interfaces that exist for interactive machine learning, I find there is an interesting parallel between the development of the “Supporting Experimentation of Inputs” type of interface and to that of text editors. The earliest text editor was the typewriter, where an input once entered could never be taken back. A correction would require starting over or the use of an ugly whiteout. With electronics came the text editors where you could edit only one line at a time. And finally, today we have these advanced, feature rich, editors and IDEs with autocomplete suggestions, in line linting and automatic type checking and error feedback. It would be interesting to see what the next stage of ML model editing would look like if they continued on this trajectory, where we can go from simple “backspace key” type experimentation to more features parallel to what modern text editors have for words. The idea of allowing “Combining Models” as a way to create models draws another interesting parallel to car manufacturing, where cars went from being handcrafted to being built on an assembly line with standardized parts.

I also think their proposal for creating a universal language to connect the different ML fields might end up creating a language that is too general and the different fields, though initially unified, might end up splitting off again due to using only subsets of the language that don’t overlap with each other or by creating new words because the language does not have anything specific enough.

  1. Is the task of creating a “universal language” a good thing? Or would we end up with something too general to be useful and cause fields to create their own subsets?
  2. What other kinds of parallels can we see in the development of machine learning interfaces, like the parallels to text editor development and car manufacturing?
  3. Where is the “Goldilocks zone” for ML systems that are giving context to the user for the sake of transparency? There is a spectrum between “Label this photo with no context” to “here is every minute detail, number of pixels, exact gps location, all sorts of other useless info”. How do we decide which information the ML system should provide as context?

Read More