Compatibility Tradeoff

February 18, 2020 Fanglan Chen 2 Comments

Summary

Bansal et al.’s paper “Updates in Human-AI Teams” explores an interesting problem — the influence of updates to an AI system on the overall team performance. Nowadays, AI systems have been deployed to support human decision making in high-stakes domains including criminal justice and healthcare. In the working process of a team of humans and AI systems, humans make decisions with a reference to AI’s inferences. A successful partnership requires that the human develops an understanding into the AI system performance, especially its error boundary. Updates with algorithms of higher performance can potentially increase the AI’s predictive accuracy. However, that may require humans to regain interactive experiences and rebuild their confidence in the AI system, the adjusting process of which may actually hurt team performance. The authors introduce the concept of compatibility between an AI update and prior user experience and present methods for studying the role of compatibility in human-AI teams. Extensive experiments on three high-stakes classification tasks (recidivism, credit risk, and mortality) demonstrate that current AI systems are not provided with compatible updates, resulting in decreased performance after updating. To improve the compatibility of an update, the authors propose a re-training objective by penalizing new failures from AI systems. Their proposed compatible updates achieve a good balance of the performance and compatibility trade-off in different tasks.

Reflection

I think making AI and humans as a team to take full advantage of the collaboration is a pretty neat idea. Humans are born with the ability to adapt in the face of an uncertain and adverse world with the capacity of logic reasoning. Machines cannot perform well in those areas but can achieve efficient computation and free people for higher-level tasks. Understanding how machines can efficiently enhance what humans perform best and how humans can augment the work scope of machines is the key to rethink and redesign current decision making system.

What I find interesting about the research problem discussed in this paper is that the authors focus on the idea of unifying a decision made by humans and machines but not merely on the performance in tasks to recommend updates. In machine learning with no human involved, the goal is usually to achieve better and better performance which is evaluated by metrics such as accuracy, precision, recall, etc. The compatible updates can be seen as the machine learning algorithms with similar decision boundaries but better performance, which seems to be an even more difficult task to accomplish. To get there, humans need to perform crucial roles. Firstly, humans must train machines to achieve good performance on certain tasks. Next, humans need to understand and be able to explain the outcomes of those tasks, especially where AI systems fail. That requires an interpretability component in the system. As AI systems are increasingly drawing conclusions through opaque processes (also-called black-box problem), there is a large demand of human experts in the field to explain model behavior to non-expert users. Last but not least, humans need to sustain the responsible use of AI systems by, for example, updating for better decision making discussed in the paper. That would require a large body of human experts who continually work to ensure that AI systems are functioning properly, safely, and responsibly.

The above discussion is one side of a coin, focusing on how humans can extend what machines can achieve. The other side is comparatively less discussed in the current literature. Except for extending physical capabilities, how humans can learn from the interaction with AI systems and enhance individual abilities is an interesting question to explore. I would imagine, in an advanced Human-AI team, that humans and AI systems communicate in a more interactive way which allows for collaborative learning from their own mistakes and the rationale of the correct decisions made by each other. That leads to another question, if AI systems can exceed or rival humans in high-stake decision making such as recidivism and underwriting, how risky is it to handle the tasks to machines? How can we decide when to let humans take control?

Discussion

I think the following questions are worthy of further discussion.

What can humans do that machines cannot and vice versa?
What is the goal of decision making and what factors are stopping humans or machines from making good decisions?
In the Human-AI teams discussed in the paper, what can humans benefit from the interaction with the AI systems?
The partnership introduced by the authors is more like a human-assisting-machine approach. Can you provide some examples of machine-assisting-human approaches?

2/19/20 – Lee Lisle – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

February 18, 2020February 18, 2020 Lorance R Lisle 1 Comment

Summary

Geiger and Ribes cover the case of using automated tools or “bots” in order to prevent vandalism on the popular online and user-generated encyclopedia “Wikipedia.” The authors detail how editors use popular distributed cognition coordination services such as “Huggle,” and argue that these coordination applications affect the creation and maintenance of wikipedia as much as the traditional social roles of editors. The team of human and AI work together to fight vandalism in the form of rogue edits. They cover how bots assisted essentially 0% of edits in 2006 to 12% in 2009, while editors use even more bot assistance. They then deep dive into how the editors came to ban a single vandal that committed 20 false edits to Wikipedia in an hour, which they term a “trace ethnography.”

Personal Reflection

This work was eye-opening in seeing exactly how Wikipedia editors leverage bots and other distributed cognition to maintain order in Wikipedia. Furthermore, after reading this, I am much more confident in the accuracy of articles contained on the website (possibly to the chagrin of teachers everywhere). I was surprised how easily attack edits were repelled by the Wikipedia editors, considering that hostile bot networks could be deployed against Wikipedia as well.

I also generally enjoyed the analogy of how managing Wikipedia is like navigating a naval vessel in that both leverage significant amounts of distributed cognition in order to succeed. Showing how many roles are needed in order to understand various jobs and collaborate between people was quite effective.

Lastly, their focus (trace ethnography) on a single vandal was an effective way of portraying what is essentially daily life for these maintainers. I was somewhat surprised that only four people were involved before banning a user; I had figured that each vandal took much longer to identify and remedy. How the process proceeded, where the vandal got repeated warnings before a (temporary) ban occurred, and how the bots and humans worked together in order to come to this conclusion, was a fascinating process that I hadn’t seen written in a paper before.

Questions

One bot that this article didn’t look into is a twitter bot that tracked all changes on Wikipedia made by IP addresses used by congressional members (@CongressEdits). Its audience is not specifically intended to be the editors of Wikipedia, but how might this help them? How does this bot help the general public? (It has since been banned in 2018) How might a tool like this be abused?
How might a trace ethnography be used in other applications for HCI? Does this approach make sense for domains other than global editors?
How can Huggle (or the other tools) be changed in order to tackle a different application, such as version control? Would it be better than current tools?
Is there a way to exploit this system for vandals? That is, are there any weaknesses to human/bot collaboration in this case?

02/19/2020 – Nan LI – In Search of the Dream Team: Temporally Constrained Multi-Armed Bandits for Identifying Effective Team Structures

February 18, 2020February 18, 2020 NAN LI 1 Comment

Summary:

The paper pointed out the theory that there is no universally ideas structures for effective teamwork. The best structure for teamwork is determined by different team members, tasks, surroundings, etc. Thus, this paper presents a system that investigates the optimal team structure by adapting different team structures to teams and evaluate the efficiency based on team performance and teamwork feedback. However, the combination of diverse dimensions of team structures and the arms (the different values of dimension) of each dimension is a large set. To avoid overwhelming group testers with these values, the paper also leverages a model called multi-armed bandits with temporal constrains which set the constraints of the number of arms selections based on several factors. The paper tested the platform with AMT workers, and evaluate the performance of the system with designed task and performance evaluation. The system confirmed that there are no two teams had the same optimal team structures, and this structure even different for the same group when completing the different tasks. The results also indicate the platform DreamTeam can promote high-efficiency teamwork.

Reflection:

First, I highly agree with the opinion that there are no universal idea structures for effective teamwork. Besides, searching the optimal structure through adapt different dimensions and evaluating each fit seems also reasonable. However, I think the experiment could gather more valuable information if they test the platform with a real group instead of a randomly formed group. Because I think the premise of becoming a group and completing a task together is that the group members are familiar with each other. Thus, this platform should be more effective in the early stages of team formation. Before the team members are familiar with each other, they can use this system to find the optimal team structure temporarily, so that they can quickly cooperate and work as a team even though they do not know each other. Nevertheless, as the familiarity among the group members raises, using this method to determine the optimal structure may be inefficient. Because they may have already find the best structure for some of the dimensions as they get along and getting more experience of working together.

I also considered another situation that is more suitable for this system, when a long-established team is assigned to working on a new type of task. Then maybe the working mode of the teamwork needs to be switched so that they can complete the new task most efficiently. At this time, the system support is demanded to find this new optimal structure.

Finally, I think the constraints method mentioned in the article also very inspirational. Maybe we can improve the effectiveness of the DreamTeam platform by allowing users to pre-delete some dimensions that they would not like to change. For example, the hierarchy or interaction pattern. In this case, the reduction of the combination is more conducive to exhaustive testing, and the adapting structure should be more fit for the teamwork.

Question:

What do you think using the computation power to decide the optimal structure for teamwork?
In this paper, the author finds random tester to form a group and complete the task, do you think this will influence the results?
Under what condition do you think this platform would most benefit.

Word Count: 544

02/19/2020 – Human-Machine Collaboration for Content Regulation – Myles Frantz

February 18, 2020February 18, 2020 Miles Frantz Leave a comment

Since the dawn of the internet, it has surpassed many expectations and is prolific throughout everyday life. Though initially there was a lack of standards in website design and forum moderation, it has relatively stabilized with various and scientific approaches. A popular forum side, Reddit, use a human lead human-ai collaboration to help automatically and manually moderate the ever-growing comments and thread. Searching through the top 100 subreddits (at the time of writing), the team followed surveyed moderators from 5 varied and highly active subreddits. These moderators are majority Due to the easy to use API provided by Reddit, one of the most used moderation tools was a third party later incorporated into Reddit Automod. This is one of the more popular and common tools used by moderators in Reddit. Since it is very extensible, there is no common standard between all the subreddits. Moderators within the 5 subreddits use this bot in relatively similar but different ways. Not only the sole bot used by moderations, other bots can be used to further interact and streamline other bots in a similar fashion. However due to the complication of bots (technologically or lack of interest in learning the tool), some subreddits let a few people manage the bots, sometimes to damning results. When issues happen, instead of being reactive to various users’ reactions, the paper argues for more transparency of the bot.

I agree with the author of the original of automod, when he started off making the bot purely to automate several steps. Continuing this forward with the scaling of Reddit, I do believe it would be impossible for only human moderators to keep up with the “trolls”.

Though I do disagree with how the rules of the automod are spread out. I would believe the decentralization of knowledge would make the system more robust, especially since the moderators are voluntary. It is natural for people to avoid what they don’t understand, for fear of it in general or fear for what repercussions may happen. Though I don’t think putting all of the work on one moderator is necessarily the right answer.

One of my questions is regarding one of the outcomes for Reddit; granting more visibility of the automods actions. Notably due to the scale of Reddit, extending this kind of functionality automatically could incur much more of a memory and storage overhead. Already Reddit stores vast amount of data however potentially doubling the memory capacity (if every comment was reviewed by automod) may be a downfall to this approach.
Instead of surveying the top 60%, I wonder if surveying the lower ranked (via RedditMetrics) subreddit with a lower number of moderators would fit the same pattern of the automod use. I would imagine they would be forced to use the automod tool more in depth and in breadth due to the lack of available resources however this is pure speculation.
A final question would be, to what percentage is there an over duplication of bots across the subreddits? If there is a big percentage it may lead to a vastly different experience across subreddits, as it seemingly is now potentially causing confusion amongst new or recurring users.

02/19/2020 – Updates in Human-AI Teams: Understanding and Addressing the Performance / Compatibility Trade off – Yuhang Liu

February 18, 2020February 18, 2020 yuhang Liu 1 Comment

This paper first proposes the complementarity between humans and artificial intelligence. In many cases, humans and artificial intelligence will form a team. When people make decisions after checking the inferences of AI, this cooperation model has applications in many fields, and achieved significant results. Usually, this kind of achievement requires certain prerequisites. First, people must have their own judgments on the conclusions of artificial intelligence. At the same time, the results of artificial intelligence must be accurate. The tacit cooperation between the two can improve efficiency. However, with the updating of artificial intelligence systems and the expansion of data, this cooperation will be broken. On the one hand, the accuracy of artificial intelligence will decline, and because of the expansion of boundaries, people’s understanding of artificial intelligence will be broken. So after the system update, the efficiency will be reduced instead. This paper mainly studies this situation. The article hopes to be compatible with the previous method after the update, so several methods are proposed to achieve this purpose, so as to achieve more compatible and accurate updates.

It is also suggested that this idea is obtained by analogy. In software engineering, if the updated system can support legacy software, it will be compatible after the update. I agree with this kind of analogy greatly, which is similar to bionics. We can continuously apply new ideas to the computer field through this kind of thought. The method mentioned in this paper is also very necessary. In the ordinary process of artificial intelligence or machine learning, we usually build data sets for each time, and lack the concept of inheritance, which is very inconvenient. After adopting compatible ideas, it will greatly save energy and be able to serve people more smoothly.

This article introduces CAJA, a platform for measuring the impact of AI performance and updates on team performance. At the same time, a practical retraining goal is introduced in the article to improve update compatibility. The main idea is to improve update compatibility by punishing new errors. But it can also be seen from the text that trust is the core of team work. Admittedly, trust is the essence of a team, but only as the basis of work, I think that more simulations and improvements are needed to improve humanity. The combination of problem-solving factors and the key of machine learning, we know that after learning new things, people will not have a negative impact on previous skills, but we will have more perspectives and methods to think about a problem, so I think that humans and machines should be mixed, that is, a team as a whole, so that the results can be more compatible, and the human machine interaction can be more successful.

question：

What are the implications of compatible AI updates?
How to better treat people and machines as a whole?
Whether compatible AI will affect the final training results?

2/19/20 – Lee Lisle – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

February 18, 2020February 18, 2020 Lorance R Lisle 2 Comments

Summary

Bansal et. al discuss how human-AI teams work in solving high-stakes issues such as hospital patient discharging scenarios or credit risk assessment. They point out that the humans in these teams often create a mental model of the AI suggestions, where the mental model is an understanding of when the AI is likely wrong about the outcome. The authors then show that updates to the AI can produce worse performance if they are not compatible with the already formed mental model of the human user. They go on to define types of compatibility for AI updates, as well as a few other key terms relating to human/AI teams. They develop a platform to measure how compatibility can affect team performance, and then measure AI update compatibility effectiveness through a user study using 25 mTurk workers. In all, they show that incompatible updates reduce performance as compared to no update at all.

Personal Reflection

The paper was an interesting study in the effect of pushing updates without considering the user involved in the process. I hadn’t thought of the human as an exactly equal player in the team, where the AI likely has more information and could provide a better suggestion. However, it makes sense that the human leverages other sources of information and forms a better understanding of what choice to ultimately make.

CAJA, the human/AI simulation platform, seems like a good way to test AI updates, however I struggle to see how it can be used to test other theories as the authors seem to suggest. It is, essentially, a simple user-learning game, where users figure out when to trust the machine and when to deviate. While this isn’t exactly my field of expertise, I only see the chance to change information flows and the underlying AI as ways of learning new things about human/AI collaboration. This would mean terming this as a platform is a little excessive.

Questions

The authors mention that, in order to defeat mTurk scammers who click through projects like these quickly, they drop the lowest quartile (in terms of performance) out of their results. Do you think this was an effective countermeasure, or could the authors be cutting good data?
From other sources, such as Weapons of Math Destruction, we can read how some AI suggestions are inherently biased (even racist) due to input data. How might this change the authors results? Do you think this is taken into consideration at all?
One suggestion near the end of the paper stated that, if pushing an incompatible update, the authors of the AI should make the change explicit so that the user could adjust accordingly. Do you think this is an acceptable tradeoff to not creating a compatible update? Why or why not?
The authors note that, as the complexity of error boundary f became more complex, errors increased, so they kept to relatively simple boundaries. Is this an effective choice for this system, considering real systems are extremely complex? Why or why not?
The authors state that they wanted the “compute” cost to be net 0. Does this effectively simulate real-world experiences? Is the opportunity-cost the only net negative here?

02/19/2020 – Nan LI – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

February 18, 2020February 18, 2020 NAN LI 2 Comments

Summary:

In this paper, the author presented that it is prevailing nowadays human and AI form a team to make decisions in a way that AI provides recommendations and humans decide to trust or not. In these cases, successful team collaboration mainly based on human knowledge on previous AI system performance. Therefore, the author proposed that though the update of the AI system would enhance the AI predictive precision, it will hurt the team performance since the updated version of the AI system usually not compatible with the mental model that humans developed with previous AI systems. To solve this problem, the author introduced the concept of compatibility of AI update with prior user experience. To examine the role of this compatibility in Human-AI teams, the author proposed methods and designed a platform called CAJA to measure the impact of updates on team performance. The outcomes show that team performance could be harmed even the updated system’s predictive accuracy improved. Finally, the paper proposed a re-training objective that can promote the compatibility of updates. In conclusion, to avoid diminish team performance, the developer should build a more compatible update without surrendering performance.

Reflection:

In this paper, the author talked about a specific interaction, which is AI-advised human decision making. As the example presented in the paper–Patient readmission system. In these cases, an incompatible update of the AI system would indeed harm the team performance. However, I think the extent of the impact largely depends on the correlation between the human and AI systems.

If the system and the user have a high grade of interdependence, both are not specialists on a task, the system prediction accuracy and user knowledge have the same impact on the decision result, the incompatible update of the AI system will weaken the team performance. Even though this effect can be eliminated by the running-in of the user and the system later, the cost for the decision in the high-stakes domain will be very large.

However, if the system interacts with users frequently, but the system’s predictions are only one of the concerns for humans to make decisions and cannot directly affect the decision, then the impact of incompatible updates on team performance will be limited.

Besides, if humans are more expertise on the task, and can validate the correctness of the recommendation promptly, then both the improvement of the system performance and the new errors caused by the system update will not have much impact on the results. On the other hand, if the error caused by the update does not affect team performance, then when updating the system, we do not need to consider compatibility but only need to consider the improvement of system performance. As a conclusion, if there is not enough interaction between the user and the system, and the degree of interdependence is not high, or the system only serves as an auxiliary or double-check, then the system update will not have a great impact on team performance.

A compatible update is indeed helpful for users to quickly adapt to the new system, but I think the impact of update largely depends on the correlation between the user and the system, or the proportion of the system’s role in teamwork.

Besides, design a compatible update version also requires extra cost. Therefore, I think we should consider minimizing the impact of system errors on the decision-making process when designing the system and establishing human-AI interaction.

Question:

What do you think about the concept of compatibility of AI update?
Do you have any human-AI system examples that apply this author theory?
Under what circumstances do you think the author’s theory is the most used and when it is not applicable?
When we need to update the system frequently, do you think it is better to build a compatible update or to use an alternative method to solve the human adaptation costs?
In my opinion, Huaman’s degree of adaptation is very high, and the cost required for humans to adapt is much smaller than the cost of developing a compatible update version. what do you think?

Word Count: 682

02/19/2020 – Vikram Mohanty – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

February 18, 2020 Vikram Mohanty 2 Comments

Summary

This paper, through a case study, highlights the invisible distributed cognition process that goes on underneath a collaborative environment like Wikipedia, and how different actors, both humans and non-humans, come together for achieving a common goal – banning a vandal on Wikipedia. The authors show the usefulness of trace ethnography as a method for reconstructing user actions and understanding better the role each actor plays in the larger scheme of things. The paper advocates for not dismissing the role of bots as mere force multipliers, but to see them in a different lens considering the wide impact they have.

Reflection

Similar to the “Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator” paper, this paper is a great example that intelligent agents (AI-infused systems, bots, scripts, etc.) should not be studied in isolation only, but through a socio-technical lens. In my opinion, that provides a more comprehensive picture of the goals these agents can and cannot achieve, the collaboration processes they may inevitably transform, the human roles they may affect and other unintended consequences than performance/accuracy metrics alone.

Trace ethnography is a powerful method for reconstructing user actions in a distributed environment, and using that to understand how multiple actors (human and non-humans) achieve a complex objective, by sub-consciously collaborating with each other. The paper advocates that bots/automation/intelligent agents should not be seen as just force multipliers or irrelevant users. This is important as a lot of current evaluation metrics focus only on quantitative measures such as performance or accuracy. This paints an incomplete, and sometimes, an irresponsible picture of intelligent agents, as they have now evolved to assume an irreplaceable role in the larger scheme of things (or goals).

The final decision-making privilege resides with the human administrator and the whole socio-technical pipeline assists each step of decision-making with all possible information available so that checks and bounds (or order, as the paper mentions) is maintained at every stage. Automated decisions, whenever taken, are grounded in some confidence of certainty. In my opinion, while building AI models, researchers should think about the AI-infused system or the real-world setting of which these algorithms would be a part of. This might motivate researchers to make these algorithms more transparent or interpretable. The lens of the user who is going to wield these models/algorithms might help further.

It’s interesting to see some of the principles of mixed-initiative systems being used here i.e. history of the vandal’s actions, templated messages, showing statuses, etc.

Questions

Do you plan to use trace ethnography in your proposed project? If so, how? Why do you think it’s going to make a difference?
What are some of the risks and benefits of employing a fully automated pipeline in this particular case study i.e. banning a Wikipedia vandal?
A democratic online platform like Wikipedia supports the notion of anyone coming in and making changes, and thus necessitates deploying moderation workflows to curb bad actors. However, if a platform were restrictive to some degrees, a post-hoc setup may not be necessary and the platform might be less toxic. This does not necessarily be the case for Wikipedia and can also extend to SNS like Twitter/Facebook, etc. What would you prefer, a democratic one or a restrictive one?

02/19/20 – Fanglan Chen – In Search of the Dream Team: Temporally Constrained Multi-Armed Bandits for Identifying Effective Team Structures

February 18, 2020 Fanglan Chen 1 Comment

Summary

Zhou et al.’s paper “In Search of the Dream Team” introduces DreamTeam — a system that identifies effective team structures for each group of individuals by suggesting different structures and evaluating the fit of each team. How team works relates with team structures, including roles, norms, and interaction patterns. Prior organizational behavior research doubts the existence of universally perfect structures. The rationale is simple: teams broast of great diversity so one single structure cannot satisfy the full functioning of each team. The proposed DreamTeam explores values along five dimensions of team structures including hierarchy, interaction patterns, norms of engagement, decision-making norms, and feedback norms. The system leverages feedback, randomly choosing metrics such as team performance or satisfaction, to iteratively identify the team structures that facilitate the best organization of each team. Also, the authors design multi-armed bandits with temporal constraints, an algorithm that determines the timing of exploration and exploitation trade-offs across multiple dimensions to avoid overwhelming teams with too many changes. In the experiments, DreamTeam is integrated with the chat platform Slack and achieves better performance and more diverse team structures compared with baseline methods.

Reflection

The authors design a system to facilitate the organization of virtual teams. Along with the several limitations mentioned in the paper, I feel the proposed DreamTeam system is based on a comparatively narrow scope of what makes a dream team and it seems difficult to generalize the framework to a variety of domains or platforms.

In the first place, I do not agree that there is a universal approach to design or evaluate a so-called dream team. The components that make a dream team vary in different domains. For example, in sports, I would say personality and talent play important roles in forming a dream team. Actually, it goes beyond the term “forming” that a bunch of talented individuals not only bring technical expertise to the team, but they also contribute passion, strong work ethic, and strive for peak performance in the pursuit of excellence. To extend that point, working with people having similar personalities, similar values, similar pursuits will bring some chemistry to the team work which potentially enables challenging problem solving and strategic planning. All of these are not mentioned in the demensions and nearly impossible to be evaluated quantitatively.

Also, I think it is important to make every team member understand their role, such as why they need to tackle the tasks and how that ties to a larger purpose beyond self’s needs. This provides a clear purpose and direction of where a group of people need to move forwards as a team. I do not think the authors emphasize the importance of how such understanding influences team member level of commitment. In addition, this kind of unified purpose can avoid duplication of member efforts and prevents pulling the efforts in multiple directions.

Last but not least, in my opinion, basing on the maximizing of rewards is not the ideal way to determine the best team structures. Human society treasure process as well as results. It can be seen as a successful teamwork as long as the whole team is motivated and working on it. If too much emphasis is put on results, then the joy will be drained out of the job for the team. As long as progressive steps are made towards achieving the goal within a reasonable time frame, the team will become better. Building an ambitious, driven and passionate team is just the start. We need to ensure that the team members survive and are nurtured so that they can deliver on the targets.

Discussion

I think the following questions are worthy of further discussion.

If you are the CEO of a company or a university president, would you consider using the proposed DreamTeam system to help organize your own team? Why or why not?
Do you think the five bandits capture all dimensions to make a dream team?
Do you think the proposed DreamTeam system can be generalized to various domains? Are there any domains you think the system would not contribute towards an efficient team structure?
Is there anything you can think about to improve the proposed DreamTeam system?

02/19/2020 – Vikram Mohanty – Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator

February 18, 2020 Vikram Mohanty 2 Comments

Summary

This paper thoroughly summarizes existing work on content regulation in online platforms, but focuses on human-machine collaboration aspect in this domain which hasn’t been widely studied. While most of the work in automated content regulation has been about introducing/improving algorithms, this work adopts a socio-technical lens of trying to understand how human moderators collaborate with automated moderator scripts. As most online platforms like Facebook and Google are pretty closeted about their moderation activities, the authors focus on Reddit, which allows moderators to use the Reddit API and develop their own scripts for each sub-reddit. The paper paints a comprehensive picture of how human moderators collaborate around the automoderator script, the different roles they assume, the other tools they use, and the challenges they face. Finally, the paper proposes design suggestions which can facilitate a better collaborative experience.

Reflection

Even though the Reddit Automoderator cannot be classified as an AI/ML tool, this paper sets a great example for how researchers can better assess the impact of intelligent agents on users, their practices and their behavior. In most instances, it is difficult for AI/ML model developers to foresee exactly how the algorithms/models they build are going to be used in the real world. This paper does a great job at highlighting how the moderators collaborate amongst themselves, how different levels of expertise, tech-savviness play a role in the collaboration process, how things like community guidelines are affected and how the roles for humans changed due to the bots amidst them. Situating the evaluation of intelligent/automated agents in a real-world usage scenario can give us a lot of insights where to direct our efforts for improvement or how to redesign the overall system/platform where the intelligent agent is being served.

It’s particularly interesting to see how users (or moderators) with different experience assume different roles with regards to how or who gets to modify the automoderator scripts. It may be an empirical question, but is a quick transition from newcomers/novices to an expert useful for the the community’s health, or the roles reserved for these newcomers/novices extremely essential? If it’s the former, then ensuring a quick learning curve with the usage of these bots/intelligent agents should be a priority for developers. Simulating what content will be affected with a particular change in the algorithm/script, as suggested in the discussion, can foster a quick learning curve for users (in addition to achieving the goal of minimizing false positives).

While the paper comments on how these automated scripts are supporting the moderator, it would have been interesting to see a comparative study of no automoderator vs automoderator. Of course, that was not the goal of this paper, but it could have helped paint the picture that automoderator adds to user satisfaction. Also, as the paper mentions, the moderators value their current level of control in the whole moderation process, and therefore, would be uncomfortable in a fully automated setting or one where they would not be able to explain their decisions. This has major design implications not just for content regulation, but pretty much in for complex, collaborative task. The fact that end-users developed and used their own scripts, accustomed to the community’s needs, is promising and opens up possibilities for coming up with tools which users with no or little technical knowledge can use to easily build and test their own scripts/bots/ML models.

With the introduction of automoderator, the role of the human moderators changed from their tradition job of just moderating the content to now ensuring that the rules of automoderator are updated, preventing users to game the system and minimizing false positives. Automation creating new roles for humans, instead of replacing them, is pretty evident here. As the role of AI increases in AI-infused systems, it is also important to assess the user satisfaction with the new roles.

Questions

Do you see yourself conducting AI model evaluation with a wider socio-technical lens of how they can affect the target users, their practices and behaviors? Or do you think, evaluating in isolation is sufficient?
Would you advocate for AI-infused systems where the roles of human users, in the process of being transformed, get reduced to tedious, monotonous, repetitive tasks? Do you think the moderators in this paper enjoyed their new roles?
Would you push for fully automated systems or ones where the user enjoys some degree of control over the process?