02/19/2020-Bipasha Banerjee -In Search of the Dream Team: Temporally Constrained Multi-Armed Bandits for Identifying Effective Team Structures

Summary:

The paper aims to find a Dream Team by adopting teams to different structures and subsequent evaluation. The authors try to identify the ideal team structure using “the multi-armed bandit” approach over time. The dream team structure selects the next exploration task based on the reward from the previous job. They explored a lot of background research on HCI groups, the structural contingency theory from organizational behavior, multi-armed bandit. A network of five bandits was created with different dimensions, namely, hierarchy, interaction patterns, norms of engagement, decision-making norms, and feedback norms. Each of the dimensions has different possible values. For example, for hierarchy, there can be three possible values – none, centralized (where a leader was elected), decentralized (majority vote). Global temporal constraint and dimensional temporal constraint are taken into consideration to determine at what stage the teams are prepared to embrace changes and also take into account if too many dimensions change at one. The authors used the popular game Codenames for the Slack interface. They used Amazon Mechanical Turk to employ 135 workers and assigned them based on five conditions, namely, control, collectively chosen, manager chosen, bandit chosen, and Dream Team is chosen. There were 35 teams with seven teams per condition. It was found that Dream Team based teams outperformed other teams 

Reflection 

The paper was a nice read on selecting the ideal team structure to maximize productivity. The paper did extensive background research on team structures and included theories from HCI and organizational behavior. Being from a CS background, I have no idea about what team structure is and the theory involved behind selecting the ideal structure. It was a very new concept for me, and the difference between the approaches taken by the HCI domain and Organizational behavior was intriguing. The authors described their approach in detail and mathematically, which makes it easy to visualize the problem as well as the method.

The most interesting section was the integration with Slack, where the Slack bot was utilized to guide the Team with broadcast messages. It was interesting to see how different teams reacted to the messages the Slack bot posted. Dream Teams mostly adhered to the suggestions of the Slack bot whereas, some of the other team structures chose to ignore them. It would be good if the evaluation is also done on a different task. The game is relatively simple, and we don’t know how the Dream Team structure would perform for complicated tasks. It would be intriguing to see how this work could be potentially extended.

The paper highlights a probabilistic approach to proposing the ideal team structure. One thing that was not very clear to me is how the slack bots suggest other than taking into consideration the current score and the best approach. Is it using NLP techniques to deduce the sentiment of the comment and then posting a cross-comment? 

Question

  1. The authors used slack to test their hypothesis. How would dream-team perform for real-life software development teams?
  2. The test subjects were Amazon Mechanical Turks, and the project was reasonably simple (codenames game). Would Dream Team performs better than the other structures when it is domain-specific, and experts are involved? Would it lead to more conflicts?
  3. Could we use better NLP techniques and sentiment analysis to guide the DreamTeams better?

Read More

02/19/2020 – Palakh Mignonne Jude – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

SUMMARY

In this paper, the authors focus on the efforts (both human and non-human) taken in order to moderate content on the English-language Wikipedia. The authors use trace ethnography in order to indicate how these ‘non-human’ technologies have transformed the way editing and moderation is performed on Wikipedia. These tools not only increase the speed and efficiency of the moderators, but also aide them in identifying changes that may have gone unnoticed by moderators – for example, the use of the ‘diff’ feature to identify edits made by a user enables the ‘vandal fighters’ to easily view malicious changes that may have been made to Wikipedia pages. The authors mention editing tools such as Huggle, Twinkle as well as a bot called the ClueBot that can examine edits and revert them based on a set of criteria such as obscenity, patent nonsense as well as mass removal of content by a user.  This synergy between the tools and humans has helped monitor changes to Wikipedia in near real-time and has lowered the level of expertise required by reviewers as an average volunteer with little to no knowledge of a domain is capable of performing these moderation tasks with the help of the aforementioned tools.

REFLECTION

I think it is interesting that the authors focus on the social effect on the activities done in Wikipedia due to various bots and assisted editing tools. I especially liked the analogy drawn from the work of Ed Hutchins of a navigator that is able to know the various trajectories through the work of a dozen crew members which the authors mention to be similar to blocking a vandal on Wikipedia through the combined effort of a complex network of interactions between software systems as well as human reviewers.

I thought it was interesting that the use of bots in edits increased from 2-4% in 2006 to about 16.33% in just about 4 years and this made me wonder what the current percentage of edits made by bots would be. The paper also mentions that the detection algorithms often discriminate against anonymous and newly registered users which is why I found it interesting to learn that users were allowed to reconfigure their queues such that they did not view anonymous edits as more suspicious. The paper mentions ClueBot that is capable to automatically reverting edits that contain obscene content, which made me wonder if efforts were made to develop bots that would be able to automatically revert edits that may contain hate speech and highly bigoted views.

QUESTIONS

  1. As indicated in the paper ‘Updates in Human-AI teams’, humans tend to form mental models when it comes to trusting machine recommendations. Considering that the editing tools in this paper are responsible for queuing the edits made as well as accurately keeping track of the number of warnings given to a user, do changes in the rules used by these tools affect human-machine team performance?
  2. Would restricting edits on Wikipedia to only users that are required to have non-anonymous login credentials (if not to the general public, non-anonymous to the moderators such as the implementation on Piazza wherein the professor can always view the true identity of the person posting the question) help lower the number of cases of vandalism?
  3. The study performed by this paper is now about 10 years old. What are the latest tools that are used by Wikipedia reviewers? How do they differ from the ones mentioned in this paper? Are more sophisticated detection methods employed by these newer tools? And which is the most popularly used assisted editing tool?

Read More

02/19/2020 – Palakh Mignonne Jude – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

SUMMARY

In this paper, the authors talk about the impact updates made to an AI model can have on the overall human-machine team performance. They describe the mental model that a human develops through the course of interacting with an AI system and how this gets impacted when an update is made to the AI system. They introduce the notion of ‘compatible’ AI updates and propose a new objective that will penalize new errors (errors introduced in the new model that were not present in the original model). The authors introduced terms such as ‘locally-compatible updates’, ‘compatibility score’ as well as ‘globally-compatible updates’. They performed experiments with high-stakes domains such as recidivism prediction, in-hospital mortality prediction, and credit risk assessment. They also developed a platform to study human-AI teams called CAJA, which is a web-based game and the authors claim that no human is a task expert. CAJA enables designers to vary different parameters including the number of human-visible features, AI accuracy, reward function, etc.

REFLECTION

I think this paper was very interesting as I have never considered the impact on team performance due to updates to an AI system. The idea of a mental model, as introduced by the authors of this paper, was novel to me as I have never thought about the human aspect of utilizing such AI systems that make various recommendations. This paper reminded me of the multiple affordances mentioned in the paper ‘An Affordance-Based Framework for Human Computation and Human-Computer Collaboration’ wherein both humans and machine are in pursuit of a common goal and leverage the strengths of both humans and machines.

I thought that it was good that they defined the notion of compatibility to include the human’s mental model and I agree that developers retraining AI models are susceptible to focus on retraining in terms of improving the accuracy of a model and that they tend to ignore the details of human-AI teaming.

I was also happy to read that the workers used as part of the study performed in this paper were paid on average $20/hour as per the ethical guidelines for requesters.

QUESTIONS

  1. The paper mentions the use of Logistic Regression and multi-layer perceptron. Would a more detailed study on the types of classifiers that are used in these systems help?
  2. Would ML models that have better interpretability for the decisions made have given better initial results and prevented the dip in team performance? In such cases, would providing a simple ‘change log’ (as is done in a case of other software applications), have aided in preventing this dip in team performance or would it have still been confusing to the humans interacting with the system?
  3. How were the workers selected for the studies performed on the CAJA platform? Were there any specific criteria used to select such workers? Would the qualifications of the workers have affected the results in anyway?

Read More

02/19/2020 – Sukrit Venkatagiri – In Search of the Dream Team

Paper:  Sharon Zhou, Melissa Valentine, and Michael S. Bernstein. 2018. In Search of the Dream Team: Temporally Constrained Multi-Armed Bandits for Identifying Effective Team Structures. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18), 1–13. https://doi.org/10.1145/3173574.3173682

Summary: This paper introduces a system called DreamTeam that explores a search space for the optimal design of teams in an online setting. The system does this through multi-armed bandits with temporal constraints, a type of algorithm that manages the timing of exploration–exploitation trade-offs across multiple bandits simultaneously. This answers a classic question in HCI and CSCW: when should teams favor one approach over another? The paper contributes a computationally identifiable method of good team structures, a system that manifests this, and an evaluation with improvements of 46%. The paper concludes with a discussion of computational partners for improving group work, such as aiding us by pointing out our biases and inherent limitations, and helping us replan when the environment shifts.

Reflection:

I appreciate the way they evaluated the system and conducted randomized controlled trials for each of their experimental conditions. The evaluation is done on a collaborative intellective task, and I wonder how different the findings would be if they had evaluated it using a creative task, instead of an intellective or analytic task. Perhaps there is a different optimal “dream team” based not only on the people but the task itself. 

I also appreciate the thorough system description and how the system was integrated within Slack as opposed to having it be its own standalone system. This increases the real world generalizability of the system and also means that it is easier for others to build on top of. In addition, hiring workers in real-time would have been hard, and it’s unclear how synchronous/asynchronous the study was.

One interesting approach is considering both types of bandits simultaneously, exploration and exploitation. I wonder how the system might have fared if teams were given the choice to explore each on their own—probably worse. 

Another interesting finding is the evaluation with strangers on MTurk. I wonder if the results would have differed if it was a) in a co-located setting and/or b) among coworkers who already knew each other. 

Finally, the paper is nearly two years old, and I don’t see any follow up work evaluating this system in the wild. I wonder why or why not. Perhaps there is not much to gain through an in-the-wild evaluation, or that an in-the-wild evaluation did not fare well. Either way, it would be interesting to read about the results—good or bad.

Questions:

  1. Have you thought about building a Slack integration for your project instead of a standalone system?
  2. How might this system function differently if it were for a creative task such as movie animation?
  3. How would you evaluate such a system in the wild?

Read More

02/19/2020 – Sukrit Venkatagiri – The Case of Reddit Automoderator

Paper: Shagun Jhaver, Iris Birman, Eric Gilbert, and Amy Bruckman. 2019. Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator. ACM Transactions on Computer-Human Interaction (TOCHI) 26, 5: 31:1–31:35. https://doi.org/10.1145/3338243

Summary: This paper studies Reddit’s Automod, a rule-based moderator for Reddit that automatically filters content on subreddits, and can be customized by the moderators to suit each subreddit. The paper sought to understand how moderators use Automod, and what advantages and challenges it presented. The paper discusses these findings in detail and the authors found that: there was a need for audit tools to tune the performance of Automod, a repository for sharing these tools, and for improving the division of labor between human and machine decision making. They concluded with a discussion of the sociotechnical practices that shape the use of the tools, how they help workers maintain their communities, and the challenges and limitations, as well as solutions that may help address them.

Reflection:

I appreciate that the authors were embedded within the Reddit community for over one year and provides concrete recommendations for creators of new and existing platforms, for designers and researchers interested in automated content moderation, for scholars of platform governance, and for content moderators themselves.

I also appreciate the deep and thorough qualitative nature of the study, along with the screenshots, however the paper may be too long and too detailed in some aspects. I wish there was a “mini” version of this paper. The quotes themselves were exciting and exemplary of problems the users faced.

The finding that different subreddits configured and used subreddits was interesting and I wonder how much a moderators’ skills and background affects whether and in what ways they configure and use Automod. Lastly, the conclusion is very valuable and especially as it is targeted towards different groups within and outside of academia.

Two themes that emerged, “Becoming/continuing to be a moderator” and “recruiting new moderators” sound interesting, but I wonder why it was left out of the results. The paper does not provide any explanation in regards to this.

Questions:

  1. How do you subreddits might differ in their use of Automod based on their technical abilities?
  2. How can we teach people to use Automod better?
  3. What are the limitations of Automod? How can they be overcome through ML methods?

Read More

02/19/2020-Bipasha Banerjee – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Summary: 

The paper discusses software development tools that help in moderating content posted or altered in the online encyclopedia, popularly known as Wikipedia. Wikipedia was built on the concept that anyone with an internet connection and a device could edit pages on the platform. However, such platforms with “anyone can edit” mantra are prone to malicious users, aka Vandals. Vandals are people who post inappropriate content in the form of text alteration, the introduction of offensive content, etc. Humans can be moderators who can scan for offensive content and remove them. However, this is a tedious task for humans to do. It is impossible for them to monitor huge amounts of content and look for small changes hidden in a large body of the text. To aid humans, there are fully automated softwares that are responsible for monitoring, editing, and overall maintenance of the platform. Examples of such software are Huggle and Twinkle. These tools work with humans and help in keeping the platform free from vandals by flagging users and taking appropriate actions as deemed necessary.

Reflection:

This paper was an interesting read on how offensive content is dealt with in platforms like Wikipedia. It was interesting to learn about different tools and how they interact with humans and help them in making the platform clean of bad content. These tools are extremely useful, and it makes use of machine affordance of dealing with large amounts of data. However, I feel we should also discuss the fact that machines need human interference to evaluate its performance. The paper mentions “leveraging the skills of volunteers who may not be qualified to review an article formally”, this statement is bold and leads to a lot of open questions. Yes, this makes it easy to hire people with lesser expertise, but at the same time, it makes us aware of the fact that machines are taking up some jobs and undermining humans’ expertise.  

Most of the tools mentioned are flagging content based on words, phrases, or deletion of enormous content. These can be defined to be rule-based rather than machine learning. Can we implement machine learning and deep learning algorithms where the tool learns from user behavior as Wikipedia is data-rich and could provide a lot of data to the model to train on? The paper mentioned that “significant removal of content” is placed higher on the filter queue. My only concern is sometimes a user might press enter by mistake. For example, take the case of git. Users write codes, and the difference is generally recorded and shown in the diff from the previous commit. If a coder writes new lines of code may be a line or two and press enter erroneously before or after the entire piece, the whole block shows as “newly added” in the diff. This is easy for a human to understand, but a machine flags such content, nonetheless. This may lead to extra work which normally would have been not in the queue or even lower.

The paper talks about the “talk page” where the warnings are posted by tools. This is a very good step as public shaming is needed to stop such baneful behavior. However, we can incorporate a harsher way to “shame” such users. This may be in the form of poster usernames on the main homepage for every category. This won’t work for anonymous, but maybe blocking their IP address would be a temporary fix, I feel like human and computer interaction is well defined in the paper and the concept of content controlling bots make our life easier.

Questions:

  1. Are machines undermining human capabilities? Do we not need expertise any more?
  2. How can such tools utilize the vast amount of data better? E.g., for training deep learning models.
  3. How could such works be extended to other platforms like Twitter?

Read More

2/19/2020 – Jooyoung Whang – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

This paper introduces how bots and humans interact and collaborate to moderate thousands of wiki pages and ban vandal users. To study the use of moderator bots, the authors use a technique called trace ethnography. The technique traces the logs and records left by using automated services to give an insight into how the moderation was made using various tools. The authors explain how the tools facilitate distributed cognition and enhance teamwork among rather isolated vandal fighters. According to the paper, the set of vandal warnings is logged on the potential vandal user’s talk page which is then used to determine by feature vandal fighters how severe a warning should be given to the user. Temporary bans are made in a similar fashion, where a ban request is sent to the administrator’s ban request board and the next time an administrator finds a vandal activity by the same user, the ban is given. The paper makes use of a detailed use case to explain the process step-by-step.

The paper was interesting in that it shined a light to another pro that automation can bring to collaborative work. The paper emphasizes that it was the automated bots and their efficient reporting system that created a decentralized network of human moderators by pre-processing and analyzing the queued edits to form a ranked queue of potential vandal edits according to previous warnings. As there exist many effective scheduling algorithms, automated scheduling is a great way of handling human teamwork. Wikipedia’s system reminded me of a thread pool system that modern CPUs use, except that each thread’s task is carried out by a human.

Wikipedia’s vandal fighting system makes perfect use of human and AI affordance. The human’s side makes use of their linguistic and complex reasoning ability to determine the vandal edits. The AI side efficiently handles the many repetitive tasks like sorting edit queues and logging and retrieving warnings.

The followings are the questions that I had while reading the paper:

1. At the end of the use case presented in the paper, an obsolete report made after a user’s ban was automatically removed by the system. This is an example of resolving a race condition. Could there be any other possible conflicts that may occur because of the order of edits? Would some of them be difficult to fix by a bot?

2. According to the paper, it seems that the time of the warning by the system is not considered on a potential vandal user’s talk page when assigning a warning. What if the user who have gotten four warnings decided to quit being a vandal, came back a few years later, and accidentally made an edit that was considered vandal? The system would issue a temporary ban. Do you think this is fair?

3. According to the paper, vandal fighters are able to select from a range of helper bots in their activity. All these bots are compatible with each other because of the presence of a talk page provided by Wikipedia. Would there be any case where the different types of bots cause a problem or conflict with each other?

Read More

2/19/20 – Jooyoung Whang – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

According to the paper, most developers of classification or prediction systems focus on the quality of the predictions but not on the system’s team performance with the user. The authors of this paper introduce the problem that may occur according to the current model training loss criteria and provide new methods that address the problem. To develop a more advanced image of the users’ interactions with a classifier system, the authors develop a web-based game system called Caja and conduct a user study using the Amazon Mechanical Turk. They conclude that the increase in performance of the system does not necessarily mean that the team performance of the system with the users also increase. They also confirm that their proposed training method using the new loss function and a new concept called Dissonance improves team performance.

I liked the authors’ new perspective to human-AI collaboration and model training. Now that I think of it, not considering the users of the system during development is contradictory to what the system’s trying to achieve. One thing I was interested in and had thoughts about was their definition of Dissonance. The term is used to compare and link with the old model of a system with the new updated model in terms of user expectation. I saw that the term penalizes a system when the new system misclassifies for a set of input that the old model used to get right. However, what if the users of the old system made predictions according to how the system was wrong? This may be a weird concern and probably an edge case, but if the user made decisions based on the thought that the system was wrong all the time, the team performance of that that person with the updated model will always be worse even if the new system was trained with the suggested loss function.

The followings are the questions that I had while reading the paper:

1. As I have written in my reflection, do you think the new proposed training method will be effective if the users made decisions based on the idea that the system will be always wrong? Or, is this a too extreme and absurd thought?

2. The design of Caja ensures that the user can never arrive at the solution by him or herself because too much about the problem domain is hidden to the user. However, this is often not the case in real world scenarios. The user of the system is often also an expert of the related field. Does this reduce the quality and trustworthiness of the results of this research? Why or why not?

3. The research started from the idea that interaction with the users must be considered when making an update to an AI system. In this case, it was particularly for human-AI collaboration. What if it was the opposite? For example, there are AIs that are built to compete with humans like AlphaGo. These types of AIs are also developed with the goal of producing the most optimal solution to a given input without considering the interaction with the user. How can training be modified to include users for competing AIs?

Read More

02/19/2020 – Updates in Human-AI teams: Understanding and Addressing the Performance/Compatibility Tradeoff – Sushmethaa Muhundan

The paper studies human-AI teams in decision-making settings specifically focusing on updates made to the AI component and its subsequent influence on the decision-making process of the human. In an AI-advised human decision-making interaction, the AI system recommends actions to the human. Based on this recommendation, their past experience as well as domain knowledge, the human takes an informed decision. They can choose to go ahead with the action recommended by the AI or they can choose to disregard the recommendation. During their course of interaction with AI systems, humans develop a mental model of the system. This is developed based on mapping scenarios where the AI’s decision was correct versus when they were incorrect by means of rewards and feedback provided to the humans by the system. As part of the experiment, studies were conducted to establish relationships between updates to AI systems and team performance. User behavior was monitored using a custom platform, CAJA, built to gain insights about the influence of updates to AI models on the user’s mental model and consequently team performance. Consistency metrics were introduced and several real-world domains were analyzed including recidivism prediction, in-hospital mortality prediction, and credit risk assessment. 

It was extremely surprising to note that updates to the AI’s performance that makes it better actually may hurt the team performance. My initial instinct was that with an increase in the AI’s performance, the team performance would increase proportionally but this is not always the case. In certain cases, despite there being an increase in the AI’s performance, the new results might not be consistent with the human’s mental model and as a result, incorrect decisions are taken based on past interactions with the AI and hence the overall team performance decreases. An interesting and relatable parallel is drawn to concepts of backward compatibility in software engineering with respect to updates. The concept of compatibility is introduced using this analogy to describe the ideal scenario where updates to the AI does not introduce further errors.

The platform developed to conduct the studies, CAJA, was an innovative way to overcome the challenges of testing in real-world settings. This platform abstract away the specifics of problem-solving by presenting a range of problems that distills the essence of mental models and trust in one’s AI teammate. It was very interesting to note that these problems were designed such that no human could be a task expert thereby maximizing the importance of mental models and their influence in decision making.

  • What are some efficient means to share the summary of AI updates in a concise, human-readable form that captures the essence of the update along with the reason for the change?
  • What are some innovative ideas that can be used to reduce the cost incurred by re-training humans in an AI-advised human decision-making ecosystem?
  • How can developers be made more aware of the consequences of the updates made to the AI model on team performance? Would increasing awareness help improve team performance?

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal – Sushmethaa Muhundan

The paper takes about the counter-vandalism process in Wikipedia focussing on both the human efforts as well as the silent non-human efforts put in. Fully-automated anti-vandalism bots are a key part of this process and play a critical role in managing the content on Wikipedia. The actors involved range from being fully autonomous software to semi-automated programs to user interfaces used by humans. A case study is presented which is an account of detecting and banning a vandal. This aims to highlight the importance and impact of bots and assisted editing programs. Vandalism-reverting software use queuing algorithms teamed with a ranking mechanism based on vandalism-identification algorithms. The queuing algorithm takes into account multiple factors like the kind of user who made the edit, revert history of the user as well as the type of edit made. The software proves to be extremely effective in presenting prospective vandals to the reviewers. User talk pages are forums utilized to take action after an offense has been reverted. This largely invisible infrastructure has been extremely critical in insulating Wikipedia from vandals, spammers, and other malevolent editors. 

I feel that the case study presented helps understand the internal working of vandalism-reverting software and it is a great example of handling a problem by leveraging the complementary strengths of AI and humans via technology. It is interesting to note that the cognitive work of identifying a vandal is distributed across a heterogeneous network and is unified using technology! This lends speed and efficiency and makes the entire system robust. I found it particularly interesting that ClueBot, after identifying a vandal, immediately reverted the edit within seconds. This edit did not have a wait in a queue for a human or a non-human bot to review but was resolved immediately using a bot.

A pivotal feature of this ecosystem that I found very fascinating was the fact that domain expertise or skill is not required to handle such vandal cases. The only expertise required of vandal fighters is in the use of the assisted editing tools themselves, and the kinds of commonsensical judgment those tools enable. This widens the eligibility criteria for prospective workers since specialized domain experts are not required.

  • The queuing algorithm takes into account multiple factors like the kind of user who made the edit, revert history of the user as well as the type of edit made. Apart from the factors mentioned in the paper, what other factors can be incorporated into the queuing algorithm to improve its efficiency?
  • What are some innovative ideas that can be used to further minimize the turnaround reaction time to a vandal in this ecosystem?
  • What other tools can be used to leverage the complementary strengths of humans and AI using technology to detect and handle vandals in an efficient manner?

Read More