2/19 – Dylan Finch – In Search of the Dream Team:Temporally Constrained Multi-Armed Bandits forIdentifying Effective Team Structures

Word count: 517

Summary of the Reading

This paper seeks to help make it faster and easier for teams to find their ideal team structure. While many services allow teams to test out many different team structures to find the best one, many of those services can take a lot of time and can greatly affect the people who work on the team. Often times they have to switch structures so often that it makes it hard for the teams to concentrate on getting work done. 

The method proposed in the paper seemed to be very successful. It resulted in teams that were 38-46% more effective. The system works by testing different team structures and taking automatically generated feedback information (like performance metrics) to figure out how effective each structure is. It will then base its future combinations on this feedback. Each time a new structure is tested, it varies on a five dimensions: hierarchy, interaction patterns, norms of engagement, decision-making norms, and feedback norms.

Reflections and Connections

I think that this paper has an excellent idea for a system that can help teams to work better together. One of the most important things about a team is how it is structured. The structure of a team can make or break its effectiveness, so getting the structure right is very important to making an effective team. A tool like this that can help a team figure out the best structure with minimal interruption will be very useful to everyone in the business world who needs to manage a team. 

I also thought that it was a great idea to integrate the system into Slack. When I worked in industry last summer, all of the teams at my company used Slack. So, it makes a lot of sense to implement this new system in a system that people are already familiar with.  The use of Slack also allows the creators to make the system more friendly. I think it is much better to get feedback from a human-like Slack bot than some other heartless computer program. It is also very cool how the team members can interact with the bot in Slack. 

I also found the dimensions that they used in the team structures to be interesting. It is valuable to be able to classify teams in some concrete way based on certain dimensions of how they perform. This also has a lot of real world applications. I think that a lot of the time, one of the hardest things in any problem space is just to quantify the possible states of the system. They did this very nicely with the team dimensions and all of their values. 

Questions

  1. Would you recommend this system to your boss at your next job as a way to figure out how to organize the team?
  2. Aside from the ones listed in the paper, what do you think could be some limitations of the current system?
  3. Do you think that the possible structures had enough dimensions and values for each dimension?

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia – Myles Frantz

Given an extensive website such as Wikipedia, there is bound to be an abundance of actors, both good and bad. With the scalability and wide ruleset of the popular web forum site, it would be nigh impossible for human moderators to handle the workload and cross examine each page in depth. To alleviate this, programs that use machine learning were created to help cross track user’s usage of the site into a single repository. Once all the information is gathered here, if a user is acting in a malicious way, it can easily be caught by the system and auto-reverted based on the machine learnings predictions. Such was the case for the user from the case study, whom attempted to slander a famous musician, but was caught quickly and with ease.

I absolutely agree with all the moderation going on around Wikipedia. Given the site domain, there are a vast number of pages that must be secured and protected (all to the same level). It is unrealistic to expect a non-profit website to be able to hire more manual workers to accomplish this same task (in contrast to Youtube, or Facebook). Also, the context in which must be followed in order to fully track a malicious user down manually would be completely exhaustive. For the security side of malware tracking, there is a vast amount of decompilers, raw binary program tracers, and even a custom Virtual Machine and Operation System (Security Onion) that contains various amounts of programs “out of the box” that are ready to track the full environment for the malware.

I disagree with one of the major issues that arises, regarding the bots creating and executing their own moral agenda. This is completely learned and based on the various factors (such as the rules, the training data, and correction values). Though they have the power to automatically revert and edit someone else’s page, these are done at the discretion of the person who created the rules. It would likely have some issues, but it is the overall learning process. These false positives would also be able to be appealed if the author so chooses to follow through, so it’s not a fully final decision.

  • I would believe with such a tool suite, there would be a tool that would act as a combination, a “Visual Studio Code” like interface for all these tools. Having all these tools at the ready is useful, however since time is of the essence some tool wrapping all the common functions would be very convenient.
  • I would like to get several how many reviews from moderators are completely biased. Having a moderator work force should ideally be unbiased however realistically it is unlikely to fully happen.
  • I would also like to see the percentage of false positives, even in this robust of a system. Likely with new moderators they are likely to flag or unflag something if they are unfamiliar with the rules.

Read More

02/19/20 – Lulwah AlKulaib- OrderWikipedia

Summary

The paper examines the roles of software tools in English language Wikipedia. The authors shed light on the process of counter-vandalism in Wikipedia. They explain in detail how participants and their assisted editing tools review Wikipedia contributions and enforce standards. They show that the editing process in Wikipedia is not a disconnected activity where editors force their views on others. Specifically, vandal fighting is shown as a distributed cognition process where users come to know their projects and users who edit it in a way that is impossible for a single individual. The authors claim the blocking of a vandal a cognitive process made possible by a complex network of interactions between humans, encyclopedia articles, software systems, and databases. Humans and non-humans work to produce and maintain a social order in the collaborative production of an encyclopedia with hundreds of thousands of diverse and often unorganized contributors. The authors introduce trace ethnography as a method of studying the seemingly ad-hoc assemblage of editors, administrators, bots, assisted editing tools, and others who constitute Wikipedia’s vandal fighting network.

Reflection

The paper comes off as a survey paper. I found that the authors explained some methods that already existed and used one of the authors experience to elaborate on others’ work. I couldn’t see their contribution but maybe that was needed 10 years ago? The tools that they mentioned (Huggle, AIV, Twinkle, ..etc.) were standard tools to be used when editing Wikipedia’s articles and monitoring edits made by others. They reflected on how those tools were helpful in a manner that made fighting vandalism an easier task. They mention that these tools facilitate viewing each edited article by linking it with a detailed edit summary with an explanation why it was done, by whom, and related IP addresses. They explain how they use such software to detect vandalism and how to revert back to the correct version of the article. They presented a case study of a Wikipedia vandal and showed logs of the changes that he was able to make in an hour. The authors also referenced Ed hutchins who explains how cognitive work must be performed in order to keep US Navy ships on course at any given time. And how that is a similar reference to what it takes to manage Wikipedia. Technological actors in Wikipedia, such as Huggle, make what would be a difficult task into a mundane affair. Reverting edits becomes a matter of pressing a button. The paper was informative for someone who hasn’t worked on editing Wiki articles but I thought that this paper could have been presented as a tutorial, it would’ve been more beneficial. 

Discussion

  • Have you worked on Wikipedia article editing before?
  • Did you encounter using the tools mentioned in the paper?
  • Is there any application that comes to mind where this can be used other than Wikipedia?
  • Do you think such tools could be beneficial when it comes to open source software version control?
  • How would this method generalize to open source software version control?

Read More

02/19/20 – Fanglan Chen – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

Bansal et al.’s paper “Updates in Human-AI Teams” explores an interesting problem — the influence of updates to an AI system on the overall team performance. Nowadays, AI systems have been deployed to  support human decision making in high-stakes domains including criminal justice and healthcare. In the working process of a team of humans and AI systems, humans make decisions with a reference to AI’s inferences. A successful partnership requires that the human develops an understanding into the AI system performance, especially its error boundary. Updates with algorithms of higher performance can potentially increase the AI’s predictive accuracy. However, that may require humans to regain interactive experiences and rebuild their confidence in the AI system, the adjusting process of which may actually hurt team performance. The authors introduce the concept of compatibility between an AI update and prior user experience and present methods for studying the role of compatibility in human-AI teams. Extensive experiments on three high-stakes classification tasks (recidivism, credit risk, and mortality) demonstrate that current AI systems are not provided with compatible updates, resulting in decreased performance after updating. To improve the compatibility of an update, the authors propose a re-training objective by penalizing new failures from AI systems. Their proposed compatible updates achieve a good balance of the performance and compatibility trade-off in different tasks.

Reflection

I think making AI and humans as a team to take full advantage of the collaboration is a pretty neat idea. Humans are born with the ability to adapt in the face of an uncertain and adverse world with the capacity of logic reasoning. Machines cannot perform well in those areas but can achieve efficient computation and free people for higher-level tasks. Understanding how machines can efficiently enhance what humans perform best and how humans can augment the work scope of machines is the key to rethink and redesign current decision making system.

What I find interesting about the research problem discussed in this paper is that the authors focus on the idea of unifying a decision made by humans and machines but not merely on the performance in tasks to recommend updates. In machine learning with no human involved, the goal is usually to achieve better and better performance which is evaluated by metrics such as accuracy, precision, recall, etc. The compatible updates can be seen as the machine learning algorithms with similar decision boundaries but better performance, which seems to be an even more difficult task to accomplish. To get there, humans need to perform crucial roles. Firstly, humans must train machines to achieve good performance on certain tasks. Next, humans need to understand and be able to explain the outcomes of those tasks, especially where AI systems fail. That requires an interpretability component in the system. As AI systems are increasingly drawing conclusions through opaque processes (also-called black-box problem), there is a large demand of human experts in the field to explain model behavior to non-expert users. Last but not least, humans need to sustain the responsible use of  AI systems by, for example, updating for better decision making discussed in the paper. That would require a large body of human experts who continually work to ensure that AI systems are functioning properly, safely, and responsibly. 

The above discussion is one side of a coin, focusing on how humans can extend what machines can achieve. The other side is comparatively less discussed in the current literature. Except for extending physical capabilities, how humans can learn from the interaction with AI systems and enhance individual abilities is an interesting question to explore. I would imagine, in an advanced Human-AI team, that humans and AI systems communicate in a more interactive way which allows for collaborative learning from their own mistakes and the rationale of the correct decisions made by each other. That leads to another question, if AI systems can exceed or rival humans in high-stake decision making such as recidivism and underwriting, how risky is it to handle the tasks to machines? How can we decide when to let humans take control? 

Discussion

I think the following questions are worthy of further discussion.

  • What can humans do that machines cannot and vice versa?
  • What is the goal of decision making and what factors are stopping humans or machines from making good decisions? 
  • In the Human-AI teams discussed in the paper, what can humans benefit from the interaction with the AI systems?
  • The partnership introduced by the authors is more like a human-assisting-machine approach. Can you provide some examples of machine-assisting-human approaches?

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia – Subil Abraham

This paper is a very interesting inside look at how the inner cogs of Wikipeda functions, particularly relating to how vandalism is managed with the help of automated software tools. The tools developed unofficially by Wikipedia contributors were created out of necessity in order to a) make it easier to identify bad actors, b) automate and speed up reversions of vandalism, and c) give power to the non-experts to police obvious vandalism such as changing or deleting sections without needing a subject matter expert to do a full review of the article. The paper uses trace ethnography in order to study the usage of these tools and puts forth an interesting case study of a vandal defacing various articles and how through distributed actions by various volunteers, assisted by these tools, the vandal was identified, warned for their repeated offenses, and finally banned as their egregious actions continued, all within the span of 15 minutes and no explicit coordination among the volunteers.

I find this to be a fascinating look into distributed cognition in action, where in multiple independent actors are able to take independent action that produce a cohesive result (in the case study, multiple volunteers and automated tools identifying a vandal and issuing warnings, ultimately resulting in their ban). I find I’m thinking the work of these tools as kind of an equivalent to human body’s unconscious activities. For example, the act of walking is incredibly complex involving precise coordination of hundreds of muscles all moving at the right moments. However, we do not have to think any harder than “I want to get from here to there” and our body handles the rest. That’s kind of what it feels like these tools are, something that handles the complex busywork and leave the big decisions to us. I am wondering though how things have changed from 2009. The paper mentions that the bots tend to ignore changes made by other bots because presumably those other bots are being managed by other volunteers but the bot configuration can be changed so that it explicitly monitors other bots. I wonder how much of that functionality is used now because I am sure Wikipedia now has to deal with a lot more politically motivated vandalism, and much of it is being done by bots. Reddit is a big victim of this, so it is not hard to imagine Wikipedia faces the same problem. Of course, the adversarial bots would be a lot more clever than just pretending to be a friendly bot because that might not cut it anymore. It’s still an important thing to think about.

  1. How would the functionality of Huggle and its ilk fare in the space of Reddit’s automoderator, and vice versa? Are they dealing with fundamentally different things or is there overlap?
  2. How has dealing with vandalism changed on Wikipedia in the decade since this paper was published?
  3. Is there a place for a heirarchy of bots, where lower level bots scan for vandalism and higher level bots make the decisions for banning, all with minimal human intervention? Or will there always need active human participation?

Read More

02/19/2020 – In Search of the Dream Team – Subil Abraham

How do you identify the best way to structure your team? What kind of leadership setup should it have, how should team members collaborate and make decisions? What kind of communication norms should they follow? These are all important questions to ask when setting up a team but answering them is hard, because there is no right answer. Every team is different as a function of its team members. So it is necessary to iterate on these dimensions and experiment with different choices to try and see which setup works best for a particular team. Earlier work in CSCW attempts this with “multi-arm bandits” where each dimension is independently experimented with by a so called “bandit” (a computational decision maker) in order to collectively reach a configuration based on recommendations from each bandit for each dimension. However, this earlier work suffered from the problem of recommending too many changes and overwhelming the teams involved. Thus this paper proposes a version with temporal constraints, that still provides the same benefits of exploration and experimentation while limiting how often changes are recommended to avoid overwhelming the team.

This is my first exposure to this kind of CSCW literature and I find it a very interesting look into how computational decision makers can help make better teams. The idea of a computational agent looking at performance of teams and how they’re functioning and make recommendations to improve the team dynamics intuitively makes sense, because the team members themselves either can’t take an objective view because of their bias, or could be afraid to make recommendations or propose experimentation for fear of upsetting the team dynamic. The fact that this proposal is about incorporating temporal constraints to these systems is also a cool idea because of course humans can’t deal with frequent change because that would be very overwhelming Having an external arbiter do that job instead is very useful. I wonder whether if the failure of the human manager to experiment is because humans in general are risk averse, or the managers that were picked were particularly risk averse. This ties into my next complaint about the experiment sizes; both in the manager section and in the overall, I find the experiment size is awfully small. I feel like you can’t capture proper trends, especially socialogical trends such as being discussed in this paper, with experiments with just 10 teams. I feel a larger experiment should have been done to identify larger trends before this paper was published. Assuming that the related earlier work with multi-arm bandits also had similar experiment sizes, they should have been larger experiments as well before they were published.

  1. could we expand the dreamteam recommendations where, in addition to recommending changes in the different dimensions, it is also able to recommend more specific things. The main thing I was thinking of was if it is changing heirarchy to a leader based setup, it also recommends a leader, or explicitly recommends people vote on a leader, rather than just saying “hey, you guys now need to work with a leader type setup”?
  2. Considering how limited the feedback that DreamTeam could get, what else could be added than just looking at the scores at different time steps?
  3. What would it take for managerial setups to be less loss averse? Is the point of creating something like DreamTeam to help and push managers to have more confidence in instituting change, or is it to just have a robot take care of everything, sans managers at all?

Read More

2/19/20 – Lee Lisle – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Summary

            Geiger and Ribes cover the case of using automated tools or “bots” in order to prevent vandalism on the popular online and user-generated encyclopedia “Wikipedia.” The authors detail how editors use popular distributed cognition coordination services such as “Huggle,” and argue that these coordination applications affect the creation and maintenance of wikipedia as much as the traditional social roles of editors. The team of human and AI work together to fight vandalism in the form of rogue edits. They cover how bots assisted essentially 0% of edits in 2006 to 12% in 2009, while editors use even more bot assistance. They then deep dive into how the editors came to ban a single vandal that committed 20 false edits to Wikipedia in an hour, which they term a “trace ethnography.”

Personal Reflection

            This work was eye-opening in seeing exactly how Wikipedia editors leverage bots and other distributed cognition to maintain order in Wikipedia. Furthermore, after reading this, I am much more confident in the accuracy of articles contained on the website (possibly to the chagrin of teachers everywhere). I was surprised how easily attack edits were repelled by the Wikipedia editors, considering that hostile bot networks could be deployed against Wikipedia as well.

            I also generally enjoyed the analogy of how managing Wikipedia is like navigating a naval vessel in that both leverage significant amounts of distributed cognition in order to succeed. Showing how many roles are needed in order to understand various jobs and collaborate between people was quite effective.

            Lastly, their focus (trace ethnography) on a single vandal was an effective way of portraying what is essentially daily life for these maintainers. I was somewhat surprised that only four people were involved before banning a user; I had figured that each vandal took much longer to identify and remedy. How the process proceeded, where the vandal got repeated warnings before a (temporary) ban occurred, and how the bots and humans worked together in order to come to this conclusion, was a fascinating process that I hadn’t seen written in a paper before.

Questions

  1. One bot that this article didn’t look into is a twitter bot that tracked all changes on Wikipedia made by IP addresses used by congressional members (@CongressEdits). Its audience is not specifically intended to be the editors of Wikipedia, but how might this help them? How does this bot help the general public? (It has since been banned in 2018) How might a tool like this be abused?
  2. How might a trace ethnography be used in other applications for HCI? Does this approach make sense for domains other than global editors?
  3. How can Huggle (or the other tools) be changed in order to tackle a different application, such as version control? Would it be better than current tools?
  4. Is there a way to exploit this system for vandals? That is, are there any weaknesses to human/bot collaboration in this case?

Read More

2/19/20 – Lee Lisle – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

            Bansal et. al discuss how human-AI teams work in solving high-stakes issues such as hospital patient discharging scenarios or credit risk assessment. They point out that the humans in these teams often create a mental model of the AI suggestions, where the mental model is an understanding of when the AI is likely wrong about the outcome. The authors then show that updates to the AI can produce worse performance if they are not compatible with the already formed mental model of the human user. They go on to define types of compatibility for AI updates, as well as a few other key terms relating to human/AI teams. They develop a platform to measure how compatibility can affect team performance, and then measure AI update compatibility effectiveness through a user study using 25 mTurk workers. In all, they show that incompatible updates reduce performance as compared to no update at all.

Personal Reflection

            The paper was an interesting study in the effect of pushing updates without considering the user involved in the process. I hadn’t thought of the human as an exactly equal player in the team, where the AI likely has more information and could provide a better suggestion. However, it makes sense that the human leverages other sources of information and forms a better understanding of what choice to ultimately make.

            CAJA, the human/AI simulation platform, seems like a good way to test AI updates, however I struggle to see how it can be used to test other theories as the authors seem to suggest. It is, essentially, a simple user-learning game, where users figure out when to trust the machine and when to deviate. While this isn’t exactly my field of expertise, I only see the chance to change information flows and the underlying AI as ways of learning new things about human/AI collaboration. This would mean terming this as a platform is a little excessive.

Questions

  1. The authors mention that, in order to defeat mTurk scammers who click through projects like these quickly, they drop the lowest quartile (in terms of performance) out of their results. Do you think this was an effective countermeasure, or could the authors be cutting good data?
  2. From other sources, such as Weapons of Math Destruction, we can read how some AI suggestions are inherently biased (even racist) due to input data. How might this change the authors results? Do you think this is taken into consideration at all?
  3. One suggestion near the end of the paper stated that, if pushing an incompatible update, the authors of the AI should make the change explicit so that the user could adjust accordingly. Do you think this is an acceptable tradeoff to not creating a compatible update?  Why or why not?
  4. The authors note that, as the complexity of error boundary f became more complex, errors increased, so they kept to relatively simple boundaries. Is this an effective choice for this system, considering real systems are extremely complex? Why or why not?
  5. The authors state that they wanted the “compute” cost to be net 0. Does this effectively simulate real-world experiences? Is the opportunity-cost the only net negative here?

Read More

02/19/2020 – Updates in Human-AI Teams: Understanding and Addressing the Performance / Compatibility Trade off – Yuhang Liu

This paper first proposes the complementarity between humans and artificial intelligence. In many cases, humans and artificial intelligence will form a team. When people make decisions after checking the inferences of AI, this cooperation model has applications in many fields, and achieved significant results. Usually, this kind of achievement requires certain prerequisites. First, people must have their own judgments on the conclusions of artificial intelligence. At the same time, the results of artificial intelligence must be accurate. The tacit cooperation between the two can improve efficiency. However, with the updating of artificial intelligence systems and the expansion of data, this cooperation will be broken. On the one hand, the accuracy of artificial intelligence will decline, and because of the expansion of boundaries, people’s understanding of artificial intelligence will be broken. So after the system update, the efficiency will be reduced instead. This paper mainly studies this situation. The article hopes to be compatible with the previous method after the update, so several methods are proposed to achieve this purpose, so as to achieve more compatible and accurate updates.

It is also suggested that this idea is obtained by analogy. In software engineering, if the updated system can support legacy software, it will be compatible after the update. I agree with this kind of analogy greatly, which is similar to bionics. We can continuously apply new ideas to the computer field through this kind of thought. The method mentioned in this paper is also very necessary. In the ordinary process of artificial intelligence or machine learning, we usually build data sets for each time, and lack the concept of inheritance, which is very inconvenient. After adopting compatible ideas, it will greatly save energy and be able to serve people more smoothly.

This article introduces CAJA, a platform for measuring the impact of AI performance and updates on team performance. At the same time, a practical retraining goal is introduced in the article to improve update compatibility. The main idea is to improve update compatibility by punishing new errors. But it can also be seen from the text that trust is the core of team work. Admittedly, trust is the essence of a team, but only as the basis of work, I think that more simulations and improvements are needed to improve humanity. The combination of problem-solving factors and the key of machine learning, we know that after learning new things, people will not have a negative impact on previous skills, but we will have more perspectives and methods to think about a problem, so I think that humans and machines should be mixed, that is, a team as a whole, so that the results can be more compatible, and the human machine interaction can be more successful.

question:

  1. What are the implications of compatible AI updates?
  2. How to better treat people and machines as a whole?
  3. Whether compatible AI will affect the final training results?

Read More

02/19/2020 – Human-Machine Collaboration for Content Regulation – Myles Frantz

Since the dawn of the internet, it has surpassed many expectations and is prolific throughout everyday life. Though initially there was a lack of standards in website design and forum moderation, it has relatively stabilized with various and scientific approaches. A popular forum side, Reddit, use a human lead human-ai collaboration to help automatically and manually moderate the ever-growing comments and thread. Searching through the top 100 subreddits (at the time of writing), the team followed surveyed moderators from 5 varied and highly active subreddits. These moderators are majority Due to the easy to use API provided by Reddit, one of the most used moderation tools was a third party later incorporated into Reddit Automod. This is one of the more popular and common tools used by moderators in Reddit. Since it is very extensible, there is no common standard between all the subreddits. Moderators within the 5 subreddits use this bot in relatively similar but different ways. Not only the sole bot used by moderations, other bots can be used to further interact and streamline other bots in a similar fashion. However due to the complication of bots (technologically or lack of interest in learning the tool), some subreddits let a few people manage the bots, sometimes to damning results. When issues happen, instead of being reactive to various users’ reactions, the paper argues for more transparency of the bot.

I agree with the author of the original of automod, when he started off making the bot purely to automate several steps. Continuing this forward with the scaling of Reddit, I do believe it would be impossible for only human moderators to keep up with the “trolls”.

Though I do disagree with how the rules of the automod are spread out. I would believe the decentralization of knowledge would make the system more robust, especially since the moderators are voluntary. It is natural for people to avoid what they don’t understand, for fear of it in general or fear for what repercussions may happen. Though I don’t think putting all of the work on one moderator is necessarily the right answer.

  • One of my questions is regarding one of the outcomes for Reddit; granting more visibility of the automods actions. Notably due to the scale of Reddit, extending this kind of functionality automatically could incur much more of a memory and storage overhead. Already Reddit stores vast amount of data however potentially doubling the memory capacity (if every comment was reviewed by automod) may be a downfall to this approach.
  • Instead of surveying the top 60%, I wonder if surveying the lower ranked (via RedditMetrics) subreddit with a lower number of moderators would fit the same pattern of the automod use. I would imagine they would be forced to use the automod tool more in depth and in breadth due to the lack of available resources however this is pure speculation.
  • A final question would be, to what percentage is there an over duplication of bots across the subreddits? If there is a big percentage it may lead to a vastly different experience across subreddits, as it seemingly is now potentially causing confusion amongst new or recurring users.

Read More