Subil Abraham – 03/04/2020 – Real-time captioning by groups of non-experts

This paper pioneers the approach of using crowd work for closed captioning systems. The scenario they target is classes and lectures, where a student can hold up their phone and record the speaker and the sound the transmitted to the crowd workers. The sound that is passed is given as bite sized pieces for the crowd workers to transcribe, and the paper’s implementation of the multiple sequence alignment algorithms takes those transcriptions and combines them. The focus of the tool is very much on real-time captioning so the amount of time a crowd worker can spend on a portion of sound is limited. The authors design interfaces on the worker side to promote continuous transcription, and on the user side to allow them to correct the received transcriptions in real time, enhancing the quality further. The authors had to deal with interesting challenges in resolving errors in the transcription, which they did by a combination of comaparing transcriptions of the same section from different crowd workers, using bigram and trigram data to validate the word ordering. Evaluations showed that precision was stable while coverage increased with increase in the number of workers, while having lower error rate compared to automatic transcription and untrained transcribers.

One thing that needs to be pointed out about this work is that I believe that ASR is always rapidly improving and has made significant strides from when this paper was published. From my own anecdotal experience, Youtube’s automatic closed captions are getting very very close to being fully accurate (however, thinking back on our reading of the Ghost Work book at the beginning of the semester, I wonder if Youtube is cheating a bit and using crowd work intervention for some their videos to help their captioning AI along). I also find that the author’s solution for merging the transcriptions of the different sound bites is interesting. How they would solve that was the first thing that was on my mind because it was not going to be a matter of simply aligning the time stamps because those were definitely going to be imprecise. So I do like their clever multi part solution. Finally, I was a little surprised and disappointed that the WER was at ~45% which was a lot higher than I expected. I was expecting the error rate to be a lot closer to professional transcribers but unfortunately not. The software still has a way to go in that.

  1. How could you get the error rate down to the professional transcriber’s level? What is going wrong there that is causing it to be that high?
  2. It’s interesting to me that they couldn’t just play isolated sound clips but instead had to raise and lower volume on a continuous stream for better accuracy. Where are the other places humans work better when they have a continuous stream of data rather than discrete pieces of data?
  3. Is there an ideal balance between choosing precision and coverage in the context of this paper? This was something that also came up in last week’s readings. Should the user decide what the balance should be? How would they do it when there can be multiple users all at the same location trying to request captioning for the same thing?

Read More

Subil Abraham – 03/04/2020 – Pull the Plug

The paper proposes a way of solving the issue of deciding when a computer or human should do the work of foreground segmentation of images. Foreground segmentation is a common task in computer vision where the idea is that there is an element in an image that is the focus of the image and that is what is needed for actual processing. However, automatic foreground segmentation is not always reliable so sometimes it is necessary to get humans to do it. The important question is deciding which images you send to humans for segmentation because hiring humans are expensive. The paper proposes a machine learning method that calculates the quality of a given coarse or fine grained segmentation and decide if it is necessary to bring in a human to do the segmentation. They evaluate their framework by examining the quality of different segmentation algorithms and are able to acheive the quality equivalent to 100% human work by using only 32.5% human effort for Grab Cut segmentation, 65% human effort for Chan Vese, and 70% human effort for Lankton.

The authors have pursued a truly interesting idea in that they are not trying to create a better way of automatic image segmentation, but rather creating a way of determining if the auto image segmentation is good enough. My initial thought was couldn’t something like this be used to just make a better automated image segmenter? I mean, if you can tell the quality, then you know how to make it better. But apparently that’s a hard enough problem that it is far more helpful to just defer to a human when you predict that your segmentation quality is not where you want it. It’s interesting that they talk about pulling the plug on both computers and humans but the focus of the paper seems to be focused on pulling the plug on computers i.e. the human workers are the backup plan in case the computer can’t do the quality work and not the other way around. This applies to both their cases, coarse grained and fine grained segmentation work. I would like to see future work where the primary work is done by humans first and then test to see how pulling the plug on the human work would be effective and where the productivity would increase. This would have to be work in something that is purely in the human domain (i.e. can’t use regular office work because that is easily automatable).

  1. What are examples of work where we pull the plug on the human first, rather pulling the plug on computers?
  2. It’s an interesting turn around that we are using AI effort to determine quality and decide when to bring humans in, rather than improving the AI of the original task itself. What other tasks could you apply this, where there are existing AI methods but an AI way of determining quality and deciding when to bring in humans would be useful?
  3. How would you set up a segmentation workflow (or another application’s workflow) where when you pull the plug on the computer or human, you are giving the best case result to the other for improvement, rather than starting over from scratch?

Read More

02/26/2020 – Subil Abraham – Will you accept an imperfect AI?

Reading: Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–14. https://doi.org/10.1145/3290605.3300641

Different parts of our lives are being infused with AI magic. With this infusion, however, comes problems, because the AI systems deployed aren’t always accurate. Users are used to software systems being precise and doing exactly the right thing. But unfortunately they can’t extend that expectation for AI systems because they are often inaccurate and make mistakes. Thus it is necessary for developers to set expectations of the users ahead of time so that the users are not disappointed. This paper proposes three different visual methods of setting the user’s expectations on how well the AI system will work: an indicator depicting accuracy, a set of examples demonstrating how the sytem works, and a slider that controls how aggressively the system should work. The system under evaluation is a detector that will identify and suggest potential meetings based on the language in an email. The goal of the paper isn’t to improve the AI system itself, but rather to evaluate how well the different expectation setting methods work given an imprecise AI system.

I want to note that I really wanted to see an evaluation on the effects of mixed techniques. I hope that it will be covered in possible future work they do but am also afraid that such work might never get published because it would be classified as incremental (unless they come up with more expectation setting methods beyond the three mentioned in this paper, and do a larger evaluation). It is useful to see that we now have numbers to back up that high-recall applications under certain scenarios are perceived as more accurate. It makes intuitive sense that it would be more convenient to deal with false positives (just close the dialog box) than false negatives (having to manually create a calendar event). Also, seeing the control slider brings to mind the trick that some offices play where they have the climate control box within easy reach of the employees but it actually doesn’t do anything. It’s a placebo to make people think it got warmer/colder when nothing has changed. I realize that the slider in the paper is actually supposed to do what it advertised, but it makes me think of other places where a placebo slider can be given to a user to make them think they have control when in fact the AI system remains completely unchanged.

  1. What other kinds of designs can be useful for expectation setting in AI systems?
  2. How would these designs look different for a more active AI system like medical prediction, rather than a passive AI system like the meeting detector?
  3. The paper claims that the results are generalizable for other passive AI systems, but are there examples of such systems where it is not generalizable?

Read More

02/26/2020 – Subil Abraham – Explaining models

A big concern with the usage of current ML systems is the issue of fairness and bias when making their decisions. Bias can creep into ML decisions through either the design of the algorithm or through training datasets that are labeled in a way to bias against certain kinds of things. The example used in this paper is the bias against African Americans in an ML system used by judges to predict the probability of a person re-offending after committing a crime. Fairness is hard to judge when ML systems are black boxes so this paper proposes that if ML systems expose reasons behind the decisions (i.e. the idea of explainable AI), a better judgement of the fairness of the decision can be made by the user. To this end, this paper examines the effect of four of different kinds of explanations of the ML decisions on people’s judgements of the fairness of that decision.

I believe this is a very timely and necessary paper in these times, with ML systems being used more and more for sensitive and life changing decisions. It is probably impossible to stop people from adopting these systems so the next best thing is making explainability of the ML decisions mandatory, so people can see and judge if there was potentially bias in the ML system’s decisions. It is interesting that people were mostly able to perceive that there were fairness issues in the raw data. You would think that that would be hard but the generated explanations may have worked well enough to help with that (though I do wish they could’ve shown an example comparing a raw data point and a processed data point that showed how their pre-processing cleaned things). I did wonder why they didn’t show confidence levels to the users in the evaluation, but their explanation that it was something they could not control for makes sense. People could have different reactions to confidence levels, some thinking that anything less than a 100% is insufficient, while others thinking that 51% is good enough. So keeping it out is a limiting but is logical.

  1. What other kinds of generated explanations could be beneficial, outside of the ones used in the paper?
  2. Checking for racial bias is an important case for fair AI. In what other areas is fairness and bias correction in AI critical?
  3. What would be ways that you could mitigate any inherent racial bias of the users who are using explainable AI, when they are making their decisions?

Read More

02/19/2020 – In Search of the Dream Team – Subil Abraham

How do you identify the best way to structure your team? What kind of leadership setup should it have, how should team members collaborate and make decisions? What kind of communication norms should they follow? These are all important questions to ask when setting up a team but answering them is hard, because there is no right answer. Every team is different as a function of its team members. So it is necessary to iterate on these dimensions and experiment with different choices to try and see which setup works best for a particular team. Earlier work in CSCW attempts this with “multi-arm bandits” where each dimension is independently experimented with by a so called “bandit” (a computational decision maker) in order to collectively reach a configuration based on recommendations from each bandit for each dimension. However, this earlier work suffered from the problem of recommending too many changes and overwhelming the teams involved. Thus this paper proposes a version with temporal constraints, that still provides the same benefits of exploration and experimentation while limiting how often changes are recommended to avoid overwhelming the team.

This is my first exposure to this kind of CSCW literature and I find it a very interesting look into how computational decision makers can help make better teams. The idea of a computational agent looking at performance of teams and how they’re functioning and make recommendations to improve the team dynamics intuitively makes sense, because the team members themselves either can’t take an objective view because of their bias, or could be afraid to make recommendations or propose experimentation for fear of upsetting the team dynamic. The fact that this proposal is about incorporating temporal constraints to these systems is also a cool idea because of course humans can’t deal with frequent change because that would be very overwhelming Having an external arbiter do that job instead is very useful. I wonder whether if the failure of the human manager to experiment is because humans in general are risk averse, or the managers that were picked were particularly risk averse. This ties into my next complaint about the experiment sizes; both in the manager section and in the overall, I find the experiment size is awfully small. I feel like you can’t capture proper trends, especially socialogical trends such as being discussed in this paper, with experiments with just 10 teams. I feel a larger experiment should have been done to identify larger trends before this paper was published. Assuming that the related earlier work with multi-arm bandits also had similar experiment sizes, they should have been larger experiments as well before they were published.

  1. could we expand the dreamteam recommendations where, in addition to recommending changes in the different dimensions, it is also able to recommend more specific things. The main thing I was thinking of was if it is changing heirarchy to a leader based setup, it also recommends a leader, or explicitly recommends people vote on a leader, rather than just saying “hey, you guys now need to work with a leader type setup”?
  2. Considering how limited the feedback that DreamTeam could get, what else could be added than just looking at the scores at different time steps?
  3. What would it take for managerial setups to be less loss averse? Is the point of creating something like DreamTeam to help and push managers to have more confidence in instituting change, or is it to just have a robot take care of everything, sans managers at all?

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia – Subil Abraham

This paper is a very interesting inside look at how the inner cogs of Wikipeda functions, particularly relating to how vandalism is managed with the help of automated software tools. The tools developed unofficially by Wikipedia contributors were created out of necessity in order to a) make it easier to identify bad actors, b) automate and speed up reversions of vandalism, and c) give power to the non-experts to police obvious vandalism such as changing or deleting sections without needing a subject matter expert to do a full review of the article. The paper uses trace ethnography in order to study the usage of these tools and puts forth an interesting case study of a vandal defacing various articles and how through distributed actions by various volunteers, assisted by these tools, the vandal was identified, warned for their repeated offenses, and finally banned as their egregious actions continued, all within the span of 15 minutes and no explicit coordination among the volunteers.

I find this to be a fascinating look into distributed cognition in action, where in multiple independent actors are able to take independent action that produce a cohesive result (in the case study, multiple volunteers and automated tools identifying a vandal and issuing warnings, ultimately resulting in their ban). I find I’m thinking the work of these tools as kind of an equivalent to human body’s unconscious activities. For example, the act of walking is incredibly complex involving precise coordination of hundreds of muscles all moving at the right moments. However, we do not have to think any harder than “I want to get from here to there” and our body handles the rest. That’s kind of what it feels like these tools are, something that handles the complex busywork and leave the big decisions to us. I am wondering though how things have changed from 2009. The paper mentions that the bots tend to ignore changes made by other bots because presumably those other bots are being managed by other volunteers but the bot configuration can be changed so that it explicitly monitors other bots. I wonder how much of that functionality is used now because I am sure Wikipedia now has to deal with a lot more politically motivated vandalism, and much of it is being done by bots. Reddit is a big victim of this, so it is not hard to imagine Wikipedia faces the same problem. Of course, the adversarial bots would be a lot more clever than just pretending to be a friendly bot because that might not cut it anymore. It’s still an important thing to think about.

  1. How would the functionality of Huggle and its ilk fare in the space of Reddit’s automoderator, and vice versa? Are they dealing with fundamentally different things or is there overlap?
  2. How has dealing with vandalism changed on Wikipedia in the decade since this paper was published?
  3. Is there a place for a heirarchy of bots, where lower level bots scan for vandalism and higher level bots make the decisions for banning, all with minimal human intervention? Or will there always need active human participation?

Read More

02/05/2020 – The Role of Humans in Interactive Machine Learning – Subil Abraham

Reading: Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35, 4: 105–120. https://doi.org/10.1609/aimag.v35i4.2513

Machine learning systems typically are built by collaboration between the domain experts and the ML experts. The domain experts provide data to the ML experts, who will carefully configure and tune the ML model which is then sent back to the domain experts for review, who will then recommend further changes and the cycle continues until the model reaches an acceptable accuracy level. However, this tends to be a slow and frustrating process and there exists a need to get the actual users involved in a more active manner. Hence, the study of interactive machine learning arose to identify how users can best interact with and improve the ML models through faster, interactive feedback loops. This paper surveys the field, looking at what users like and don’t like when teaching machines, what kind of interfaces are best suited for these interaction cycles and what unique interfaces can exist beyond the simple labelling-learning feedback loop.

When reading about the novel interfaces that exist for interactive machine learning, I find there is an interesting parallel between the development of the “Supporting Experimentation of Inputs” type of interface and to that of text editors. The earliest text editor was the typewriter, where an input once entered could never be taken back. A correction would require starting over or the use of an ugly whiteout. With electronics came the text editors where you could edit only one line at a time. And finally, today we have these advanced, feature rich, editors and IDEs with autocomplete suggestions, in line linting and automatic type checking and error feedback. It would be interesting to see what the next stage of ML model editing would look like if they continued on this trajectory, where we can go from simple “backspace key” type experimentation to more features parallel to what modern text editors have for words. The idea of allowing “Combining Models” as a way to create models draws another interesting parallel to car manufacturing, where cars went from being handcrafted to being built on an assembly line with standardized parts.

I also think their proposal for creating a universal language to connect the different ML fields might end up creating a language that is too general and the different fields, though initially unified, might end up splitting off again due to using only subsets of the language that don’t overlap with each other or by creating new words because the language does not have anything specific enough.

  1. Is the task of creating a “universal language” a good thing? Or would we end up with something too general to be useful and cause fields to create their own subsets?
  2. What other kinds of parallels can we see in the development of machine learning interfaces, like the parallels to text editor development and car manufacturing?
  3. Where is the “Goldilocks zone” for ML systems that are giving context to the user for the sake of transparency? There is a spectrum between “Label this photo with no context” to “here is every minute detail, number of pixels, exact gps location, all sorts of other useless info”. How do we decide which information the ML system should provide as context?

Read More

02/05/2020 – Guidelines for Human AI Interaction – Subil Abraham

Reading: Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), 1–13. https://doi.org/10.1145/3290605.3300233

With AI and ML making its way into every aspect of our electronic lives, it has become pertinent to examine how well it functions when faced with users. In order to do that, we need to have some set of rules or guidelines that we can use as a reference to identify whether the interaction between a human and an AI powered feature is actually functioning the best way it should function. This paper aims to fill that gap, collating the knowledge of 150 recommendations for human AI interfaces and distilling down into 18 distinct guidelines that can be checked for compliance. They also go through the process of refining and tailoring these guidelines to remove ambiguity through heuristic evaluations where experts try to match the guidelines to sample interactions and identify whether the interaction adheres to or violates the guideline or if the guideline is relevant to that particular interaction at all.

  • Though it’s only mentioned in a small sentence in the Discussion section, I’m glad that they point out and acknowledge that there is a tradeoff between being very general (at which point the vocabulary you devise is useless and you have to start defining subcategories), and being very specific (at which point you need to start adding addendums and special cases willy-nilly). I think the set of guidelines in this paper does a good job of trying to strike that balance.
  • I do find it unfortunate that they anonymized the products that they used to test interactions on. Maybe this is just standard practice when it comes to this kind of HCI work to not specify the exact products that they evaluate to avoid dating the work in the paper. It probably makes sense this way they have control of the narrative and can simply talk about the application in terms of the feature and interaction tested. This avoids having to grapple over which version of the application they used on which day, because applications get updated all the time and violations might get patched and fixed and thus the application is no longer a good example for a guideline adherence or violation that was noted earlier.
  • It is kind of interesting that a majority of the experts in phase 4 preferred the original version of guideline 15 (encourage feedback) as opposed to revised version (provide granular feedback) that was successful in the user study. I wish they had explained or speculated why that was.
  1. Why do you think experts in phase preferred the original version of guideline 15 as opposed to revised version, even though the revised version was demonstrated to cause less confusion between it and guideline 17 compared to the original version?
  2. Are we going to see even more guidelines, or a revision of these guidelines 10 years down the line, when AI assisted applications become even ubiquitous?
  3. As the authors pointed out, the current ethics related guidelines (5 and 6) may not be sufficient to cover all the ethical concerns. What other guidelines should there be?

Read More

01/29/20 – The Future of Crowd Work – Subil Abraham

Reading: Aniket Kittur, Jeffrey V. Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The Future of Crowd Work. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work (CSCW ’13), 1301–1318. https://doi.org/10.1145/2441776.2441923

What can we do to make crowd work better than the current state of simple tasks, to allow more complexity and satisfaction for the workers? The paper tries to provide a framework to improve crowd work in that direction. It does this through framing it in terms of 12 research directions that need to be studied so that they can be improved upon. The research foci are envisioned to promote the betterment of the current, less than stellar, sometimes exploitative nature of crowd work and make it into something “we would want our children to participate” in.

I like their parallels to distributed computing because it really is like that, trying to coordinate a bunch of people to complete some larger task by combining the results of smaller tasks. I work on distributed things so I appreciate the parallel they make because it fits my mental framework. I also find it interesting that one of the ways of quality control is to observe the worker’s process rather than just evaluating the output but it makes sense that evaluating the process allows the requester to maybe give guidance on what the worker is doing wrong and help improve the processes, whereas with just looking at the output, you can’t know where things went wrong and can only guess. I also think that their suggestion that crowd workers can move up to be full employees as somewhat dangerous because it seems to incentivize the wrong things for companies. I’m imagining a scenario where a company is built entirely on utilizing high level crowd work where they’re advertising that you have opportunities to “move up”, “make your own hours”, “hustle will reach the top”, where the reward is job security. I realize I just described what tenure track may be like for an academic. But that kind of incentive structure seems exploitative and wrong to me. This kind of set up seems normal because it may have existed for a long time in academia and prospective professors accept it because they are single mindedly determined (and somewhat insane) that they are willing to see this through. But I would hate for something like that to become the norm everywhere else.

  1. Did anyone feel like there was any avenue that wasn’t addressed? Or did the 12 research foci fully cover every aspect of potential crowd work research?
  2. Do you think the idea of moving up to employee status on crowd work platforms as a reward for doing a lot of good work is a good idea?
  3. What kind of off-beat innovations can we think of for new kinds of crowd platforms? Just as a random example – a platform for crowds to work with other crowds, like one crowd assigns tasks for another crowd and they go back and forth.

Read More

01/29/20 – Affordance-Based Framework for Human Computation and Human-Computer Collaboration – Subil Abraham

Reading: R. Jordon Crouser and Remco Chang. 2012. An Affordance-Based Framework for Human Computation and Human-Computer Collaboration. IEEE Transactions on Visualization and Computer Graphics 18, 12: 2859–2868. https://doi.org/10.1109/TVCG.2012.195

This paper is creating a summary of data visualization innovations as well as more general human computer collaboration tools for interpreting and making conclusions for data. The goal of the paper is to create a common language by which to categorize these tools and thereby provide a way of comparing the tools and understanding exactly what is needed for a particular situation rather than relying on just researcher intuition. They set up a framework in terms of affordances, what a human or computer can find opportunity and are capable of doing to do given the environment. By framing things in terms of affordances, we are able to identify how a human and/or computer can contribute to the goal of a given task, as well as be able to frame a system in comparison to other systems in terms of their affordances.

The idea of categorizing human-computer collaborations in terms of affordances is certainly an interesting and intuitive idea. Framing the characteristics of the different tools and software we use in these terms is a useful way of looking at things. However, as useful as the framework is, having read a little bit about function allocation, I don’t see how hugely different affordances are from function allocation. They both seem to be saying the same thing, in my view. The list of affordances is a bit more comprehensive than the Fitts HABA-MABA list. However, they both seem to be conveying the same information. Perhaps I do not have the necessary width of knowledge to see the difference, but the paper doesn’t make any convincing argument that is easy for an outsider to this field to understand.

Questions for discussion:

  1. How effective of a system is affordances? What use is it actually able to provide besides being one more set of standards? (relevant xkcd: https://m.xkcd.com/927/)
  2. There is a seemingly clear separation between human and machine affordances. But human adaptability seems to be third kind of affordance, a hybrid affordance where a machine action is used to spark human ingenuity. Does that seem like a valid or would you say that adaptability falls clearly in one of the two existing categories?
  3. Now that we have a language to talk about this stuff, can we now use this language, these different affordances, to combine together to create new applications? What would that look like? Or are we limited to just identifying an application by its affordances after its creation?

Read More