03/04/20 – Lee Lisle – Real-Time Captioning by Groups of Non-Experts

March 4, 2020 Lorance R Lisle 2 Comments

Summary

Lasecki et al. present a novel captioning system for the deaf and hard of hearing population group entitled LEGION:SCRIBE. Their implementation involves crowdsourcing multiple people per audio stream to achieve low-latency as well as highly accurate results. They then detail that the competitors for this are professional stenographers (but they are expensive) and automatic speech recognition (ASR, which has large issues with accuracy). They then go over how they intend on evaluating SCRIBE, with their Multiple Sequence Alignment (MSA) approach that aligns the output from multiple crowdworkers together to get the best possible caption. Their approach also allows for changing the quality to improve coverage or precision, where coverage will provide a more complete caption and precision attains a lower word error rate. They then conducted an experiment where they transcribed a set of lectures using various methods including various types of SCRIBE (varying number of workers and coverage) and an ASR. SCRIBE outperformed the ASR in both latency and accuracy.

Personal Reflection

This work is pretty relevant to me as my semester project is on transcribing notes for users in VR. I was struck by how quickly they were able to get captions from the crowd, but also how many errors still were present in the finished product. In figure 11, the WER for CART was a quarter of their method that only got slightly better than half of the words correct. And in figure 14, none of the transcriptions seem terribly acceptable, though CART was close. I wonder if their WER performed so poorly due to the nature of the talks or that there were multiple speakers in each scene. I wish that they had discussed how much impact having multiple speakers is in transcription services rather than the somewhat vague descriptions they had.

It was interesting that they could get the transcriptions done through Mechanical Turk at the rate of $36 per hour. This is roughly 1/3 of their professional stenographer (at $1.75 per minute or $105 per hour). The cost savings are impressive, though the coverage could be a lot better.

Lastly, I was glad they included one of their final sections, “Leveraging Hybrid Workforces,” as it is particularly relevant to this class. They were able to increase their coverage and precision by including an ASR as one of the inputs into their MSA combiner, regardless if they were using one worker or ten. This indicates that there is a lot of value in human-AI collaboration in this space.

Questions

If such low-latency wasn’t a key issue, could the captions get an even lower WER? Is it worth a 20 second latency? A 60 second latency? Is it worth the extra cost it might incur?
Combined with our reading last week on acceptable false positives and false negatives from AI, what is an acceptable WER for this use case?
Their MSA combiner showed a lot of promise as a tool for potentially different fields. What other ways could their combiner be used?
Error checking is a problem in many different fields, but especially crowdsourcing as the errors can be caused in many different ways. What other ways are there to combat errors in crowdsourcing? Would you choose this way or another?

03/04/20 – Lee Lisle – Combining Crowdsourcing and Google Street View to Identify Street-Level Accessibility Problems

March 4, 2020 Lorance R Lisle 1 Comment

Summary

Hara, Le, and Froehlich developed an interface that uses Google Street View to identify accessibility issues in city sidewalks. They then perform a study using three researchers and 3 accessibility experts (wheelchair users) to evaluate their interface. This severed as both a way to assess usability issues with their interface as well as a ground truth to verify the results of their second study. That study involved launching crowdworking tasks to identify accessibility problems as well as categorizing what type each problem is. Over 7,517 Mechanical Turk HITs they found that crowdworkers could identify accessibility problems 80.6% of the time and could correctly classify the problem type 78.3% of the time. Combining their approach with a majority voting scheme, they raised these values to 86.9% and 83.8%.

Personal Reflection

Their first step to see if their solution was even feasible seemed like an odd study. Their users were research members and experts, both of which are theoretically more driven than a typical crowdworker. Furthermore, I felt like internal testing and piloting would be more appropriate than a soft-launch like this. While they do bring up that they needed a ground truth to contextualize their second study, I initially felt that this should then be performed by only experts and not as a complete preliminary study. However, as I read more of the paper, I felt that the comparison between the groups (experts vs. researchers) was relevant as it highlighted how wheelchair bound people and able-bodied people can see situations differently. They could not have collected this data on Mechanical Turk alone as they couldn’t guarantee that they were recruiting wheelchair bound participants otherwise.

It was also good to see the human-AI collaboration highlighted in this study. That they’re using the selection (and subsequent images generated by those selections) as training data for a machine learning algorithm, it should lessen the need for future work.

Their pay level also seemed very low at 1-5 cents per image. Even assuming a selection and classification takes only 10 seconds, their total page loading only takes 5 seconds, and they always get 5 cents per image, that’s $12 an hour for ideal circumstances.

The good part of this research is that it cheaply identifies problems quickly. This can be used to identify a large amount of issues and save time in deploying people to fix issues that are co-located in the same area rather than deploying people to find issues and then solve them with lesser potential coverage. It also solves a public need for a highly vulnerable population which makes their solution’s impact even better.

Lastly, it was good to see how the various levels of redundancy impacted their results. The falloff from increasing past 5 workers was harsher than I expected, and the increase in identification is likely not worth doubling the cost of these tasks.

Questions

What other public needs could a Google Street View/crowdsourcing hybrid solve?
What are the tradeoffs for the various stakeholders involved in solutions like this? (The people who need the fixes, the workers who typically had to identify these problems, the workers who are deployed to maintain the identified areas, and any others)
Should every study measure the impact of redundancy? How might redundant workers affect your own projects?

02/26/20 – Lee Lisle – Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning

February 25, 2020February 25, 2020 Lorance R Lisle Leave a comment

Summary

Kuar et al. cover how data scientists are now tackling with ways of explaining their algorithm’s results with the public through interpretability tools. They note that machine learning algorithms are often “black boxes” that don’t typically convey how they get to certain results, but that there are several methods of interpreting the results based off these algorithms such as GAMs, LIME, and SHAP. The authors then conduct six interviews, a contextual inquiry of data scientists, and a large-scale survey to see if these tools are being used effectively. They found that, while some tools do perform better than others, these tools are being misused by data scientists in that they misunderstood their intended use. The authors found that the participants either over-utilized or under-utilized the tools and trusted their output and impact too deeply.

Personal Reflection

It was fascinating to see tools that HCI professionals typically use to understand many different aspects of a job turned onto computer science practitioners and algorithm designers as a sort of self-evaluation of the field. I was also surprised to see that there are so many possible errors in training data; I had assumed that these training datasets had been cleaned and verified to make sure there were no duplicates or missing data from them. That part reinforced the need for the tools to find issues with datasets.

The study uncovered that the visualizations made the data scientists over-confident in their results. It was interesting to see that once the tools discovered an insight into the data, the data scientists didn’t look more deeply into that result. That they were fine with not knowing why a key attribute led to a certain result more easily showcased why they might need to look more deeply into the workings of the algorithms. They used a lot of similar answers In that “I guess” and “I suppose” and “not sure why” were all present and are fairly similar responses. It was furthermore odd that, during the survey, they weren’t confident that the underlying models were reasonable but didn’t think the dataset or model was to blame. Does this point to some amount of overconfidence in their own field?

Questions

Since this covered the AI designers mainly, do you think there a key aspect of HCI research that could use a more complete understanding of its practice and practitioners? I.E., is there an issue that could be seen if HCI practitioners performed an ethnography or survey on their colleagues?
Since they had participants essentially perform a simulated task in the second phase, do you think this affected the results?
Would seeing these data scientists work on their own datasets have made a difference to the results? Do you think it would have changed how the data scientists think about their own work?
Since the dataset they used was a relatively low-risk dataset (i.e., it wasn’t like a recidivism predictor or loan-default prediction service), does that impact how the data scientists interacted with the tools?

02/26/20 – Lee Lisle – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

February 25, 2020February 25, 2020 Lorance R Lisle 2 Comments

Summary

Dodge et al. cover a terribly important issue with artificial intelligence programs and biases from historical datasets, and how to mitigate the inherent racism or other biases within. They also work understand how to better communicate why AIs reach the recommendations they do and how. In an experiment, they look at communicating outcomes from a known biased ML model for predicting recidivism amongst released prisoners called COMPAS. They cleaned the ML model to make race less impactful to the final decision, and then produced 4 ways of explaining the result of the model to 160 mTurk workers: Sensitivity, Input-influence, Case, and demographic. “Input” emphasizes how much each input affected the results, “Demographic” describes how each demographic affects the results, “Sensitivity” shows what flipped demographics would have changed the results, and “Case” finds the most similar cases and details those results. They found that local-based explanations (case and sensitivity) had the largest impact on perceived fairness.

Personal Reflection

This study was pretty interesting to me based on it actually trying to adjust for the biases of input data as well as understanding how to better convey insights from less-biases systems. I am still unsure that the authors removed all bias from the COMPAS system but seeing that they did lower the coefficient significantly shows that it was working on it. In this vein, the paper made me want to read the paper they cited as how they could mitigate biases in these algorithms.

I found their various methods on how to communicate how the algorithm came to its recommendation to be rather incisive. I wasn’t surprised that people found that when the sensitivity explanation said that if the individual’s race was flipped the decision would be flipped lead to more perceived issues with the ML decision. That method of communication seems to lead people to see issues with the dataset more easily in general.

The last notable part of the experiment is that they didn’t give a confidence value for each case – they stated that they could not control for it and so did not present it to participants. That seems like an important part of making a decision based on the algorithm. If the algorithm is really on the fence, but has to recommend one way or the other, it might make it easier to state that the algorithm is biased.

Questions

Would removing the race (or other non-controllable biases) coefficient altogether affect the results too much? Is there merit in zero-ing out the coefficient of these factors?
Having an attention check in the mTurk workflow is, in itself, not surprising. However, the fact that all of the crowdworkers passed the check is surprising. What does this mean for other work that ignores a subset of data assuming the crowdworkers weren’t paying attention? (Like the paper last week that ignored the lowest quartile of results)
What combination of the four different types would be most effective? If you presented more than one type, would it have affected the results?
Do you think showing the confidence value for the algorithm would impact the results significantly? Why or why not?

2/19/20 – Lee Lisle – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

February 18, 2020February 18, 2020 Lorance R Lisle 2 Comments

Summary

Bansal et. al discuss how human-AI teams work in solving high-stakes issues such as hospital patient discharging scenarios or credit risk assessment. They point out that the humans in these teams often create a mental model of the AI suggestions, where the mental model is an understanding of when the AI is likely wrong about the outcome. The authors then show that updates to the AI can produce worse performance if they are not compatible with the already formed mental model of the human user. They go on to define types of compatibility for AI updates, as well as a few other key terms relating to human/AI teams. They develop a platform to measure how compatibility can affect team performance, and then measure AI update compatibility effectiveness through a user study using 25 mTurk workers. In all, they show that incompatible updates reduce performance as compared to no update at all.

Personal Reflection

The paper was an interesting study in the effect of pushing updates without considering the user involved in the process. I hadn’t thought of the human as an exactly equal player in the team, where the AI likely has more information and could provide a better suggestion. However, it makes sense that the human leverages other sources of information and forms a better understanding of what choice to ultimately make.

CAJA, the human/AI simulation platform, seems like a good way to test AI updates, however I struggle to see how it can be used to test other theories as the authors seem to suggest. It is, essentially, a simple user-learning game, where users figure out when to trust the machine and when to deviate. While this isn’t exactly my field of expertise, I only see the chance to change information flows and the underlying AI as ways of learning new things about human/AI collaboration. This would mean terming this as a platform is a little excessive.

Questions

The authors mention that, in order to defeat mTurk scammers who click through projects like these quickly, they drop the lowest quartile (in terms of performance) out of their results. Do you think this was an effective countermeasure, or could the authors be cutting good data?
From other sources, such as Weapons of Math Destruction, we can read how some AI suggestions are inherently biased (even racist) due to input data. How might this change the authors results? Do you think this is taken into consideration at all?
One suggestion near the end of the paper stated that, if pushing an incompatible update, the authors of the AI should make the change explicit so that the user could adjust accordingly. Do you think this is an acceptable tradeoff to not creating a compatible update? Why or why not?
The authors note that, as the complexity of error boundary f became more complex, errors increased, so they kept to relatively simple boundaries. Is this an effective choice for this system, considering real systems are extremely complex? Why or why not?
The authors state that they wanted the “compute” cost to be net 0. Does this effectively simulate real-world experiences? Is the opportunity-cost the only net negative here?

2/19/20 – Lee Lisle – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

February 18, 2020February 18, 2020 Lorance R Lisle 1 Comment

Summary

Geiger and Ribes cover the case of using automated tools or “bots” in order to prevent vandalism on the popular online and user-generated encyclopedia “Wikipedia.” The authors detail how editors use popular distributed cognition coordination services such as “Huggle,” and argue that these coordination applications affect the creation and maintenance of wikipedia as much as the traditional social roles of editors. The team of human and AI work together to fight vandalism in the form of rogue edits. They cover how bots assisted essentially 0% of edits in 2006 to 12% in 2009, while editors use even more bot assistance. They then deep dive into how the editors came to ban a single vandal that committed 20 false edits to Wikipedia in an hour, which they term a “trace ethnography.”

Personal Reflection

This work was eye-opening in seeing exactly how Wikipedia editors leverage bots and other distributed cognition to maintain order in Wikipedia. Furthermore, after reading this, I am much more confident in the accuracy of articles contained on the website (possibly to the chagrin of teachers everywhere). I was surprised how easily attack edits were repelled by the Wikipedia editors, considering that hostile bot networks could be deployed against Wikipedia as well.

I also generally enjoyed the analogy of how managing Wikipedia is like navigating a naval vessel in that both leverage significant amounts of distributed cognition in order to succeed. Showing how many roles are needed in order to understand various jobs and collaborate between people was quite effective.

Lastly, their focus (trace ethnography) on a single vandal was an effective way of portraying what is essentially daily life for these maintainers. I was somewhat surprised that only four people were involved before banning a user; I had figured that each vandal took much longer to identify and remedy. How the process proceeded, where the vandal got repeated warnings before a (temporary) ban occurred, and how the bots and humans worked together in order to come to this conclusion, was a fascinating process that I hadn’t seen written in a paper before.

Questions

One bot that this article didn’t look into is a twitter bot that tracked all changes on Wikipedia made by IP addresses used by congressional members (@CongressEdits). Its audience is not specifically intended to be the editors of Wikipedia, but how might this help them? How does this bot help the general public? (It has since been banned in 2018) How might a tool like this be abused?
How might a trace ethnography be used in other applications for HCI? Does this approach make sense for domains other than global editors?
How can Huggle (or the other tools) be changed in order to tackle a different application, such as version control? Would it be better than current tools?
Is there a way to exploit this system for vandals? That is, are there any weaknesses to human/bot collaboration in this case?

2/5/20 – Lee Lisle – Guidelines for Human-AI Interaction

February 5, 2020February 5, 2020 Lorance R Lisle 1 Comment

Summary

The authors (of which there are many) go over the various HCI-related findings for Human-AI interaction and categorize them into eighteen different types over 4 categories (applicable to when the user encounters the AI assistance). The work makes sure the reader knows it was from the past twenty years of research and from a review of industry guidelines, articles and editorials in the public domain, and a (non-exhaustive) survey of scholarly papers on AI design. In all, they found 168 guidelines that they then performed affinity diagramming (and filtering out concepts that were too “vague”), resulting in twenty concepts. Eleven members of their team at Microsoft then performed a modified discount heuristic evaluation (where they identified an application and its issue) and refined their guidelines with that data, resulting in 18 rules. Next, they performed a user study with 49 HCI experts where each was given an AI-tool and asked to evaluate it. Lastly, they had experts validate their revisions in the previous phase.

Personal Reflection

These guidelines are actually quite helpful in evaluating an interface. As someone who has performed several heuristic evaluations in a non-class setting, having defined rules that can be easily determined if they’ve been violated makes the process significantly quicker. Nielsen’s heuristics have been the gold standard for perhaps too long, so revisiting the creation of guidelines is ideal. It also speaks to how new this paper is, being from 2019’s CHI conference.

Various things surprised me in this work. First, I was surprised that they stated that contractions weren’t allowed for their guidelines because they weren’t clear. I haven’t heard that complaint before, and it seemed somewhat arbitrary. A contraction doesn’t change a sentence much (doesn’t in this sentence is clearly “does not”), but I may be mistaken here. I was also surprised to find their tables in figure 1 to be hard to read, as if maybe it as a bit too information dense to clearly impart their findings. I was also surprised about their example for guideline 6, as suggesting personal pronouns and kind of stating there are only 2 is murky, at best (I would’ve used a different example entirely). Lastly, the authors completely ignored the suggestion of keeping the old guideline 15, stating their own reasons despite the expert’s preferences.

I also think this paper in particular will be a valuable resource for future AI development. In particular, it can give a lot of ideas for our semester project. Furthermore, these guidelines can help early on in the process of designing future interactions, as they can refine and correct interaction mistakes before the implementation of many of these features.

Lastly, I thought it was amusing the “newest” member of the team got a shout-out in the acknowledgements.

Questions

The authors bring up trade-offs as being a common occurrence in balancing these (and past) guidelines. Which of these guidelines do you think is easier or harder to bend?
The authors ignored the suggestion of their own panel of experts in revising one of their guidelines. Do you think this is appropriate for this kind of evaluation, and why or why not?
Can you think of an example of one of these guidelines not being followed in an app you use? What is it, and how could it be improved?

2/5/20 – Lee Lisle – Principles of Mixed-Initiative User Interfaces

February 5, 2020 Lorance R Lisle Leave a comment

Summary

The author, Horvitz, proposes a list of twelve principles for mixed-initiative, or AI-assisted programs that should underlie all future AI-assisted programs. He also designs a program called LookOut, which focuses on email messaging and scheduling. It will automatically parse emails (and it seems like other messaging services) and extracts possible event data for the user to add the event to their calendar, inferring dates and locations when needed. It also has an intermediary step where the user can edit the suggested event fields (time/location/etc.). In the description of LookOut’s benefits, the paper clearly lays out some probability theory of how it guesses what the user wants. It also lays out why each behind-the-scenes AI function is performed the way it is in LookOut.

Personal Reflection

I was initially surprised about this paper’s age; I had thought that this field was defined later than it apparently was. For example, Google was founded only a year before this paper was published. It was even more jarring to see Windows 95 (98?) in the figures. Furthermore, when the author starts describing LookOut, I realized that this is baked into a lot of email systems today, such as Gmail and the Apple Mail application, where they automatically can create links that will add events to your various calendars. The other papers we have read for this class tend to stay towards overviews or surveys of literature rather than a single example and deep dive into explaining its features.

It is interesting that “poor guessing of user’s goals” has been an issue this long. This problem is extremely persistent and speaks to how hard it is to algorithmically decide or understand what a user wants or needs. For example, Lookout was trained on 1000 messages, while (likely) today’s services are trained on millions, if not orders of magnitude more. While I imagine the performance is much better today, I’m curious what the comparative rates of false positives/negatives there are.

This paper was strong overall, with a deep dive into a single application rather than an overview of many. Furthermore, it made arguments that are, for the most part, still relevant in the design of today’s AI-assisted programs. However, I would have liked the author to specifically mention the principles as they came up in the design of his program. For example, he could have said that he was fulfilling his 5^th principle in the “Dialog as an Option for Action” section. However, this is a small quibble in the paper.

Lastly, while AI assistants should likely have an embodiment occasionally, the Genie metaphor (along with Clippy^TM style graphics) is gladly retired now, and should not be used again.

Questions

Are all of the principles listed still important today? Is there anything they missed with this list, that may have arisen from faster and more capable hardware/software?
Do you think it is better to correctly guess what a user wants or is it better to have an invocation (button, gesture, etc.) to get an AI to engage a dialog?
Would using more than one example (LookOut, in this case) strengthened the paper’s argument of what design principles were needed? Why or why not?
Can an AI take action incorrectly and not bother a user? How, and in what instances for LookOut might this be performed?

01/29/20 – Lee Lisle – An Affordance-Based Framework for Human Computation and Human-Computer Collaboration

January 28, 2020January 28, 2020 Lorance R Lisle 1 Comment

Summary

Crouser and Chang make the argument that visual analytics is defined as “the science of analytical reasoning facilitated by visual interactive interfaces,” and is being pushed through two main directions of thought – human computation and human computer collaborations. However, there’s no common design language between the two subdisciplines. Therefore, they took it upon themselves to do a survey of 1217 papers, whittling them down to 49 representative papers to then find common threads that can help define the fields. They then categorized the research into what affordances the research studies either for users or the machines. Humans are naturally better with visual perception, visuospatial thinking, audiolinguistic ability, sociocultural awareness, creativity, and domain knowledge, while machines are better with large-scale data manipulation, data storage and collection, efficient data movement and biasfree analysis. The authors then suggest that research explore human adaptability and machine sensing as well as discuss when to use these strategies.

Personal Reflection

When reading this I did question a few things about the studies. For example, in bias-free analysis, while they do admit that human bias can be introduced during the programming, they fail to acknowledge the bias that can be present in the input data. Entire books have been written (Weapons of Math Destruction being one) that cover how “bias-free” algorithms can be fed input data that have clear bias, resulting in a biased system regardless of it being hard-coded in the algorithm.

Outlining these similarities between various human-computer collaborations allows other researchers to scope projects better. Bringing up the deficiencies of certain approaches allows for avoidance of the same pitfalls.

The complexity measure questions section, however, felt a little out of place considering it was the first time it was brought up in the paper. However, it asked strong questions that definitely impact this area of research. If ‘running time’ for a human is a long time, this could mean there are improvements to be made and areas that we can introduce more computer-aid.

Questions

This kind of paper is often present in many different fields. Do you find these summary papers useful? Why or Why not? Since it’s been 8 years since this was published, is it time for another one?
Near the end of the paper, they ask what the best way is to measure human work. What are your ideas? What are the tradeoffs for the types they suggested (input size, information density, human time, space)?
Section 6 makes it clear that using multiple affordances at once needs to be balanced in order to be effectively used. Is this an issue with the affordances or an issue with the usability design of the tools?
The authors mention two areas of further study in section 7: Human adaptability and machine sensing. Have these been researched since this paper came out? If not, how would you tackle these issues?

01/22/2020 – Lee Lisle – Ghost Work

January 28, 2020January 28, 2020 Lorance R Lisle Leave a comment

Summary

Ghost work’s introduction and first chapter cover a birds-eye look at the new gig economy of crowd intelligence tasks. They cover several anecdotes of people working on these tasks in various situations, mostly in the United States and India. The introduction also wants to get across that these types of tasks are for problems that AI can’t solve or needs training to be able to solve. The text also tries to impart that there is nothing to fear from AI – automation has always happened through new technologies and that there will always be more work generated by the blind spots of the newer technologies. The first chapter then goes through several different scenarios for these workers, starting with the worst working conditions and then working up to the “best.” Lastly, it pointed out that there are possible moral issues with the whole setup, using a lawsuit of workers for a specific company arguing that they were essentially being paid minimum wage working full time with no benefits.

Personal Reflection

I thought it was interesting to better understand how these gig workers came into being and why they’re needed. However, I couldn’t stop thinking about the human element. Yes, the text seems to drive you towards that direction, but it’s not until the last 2-3 pages of the book that it ever actually asks the question “Is this right?” The first few specific anecdotes in the first chapter were chilling – these were people with not just undergraduate degrees but post-graduate degrees who were working for a paycheck that put them below the poverty line. Arguably only the last 2 companies mentioned, Upwork and Amara, were even close to acceptable living conditions. LeadGenius was close, as there were tiers and “promotions” that could be earned through working, but they still seemed to pay very little for quality work. MTurk being the worst really outshone the others. A talented worker as the example was, she was only earning $16,000 a year working (according to the intro) 10 hours a day, and she was happy it was better than the $4,400 she earned the first year. The text even (insultingly) points out $4400 is more than earning $0. Futhermore, working as an mTurk required additional work to figure out what good HITs would open and what requesters to avoid as well as learn the tips and tricks of the trade. At least in a Starbucks you get paid the full amount during your training. This text and the whole ghost work gig-economy industry feels like share-cropping, where the workers are cheated out of their proper valuation.

Questions

How could the ghost-work/gig-economy be regulated? Is self-regulation as shown by the mechanical turk forums and reddits enough?
Now knowing what life is like for these workers, could you ethically use this service? Are the rosy-stories of “I can fill in gaps on my resume” or “I can’t work standard hours so the flexibility is nice” or “I have another source of income so this is just free money” enough to counteract the underpayment of these workers?
Which of the businesses that setup gig-workers seems like the best tradeoff for requesters and workers? Why?
What do you think about the idea of having to screen “employers” on Mechanical Turk? How can this impact the pay rate?