03/25/2020 – Vikram Mohanty – Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

Authors: Ting-Hao (Kenneth) Huang, Joseph Chee Chang, and Jeffrey P. Bigham

Summary

This paper discusses Evorus, a crowd-powered intelligent conversation agent that is targeted towards automation over time. It allows new chatbots to be integrated, reuses prior crowd responses, and learns to automatically approve responses. It demonstrates how automation can efficiently be deployed by augmenting an existing system. Users used Evorus through Google Hangouts.  

Reflection

There’s a lot happening in this paper, but then it’s perfectly justified because of the eventual target — fully automated system. This paper is a great example of how to carefully plan the path to an automated system from manual origins. It is realistic in terms of feasibility, and the transition from a crowd-based system to a crowd-AI collaborative system aimed towards a fully automated one seems organic and efficient as seen from the results. 
In terms of their workflow, they break down different elements i.e. chatbots and vote bots, and essentially, scope down the problem to just selecting a chatbot and voting on responses. A far-fetched approach would have been to build (or aim for) an end-to-end (God-mode) chatbot that can give the perfect response. Because the problem is scoped down, and depends on interpretable crowd worker actions, designing a learning framework around these actions and scoped down goals seems like a feasible approach. This is a great takeaway from the paper — how to break down a complex goal into smaller goals. Instead of attempting to automate an end-to-end complex task, crafting ways to automate smaller, realizable elements along the path seems like a smarter alternative. 
The voting classifier was carefully designed, considering a lot of interpretable and relevant features again such as message, turn and conversation levels. Again, this was evaluated with a real purpose i.e. reducing the human effort in voting. 
This paper also shows how we can still build intelligent systems that improve over time on top of AI engines that we cannot (or may, actually do not have to) modify i.e. third-party developer chatbots, off-the-shelf AI APIs. Crowd-AI collaboration can be useful for this aspect, and therefore designing the user interaction(s) remains critical for a learning framework to be augmented to the fixed AI engine e.g. vote bot or the select bot in this paper’s case.

Questions

  1. If you are working with an off-the-shelf AI engine that cannot be modified, how do you plan on building a system that improves over time? 
  2. What other (interaction) areas in the Evorus system do you see for a potential learning framework that would improve the performance of the system (according to the existing metrics)?
  3. If you were working a complex task, would you prefer an end-to-end God-mode solution or adopt a slow approach by carefully breaking it down and automating each element?

Read More

03/04/2020 – Vikram Mohanty – Combining crowdsourcing and google street view to identify street-level accessibility problems

Authors: Kotaro Hara, Vicki Le, and Jon Froehlich

Summary

This paper discusses the feasibility of using AMT crowd workers to label sidewalk accessibility problems in Google Street View. The authors create ground truth datasets with the help of wheelchair users, and found that Turkers reached an accuracy of 81%. The paper also discusses some quality control and improvement methods, which was shown to be effective i.e. improved the accuracy to 93%. 

Reflection

This paper reminded me of Jeff Bigham’s quote – “Discovery of important problems, mapping them onto computationally tractable solutions, collecting meaningful datasets, and designing interactions that make sense to people is where HCI and its inherent methodologies shine.” It’s a great example for two important things mentioned in the quote : a) discovery of important problems, and b) collecting meaningful datasets. The paper’s contribution mentions that the datasets collected will be used for building computer vision algorithms, and this paper’s workflow involves the potential end-users (wheelchair users) early on in the process. Further, the paper attempts to use Turkers to generate datasets that are comparable in quality to that of the wheelchair users, essentially setting a high quality standard for generating potential AI datasets. This is a desirable approach for training datasets, which can potentially help prevent problems in popular datasets as outlined here: https://www.excavating.ai/

The paper also proposed two generalizable methods for improving data quality from Turkers. Filtering out low-quality workers during data collection by seeding in gold standard data may require designing modular workflows, but the time investment may well be worth it. 

It’s great to see how this work evolved to now form the basis for Project Sidewalk, a live project where volunteers can map accessibility areas in the neighborhood.

Questions

  1. What’s your usual process for gathering datasets? How is it different from this paper’s approach? Would you be willing to involve potential end-users in the process? 
  2. What would you do to ensure quality control in your AMT tasks? 
  3. Do you think collecting more fine-grained data for training CV algorithms will come at a trade-off for the interface not being simple enough for Turkers?

Read More

03/04/2020 – Vikram Mohanty – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

Authors: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris

Summary

This paper studies how crowdsourcing can be used to evaluate automated approaches for generating alt-text captions for BVI (Blind or Visually Impaired) users on social media. Further, the paper proposes an effective real-time crowdsourcing workflow to assist BVI users in interpreting captions. The paper shows that the shortcomings of existing AI image captioning systems frequently hinder a user’s understanding of an image they cannot see, much to the extent that clarifying conversations with sighted assistants can’t even correct. The paper finally proposes a detailed set of guidelines for future iterations of AI captioning systems. 

Reflection

This paper is another example of people working with imperfect AI. Here, the imperfect AI is a result of not relying on collecting meaningful datasets, but as a result of building algorithms from constrained datasets without having a foresight of the application i.e. alt-text for BVI users. The paper demonstrates a successful crowdsourcing workflow augmenting the AI’s suggestion, and serves as a motivation for other HCI researchers to think of design workflows that can integrate the strengths of interfaces, crowds and AI together. 

The paper shows an interesting finding where the simulated BVI users found it easier to generate a caption from scratch than from the AI’s suggestion. This shows how the AI’s suggestion can bias a user’s mental model in the wrong direction, from where recovery might be costlier compared to no suggestion in the first place. This once again stresses the need for considering real-world scenarios and users in the evaluation workflow. 

The solution proposed here is bottlenecked by the challenges presented by real-time deployment with crowd workers. Despite that, the paper makes an interesting contribution in the form of guidelines essential for future iteration of AI captioning systems. Involving potential end-users and proposing systematic goals for an AI to achieve is a desirable goal in the long-run.

Questions

  1. Why do you think people preferred to generate the captions from scratch rather than from the AI’s suggestions? 
  2. Do you ever re-initialize a system’s data/suggestions/recommendations to start from blank? Why or why not? 
  3. If you worked with an imperfect AI (which is more than likely), how do you envision mitigating the shortcomings when you are given the task to redesign the client app? 

Read More

02/26/2020 – Vikram Mohanty – Will you accept an imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Authors: Rafal Kocielnik, Saleema Amershi, Paul Bennett

Summary


This paper discusses the impact of end-user expectations on the subjective perception of AI-based systems. The authors conduct studies to better understand how different types of errors (i.e. False Positives and False Negatives) are perceived differently by users, even though accuracy remains the same. The paper uses the context of an AI-based scheduling assistant (in an email client) to demonstrate 3 different design interventions for helping end-users adjust their expectations of the system. The studies in this paper showed that these 3 techniques were effective in preserving user satisfaction and acceptance of an imperfect AI-based system. 

Reflection

This paper is basically an evaluation of the first 2 guidelines from the “Guidelines of Human-AI Interaction” paper i.e. making clear what the system can do, and how well it can do what it does. 

Even though the task in the study was artificial (i.e. using workers from an internal crowdsourcing platform instead of real users of a system and subjecting to an artificial task instead of a real one), the study design, the research questions and the inference from the data initiates the conversation on giving special attention to the user experience in AI-infused systems. Because the tasks were artificial, we could not assess scenarios where users actually have a dog in the fight e.g. they miss an important event by over-relying on the AI assistant and start to depend less on the AI suggestions. 

The task here was scheduling events from emails, which is somewhat simple in the sense that users can almost immediately assess how good or bad the system is at. Furthermore, the authors manipulated the dataset for preparing the High Precision and High Recall versions of the system. For conducting this study in a real-world scenario, this would require a better understanding of user mental models with respect to AI imperfections. It becomes slightly trickier when these AI imperfections can not be accurately assessed in a real-world context e.g. search engines may retrieve pages containing the keywords, but may not account context into the results, and thus may not always give users what they want.  

The paper makes an excellent case of digging deeper into error recovery costs and correlating that with why participants in this study preferred a system with high false positive rates. This is critical for system designers to keep in mind while dealing with uncertain agents like an AI core. This gets further escalated when it’s a high-stakes scenario. 

Questions

  1. The paper starts off with the hypothesis that avoiding false positives is considered better for user experience, and therefore systems are optimized for high precision. The findings however contradicted it. Can you think about scenarios where you’d prefer a system with a higher likelihood of false positives? Can you think about scenarios where you’d prefer a system with a higher likelihood of false negatives?
  2. Did you think the design interventions were exhaustive? How would you have added on to the ones suggested in the paper? If you were to adopt something for your own research, what would it be? 
  3. The paper discusses factoring in other aspects, such as workload, both mental and physical, and the criticality of consequences. How would you leverage these aspects in design interventions? 
  4. If you used an AI-infused system every day (to the extent it’s subconsciously a part of your life)
    1. Would you be able to assess the AI imperfections purely on the basis of usage? How long would it take for you to assess the nature of the AI? 
    2. Would you be aware if the AI model suddenly changed underneath? How long would it take for you to notice the changes? Would your behavior (within the context of the system) be affected in the long term? 

Read More

02/26/20 – Vikram Mohanty – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Authors: Jonathan Dodge, Q. Vera Liao, Yunfeng Zhang, Rachel K. E. Bellamy, Casey Dugan

Summary

This paper discusses how different types of programmatically generated explanations can impact people’s fairness judgments of ML systems. The authors conduct studies with Mechanical Turk workers by showing them profiles from a recidivism dataset and the explanations for a classifier’s decision. Findings from the paper show that certain explanations can enhance people’s confidence in the fairness of the algorithm, and individual differences, including prior positions and judgment criteria of algorithmic fairness, impact how people react to different styles of explanation.

Reflection

For the sake of the study, the participants were shown only one type of explanation. While that worked for the purpose of this study, there is value in seeing the global and local explanations together. For e.g. the input-influence explanations can highlight the features that is more/less likely to re-offend, and allowing the user to dig deeper into the features by showing a local explanation can help in forming more clarity. There is some scope of building interactive platforms with the “overview first, details on demand” philosophy. It is, therefore, interesting to see the paper discuss about the potentials of a human-in-the-loop workflow.

I agree with the paper that a focus on data oriented explanation has the unintended consequence of shifting blame from the algorithms, which can slow down the “healing process” from the biases we interact with when we use these systems. Re-assessing the “how” explanations i.e. how the decisions were made is the right approach. The Effect of Population and “Structural” Biases on Social Media-based Algorithms – A Case Study in Geolocation Inference Across the Urban-Rural Spectrum by Johnson et al. illustrates how bias can be attributed to the design of algorithms themselves rather than population biases in the underlying data sources.

The paper makes an interesting contribution regarding the participants’ prior beliefs and positions and how that impacts the way they perceive these judgments. In my opinion, as a system developer, it seems like a good option to take a position (obviously, being informed and depends on the task) and advocate for normative explanations, rather than appeasing everyone and reinforcing meaningless biases which could have been avoided otherwise.

Questions

  1. Based on Figure 1, what other explanations would you suggest? If you were to pick 2 explanations, which 2 would you pick and why?
  2. If you were to design a human-in-the-loop workflow, what sort of input would you seek from the user? Can you outline some high-level feedback data points for a dummy case?
  3. Would normative explanations frustrate you if your beliefs didn’t align with the explanations (even though the explanations make perfect sense)? Would you adapt to the explanations? (PS Read about the backfire offer here: https://youarenotsosmart.com/2011/06/10/the-backfire-effect/)

Read More

02/19/2020 – Vikram Mohanty – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

Summary

This paper, through a case study, highlights the invisible distributed cognition process that goes on underneath a collaborative environment like Wikipedia, and how different actors, both humans and non-humans, come together for achieving a common goal – banning a vandal on Wikipedia. The authors show the usefulness of trace ethnography as a method for reconstructing user actions and understanding better the role each actor plays in the larger scheme of things. The paper advocates for not dismissing the role of bots as mere force multipliers, but to see them in a different lens considering the wide impact they have.

Reflection

Similar to the “Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator” paper, this paper is a great example that intelligent agents (AI-infused systems, bots, scripts, etc.) should not be studied in isolation only, but through a socio-technical lens. In my opinion, that provides a more comprehensive picture of the goals these agents can and cannot achieve, the collaboration processes they may inevitably transform, the human roles they may affect and other unintended consequences than performance/accuracy metrics alone.

Trace ethnography is a powerful method for reconstructing user actions in a distributed environment, and using that to understand how multiple actors (human and non-humans) achieve a complex objective, by sub-consciously collaborating with each other. The paper advocates that bots/automation/intelligent agents should not be seen as just force multipliers or irrelevant users. This is important as a lot of current evaluation metrics focus only on quantitative measures such as performance or accuracy. This paints an incomplete, and sometimes, an irresponsible picture of intelligent agents, as they have now evolved to assume an irreplaceable role in the larger scheme of things (or goals).

The final decision-making privilege resides with the human administrator and the whole socio-technical pipeline assists each step of decision-making with all possible information available so that checks and bounds (or order, as the paper mentions) is maintained at every stage. Automated decisions, whenever taken, are grounded in some confidence of certainty. In my opinion, while building AI models, researchers should think about the AI-infused system or the real-world setting of which these algorithms would be a part of. This might motivate researchers to make these algorithms more transparent or interpretable. The lens of the user who is going to wield these models/algorithms might help further.

It’s interesting to see some of the principles of mixed-initiative systems being used here i.e. history of the vandal’s actions, templated messages, showing statuses, etc.

Questions

  1. Do you plan to use trace ethnography in your proposed project? If so, how? Why do you think it’s going to make a difference?
  2. What are some of the risks and benefits of employing a fully automated pipeline in this particular case study i.e. banning a Wikipedia vandal?
  3. A democratic online platform like Wikipedia supports the notion of anyone coming in and making changes, and thus necessitates deploying moderation workflows to curb bad actors. However, if a platform were restrictive to some degrees, a post-hoc setup may not be necessary and the platform might be less toxic. This does not necessarily be the case for Wikipedia and can also extend to SNS like Twitter/Facebook, etc. What would you prefer, a democratic one or a restrictive one?

Read More

02/19/2020 – Vikram Mohanty – Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator

Summary

This paper thoroughly summarizes existing work on content regulation in online platforms, but focuses on human-machine collaboration aspect in this domain which hasn’t been widely studied. While most of the work in automated content regulation has been about introducing/improving algorithms, this work adopts a socio-technical lens of trying to understand how human moderators collaborate with automated moderator scripts. As most online platforms like Facebook and Google are pretty closeted about their moderation activities, the authors focus on Reddit, which allows moderators to use the Reddit API and develop their own scripts for each sub-reddit. The paper paints a comprehensive picture of how human moderators collaborate around the automoderator script, the different roles they assume, the other tools they use, and the challenges they face. Finally, the paper proposes design suggestions which can facilitate a better collaborative experience.

Reflection

Even though the Reddit Automoderator cannot be classified as an AI/ML tool, this paper sets a great example for how researchers can better assess the impact of intelligent agents on users, their practices and their behavior. In most instances, it is difficult for AI/ML model developers to foresee exactly how the algorithms/models they build are going to be used in the real world. This paper does a great job at highlighting how the moderators collaborate amongst themselves, how different levels of expertise, tech-savviness play a role in the collaboration process, how things like community guidelines are affected and how the roles for humans changed due to the bots amidst them. Situating the evaluation of intelligent/automated agents in a real-world usage scenario can give us a lot of insights where to direct our efforts for improvement or how to redesign the overall system/platform where the intelligent agent is being served.

It’s particularly interesting to see how users (or moderators) with different experience assume different roles with regards to how or who gets to modify the automoderator scripts. It may be an empirical question, but is a quick transition from newcomers/novices to an expert useful for the the community’s health, or the roles reserved for these newcomers/novices extremely essential? If it’s the former, then ensuring a quick learning curve with the usage of these bots/intelligent agents should be a priority for developers. Simulating what content will be affected with a particular change in the algorithm/script, as suggested in the discussion, can foster a quick learning curve for users (in addition to achieving the goal of minimizing false positives).

While the paper comments on how these automated scripts are supporting the moderator, it would have been interesting to see a comparative study of no automoderator vs automoderator. Of course, that was not the goal of this paper, but it could have helped paint the picture that automoderator adds to user satisfaction. Also, as the paper mentions, the moderators value their current level of control in the whole moderation process, and therefore, would be uncomfortable in a fully automated setting or one where they would not be able to explain their decisions. This has major design implications not just for content regulation, but pretty much in for complex, collaborative task. The fact that end-users developed and used their own scripts, accustomed to the community’s needs, is promising and opens up possibilities for coming up with tools which users with no or little technical knowledge can use to easily build and test their own scripts/bots/ML models.

With the introduction of automoderator, the role of the human moderators changed from their tradition job of just moderating the content to now ensuring that the rules of automoderator are updated, preventing users to game the system and minimizing false positives. Automation creating new roles for humans, instead of replacing them, is pretty evident here. As the role of AI increases in AI-infused systems, it is also important to assess the user satisfaction with the new roles.

Questions

  1. Do you see yourself conducting AI model evaluation with a wider socio-technical lens of how they can affect the target users, their practices and behaviors? Or do you think, evaluating in isolation is sufficient?
  2. Would you advocate for AI-infused systems where the roles of human users, in the process of being transformed, get reduced to tedious, monotonous, repetitive tasks? Do you think the moderators in this paper enjoyed their new roles?
  3. Would you push for fully automated systems or ones where the user enjoys some degree of control over the process?

Read More

2/5/2020 – Jooyoung Whang – Principles of Mixed-Initiative User Interfaces

This paper seeks to find when it is good to allow direct user manipulation versus automated services (agents) for a human-computer interaction system. The author ends up with the concept of mixed-initiative user interfaces, a system that seeks to pull out maximum efficiency using both sides’ perks and collaboration. In the proposal, the author claims that the major factor to consider when providing automated services is addressing the performance uncertainty and predicting the user’s goals. According to the paper, many poorly designed systems fail to gauge when to provide automated service and misinterpret user intention. To overcome these problems, the paper addresses that automated services should be provided when it is certain it can give additional benefits than when doing it manually by the user. The author also writes that effective and natural transfer of control to the user should be provided so that the users can efficiently recover and step forward towards their goals upon encountering errors. The paper also provides a use case of a system called, “LookOut.”

I greatly enjoyed and appreciated the example that the author provided. I personally have never used LookOut before, but it seemed like a good program from reading the paper. I liked that the program gracefully handled subtleties such as recognizing phrases like “Hmm..” to sense that a user’s thinking. It was also interesting that the paper tries to infer a user’s intentions using a probabilistic model. I recognized keywords such as utility and agents that also frequently appear in the machine learning context. In my previous machine learning experience, an agent acted according to policies leading to maximum utility scores. The paper’s approach is similar except it involves user input and the utility is the user’s goal achievement or intention. The paper was a nice refresher for reviewing what I learned in AI courses as well as putting humans into the context.

The followings are the questions that I’ve come up with while reading the paper:

1. The paper puts a lot of effort in trying to accurately acquire user intention. What if the intention was provided in the first place? For example, the user could start using the system by selecting their goal from a concise list. Would this benefit the system and user satisfaction? Would there be a case where it won’t (such as even misinterpreting the provided user goal)?

2. One of the previous week’s readings provided the idea of affordances (what a computer or a human is each better at doing than the other). How does this align with automated service versus direct human manipulation? For example, since computers are better at processing big data, tasks related to this would preferably need to be automated.

3. The paper seems to assume that the user always has a goal in mind when using the system. How about purely exploratory systems? In scientific research settings, there are a lot of times when the investigators don’t know what they are looking for. They are simply trying to explore the data and see if there’s anything interesting. One could claim that this is still some kind of a goal, but it is a very ambiguous one as the researchers don’t know what would be considered interesting. How should the system handle these kinds of cases?

Read More

02/05/20 – Vikram Mohanty – Principles of Mixed-Initiative User Interfaces

Paper Authors: Eric Horvitz

Summary

This is a formative paper on how mixed-initiative user interfaces should be designed, taking into account the principles surrounding users’ abilities to directly manipulate the objects, and combining it with principles of interface agents targeted towards automation. The paper outlines 12 critical factors for the effective integration of automated services with direct manipulation interfaces, and illustrates these points through different features of LookOut, a piece of software that provides automated scheduling services from emails in Microsoft Outlook.

Reflection

  1. This paper has aged well over the last 20 years. Even though this work has led to updated renditions which take into account recent developments in AI, the core principles outlined in this paper (i.e. being clear about the user’s goals, weighing in costs and benefits before intervening during the user’s actions, ability for users to refine results, etc.) still hold true till date.
  2. The AI research landscape has changed a lot since this paper came out. To give some context, modern AI-based techniques such as deep learning wasn’t prevalent both due to the lack of datasets and computing power. The internet was nowhere as big as it is right now. The cost of automating everything back then would obviously be bottlenecked by the lack of datasets. That feels like a strong motivation for aligning automated actions with the user’s goals and actions and factoring in context-dependent costs and benefits. For e.g. assigning a likelihood that an email message that has just received the focus of attention is in the goal category of “User will wish to schedule or review a calendar for this email” versus the goal category of “User will not wish to schedule or review a calendar for this email” based on the content of the messages.” This is predominantly goal-driven and involves exploring the problem space to generate the necessary dataset. Right now, we are not bottlenecked by problems like lack of computing power or unavailability of datasets, and if we do not follow what the paper advocates about aligning automated actions with the user’s goals and actions or factoring in the context, we may end up with meaningless datasets or unnecessary automation.
  3. These principles do not treat agent intervention lightly at all. In a fast-paced world, in the race towards automation, this particular point might get lost easily. For LookOut’s intervention with a dialog or action, multiple studies were conducted to identify the most appropriate timing of messaging services as a function of the nature of the message. Carefully handling the presentation of automated agents is crucial for a positive user experience.
  4. The paper highlights how the utility of system taking action when a goal is not desired can depend on any combination of the user’s attention status or the screen real estate or users being more rushed. This does not seem like something that can be easily determined by the system on its own or algorithm developers. System developers or designers may have a better understanding of such real-world possible scenarios, and therefore, this calls for researchers from both fields to work together towards a shared goal.
  5. Uncertainties or the limitations of AI should not come in the way of solving hard problems that can benefit users. Designing intelligent user interfaces that can leverage the complementary strengths of humans and AI can help solve problems that cannot be solved on its own by either parties. HCI folks have long been at the forefront of thinking about how humans will interact with AI, and how to do work that allows them to do so effectively.

Questions

  1. Which principles, in particular, do you find useful if you are designing a system where the intelligent agent is supposed to aid the users in open-ended problems that do not have a clear predetermined right/wrong solution i.e. search engines or Netflix recommendations?
  2. Why don’t we see the “genie” or “clippy” anymore? What does it tell about this – “employing socially appropriate behaviors for agent−user interaction”?
  3. A) For folks who work on building interfaces, do you feel some elements can be made smarter? How do you see using these principles in your work? B) For folks who work on developing intelligent algorithms, do you consider end-user applications in your work? How do you see using these principles in your work? Can you imagine different scenarios where your algorithm isn’t 100% accurate.

Read More

02/05/20 – Vikram Mohanty – Power to the People: The Role of Humans in Interactive Machine Learning

Paper Authors: Saleema Amershi, Maya Cakmak, W. Bradley Knox, Todd Kulesza

Summary

This paper highlights the usefulness of intelligent user interfaces or the power of human-in-the-loop workflows for improving machine learning models, and makes the case for moving from traditional machine learning workflows to interactive machine learning platforms. Implicitly, domain experts, or the potential users of such applications, can provide high-quality data points. In order to facilitate that, the role of user interfaces and user experience is illustrated via numerous examples. The paper outlines some challenges and future direction of research for understanding better how user interfaces interact with learning algorithms and vice-versa.

Reflections

  1. The case study with proteins and biochemists illustrates a classic case of frustration associated with iterative design, while striving to align with user needs. However, in this example, the problem space was focused on getting a ML model right for the users. As the case study showed, interactive machine learning applications seemed to be the right fit for solving this problem as opposed to iteratively tuning the model manually by the experts. The research community is rightfully moving in the direction of producing smarter applications, and in order to ensure more (better?) intelligibility of these applications, building user interfaces/applications for interactive machine learning seem to be an effective and cost-efficient route.  
  2. In the realm of intelligent user interfaces, even though human users are not just good enough for providing quality training data and provide a lot more value beyond that, my reflection will center around the “human-in-the-loop” aspect to keep the discussion aligned with the paper’s narrative. The paper, without explicitly mentioning, also shows how we can get good quality training labels without relying solely on crowdsourcing platforms like AMT or Figure Eight, but rather, by focusing on the potential users of such applications, who are often domain experts for the applications. The trade-offs between collecting data from novice workers on AMT and domain experts are pretty obvious: quality vs cost.
  3. The authors, through multiple examples, also make an effective argument about the inevitable role of user interfaces in ensuring a stream of good-quality data. The paper further stresses the importance of user experiences in generating rich and meaningful datasets.
  4. “Users are People, Not Oracles” is the first point, and seems to be a pretty important one. If applications are built with the sole intention of collecting training data, there’s a risk of user experience being sacrificed, which may affect good quality data and the cycle ceases to exist.
  5. Because it is difficult to decouple the contributions of the interface design or the algorithm chosen, coming up with an effective evaluation workflow seems like a challenge. However, it seems to be very context-dependent and following recent guidelines such as https://pair.withgoogle.com/ or https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/ can go a long way in improving these interfaces.

Questions

  1. For researchers working on crowdsourcing platforms, even it’s for a simple labeling task, how did you handle poor quality data? Did you ever re-evaluate your task design (interface/user experience)?
  2. Let’s say you work in a team with domain experts. Domain experts use an intelligent application in their every day work to accomplish a complex task A (the main goal of the team) and a result, you get data points (let’s call it A-data). As a researcher, you see the value of collecting data points B-data from the domain experts, which may improve the efficiency of task A. However, in order to collect B-data, domain experts have to perform task B, which is an extra task and deviates from A (which is their main objective and what they are paid for). How would you handle this situation? [This is pretty open-ended]
  3. Can you think of any examples where collecting negative user feedback (which can significantly improving the learning algorithm) also fits the natural usage of the application?

Read More