04/29/2020 – Vikram Mohanty – VisiBlends: A Flexible Workflow for Visual Blends

Authors: Lydia B. Chilton, Savvas Petridis, and Maneesh Agrawala.

Summary

This paper discusses the concept of visual blends, which are an advanced design technique to draw attention to a message. This paper presents VisiBlends, a workflow for creating visual blends, by decomposing the process into computational techniques and human micro-tasks. Users can collaboratively create visual blends with steps involving brainstorming, synthesis and iteration. An evaluation of the workflow showed that it improved novices’ ability to create visual blends, and it works well for both decentralized and co-located groups.

Reflection

Stemming from a poor design sense personally, I appreciated how this paper read and what it has to contribute a lot. Creativity is a complex topic, and designing computational tools to support it can get really tricky. This paper was easy to read, and very well supported by figures for helping readers understand not only the concept of Visual Blends, but also the findings. VisiBlends opens up new possibilities of how tools can extend support to other design applications such as web design. (As per my knowledge, there are AI engines for generating web design templates, color schemes, etc. but I am not aware of user studies for these AI solutions)

This paper echoes something that we have read in a lot of papers — decomposing big tasks into smaller, meaningful chunks. However, decomposing creative tasks can become tricky. The steps were simple enough for onboarding novice users and the algorithm was intuitive. This human-AI collaboration seemed a bit unique to me particularly because the success of the whole endeavor also depended on how well the user understood how the algorithm works. This is a stark contrast to the black-box vision of algorithms. Will it become difficult as the algorithm gets more complex to support more complex blends?

All of the findings supported the usefulness of VisiBlend approach. However, I wonder if there’s a possibility of the task or the concept of visual blends (ticking off the checklist of requirements) being complex enough to understand in the first attempt. I am sure the training process, which they stress to be important, was thorough and comprehensive. But, at the end of the day, it boils down to learning from experience. I feel one would understand the requirements of visual blends better through the tool and may face difficulty in the control condition.

I really liked how the participants iterated in different ways to improve upon the initial visual blend. This is a great demonstration of human-machine collaboration, where people use machine suggestions to refine the original parameters and improve the whole process. I am also glad they addressed the issue of gender stereotypes for the “Women in CS” design task as that was something on my mind as well.

Questions

  1. What do you feel about decomposing creative tasks? What are the challenges? Do you think it’s always possible?
  2. Do you think users should always have a good sense of how the algorithm works? When do you think it’s necessary?
  3. What changes are necessary for this to scale up to more complex designs? How would a complex algorithm affect the whole process?

Read More

04/29/2020 – Vikram Mohanty – DiscoverySpace: Suggesting Actions in Complex Software

Authors: C. Ailie Fraser, Mira Dontcheva, Holger Winnemöller, Sheryl Ehrlich, and Scott Klemmer.

Summary

This paper proposes DiscoverySpace, an extension panel for Adobe Photoshop, to help onboard new users execute tasks and explore the features of the software. DiscoverySpace suggests task-level action macros to apply to photographs based on visual features. These actions are retrieved from an online user community. The findings from user evaluation showed that it helped novices maintain confidence, accomplish tasks and discover features.

Reflection

This paper addresses an important problem that I often encounter while building tools/interfaces for novice users — how do we efficiently handle the onboarding process? While DiscoverySpace, in its current form, is far from being the ideal tool, it still opens the doors for how we can leverage the strengths of recommender engines, NLP and UI designs to build future onboarding tools for complex softwares.

Something we discussed in the last class – DiscoverySpace also demonstrates how need-finding exercises can be translated into design goals, and in doing so, it increases the likelihood of being useful. I wonder, how this process can be scaled up for a general purpose workflow of designing onboarding tools for any software (or maybe it is not necessary).

In one of my previous reflections, I mentioned about how in-app recommendations for using different features helped users explore more, which was a stark contrast to the notion of filter bubbles. This paper also demonstrated a similar finding, which leads me to believe that maybe in-app feature recommendations are useful for exploring the space when the users, by themselves, cannot explore or unaware of the unknown space. I am hoping to see a future study by the RecSys community, if there isn’t one already, to understand the correlation between tool feature recommendations and the user’s expertise level.

This paper certainly made me think a lot more about general purpose applicability i.e. how can we build a toolkit that can work for any software. I really liked the discussion section of the paper as it discussed the topics that would essentially form the pathway to such a toolkit. Building a corpus of actions is certainly not impossible, considering the number of users for a software. Most plugins and themes are user-generated, and that’s possible because of the low barrier to contribution. Similar pathways and incentives can be created for users to build a repository of actions, which can be easily imported into DiscoverySpace, or whatever the future version is called. The AI engine can also learn when to recommend what kind of actions with increasing data. Considering how big the Open Source Community is, that would be a good starting place to deploy such a toolkit.

Questions

  1. Have you ever had trouble onboarding novice users for your software/tool/interface? What did you wish for, then?
  2. Do you think a general purpose onboarding toolkit can be built basing off the concepts in DiscoverySpace?
  3. The paper mentions about the issue of end-user control in DiscoverySpace. What are the challenges of designing DiscoverySpace if we think about extending the user base to expert users (obviously, the purpose will change)?

Read More

04/22/2020 – Vikram Mohanty – Opportunities for Automating Email Processing: A Need-Finding Study

Authors: Soya Park, Amy X. Zhang, Luke S. Murray, David R. Karger

Summary

This paper addresses the problem of automating email processing. Through an elaborate need-finding exercise with different participants, the paper synthesizes the different aspect of email management that users would want to automate, and the type of information and computation needed to achieve that. The paper also conducts a survey of existing email automation software to understand what has already been achieved. The findings show the need for a richer data model for rules, more ways to manage attention, leveraging internal and external email context, complex processing such as response aggregation, and affordances for senders.

Reflection

This paper demonstrates why need-finding exercises are useful, particularly, when the scope of automation is endless and one needs to figure out what deserves attention. This approach also helps developers/companies avoid proposing one-size-fits-all solutions and when it comes to automation, avoid end-to-end automated solutions that often fall short (in case of emails, it’s certainly debatable what qualifies as end-to-end solution). Despite the limitations mentioned in the paper, I feel the paper took steps in the right direction by gathering multiple opinions to help scope down the email processing-related automation problem into meaningful categories. Probing for developers who have shared code on Github certainly provided great value to the search to understand how experts think about the problem.

One of the findings was that users re-purposed existing affordances in the email clients to fit their personal needs. Does that mean the original design did not factor in user needs? Or the fact that the email clients need to evolve as per these new user needs?

NLP can support building richer data models for emails by learning the latent structures over time. I am sure, there’s enough data out there for training models. Of course, there will be biases and inaccuracies, but that’s where design can help mitigate the consequences.

Most of the needs were filters/rules-based, and therefore, it made sense to deploy YouPS and see how participants used them. Going forward, it will be really interesting to see how non-computer scientists use a GUI+Natural Language-based version of YouPS to fit their needs. The findings, there, will make it clear about which automation aspects should be prioritized for developing first.

As an end-user of email client, some, if not most, of my actions are at a sub-conscious level. For e.g. there are certain types of emails I do not think for even one second before marking them as read. I wonder, if a need-finding exercise, as described in this paper, would be able to capture those thoughts . Or, in addition to all the categories proposed in this paper, there should also be one where an AI attempts to make sense of your actions, and shows you a summary of what it thinks. The user, can then, reflect and figure out, if the AI’s “sensemaking” holds up or needs tweaking, and eventually, be automated. This is a mixed-initiative solution, which can effectively, over a period of time adapt to the user’s needs. This certainly depends on the AI being good enough to interpret the patterns in the user’s actions.

Questions

  1. Keeping the scope/deadlines of the semester class project aside, would you consider a need-finding exercise for your class project? How would you do it? Who would be the participants?
  2. Did you find the different categories for automated email processing exhaustive? Or would you have added something else?
  3. Do you employ any special rules/patterns in handling your email?

Read More

04/22/2020 – Vikram Mohanty – SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers

Authors: Joel Chan, Joseph Chee Chang, Tom Hope, Dafna Shahaf, Aniket Kittur.

Summary

This paper addresses the problem of finding analogies between research problems across similar/different domains by providing computational support. The paper proposes SOLVENT, a mixed-initiative system where humans annotate aspects of research papers that denote their background (the high-level problems being addressed), purpose (the specific problems being addressed), mechanism (how they achieved their purpose), and findings (what they learned/achieved), and a computational model constructs a semantic representation from these annotations that can be used to find analogies among the research papers. The authors evaluated this system against baseline information retrieval approaches and also with potential target users i.e. researchers. The findings showed that SOLVENT performed significantly better than baseline approaches, and the analogies were useful for the users. The paper also discusses implications for scaling up.

Reflection

This paper demonstrates how human-interpretable feature engineering can improve existing information retrieval approaches. SOLVENT addresses an important problem faced by researchers i.e. drawing analogies to other research papers. Drawing from my own personal experiences, this problem has presented itself at multiple stages, be it while conceptualizing a new problem, or figuring out how to implement the solution, or trying to validate a new idea, or eventually, while writing the Related Work section of a paper. This goes without saying, that SOLVENT, if commercialized, would be a boon for the thousands of researchers out there. It was nice to see the evaluation including real graduate students as their validation seemed the most applicable for such a system.

SOLVENT demonstrates the principles of mixed-initiative interfaces effectively by leveraging the complementary strengths of humans and AI. Humans are better at understanding context, and in this case, it’s that of a research paper. AI can help in quickly scanning through a database to find other articles with similar “context”. I really like the simple idea behind SOLVENT i.e how would we, humans, find analogical ideas? We will look for similar purpose and/or similar/different mechanisms. So, how about we do just that? It’s a great case of how human-interpretable intuitions translate into intelligent system design, and also scores over end-to-end automation. Something I reflected in previous papers — it always helps to look for answers by beginning from the problem and understanding it better. And that’s reflected in what SOLVENT ultimately achieves i.e. scoring over an end-to-end automation approach.

The findings are definitely interesting, particularly the drive for scaling up. Turkers certainly provided an improvement over the baseline, even though their annotations fared worse than the experts and the Upwork crowd. I am not sure what the longer term implications here are, though. Should Turkers be used to annotate larger datasets? Or should the researchers figure out a way to improve Turker annotations? Or train the annotators? These are all interesting questions. One long term implication here is to re-format the abstract into a background + purpose + mechanism + findings structure right at the initial stage. This still does not solve the thousands of prior papers. Overall, this paper certainly opens doors for future analogy mining approaches.

Questions

  1. Should conferences and journals re-format the abstract template into a background + purpose + mechanism + findings to support richer interaction between domains and eventually, accelerate scientific progress?
  2. How would you address annotating larger datasets?
  3. How did you find the feature engineering approach used in the paper? Was it intuitive? How would you have done it differently?

Read More

04/15/2020 – Vikram Mohanty – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

Authors: An T. Nguyen, Aditya Kharosekar , Saumyaa Krishnan, Siddhesh Krishnan, Elizabeth Tate, Byron C. Wallace, and Matthew Lease

Summary

This paper proposes a mixed-initiative approach to fact-checking, combining human and machine intelligence. The system automatically finds and retrieves relevant articles from a variety of sources. It then infers the degree to which each article supports or refutes the claim, as well as the reputation of each source. Finally, the system aggregates this body of evidence to predict the veracity of the claim. Users can adjust the source reputation and stance of each retrieved article to reflect their own beliefs and/or correct any errors according to them. This will, in turn, update the AI model. The paper evaluates this approach through a user study on Mechanical Turk. 

Reflection

This paper, in my opinion, succeeds as a nice implementation of all the design ideas we have been discussing in the class for mixed-initiative systems. It factors in user input, combined with an AI model output, and shows users a layer of transparency in terms of how the AI makes the decision. However, fact-checking, as a topic, is complex enough not to warrant a solution in the form of a simplistic single-user prototype. So, I view this paper as opening up doors for building future mixed-initiative systems that can rely on similar design principles, but also factor in the complexities of fact-checking (which may require multiple opinions, user-user collaboration, etc).

Therefore, for me, this paper contributes an interesting concept in the form of a mixed-initiative prototype, but beyond that, I think the paper falls short of making it clear who the intended users are (end-users or journalists) or the intended scenario it is designed for. The evaluation with Turkers seemed to indicate that anyone can use it, which opens up the possibility of creating individual echo-chambers very easily and essentially, making the current news consumption landscape worse. 

The results also showed the possibility of AI biasing users when it’s wrong, and therefore, a future design would have to factor in that. One of the users felt overwhelmed as there was a lot going on with the interface, and therefore, a future system needs to address the issue of information overdose. 

The authors, however, did a great job discussing these points in detail about the potential misuse and some of the limitations. Going forward, I would love to see this work forming the basis for a more complex socio-technical system, that allows for nuanced inputs from multiple users, interaction with a fact-checking AI model that can improve over time, and a longitudinal evaluation with journalists and end-users on actual dynamic data. The paper, despite the flaws arising due to the topic, succeeds in demonstrating human-AI interaction design principles.

Questions

  1. What are some of the positive takeaways from the paper?
  2. Did you feel that fact-checking, as a topic, was addressed in a very simple manner, and deserves more complex approaches?
  3. How would you build a future system on top of this approach?
  4. Can a similar idea be extended for social media posts (instead of news articles)? How would this work (or not work)?

Read More

04/15/2020 – Vikram Mohanty – Algorithmic Accountability Journalistic investigation of computational power structures

Authors: Nicholas Diakopoulos

Summary

This paper discusses the challenges involved in algorithmic accountability reporting and the reverse engineering approaches used to frame a story. The author interviewed four journalists who have reported on algorithms, and discusses five different case studies to present the methods and challenges involved. Finally, the paper outlines the need for transparency and potential ethical issues.

Reflection

This paper offers great insights into the decision-making process behind the reporting of different algorithms and applications. It is particularly interesting to see the lengths journalists go to figure out the story and the value in reporting. The paper is a great read even for non-technical folks as it introduces the concepts of association, filtering, classification and prioritization with examples that can be understood universally. While discussing the different case studies, the paper manages to paint a picture of the challenges the journalists encountered in a very easy-to-understand manner (e.g. incorrectly determining that Obama’s campaign targeted by age) and therefore, succeeds in showing why reporting on algorithmic accountability is hard!

In most cases, the space for potential input(s) is large enough not to be figured out easily, making the field more challenging. This somehow necessitates using the skills of computational social scientists to conduct additional studies, collect additional data and come up with inferences. The paper makes a great point about reverse engineering offering more insights than directly asking the algorithm developers, as the unintended consequences would never surface without investigating the algorithms in operation. Another case of “we need more longitudinal studies with ecological validity”!

It was very interesting to see the discussion around last-mile interventions at the user interface stages (in case of the autocomplete case). It shows the fact that (some of the) developers are self-aware and therefore, ensure that the user experience is an ethical experience. Even though they may fall short, it’s a good starting point. This also demonstrates why augmenting an existing pipeline (be it data/AI APIs or models) to make it work for the end-user is desirable (something that some of the papers discussed in the class have shown).

The questions around the ethics, as usual, do not have an easy answer — whether the reporting can enable developers to make it difficult to investigate in the future. However, regulations around transparency can go a long way in holding algorithms accountable. The paper does a great job synthesizing the challenges in all the case studies and outlines four high-level points for how algorithms can become transparent.

Questions

  1. Would you add anything more to the reverse engineering approaches discussed for the different case studies in the paper? Would you have done anything differently?
  2. If you were to investigate into the power structures of an algorithm, which application/algorithm would you chose? What methods would you follow?
  3. Any interesting case studies that this paper misses out on?

Read More

04/08/2020 – Vikram Mohanty – CrowdScape: Interactively Visualizing User Behavior and Output

Authors: Jeffrey M Rzeszotarski, Aniket Kittur

Summary

This paper proposes a system CrowdScape, that supports human evaluation of crowd work through interactive visualization of behavioral traces and worker output, combined with mixed-initiative machine learning. Different case studies are discussed to showcase the utility of CrowdScape.

Reflection

The paper addresses the issue of quality control, a long-standing problem in crowdsourcing, by combining two existing standalone approaches that researchers currently adopt: a) inference from worker behavior and b) analyzing worker output. Combining these factors is advantageous as it provides the complete picture, either by providing corroborating evidence towards ideal workers, or in some cases, may provide complementary evidence that can help infer ideal “good” workers. Just analyzing the worker output might not be enough as there’s an underlying chance that it might be as good as a random coin toss. 

Even though it was a short text in parentheses, I really liked the fact that the authors explicitly sought permission to record the worker interaction logs. 

Extrapolating other similar or dissimilar behavior using Machine Learning seems intuitive here as the data and the features used (i.e. the building blocks) of the model are very meaningful, perfectly relevant to the task and not a black-box model. As a result, it’s not surprising to see it work almost everywhere. The one case where it didn’t work, it made up for it by showing that the complementary case works. This sets a great example for designing predictive models on top of behavioral traces that actually works. 

Moreover, the whole system was built agnostic of the task, and the evaluations justified it. However, I am not sure if the best use case of the system is optimized towards recruiting multiple workers for a single task, or whether it is to identify a set of good workers to subsequently retain for other tasks in the pipeline. I am guessing it is the latter, as the former might seem like an expensive approach for getting high-quality responses. 

On the other hand, I feel the implications of this paper go beyond just crowdsourcing quality control. CrowdScape, or a similar system, can provide assistance for studying user behavior/experience in any interface (web for now), which is important for evaluating interfaces.

Questions

  1. Does your evaluation include collecting behavioral trace logs? If so, what are some of your hypotheses regarding user behavior?
  2. How do you plan on assessing quality control?
  3. What kind of tasks do you see CrowdScape being best applicable for? (e.g. single task, multiple workers)

Read More

04/08/2020 – Vikram Mohanty – Agency plus automation: Designing artificial intelligence into interactive systems

Authors: Jeffrey Heer

Summary

The paper discusses interactive systems in three different areas — data wrangling, exploratory analysis, and natural language translation — to showcase the use of “shared representation” of tasks, where machines can augment human capabilities instead of replacing them. All the systems highlight balancing of the complementary strengths and weaknesses of each, while promoting human control.

Reflection

This paper makes the case for intelligence augmentation i.e. augmenting human capabilities with the strengths of AI rather than striving to replace them. Developers of intelligent user interfaces can come up with effective collaborative systems by carefully designing the interface for ensuring that that AI component “reshapes” the shared representations that users can contribute to, and not “replace” them. This is always a complex task, and therefore, requires scoping down from the notion that AI can be used to automate everything by focusing on these editable shared representations. This has other benefits i.e. helps exploit the benefits of AI in a sum-of-parts manner rather than an end-to-end mechanism where an AI is more likely to be erroneous. The paper discusses three different case studies where a mixed-initiative deployment was successful in catering to user expectations in terms of experience and output. 

It was particularly interesting to see the participants complaining that the Voyager system, despite being good, spoilt them as it made them think less. This can hamper adoption of such systems. A reasonable design implication here should be allowing users to choose the features they want or giving them the agency to adjust the degree of automation/suggestions. This also suggests the importance of conducting longitudinal studies to understand how users use the different features of an interface i.e. whether they use one but not the other. 

According to some prior work, machine-suggested recommendations have been known to perpetrate filter bubbles. In other words, users are exposed to a similar set of items and miss out on other stuff. Here, the Voyager recommendations work in contrast to prior work by allowing users to explore the space, analyze different charts and data points they wouldn’t otherwise notice and combat confirmation bias. In other words, the system does what it claims to do i.e. augment the capabilities of humans in a positive sense using the strengths of the machine. 

Questions

  1. In the projects you are proposing for the class, does the AI component augment the human capabilities or strive to replace it (eventually)? If so, how?
  2. How do you think developers should cater to cases where users are less likely to adopt a system because it impedes their creativity?
  3. Do you think AI components (should) allow users to explore the space more than they would normally? Any possible pitfalls (information overdose, unnatural tasks/interactions, etc.)

Read More

3/25/20 – Jooyoung Whang – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

This paper attempts to measure the performance of a visual conversational bot called ALICE with a human teammate, opposed to what modern AI study commonly does, which is measuring amongst a counterpart AI. The authors design and deploy two versions of ALICE: one trained by supervised learning and the other by reinforced learning. The authors made Mturk workers have a Q&A session with ALICE to discover a hidden image only shown to ALICE, within a pool of similar images. After a fixed set of questions, the Mturk workers were asked to make a guess for which one was the hidden image. The authors evaluated using the resulting mental rankings of the hidden images after the user’s conversations with the AI. They found in previous works that bots trained using reinforced learning performed better than the other. However, the authors discover that there is no significant difference when evaluated in a human-AI team.

This paper was a good reminder that the ultimate user at the end is a human. It’s easy to forget that what computers prefer does not automatically translate over to a human’s case. It was especially interesting to see that a significant performance difference in an AI-AI setting was rendered minimal with humans in the loop. It made me wonder what it was about the reinforced-learned ALICE that QBOT preferred over the other version. Once finding that distinguishing factor, we might be able to make humans learn and adapt to the AI, leading to improved team performance.

It was a little disappointing the same research with QBOTs being the subject was left for future work. I would have loved to see the full picture. It could have also provided insight into what I’ve written above; what was it that QBOTs preferred reinforced learning?

This paper identified that there’s still a good distance between human cognition and AI cognition. If further studies find ways to minimize this gap, it will allow a quicker AI design process, where the resulting AI will be effective for both human and AI without needing extra adjustments for the human side. It would be interesting to see if it is possible to train an AI to think like a human in the first place.

These are the questions I had while reading this paper:

1. This paper was presented in 2017. Do you know any other studies done after this that measured human-AI performance? Do they agree that there’s a gap between humans and AIs?

2. If you have experience training visual conversational bots, do you know if a bot prefers some information over others? What is the most significant difference between a bot trained with supervised learning and reinforced learning?

3. In this study, the Mturk workers were asked to make a guess after a fixed number of questions. The study does not measure what’s the minimum or the maximum number of questions needed on average to make an accurate guess. Do you think the accuracy of the guesses will proportionally increase as the number of questions increases? If not, what kind of regression do you think it will follow?

Read More

03/25/2020 – Vikram Mohanty – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Authors: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, and Devi Parikh

Summary

In this paper, the authors propose coming up with realistic evaluation benchmarks in the context of human-AI teams, as opposed to evaluating AI systems in isolation. They introduce two chatbots, one better than the other on the basis of standalone AI benchmarks. However, in this paper, when they are evaluated in a task setting that mimics their intended use i.e. humans interacting with chatbots to accomplish a common goal, they perform roughly the same. The work essentially suggests a mismatch between benchmarking of AI in isolation and in the context of human-AI teams. 

Reflection

The core contribution of this paper showcases that evaluating AI systems in isolation will never give us the complete picture, and therefore, we should evaluate AI systems under the conditions they are intended to be used in with the targeted players who will be using it. In other words, the need for ecological validity of the evaluation study is stressed here. The flip side of this contribution is, in some ways, being reflected in the trend of AI systems falling short of their intended objectives in real-world scenarios.  
Even though the GuessWhich evaluation was closer to a real-world scenario than vanilla isolation evaluation methods, it still remains an artificial evaluation. However, the gap with a possible real-world scenario (where a user is actually interacting with a chatbot to accomplish some real-world task like planning a trip) would be minimal. 
The responses returned by the two bots are not wildly different (beige vs brown) since one was the base for the other one, and therefore, a human brain can somehow adapt dynamically based on the chatbot responses and accomplish the overall goal. It would also have been interesting to see how the performance changes when the AI was drastically different, or sent someone down the wrong path. 
This paper shows why it is important for AI and HCI researchers to work together to come up with meaningful datasets, setting up a realistic ecosystem for an evaluation benchmark that would be more relevant with potential users. 

Questions

  1. If, in the past, you compared algorithms solely on the basis of precision-recall metrics (let’s say, you built an algorithm and compared it with the baseline), do you feel the findings would hold up in a study with ecological validity? 
  2. How’d you evaluate a conversational agent? (Suggest something different from the GuessWhich platform)
  3. How much worse or better (or different) would a chatbot have to be for humans to perform significantly different from the current ALICE chatbots in the GuessWhich evaluation setting? (Any kind of subjective interpretation welcome) 

Read More