4/29/20 – Lee Lisle – DiscoverySpace: Suggesting Actions in Complex Software

Summary

               Fraser et al.’s paper discusses how difficult some software programs are, especially for novices, and a possible solution to help ameliorate the steep learning curves they often possess. Their solution is the DiscoverySpace interface – a tool for the creation of macros to help novice users learn new capabilities of the software they want to use. They further test this with a prototype that is used to help participants learn how to use the popular image editing program Photoshop. After finding 115 different macros from various sources, they designed their workflow and interface and ran a study to evaluate it. They found, in a study with 28 participants, that found that participants were significantly less likely to be discouraged or say they couldn’t figure something out with the DiscoverySpace tool installed with Photoshop.

Personal Reflection

               This paper provides a fantastic workflow for easing novice users into using new and difficult programs. I liked it because it provides a slightly more customized experience than the youtube video walkthroughs and online tutorials that I’m accustomed to using. I would even like to actually use this interface for Photoshop, as its one program I attempted to break into a few times early on in my college career, always failing as there were too many features too obliquely described.

               I was surprised that the authors removed the suggestions that contained pauses and dialogue. I would have expected those situations to be better able to present the user with the appropriate background for the effects they wanted to do. However, when they explained their reasoning – that the explanations often were not enough and confused the users – it made a lot more sense to remove those altogether.

               I’m not sure how I feel about their comment that later on they are supporting “paid” actions, where the macros do something that can be considered of a higher quality and thereby require some form of compensation for the macro creator. I don’t think an academic paper is the place for that sort of suggestion, as it doesn’t really add to the software or approach the paper presents. All tools that academic papers present could be used in commercial software, so why would that be of particular note in this paper?

               Lastly, and this is more of a quibble, I was more off put than I thought I would be by the text-aligned images as seen in figures 3 and 4. The alignment is more difficult to read than it would be in a casual magazine-type environment, and should be reserved for that sort of publication.

Questions

  1. Do you think the 115 presented actions are enough of a testbed for the prototype tool? I.E., should they have more to present a better amalgamation of possible uses? How would they generate more?
  2. Beyond using image analysis to present some initial ideas to the users, what other ways might you improve their approach to make it more automated, or do you think there’s enough or too much automation already?
  3. What other programs could use this approach, and how might they integrate it into their platforms?

Read More

4/29/20 – Lee Lisle – Visiblends: A Flexible Workflow for Visual Blends

Summary

Chilton et al.’s paper describes a flexible workflow for creating visual blends that the authors dub “Visiblends’ (do you see what they did there?). Lack of imagination in naming notwithstanding, the workflow involves an input of two concepts, brainstorming, image classification and blending, and evaluation of the automatic blending. They performed three studies where their workflow was tested, showing that decentralized groups of people could appropriately brainstorm and generate blends in microtasks. The first phase involves associating different words with the input themes to create a broad base of kinds of images. The second phase involved the searching for related imagery. The third phase asked crowdworkers to annotate the found images for basic shapes and coverages of those shapes. The fourth stage is performed by an AI and involves shape matching between images to combine the two themes, while the final stage (also by AI) blends the images based on the image matching. Their studies confirm that decentralized groups, collaborative groups, and novices can all use this workflow to create visual blends.

Personal Reflection

               I liked this work, overall, as a way for people to get interesting designs out of a few keywords of whatever they’re working on. I was somewhat surprised that the second step (“Finding Images”) was not an automatic process. I had figured when I read the introduction that this step was automated by image recognition software, since these are not complex images but images of single objects. However, when it was explained in the Workflow section, it makes it clear that these images are essentially another phase of the brainstorming process. However, I was concerned that it was perhaps a complex microtask since it asked for an implementation of several somewhat complex filters as well as ten images from those filters.

               I thought the images in Figure 7 were somewhat deceptive, however. They stated in the caption for that image that there was “aesthetic editing by an artist,” which implies they had a visual designer already employed. If that was the case, why is the expert not performing the expert task? I would have liked to see the actual resultant images as they show (some of) in the later studies.

               The refinement process they introduced in the first study was also interesting in that the refinement was more than just asking for more results – the user actually iterated on the design process to find similar shaped items between the two categories. This shows an aspect of human intelligence to solve a problem that the AI had difficulty solving – realizing why the AI was having trouble was a key part of the process.

               Lastly, I would have liked to see what would have happened if all three groups (decentralized, group collaboration, and novices) were given the same concepts to generate ideas from. Which group might have performed the best? Also, since this is quite decentralized, I would have liked to see an mTurk deployment to see how the crowd could perform this task as well.

Questions

  1. As discussed above, an interesting part of this paper was how human intelligence was employed to refine the AI’s process, thereby giving it better inputs. Are there other ways that using human insight into why an AI is having issues is a good way to solve the problem?
  2. When a workflow that creates microtasks like this, is it more helpful to test it with participants that come into a lab or through a crowdworking site like Mechanical Turk? Should they both be performed? Why or why not?
  3. Would you use a service like this for a creative image in your work? Why or why not?

Read More

4/22/20 – Lee Lisle – Opportunities for Automating Email Processing: A Need-Finding Study

Summary

Park et al.’s paper covers the thankless task of email management. They discuss how people spend too much time reading and responding to emails, and how it might be nice to get some sort of automation going for dealing with the deluge of electronic ascii flooding our days. In their process, they interviewed 13 people in a design workshop setting where they came up with 42 different rules for dealing with emails. From these rules, they identified five different overarching categories for these rules. Using this data, the authors then sent a survey out and received 77 responses on how they would use a “smart robot” to handle their emails. They identified 6 categories of possible automation from this survey. The authors then took to GitHub to find any existing automation that coders have come up with to deal with email through searching for codebases that messed with IMAP standards. This came up with 8 different categories. They then took all of the data thus far and created an email automation tool they called YouPS (cute), and identified how today’s email clients needed to adjust to fully handle the wanted automation.

Personal Reflection

               I have to admit, when I first saw that they specified they gathered 13 “email users,” I laughed. Isn’t that just “people?” Furthermore, a “smart robot” is just a machine learning algorithm. The entire premise of calling their mail handler “YouPS.” This paper was full of funny little expressions and puns that I aspire to create one day.

               While I liked that they found that senders wanted recipients to have an easier time dealing wither their email, I wasn’t terribly surprised about that. If I wanted a reply to an email, I’d rather they get the email and be able to deal with it immediately rather than risking them forgetting about my request altogether. That’s the best of both worlds, where all parties involved have the right amount of time to apply to pressing concerns.

               I also appreciated that they were able to get responses from non-university affiliated people, as it’s often the case that research is found too narrowly focused on college students.

               Lastly, I enjoyed the abstraction they created with their YouPS system. While it was essentially just an API that allowed users to use standard python with an email library, it seemed genuinely useful for many different tasks.

Questions

  1. What is your biggest pet peeve about the way email is typically handled? How might automation solve that issue?
  2. Grounded Theory is a method that pulls a ton of data out of written or verbal responses, but requires a significant effort. Did the team here effectively use grounded theory, and was it appropriate for this format? Why or why not?
  3. How might you solve sender issues in email? Is it a worthwhile goal, or is dealing with those emails trivial?
  4. What puns can you create based on your own research? Would you use them in your papers? Would you go so far as to include them in the titles of your works?

Read More

4/22/20 – Lee Lisle – SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers

Summary

Chan et al.’s paper discusses a way to find similarities in research papers through the use of mixed initiative analysis. They use a combination of humans to identify sections of abstracts and machine learning algorithms to identify key words in those sections in order to distill the research down into a base analogy. They then compare across abstracts to find papers with the same or similar characteristics. This enables researchers to find similar research as well as potentially apply new methods to different problems. They evaluated these techniques through three studies. The first study used grad students reading and annotating abstracts from their own domain as a “best-case” scenario. Their tool worked very well with the annotated data as compared to using all words. The second study looked at helping find analogies to fix similar problems, using out-of-domain experts to annotate abstracts. Their tool found more possible new directions than the all words baseline tool. Lastly, the third study sought to scale up using crowdsourcing. While the annotations were of a lesser quality with mTurkers, they still outperformed the all-words baseline.

Personal Reflection

               I liked this tool quite a bit, as it seems a good way to “unstuck” oneself in the research black hole and find new ways of solving problems. I also enjoyed that the annotations didn’t necessarily require domain-specific or even researcher-specific knowledge even with the various jargon that is used. Furthermore, though it confused me initially, I liked how they used their own abstract as an extra figure of sorts – using their own approach to annotating their abstract was a good idea. It cleverly showed and explained how their approach works quickly without reading the entire paper.

               I did find a few things confusing about their paper, however. They state that the GloVe model doesn’t work very well in one section, but then use it in another. Why go back to using it if it had already disappointed the researchers in one phase? Another complication I noticed was that they didn’t define the dataset in the third study. Where did the papers come from? I can glean from reading it that it was from one of the prior two studies, but I think its relevant to ask if it was the domain-specific or the domain-agnostic datasets (or both).

               I was curious about total deployment time for this kind of thing. Did they get all of the papers analyzed by the crowd in 10 minutes? 60 minutes? A day? With how parallel the task can be performed, I can imagine it could be very quick to get the analysis performed. While this task doesn’t need to be quickly performed, it could be an excellent bonus of the approach.

Questions

  1. This tool seems extremely useful. When would you use it? What would you hope to find using this tool?
  2.  Is the annotation of 10,000 research papers worth $4000? Why or why not?
  3. Based on their future work, what do you think is the best direction to go with this approach? Considering the cost of the crowdworkers, would you pay for a tool like this, and how much would be reasonable?

Read More

4/15/20 – Lee Lisle – Algorithmic Accountability: Journalistic Investigation of Computational Power Structures

Summary

Diakopoulos’s paper makes the point that AI has a power over users that is not often clearly expressed to the users, even when those algorithms have massive amounts of power over users’ lives. The author then points out four different ways algorithms have power over users: prioritization, classification, association, and filtering. After a brief description of each, the author then speaks about how transparency is key to the balancing of these powers.

Then a series of AI implementations were discussed and showed how they exerted some amount of power without informing the user. The author used autocompletions on Google & Bing, autocorrections on iPhone, political emails, price discrimination, and stock trading as examples. The author then uses interviews in order to gain insight in how journalists better understand AIs and write stories on them. This is a form of accountability, and journalists use this information to allow users to understand the technology around them.

Personal Reflection

               I thought this paper brought up a good point that was seen in other readings this week: even if the user is given agency over the final decision, the AI biases them towards a particular set of actions. Even if the weaknesses of the AI are understood, like in the Bansal et al. paper on updates, the participant is still biased from the actions and recommendations of the AI. This power, when combined with the effect it can have on peoples’ lives, can greatly change the course of lives.

               The author also makes a point that interviews with designers is a form of reverse-engineering. I had not thought of it in this way before, so it was an interesting insight into journalism. Furthermore, the idea that AIs are black boxes, but their inputs and outputs can be manipulated such that the interior workings can be better understood was another thing I hadn’t thought of.

               I was actually aware of most of the cases the author presented as ways of algorithms exerting power. I have used different computers and safe modes on browsers in the past to ensure I was getting the best deal in travel or hotels before, for example.

               Lastly, I thought the idea of journalists having to uncover these AI (potential) malpractices was an interesting quandary. Once they do this, they must publish a story, but then most people will likely not hear about it. There’s an interesting problem here of how to warn people about potential red flags in algorithms that I felt the paper didn’t discuss well enough.

Questions

  1. Are there any specific algorithms that have biased you in the past? How did they? Was it a net positive, or net negative result? What type of algorithmic power did it exert?
  2. Which of the four types of algorithmic power is the most serious, in your opinion? Which is the least?
  3. Did any of the cases surprise you? Do they change how you may use technology in the future?
  4. What ways can the users abuse the AI systems?

Read More

04/15/20 – Lee Lisle – Believe it or Not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

Summary

Ngyuen et al’s paper discusses the rise of misinformation and the need to combat it via tools that can verify claims while also maintaining users’ trust of the tool. They designed an algorithm that finds sources that are similar to a given claim to determine whether or not the claim is accurate. They also weight the sources based on esteem. They then ran 3 studies (with over 100 participants in each) where users could interact with the tool and change settings (such as source weighting) in order to evaluate their design. The first study found that the participants trusted the system too much – when it was wrong, they tended to be inaccurate, and when it was right, they were more typically correct. The second study allowed participants to change the inputs and inject their own expertise into the scenario. This study found that the sliders did not significantly impact performance. The third study focused on gamification of the interface, and found no significant difference.

Personal Reflection

               I enjoyed this paper from a 50,000 foot perspective, as they tested many different interaction types and found what could be considered negative results. I think papers that show that all work is not necessarily good have a certain amount of extra relevance – they certainly show that there’s more at work than just novelty.

I especially appreciated the study on the effectiveness of gamification. Often, the prevailing theory is that gamification increases user engagement and increases the tools’ effectiveness. While the paper is not conclusive that gamification cannot do this, it certainly lends credence to the thought that gamification is not a cure-all.

However, I took some slight issue with their AI design. Particularly, the AI determined that the phrase “Tiger Woods” indicated a supportive position. While their stance was that AIs are flawed (true), I felt that this error was quite a bit more than we can expect from normal AIs, especially ones that are being tweaked to avoid these scenarios. I would have liked to see experiment 2 and 3 improved with a better AI, as it does not seem like they cross-compared studies anyway.

Questions

  1. Does the interface design including a slider to adjust source reputations and user agreement on the fly seem like a good idea? Why or why not?
  2.  What do you think about the attention check and its apparent failure to accurately check? Should they have removed the participants with incorrect answers to this check?
  3. Should the study have included a pre-test to determine how the participants’ world view may have affected the likelihood of them agreeing with certain claims? I.E., should they have checked to see if the participants were impartial, or tended to agree with a certain world view? Why or why not?
  4. What benefit do you think the third study brought to the paper? Was gamification proved to be ineffectual, or is it a design tool that sometimes doesn’t work?

Read More

4/8/2020 – Lee Lisle – The State of the Art in Integrating Machine Learning into Visual Analytics

Summary

               Endert et al.’s focus in this paper is on how machine learning and visual analytics have blended together to create tools for sensemaking with large complex datasets. They first explain the various models of sensemaking and how they can impact learning and understanding, as well as many models of interactivity in visual analytics that complement sensemaking. Then the authors lightly describe some machine learning models and frameworks to establish a baseline knowledge for the paper. They then create 4 categories for machine learning techniques currently used in visual analytics: dimension reduction, clustering, classification, and regression/correlation models. Then then discuss papers that fit into each of these categories in another set of categories where the user either modifies parameters and computational domain or defines analytical expectations, while the machine learning model assists the user in these. The authors then point out several new ways of blending machine learning and visual analytics, such as steerable machine learning, creating training models from user interaction data, and automated report generation.

Personal Reflection

               This paper was an excellent summary of the field of visual analytics and various ways machine learning has been blended into it. Furthermore, there were several papers included that have informed my own research into visual analytics and sensemaking. I was somewhat surprised that, though the authors mention virtual reality, they don’t cover some of the tools that have been developed for immersive analytics. As a side note to this, the authors used a lot of various acronyms and did not explain all of them, for example virtual reality was referenced once and only by its acronym. When they used it for dimensional reduction, I was initially confused because they hadn’t defined that acronym, while they defined the acronyms for visual analytics and machine learning twice in the same paragraph in the introduction.

               Their related works section was impressive and really covered a lot of angles for sensemaking and visual analytics. While I do not have the base for machine learning, I assume it also covered that section well.

               I also thought the directions they suggested for future development was a good selection of ideas. I could identify ways that many of them could be applied to my work on the Immersive Space to Think, like automated report generation would be a great way to start out in IST, and a way to synthesize and perform topic analysis on any notations while in IST could lead to further analytical goals.

Questions

  1. What observations do you have on their suggestions for future development in visual analytics? What would you want to tackle first and why?
  2. In what ways does the human and machine work together in each category of machine learning (dimension reduction, clustering, classification, and regression/correlation)? What affordances does each use?
  3. Which training method do you think leads to higher quality outputs? Unmodified training sets or user interaction steerable machine learning?

Read More

4/8/20 – Lee Lisle – Agency Plus Automation: Designing Artificial Intelligence into Interactive Systems

Summary

Heer’s focus in this paper is on refocusing AI and machine learning into everyday interactions that assist users in their work rather than trying to replace users. He reiterates many times in the introduction that humans should remain in control while the AI assists them in completing the task, and even brings up the recent Boeing automation mishaps as an example of why human-in-the-loop is so essential to future developments. The author then describes several tools in data formatting, data visualization, and natural language translation that use AI to assist the user by suggesting actions based on their interactions with data, as well as domain-specific languages (DSLs) that can quickly perform actions through code. The review of his work shows that users want more control, not less, and that these tools increase productivity while allowing the user to ultimately make all of the decisions.

Personal Reflection

               I enjoyed this paper as an exploration of various ways people can employ semantic interaction in interfaces to boost productivity. Furthermore, the explorations in how users can do this without giving up control was remarkable. I hadn’t realized that the basic idea behind autocorrect/autocomplete could apply in so many different ways in these domains. However, I did notice that the author mentioned that in certain cases there were too many options for what to do next. I wonder how much ethnographic research needs to go into determining each action that’s likely (or even possible) in each case and what overhead the AI puts on the system.

               I also wonder how these interfaces will shape work in the future. Will humans adapt to these interfaces and essentially create new routines and processes in their work? As autocomplete/correct often creates errors, will we have to adapt to new kinds of errors in these interfaces? At what point does this kind of interaction become a hindrance? I know that, despite the number of times I have to correct it, I wouldn’t give up autocomplete in today’s world.

Questions

  1. What are some everyday interactions that you interact with in specialized programs and applications? I.E., beyond just autocorrect. Do you always utilize these features?
  2. The author took three fairly everyday activities and created new user interfaces with accompanying AI with which to create better tools for human-AI collaboration. What other everyday activities can you think of that you could create a similar tool for?
  3. How would you gather data to create these programs with AI suggestions? What would you do to infer possible routes?
  4. The author mentions expanding these interfaces to (human/human) collaboration. What would have to change in order to support this? Would anything?
  5. DSLs seem to be a somewhat complicated addition to these tools. Why would you want to use these and is it worth learning about the DSL?
  6. Is ceding control to AI always a bad idea? What areas do you think users should cede more control or should gain back more control?

Read More

03/25/20 – Lee Lisle – Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

Summary

            Often, humans have to be trained on how to talk to AIs such that the AI understands the human and what it is supposed to do. However, this certainly puts the onus on the human to adjust rather than having the AI adjust to the human. One way of solving that issue was to create crowdsourced responses so that the human and AI could understand each other with what was essentially a middle-man approach. Huang et al.’s work was on creating a hybrid crowd and AI powered conversational assistant that aims to create an AI that can be fast enough for a human to interact with naturally but also creates higher-quality natural responses that the crowd could create. It accomplishes this through reusing responses that were previously generated by the crowd over time as high quality responses are identified. They also deployed Evorus over a period of 5 months with 281 conversations that gradually moved from crowd responses to chatbot responses.

Personal Reflection

I liked that the authors took an odd stand in this line of research early on in the paper – that, while it has been suggested that AI will eventually take over for the crowd implementations of a lot of systems, this hasn’t happened despite a lot of research. This stand highlights that the crowd has been performing tasks that it was supposed to stop at some point for a long time.

Also, I found that I could possibly adapt what I’m working on with automatic speech recognition (ASR) to improve itself with a similar approach. If I took, for example, several outputs from different ASR algorithms along with crowd responses and had a ranked voting for which was the best transcription, perhaps it could eventually wean itself off the crowd as well.

It was also interesting that they took a reddit or other social website approach with the upvote/downvote system of determination. This approach seems to have long legs in fielding appropriate responses via the crowd.

The last observation I would like to make is that they had an interesting and diverse set of bots, though I question the usefulness of some of them. I don’t really understand how filler bot can be useful except in situations where the AI doesn’t really understand what is happening, for example. I had also though the interview bot would be low performing as the types of interviews it describes that it pulled its training data from would be particular to certain types of people.

Questions

  1. Considering they said that the authors felt that Evorus wasn’t a complete system but a stepping point to a complete system, what do you think they could do to improve it? I.E., what more can be done?
  2. What other domains within human-AI collaboration could use this approach of the crowd being a scaffold that the AI develops upon until the scaffold is no longer needed? Is the lack of these deployments evidence that the developers don’t want to leave this crutch or is it due to the crowd still being needed?
  3. Does the weighting behind the upvotes and downvotes make sense? Should the votes have equal (yet opposite) weighting or should they be manipulated as the authors did? Why or why not?
  4. Should the workers be incentivized for upvotes or downvotes? What does this do the middle-of-the-road responses that could be right or wrong?

Read More

03/25/20 – Lee Lisle – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

Summary

            Chattopadhyay et al.’s work details the problems with the current (pre-2018) methods of evaluating visual conversational agents. These agents, which are AIs designed to discuss what is in pictures, were typically evaluated through one AI (the primary visual conversational agent) describing a picture while another asked questions about it. However, the authors show how this kind of interaction does not adequately reflect how humans would converse with the agent. They use 2 visual conversation agents, dubbed ALICE_SL and ALICE_RL (for supervised and reinforcement learning, respectively) to play 20 questions with AMT workers. They found that there was no significant difference in the performance of the two versions of ALICE. This stood in contrast to the work done previously which found that ALICE_RL was significantly better than ALICE_SL when tested by AI-AI teams. Both ALICEs perform better than random chance, however. Furthermore, AI-AI teams require fewer guesses than the humans in Human-AI teams.

Personal Reflection

I first found their name for 20-questions was Guess What or Guess Which. This has relatively little to do with the paper, but it was jarring to me at first.

The first thing that struck me was their discussion of the previous methods. If the first few rounds of AI-AI evaluation were monitored, why didn’t they pick up that the interactions weren’t reflective of human usage? If the abnormality didn’t present until later on, could they have monitored late-stage rounds, too? Or was it generally undetectable? I feel like there’s a line of questioning here that wasn’t looked at that might benefit AI as well.

I was amused that, with all the paper being on AI and interactions with humans, that they chose the image set to be medium difficulty based on “manual inspection.” Does this indicate that the AIs don’t really understand difficulty in these datasets?

Another minor quibble is that they say each HIT was 10 games, but then state that they published HITs until they got 28 games completed on each version of ALICE and specify this meant 560 games. They overload the word ‘game’ without describing the actual meaning behind it.

An interesting question that they didn’t discuss investigating further is whether question strategy evolved over time for the humans. Did they change up their style of questions as time went on with ALICE? This might provide some insight as to why there was no significant difference.

Lastly, their discussion on the knowledge leak of evaluating AIs on AMT was quite interesting. I would not have thought that limiting the interaction each turker could have with an AI would improve the AI.

Questions

  1. Of all of the participants who started a HIT on AMT, only 76.7% of participants actually completed the HIT. What does this mean for HITs like this? Did the turkers just get bored or did the task annoy them in some way?
  2. The authors pose an interesting question in 6.1 about QBot’s performance. What do you think would happen if the turkers played the role of the answerer instead of the guesser?
  3. While they didn’t find any statistical differences, figure 4(b) shows that ALICE_SL outperformed ALICE_RL in every round of dialogue. While this wasn’t significant, what can be made of this difference?
  4. How would you investigate the strategies that humans used in formulating questions? What would you hope to find?

Read More