04/29/2020 – Subil Abraham – Chilton et al., “VisiBlends”

Visual blending is the idea of taking two objects or concepts that you want to represent and combining them in a way that both concepts are identifiable, while also implying that one concept is applicable to the other. The creation of visual blends is a very creative process. But with Visiblend, the authors have created a system that has greatly streamlined the process of creating visual blends, and even split the process up as microtasks. The process of creating visual blends is split up into tasks of ideation of related concepts, searching of singular representative images with simple iconic shapes for the ideated concepts, and annotate the shapes on the images. The system then takes this information to create different combinations (blends) of the images and returns them to the user for evaluation and further iteration. They conduct three case studies looking at how Visiblend can be used under different situations. They also make note of the limitations that Visiblend can only deal with simple iconic shapes and cannot do more complex stuff like animations.

The visual blended images are some of the most powerful imagery I’ve ever come across. They convey ideas so well. I think that this is a really good project that is streamlining the process of creating these powerful images. I am actually shocked at how simple the steps are (granted, it takes a lot more work to actually make it look good). But still, very surprising. Initially, I felt that the system was very limited because all it was doing was cropping and overlaying one picture on top of the other. How could that be of possible use? But then I realized that the real value was coming from the fact that it is able to perform so many blends automatically with no human effort and demo them all. It’s utility comes from the speed and iteration of the visual blends that we can do through it. It’s also really interesting how the tool allows to visualize really unintuitive combinations (like the McDonald’s + energy example). Where a human doing it would be really limited by their preconceived notions of both those concepts, a machine doesn’t have those blocks and can therefore present any combination of zany ideas that a human can look at go “Oh! That does work!”. So it serves as the perfect tool to come up with ideas because it does not have any inhib
itions.

  1. What kind of workflow would be necessary to do something like this, but for animation and gifs instead of static images?
  2. Do you think this streamlined workflow would impede some creative ideas from being conceptualized because people’s thought processes are trained to think this way?
  3. In your opinion, does Visiblend function better as a centralized collaborative tool (where everyone is in the same room) or as a distributed collaborative tools (i.e. using crowd workers on crowd work platforms)?

Read More

04/29/2020 – Subil Abraham – Fraser et al., “DiscoverySpace”

Software tends to grow in complexity as time goes on, with more features being added as more needs arise. This can be problematic especially for consumer software intended to be used by a wide array of people, where they can’t figure out how they need to do the specific thing they need to do just once because of the vast number of options arrayed before them. DiscoverySpace is proffered as a solution to this problem specifically for Photoshop, by allowing users to browse crowd sourced pre-built actions to apply on their image. They create an additional panel and ask the user to enter the kind of image they have. DiscoverySpace does keyword matching in order to identify the most suitable collection of actions to suggest, returning a random sample mix of results to promote discoverability of new Photoshop features by the user. They conduct a couple of studies, first examining how novices interact with the vanilla software with experts guiding them and showing them what they can do. And another study where they compare the usage of Photoshop by novices with and without the presence of DiscoverySpace. They find that having DiscoverySpace as a feature greatly helps novices in performing their tasks and they don’t struggle as much, but also find that it is limiting because DiscoverySpace only applies effects on the whole image.

I feel like a really good expansion of this work can be to have an additional window pop up after an action has been done with various tweakable parameters specific to that action. This feels like it could really promote the Photoshop’s plugin ecosystem and make it more accessible for ordinary folks to contribute to plugins, not just companies that specialize in it. Another idea in this paper that I find interesting is the use of random sampling to find and suggest actions. I think this is genius because it promotes curiosity in the user about the different things that Photoshop can offer and slowly, over time, allow the user to become more familiar with all the features particularly if they are curious and look through the History panel to see what effects were applied when they clicked on a DiscoverySpace action. It can serve as a useful learning tool that will allow those who prefer to work with practical examples to learn how different things work and emergently get an idea of how Photoshop as a whole functions.

  1. Would you find this a useful learning tool or do you only see it as something that is used to just immediately serve your purpose?
  2. What other software would something like this be useful?
  3. Is the random sampling they do to promote discoverability a good idea? Would their purpose be better served by optimizing suggestions solely for the specific task at hand?

Read More

04/22/2020 – Subil Abraham – Chan et al., “SOLVENT”

Academic research strives for innovation in every field. But in order to innovate, researchers need to scour the length and breadth of their field as well as adjacent fields in order to get new ideas, understand the current work, and make sure they are not repeating someone else’s work. Being able to use an automated system that will scour the published research landscape for similar existing work would be a huge boost to productivity. SOLVENT aims to solve this problem by providing a method of having humans annotating the research abstracts to identify the background, purpose, method, and findings and use that as a schema to index the papers and identify similar papers (which have also been annotated and indexed) that match closely in one or a combination of those categories. The authors conduct three case studies to validate their method, for finding analogies within a single domain, for finding analogies across different domains, and for validating if annotating using crowd workers would be useful. They find that their method consistently outperforms other baselines in finding analogous papers.

I think that something like this would be incredibly helpful for researchers and significantly streamline the research process. I would gladly use something like this in my research process because it would save so much time. Of course, as they pointed out, the issue is scale because for it to be useful, a very large chunk (or ideally all) of current published research needs to be annotated according to the presented system. This could be something that can be integrated as a gamification mechanism in google scholar, where you occasionally ask the user to annotate an abstract. This way, you’re able to do it at scale. I also find it interesting that purpose+mechanism produced more results than background+purpose+mechanism. I would’ve figured that the more context that background provides would serve to find better matches. But given the goal of finding analogies even in different fields, perhaps background+purpose+mechanism rightly does not give great results because it gets too specific by providing too much information.

  1. Would you find use for this? Or would you prefer seeking out papers on your own?
  2. Do you think that the annotation categories are appropriate? Would other categories work? Maybe more or less categories?
  3. Is it feasible to expand on this work to cover annotating the content of whole papers? Would that be asking too much?

Read More

04/22/2020 – Subil Abraham – Park et al., “Opportunities for automating email processing”

Despite many different innovations from many different directions to try and revolutionize text communication, the humble email has lived on. The use of emails and email clients has adapted to current demands. The authors of this paper conduct investigations of the current needs of email users through a workshop and a survey, and they analyze open source code repositories that were performing things on email, in order to identify what are the current users needs, what is not being solved by existing clients, and what tasks are people taking the initiative to solve programmatically that their email clients don’t solve. They identify several high level categories of needs: the need for additional metadata on the email, the ability to leverage internal and external context, managing attention, not overburdening the receiver, and automated content processing and aggregation. They create a python tool called YouPS that provides an API with which a user can write scripts to perform some email automation tasks. They study the users of their tool for a week and note the kind of work they automate with the tool. They found that about 40% of the rules the users created with YouPS could not be done within their ordinary email client.

It’s fascinating that there is so much efficiency that can be obtained by allowing people to manage their email programmatically. I feel like something like this should’ve been a solved problem but apparently there is still room for innovation. It’s also possible that what YouPS provides is something that couldn’t really be done in an existing client, either because an existing client is trying to be as user friendly as possible to the widest variety of people (how many people actually know what IMAP does?). Alternatively, it could be a result of email clients just having accumulated so much cruft that adding a programmable layer would be incredibly hard. I get the reason why their survey participants skew towards computer science students, and how their solution gravitates towards solving the email problem in a way that is better suited for people who are in computer science. But I also think that, in the long term, keeping YouPS the way it is is the right way to go. With every additional layer of abstraction you add, you lose flexibility. GUIs are not the way to go but rather that people will adapt to using the programming API as programming becomes more prevalent in daily life. I also find the idea of modes really interesting and useful and would definitely be something I would like to have in my email clients.

  1. What kind of modes would you set up with different sets of rules e.g. a research mode, a weekend mode?
  2. Do you think that changing YouPS to be more GUI based would be beneficial because it would reach a wider audience? Or should it keep its flexibility at the cost of maybe not having wide spread adoption?
  3. How would you think about training an ML model that can capture internal and external context in your email?

Read More

04/15/2020 – Subil Abraham – Nguyen et al., “Believe it or not”

In today’s era of fake news where new information is constantly spawning everywhere, the great importance of fact checking cannot be understated. The public has a right to remain informed and be able to obtain true information from accurate, reputable sources. But all too often, people are inundated with too much information and the cognitive load of fact checking everything would be too much. Automated fact checking has made strides but previous work has focused primarily on model accuracy and not on the people who need to use them. This paper is the first to study an interface for humans to use a fact checking tool. The tool is pretrained on the Emergent dataset of annotated articles and sources and uses two models, one that predicts article stance on a claim and the other that calculates the accuracy of the claim based on the reputation of the sources. The application works by taking a claim and retrieving articles that talk about the claim. It uses the article stance model to classify if the articles are for or against the given claim, and then predicts the claim’s accuracy based on the collective reputation of its sources. It conveys that its models are not accurate and provides confidence levels for its accuracy claims. It also provides sliders for the human verifiers to adjust the predicted stance of the articles and also to adjust the source reputation according to their beliefs or new information. The authors run three experiments to test the efficacy of the tool for human fact checkers. They find that the users tend to trust the system, which can be problematic when the system is inaccurate.

I find it interesting that for the first experiment, the System group’s error rate somewhat follows the stance classifiers error rate. The crowd workers are probably not going through the process of independently verifying the stance of the articles and simply trust the predicted stance they are shown. Potentially this could be mitigated by adding incentives (like extra reward) to have them actually read the articles in full. But on the flip side, we can see that their accuracy (supposedly) becomes better when they are given the sliders to modify the stances and reputation. Maybe that interactivity was the clue they needed to understand that the predicted values aren’t set in stone and could potentially be inaccurate. Though I find it strange that the Slider group in the second experiment did not adjust the sliders if they were questioning the sources. What I find even stranger though is the fact that the authors decided to keep the claim that allowing the users to use the sliders made them more accurate. This claim is what most readers would take away unless they were carefully reading the experiments and the riders. And I don’t like that they kept the second experiment results despite them not showing any useful signal. Ultimately, I don’t buy into their push that this tool is something that is useful for the general user as it stands now. And I also don’t really see how this tool could serve as a technological mediator for people with opposing views, at least not the way they described it. I find that this could serve as a useful automation tool for expert fact checkers as part of their work but not for the ordinary user, which is what they model by using crowdworkers. I like the ideas that the paper is going for, of having automated fact checking that helps for the ordinary user and I’m glad they acknowledge the drawbacks. But I think there are too many drawbacks that prevent me from fully buying into the claims of this paper. It’s poetic that I have my doubts about the claims of a paper describing a system that asks you to question claims.

  1. Do you think this tool would actually be useful in the hands of an ordinary user? Or would it serve better in the hands of an expert fact checker?
  2. What would you like to see added to the interface, in addition to what they already have?
  3. This is a larger question, but is there value in having the transparency of the machine learning models in the way they have done (by having sliders that we can manipulate to see the final value change)? How much detail is too much? What about for more complex models where you can’t have that instantaneous feedback (like style transfer) how do you provide explainability there?
  4. Do you find the experiments rigorous enough and conclusions significant enough to back up the claims they are making?

Read More

04/15/2020 – Subil Abraham – Diakopoulos, “Algorithmic accountability”

Algorithms have pervaded our every day lives, because computers have become essential in our every day lives. Their pervasion also means that they need to be closely scrutinized to ensure that they are functioning like they should, without bias, obeying the guarantees the creators have promised. Algorithmic Accountability is a category of journalism where the journalists investigate these algorithms to validate their claims and find if there are any violations. The goal is to find mistakes and omissions or bias creeping into the algorithms because though computers do exactly what they’re told, they are still created by humans with blinspots. They classify the four kinds of decisions that algorithm decision making falls under. They claim that transparency alone is not enough because full transparency can often be prevented by trade secret excuses. They utilize the idea of reverse engineering where they put in inputs and observe the outputs, without looking at the inner workings because journalists are often dealing with black box algorithms. They look at five case studies of journalists who’ve done such investigations with reverse engineering, as well as putting a theory and a methodology on how to find news-worthy stories in this space.

This paper is a very interesting look from the non CS/HCI perspective of studying how algorithms function in our lives. This paper, coming from the perspective of journalism and looking at the painstaking way journalists investigate these algorithms. Though not the focus, this work also brings to light the incredible roadblocks that come with investigating proprietary software, especially those from large secretive companies who would leverage laws and expensive lawyers to fight such investigations if it is not in their favor. In an ideal world, everyone would have integrity and would disclose all the flaws in their algorithms but that’s unfortunately not the case which is why the work these journalists are doing is important, especially when they don’t have easy access to the algorithms they’re investigating, and sometimes don’t have access to the right inputs. There is a danger here that a journalist could end up being discredited because they did the best investigation they could with the limited resources they have but the PR team of the company they’re investigating latches on to a poor assumption or two to discredit the otherwise good work. The difficulty in performing these investigations, especially for journalists who may not have prior training or experience in dealing with computers, exemplifies the need for at least some computer science education for everyone so that they can better understand the systems they’re dealing with and have a better handle on running investigations as algorithms pervade even in our lives.

  1. Do you think some of the laws in place that allow companies to obfuscate their algorithms should be relaxed to allow easier investigation?
  2. Do you think current journalistic protections are enough for journalists investigating these algorithms?
  3. What kind of tools or training can be given to journalists to make it easier for them to navigate this world of investigating algorithms?

Read More

Subil Abraham – 04/08/2020 – Rzeszotarski and Kittur, “CrowdScape”

Quality control in crowdwork is straightforward for straightforward tasks. Tasks like transcribing text on an image is fairly easy to evaluate the quality of because there is only one right answer. Requesters can use things like gold standard tests to evaluate the output of the crowdworkers directly in order to determine if they have done a good job, or use task fingerprinting to determine if the worker behavior indicates that they are making an effort. The authors propose CrowdScape as a way to combine both types of quality analysis, worker output and behavior, through a mix of machine learning and innovative visualization methods. CrowdScape includes a dashboard that provides a birds-eye view of the different aspects of the worker behavior in the form of graphs. These graphs showcase both aggregate behaviors of all the crowdworkers as well as the timeline of the individual actions a crowd worker takes on a particular task (scrolling, clicking, typing, and so on). They conduct multiple case studies on different kinds of tasks to show that their visualizations are beneficial in separating out the workers who make an effort in producing quality output versus those who are just phoning it in. Behavioral traces identify where the crowdworker spends their time by looking at their actions and how long they spend doing that action.

CrowdScape provides an interesting visual solution to the problem of “how to evaluate if the workers are being sincere in the completion of complex tasks”. Creative work especially, where you ask the crowd worker to write something on their own, is notoriously hard to determine because there is no gold standard test that you can do. So I find the inclusion of the behavior tracking visualizer where different colored lines along a timeline represent different actions done can be useful. Someone who makes an effort in typing out will show long blocks of typing with pauses for thinking. I can see how different behavioral heuristic can be applied for different tasks in order to determine if the workers are actually doing the work. I have to admit though that I find the scatter plots kind of obtuse and hard to parse. I’m not entirely sure how we’re supposed to read them and what information they are conveying. So I feel like the interface itself could do better in communicating exactly what the graphs are doing. There is promise for releasing this as a commercial or open source product (if it isn’t already one) once the polishing of the interface is done with. One last thing is the ability to group “good” submissions by the requester and then machine learning is used by CrowdScape to find other similar “good” submissions. However, the paper only makes mention of it and do not describe how it fits in with the interface as a whole. I felt this was another shortcoming of this design.

  1. What would a good interface for the grouping of the “good” output and subsequent listing of other related “good” output look like?
  2. In what kind of crowd work would CrowdScape not be useful (assuming you were able to get all the data that CrowdScape needs)?
  3. Did you find all the elements of the interface intuitive and understandable? Were there parts of it that were hard to parse?

Read More

Subil Abraham – 04/08/2020 – Heer, “Agency plus automation”

A lot of work has been independently done along the tangents of improving computers to allow humans to use them better, and separately in helping machines do work by themselves. The paper makes the case that in the quest for automation, research in augmenting humans to do work by improving the intelligence of tools has fallen to the wayside. This provides a rich area of exploration. The paper explores three tools in this space that work with the users in a specific domain and predict what they might need or want next, based on a combination of context clues from the user. Two of the three tools, Data Wrangler and Voyager use domain specific languages to represent to the user the operations that are possible, thus providing a shared representation of data transformations for the user and the machine. The last tool, for language translation, does not provide a shared representation but presents the suggestions directly because there is no real way of using a DSL here outside of exposing the parse tree which doesn’t really make sense for an ordinary end user. The paper also makes several suggestions of future work. This includes methods for better monitoring and introspection tools in these human AI systems, allowing shared representations to be designed by AI based on the domain instead of being pre-designed by a human, and finding techniques that would help to identify the right balance between human control and automation for a given domain.

The paper uses these three projects as a framing device to discuss the idea of developing better shared representations and their importance in human AI collaboration. I think its an interesting take, especially the idea of using DSLs as a means of communicating ideas between the human user and the AI underneath. They backed away from discussing what a DSL would look like for the translation software since anything outside of autocomplete suggestions don’t really make sense in that domain, but I would be interested in further exploration in that field. I also find it interesting and it makes sense that people might not like the machine predictions being thrust upon them, either because it influences the thinking or it is just annoying. I think the tools discussed manage to make a good balance in staying out of the users way. Yes, the user will be influenced but that is inevitable because the other option is to not give the predictions at all and now you get no benefit.

Although I see the point that the article is trying to make about shared representations (at least, I think I do), I really don’t see the reason for the article existing besides just the author saying “Hey look at my research, this research is very important and I’ve done things with it including making a startup”. The article doesn’t contribute any new knowledge. I don’t mean for that to sound harsh, and I can understand how reading this article is useful from a meta perspective (saves us the trouble of reading the individual pieces of research that are summarized in this article and trying to connect the dots between them).

  1. In the translation task, Why wouldn’t a parse tree work? Are there other kinds of structured representations that would aid a user in the translation task?
  2. Kind of a meta question, but do you think this paper was useful on its own? Did it provide anything outside of summarizing the three pieces of research the author was involved in?
  3. Is there any way for the kind of software discussed here, where it makes suggestions to the user, to avoid influencing the user and interfering with their thought process?

Read More

Subil Abraham – 03/25/2020 – Huang et al., “Evorus”

This paper introduces Evorus, a conversational assistant framework/interface that can serve as a middleman to curate and choose the appropriate responses for a client’s query. The goal of Evorus is to serve as a middleman between a user and many integrated chatbots, while also using crowd workers to vote on which responses are the best given the query and the context. This allows Evorus to be a general purpose chatbot, because it being powered by many domain specific chatbot and (initially) crowd workers. Evorus learns over time from the crowd workers votes on which responses to send to a query, based on its historical knowledge of previous conversations, and also learn which chatbot to direct a query to based on what it knows of which chatbots responded to similar queries in the past. It also prevents bias against newer chatbots by giving them higher initial probabilities when they first start to allow them to be selected even though Evorus does not have any historical data or prior familiarity with that chat bot. The ultimate ideal of Evorus is to eventually minimize the number of crowd worker interventions that are necessary by learning which responses to vote on and pass through to the user, and thus save crowd work costs over time.

This paper seems to follow on the theme of last week’s reading “Pull the Plug? Predicting If Computers or Humans Should Segment Images”. In that paper, the application is trying to decide on the quality of image segmentation of an algorithm, and pass it on to a human in case it was not up to par. The goals of this paper seem similar to that, but for chat bots instead of image segmentation algorithms. I’m starting to think the idea of curation and quality checking is a common refrain that will pop up in other crowd work based applications, if I keep reading in this area. I also find it an interesting choice that Evorus seems to allow multiple responses (either from bots or from crowd workers) to be voted in and displayed to the client. I suppose the idea here is that, as long as the responses made sense and they add more information that can be given to the client, it’s beneficial to allow multiple responses instead of trying to force a single, canonical response. Though I like this paper and the application that it presents, one issue I have is that they don’t show a proper user study. Maybe they felt it was unnecessary because user studies on automatic and crowd based chatbots have been done before and the results of these would be no different. But I still think they should’ve done some client side interviews or observations, or at least shown a graph of the Likert scale responses they collected for the two phases.

  1. Do you see a similarity between this work and the Pull the Plug paper? Is the idea of curation and quality control and teaching AI how to do quality control a common refrain in crowd work research?
  2. Do you find the integration of Filler bot, Interview bot, and clever bot, which are not actually contributing anything useful to the conversation, of any use? Was it just there to add conversation noise? Did they serve a training purpose?
  3. Would a user study have shown anything interesting or suprising compared a standard AI based or crowd based chat bot?

Read More

Subil Abraham – 03/25/2020 – Luger and Sellen, “Like Having a Really bad PA”

This paper tries to take a hard look at how useful conversational agents like Google Now and Siri are in the real world, when in the hands of real users who try to use them in daily life. The authors conduct interviews with 14 users to get their general thoughts about how they use these tools and in some case, get step by step details on how they do specific tasks. The paper is able to get some interesting insights and provide some useful recommendations on how to improve the existing CAs. Recommendations include making design changes to inform users the limitations of what the CAs can do, tone down some of the more personable aspects which gives a false impression that they are equivalent to humans in understanding, and rethinking design for easier use in hands free scenarios.

First thing that I noticed, after having read and focused primarily on papers that had some quantitative aspect to them, was that this paper is entirely focused on evaluating and interpreting the content of their interviews. I suppose this is another important way in which HCI research is done and shared with the world, because it focuses entirely on the human side of it. I think they have some good interpretations and recommendations from it. The general problem I have with these kinds of studies is the small sample size, which rears up here too. But I can look past that because I think they still are able to get some good insights and make some good recommendations, and provide focus on a mode of interaction that is entirely dialogue based. I do think that if they could have a bigger sample size and do some quantitative work, they could maybe show some trends in the failings of CAs. The most interesting insight for me is the fact that CAs seemed to have been designed with the thought that they would be the focus of attention when used, when in reality people were trying to use it while doing something else and were not looking at their phone. So the feedback mechanism was useless for the users because they were trying to be hands free. From my perspective, that seems to be the most actionable change and can probably lead to (or maybe it already has lead to) interesting design research on how to best provide task feedback for different kinds of tasks for hands free usage.

  1. What kind of design elements can be included to help people understand the limits of what the CA can do, and thereby avoid having unfulfillable expectations?
  2. Similarly, what kind of design elements would be useful to better suit the hands free usage of the CAs?
  3. Should CAs aim to be more task oriented like Google Now, or more personable like Siri? What’s your preferred fit?

Read More