04/29/20 – Jooyoung Whang – Accelerating Innovation Through Analogy Mining

This paper sought to find analogies in big messy real-world natural language data by applying the structure of purpose and mechanism. The authors created binarized data for each purpose and mechanism of a document by setting some words to 1 if one of purpose or mechanism could be represented by the word and 0 if not. Then the authors could evaluate distances between each of these vectors to let users generate creative ideas that have a similar purpose but the different mechanisms. The authors utilized Mturk to generate training sets as well as for evaluation. They measured creativity in terms of novelty, quality, and feasibility. The authors report significantly improved performance than the baseline of plain TF-IDF or random.

This paper appeared to be similar to the SOLVENT paper from last week, except that this one worked with product descriptions, only used the purpose-mechanism structure, and evaluated based on creativity. This was actually a more interesting read for me because it was more relevant to my project. I was especially inspired by the authors’ method of evaluating creativity. I think I may be able to do something similar for my project.

I took special note to the amount of compensation the authors paid to Mturk workers and tried to reverse-calculate the time they allotted for each worker. The authors paid $1.5 for a task that required redesigning an existing product, using 12 near-purpose far-mechanism solutions found by the authors’ approach. This must be a lot of reading (assuming 150 words per solution, that’s 1800 words of reading leaving out the instructions!) and creative thinking. Based on the amount paid, the authors expected about 10 minutes for the participants to finish their work. I am unsure if this amount was appropriate, but based on the authors’ results, it seems successful. It was difficult for me to gauge how much I should pay for my project’s tasks, but I think this study gave me a good anchor point. My biggest dilemma was balancing out the number of creative references provided by my workers versus the quality (more time needed to generate, thus more expensive) for each of the references.

These are the questions that I had while reading the paper:

1. One of the reasons why the SOLVENT paper expanded their analogy structure to purpose-background-mechanism-findings was because not all papers had a “mechanism” or a “solution.” (i.e. some papers were about simple findings of a problem or domain.) Do you think the same applies to this study?

2. Do you think the amount of compensation the authors paid was appropriate? If not, how much do you think would have been appropriate? I would personally really like to read some answers about this question to apply to my project.

3. What other ways could be used to measure “creativity”? The authors did a great job by breaking down creativity into smaller measurable components (although still being qualitative ones) like novelty, quality, and feasibility. Would there be a different method? Would there be more measurable components? Do you think the authors’ method captures the entirety of creativity?

Read More

04/29/20 – Jooyoung Whang – DiscoverySpace: Suggesting Actions in Complex Software

In this paper, the authors introduce an add-on prototype interface called DiscoverySpace that recommends actionable items to a user using photoshop. For this application, the authors focused on providing the followings:

– List possible actions at the start

– Use human language

– Show previews of before and after action

– Offer faceted browsing

– Provide relevant and possible new suggestions.

The authors conducted a between-users study to measure the performance of the software. One control group did not have access to DiscoverySpace and used plain Photoshop whereas the other group did. The authors report improved performance for users using DiscoverySpace.

In this interface, the authors require the users to provide information about an image at the start. They mentioned this could be improved in the future by automatic image analysis. I think this feature is desperately needed, at least in my case. When I decide to use a tool, I prefer the tool automatically configuring basic things for me. Also, surprisingly many users interact with interfaces in the wrong way even if it is a very simple one. I am certain some people will fail to configure DiscoverySpace. Object classification inside an image is pretty well-established today, so I think this feature will greatly improve the accessibility to users.

It was interesting to find that users still could not figure out some functionality 50% of the time while using DiscoverySpace. It is certainly better than 80% from the control group, but this percentage still looks too high. I think this says something about the complexity of the tool or the lack of information in the database that they used to provide the suggestions.

Overall, I felt that the study was done in a bit of a rush. In the result section, the study’s participants talk about the lack of functionalities such as dialing down the effect of a filter. I think the idea of the tool itself is pretty cool, but it could have benefitted from more time. I think there could have been a better result from refining the tool a bit more.

The followings are the questions that I had while reading the paper:

1. Do you think this tool has a benefit over simply using Internet search? The authors state that their tool suggests more efficient solutions. However, I think it’ll take a significantly shorter time to search on the Internet. What do you think? Would you use Discovery Space?

2. Did you also feel that the study was a bit rushed? What do you think could have changed given that the authors spent more time refining the tool? Would the participants have provided more positive feedback? What parts of the tool could be improved?

3. It seems that the participants’ survey ratings were used to measure creativity. What other metrics could have been used to measure creativity? I would especially like to hear about this since my project is about creative writing. Would it be possible to measure creativity without simple human input? Would the method be quantitative or qualitative?

Read More

04/22/20 – Jooyoung Whang – SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers

This paper proposes a novel mixed-initiative method called SOLVENT that has the crowd annotate relevant parts of a document based on purpose and mechanism and representing the documents on a vector space. The authors identify that representing technical documents using the purpose-mechanism concept with crowd workers has obstacles such as technical jargon, multiple sub-problems in one document, and the presence of understanding-oriented papers. Therefore, the authors modify the structure to hold background, purpose, mechanism, and findings instead. With each document represented by this structure, the authors were able to apply natural language processing techniques to perform analogical queries. The authors found better query results than baseline all-words representations. To scale the software, the authors made workers of Upwork and Mturk annotate technical documents. The authors found that the workers struggled with the concept of purpose and mechanism, but still provided improvements for analogy-mining.

I think this study will go nicely together with document summarization studies. It would especially help since the annotations are done by specific categories. I remember one of our class’s project involved ETDs and required summaries. I think this study could have benefited that project given enough time.

This study could also have benefited my study. One of the sample use-cases that the paper introduced was improving creative collaboration between users. This is similar to my project which is about providing creative references for a creative writer. However, if I want to apply this study to my project, I would need to additionally label each of the references provided by the Mturk workers by purpose and mechanism. This will cost me additional funds for providing one creative reference. This study would have been very useful if I had enough money and wanted more quality content rankings in terms of analogy.

It was interesting that the authors mentioned different domain papers could still have the same purpose-mechanism. It made me wonder if researchers would really want similar purpose-mechanism papers on a different domain. I understand multi-disciplinary work is being highlighted these days but would each of the disciplines involved in a study try to address the same purpose and mechanism? Wouldn’t they address different components of the project?

The followings are the questions that I had while reading the paper.

1. The paper notes that many technical documents are understanding-oriented papers that have no purpose-mechanism mappings. The authors resolved this problem by defining a larger mapping that is able to include these documents. Do you think the query results would have had higher quality if the mapping was kept compact instead of increasing the size? For example, would it have helped if the system separated purpose-mechanism and purpose-findings?

2. As mentioned in my reflection, do you think the disciplines involved in a multi-disciplinary project all have the same purpose and mechanism? If not, why?

3. Would you use this paper for your project? To put in other words, does your project require users or the system to locate analogy inside a text document? How would you use the system? What kind of queries would you need out of the combinations possible (background, purpose, mechanism, findings)?

Read More

04/22/20 – Jooyoung Whang – Opportunities for Automating Email Processing: A Need-Finding Study

In this paper, the authors explore the kinds of automated functionalities or needs for E-mail interfaces users would want. The authors held workshops with technical and non-technical people to learn about these needs. The authors found the need for functionalities such as additional or richer E-mail data models involving latent information, internal or external context, using mark-as-read to control notifications, self-destructing event E-mails, different representation of E-mail threads, and content processing. Afterward, the authors mined Github repositories that actually held implementation of E-mail automation and labeled them. The authors found prevalent implementations were on automizing repetitive processing tasks. Outside the needs identified from their first probe, the authors also found needs such as using the E-mail inbox as middleware and analyzing E-mail statistics. The authors did a final study by providing users with their own programmable E-mail inbox interface called YouPS.

I really enjoyed reading the section about probes 2 and 3 where actual implementations were done using IMAP libraries. I especially like the one about notifying the respondent using flashing visuals on a Raspberry PI. It looks like a very creative and fun project. I also noticed that many of the automation were in processing repetitive tasks. This again confirms the machine affordance about being able to process many repetitive tasks.

I personally thought YouPS to be a very useful tool. I also frequently have trouble organizing my tens of thousands of unread E-mails comprising of main advertisements. I think YouPS could serve me nicely in fixing this. I found that YouPS is public and accessible online (https://youps.csail.mit.edu/editor). I will definitely return to this interface once time permits and start dealing with my monstrosity of an inbox. YouPS addresses nicely the complexity of developing a custom inbox management system. I am not familiar with the concept of IMAPs, which hinders me from implementing E-mail related functionalities in my personal projects. A library like YouPS that simplifies the protocol would be very valuable to me.

The followings are the questions that I had while reading this paper.

1. What kind of E-mail automation would you want to make given the ability to make any automation functionality?

2. The authors mentioned in their limitations that their study’s participants were mostly technical programmers. What difference would there be between programmers and non-programmers? If the study was able to be done with only non-programmers do you think the authors would have seen a different result? Is there something specifically relevant to programmers that resulted in the existing implementations of E-mail automation? For example, maybe programmers usually deal with more technical E-mails?

2. What interface is desirable for non-programmers to meet their needs? The paper mentions that one participant did not like that current interfaces required many clicks and typing to create an automation rule and they didn’t even work properly. What would be a good way for non-programmers to develop an automation rule? The creation of a rule requires a lot of logical thinking comprising of many if-statements. What would be a minimum requirement or qualification for non-programmers to create an automation rule?

Read More

04/15/20 – Jooyoung Whang – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

In this paper, the authors state that the current fully automatic fact-checking systems are not good enough for three reasons: model transparency, taking world facts into consideration, and model uncertainty communication. So, the authors went on and built a system including humans in the loop. Their proposed system uses two classifiers that each predict the reliability of a supporting document of a claim and the veracity of the document. Using these weighted classifications, the confidence of the system’s prediction about a claim is shown to the user. The users can further manipulate the system by modifying the weights of the system. The authors conducted a user study of their system with Mturk workers. The authors found that their approach was effective, but also noted that too much information or misleading predictions can lead to big user errors.

First off, it was hilarious that the authors cited Wikipedia to introduce Information Literacy in a paper about evaluating information. I personally took it as a subtle joke left by the authors. However, it also led me to a question about the system. If I did not miss it, the authors did not explain where the relevant sources or articles came from that supported a claim. I was a little concerned if some of the articles used in the study were not reliable sources.

Also, the users conducted the user study using their own defined set of claims. While I understand this was needed for efficient study, I wanted to know how the system would work in the wild. If a user searched a claim that he or she knows is true, would the system agree with high confidence? If not, would the user have been able to correct the system using their interface? It seemed that some portion of the users were confused, especially with the error correction part of the system. I think these would have been valuable to know and would seriously need to be addressed if the system were to become a commercial product.

These are the questions that I had while reading the paper:

1. How much user intervention do you think is enough for these kinds of systems? I personally think if the users are given too much power over the system, users will apply their bias to the correction and get false positives.

2. What would be a good way for the system to only retrieve ‘reliable’ sources to reference? Stating that a claim is true based on a Wikipedia article would obviously not be so assuring. Also, academic papers cannot address all claims, especially if they are social claims. What would be a good threshold? How could this be detected?

3. Given the current system, would you believe the results that the system gives? Do you think the system addresses the three requirements that the authors introduced which all fact-checking systems should possess? I personally think that system transparency is still lacking. The system shows a lot about what kind of sources it used and how much weight it’s putting into them, but it does not really explain how it made the decision.

Read More

04/15/20 – Jooyoung Whang – What’s at Stake: Characterizing Risk Perceptions of Emerging Technologies

In this paper, the authors conduct a survey with a listing of known technological risks, asking the participants to rate the severity of each risk. The authors state that their research is an extension of prior work done in the 1980s. The paper’s survey was taken between experts and non-experts, where experts were collected from Twitter and non-experts from Mturk. From the old work and their own, the authors found that people tend to rate voluntary risks low even if in reality they are high. They also found that many emerging technological risks were regarded as involuntary. It was also shown that non-experts tended to underestimate the risks of new technologies. The authors also introduce a risk-sensitive design based on their findings. The authors show a risk-perception graph that can be used to decide whether a proposed technology is perceived by non-experts as risky as experts think or are underestimated and whether the design is acceptable.

This paper nicely captures the user characteristics of technical risk perception. I liked that the paper did not end explaining the results but also went further to propose a tool for technical designers. However, it was a little unclear to me how to use the tool. The risk-perception graph that the authors show only has “low” and “high” on the axis’s labels, which are very subjective terms. A way to quantify risk perception would have served nicely.

This paper also made me think what’s the point of providing terms of use for a product if the users get the feeling that they have involuntarily exposed to risk. I feel like a better representation would be needed. For example, a short summary outlining the most important risks in a short sentence and providing details in a separate link would be more effective than throwing a wall of text at a (most likely) non-technical user.

I also think a way to address the gap of risk perception between designers and users is to involve users in the development process in the first place. I am unsure of the exact term, but I recall learning about the term users-in-the-loop development cycle from a UX class. This development method allows designers to fix user problems early in the process and end up with higher quality products. I feel it would also inform the designers more about potential risks.

These are the questions that I had while reading the paper:

1. What are some disasters that may happen due to the gap in risk perception between users and designers of a system? Would any additional risks occur due to this gap?

2. What would be a good way to reduce the gap in risk perception? Do you think using the risk-perception graph from the paper is useful for addressing this gap? How would you measure the risk?

3. Would you use the authors’ proposed risk-sensitive design approach in your project? What kind of risks do you expect from your project? Are they technical issues and do you think your users will underestimate the risk?

Read More

04/08/20 – Jooyoung Whang – CrowdScape: Interactively Visualizing User Behavior and Output

In this paper, the authors try to help Mturk requesters by providing them with an analysis tool called “Crowdscape.” Crowdscape is a ML + visualization tool for viewing and filtering Mturk worker submissions based on the workers’ behaviors. The user of the application can threshold based on certain behavioral attributes such as time spent or typing delay. The application takes in two inputs: Worker behavior and results. The behavior input is a timeseries data of user activity. The result is what the worker submitted for the Mturk work. The authors focused on finding similarities of the answers to graph on parallel coordinates. The authors conducted a user study by launching four different tasks and recording user behavior and result. The authors conclude that their approach is useful.

This paper’s approach of integrating user behavior and result to filter good output was interesting. Although, I think this system should overcome a problem for it to be effective. The problem lies in the ethics area. The authors explicitly stated that they obtained consent from their pool of workers to collect user behavior. However, some Mturk requesters may decide not to do so with some ill intentions. This may result in intrusion of private information and even end up to theft. On the other hand, upon obtaining consent from the Mturk worker, the worker becomes aware of him or her being monitored. This could also result in unnatural behavior which is undesired for system testing.

I thought the individual visualized graphs and figures were effective for better understanding and filtering by user behavior. However, the entire Crowdscape interface looked a bit overpacked with information. I think a small feature to show or hide some of the graphs would be desirable. The same problem existed with another information exploration system from a project that I’ve worked in. In my experience, an effective solution was to provide a set of menus that hierarchically sorted attributes.

These are the questions that I had while reading the paper:

1. A big purpose of Crowdscape is that it can be used to filter and retrieve a subset of the results (that are thought to be high quality results). What other ways could this system be used for? For example, I think this could be used for rejecting undesired results. Suppose you needed 1000 results and you launched 1000 HITs. You know you will get some ill-quality results. However, since there are too many submissions, it’ll take forever to filter by eye. Crowdscape would help accelerate the process.

2. Do you think you can use Crowdscape for your project? If so, how would you use it? Crowdscape is useful if you, the researcher, is the endpoint of the Mturk task (as in the result is ultimately used by you). My project uses the results from Mturk in a systematic way without ever reaching me, so I don’t think I’ll use Crowdscape.

3. Do you think the graphs available in Crowdscape is enough? What other features would you want? For one, I’d love to have a boxplot for the user behavior attributes.

Read More

04/08/20 – Jooyoung Whang – Agency plus automation: Designing artificial intelligence into interactive systems

This paper seeks to investigate a method to achieve AI + IA. That is, enhancing human performance using automated methods but not completely replacing it. The author takes into notice that effective automation should first off bring significant value, second be unobtrusive, third do not require precise user input, and finally, adapt. The author takes these points to account and introduces three interactive systems that he built. All these systems utilize machine computing to handle the initial or small repetitive tasks and rely on human computing to make corrections and improve quality. They are all collaborative systems where AI and humans work together to boost each other’s performance. The AI part of the system tries to predict user intentions while the human part of the system drives the work.

This paper reminded me of Smart-Built Environments (SBE), a term I learned in a Virtual Environments class. SBE is an environment where computing is seamlessly integrated into the environment and interaction with it is very natural. It is capable of “smartly” providing appropriate services to humans in a non-intrusive way. For example, a system where the light automatically lights up upon a person entering a room is a smart feature. I felt that this paper was trying to build something similar in a desktop environment. One core difference with SBEs is that SBE also tries to tackle immersion and presence (which are terms frequently used for evaluating virtual environments). I wonder if the author knows about SBEs or got his project ideas from SBEs.

While reading the paper, I wasn’t sure if the author handled the “unobtrusive” part effectively. In one of the introduced systems, Wrangler was an assist tool for preprocessing data. It tries to predict user intention upon observing certain user behavior and recommends available data transformations on a side panel. I believe this was a similar approach to mimic the Google query auto-completion feature. However, I don’t think it’ll work as well as Google’s auto-completion. Google’s auto-complete suggestions appear right below where the user is typing whereas Wrangler suggests it in the side corner. This requires the user to avert his or her eye from where the point of the previous interaction was, and this is obtrusive.

These are the questions that I had while reading the paper:

1. Do you know any other systems that try to seamlessly integrate AI and human tasks? Is that system effective? How so?

2. The author of this paper mostly uses AI to predict user intentions and process repetitive tasks. What other capabilities of AI would be available for naturally integrating with human tasks? What other tasks are hard to do by humans that machines accel at that could be integrated?

3. Do you agree that “the best kind of systems is one where the user does not even know he or she is using it?” Would there ever be a case where it is crucial that the user feels the presence of the system as a separate entity? This thought came to me because systems could (and ultimately does) fail at some point. If none of the users understand how the system works, wouldn’t that be a problem?

Read More

3/25/20 – Jooyoung Whang – All Work and No Play? Conversations with a Question-and-Answer Chatbot in the Wild

In this paper, the authors design and deploy a conversational QA bot called CHIP. It is capable of providing domain-specific information (information about the company it was deployed at) and performing off-topic conversation. The authors’ interest in this study was to observe and characterize the kinds of interactions that happened for users of CHIP and measure CHIP performance. CHIP classified user intention by using two classification systems, where one classified a large category and the other a more specific one. Based on whether the specific classification was a sub-category of the larger classification, the appropriate response was given to the user. The training was done using data collected from other conversational QA agents and anonymized company E-mails. The authors observed that users usually used CHIP for system inquiry, providing feedback, and playful chit-chat.

I personally liked the study due to the interesting topic. My mother owns an Amazon Alexa. I’ve frequently seen Alexa trying to be humorous and I was astonished by how naturally human-like these conversational agents could act. At the start of this paper, I was curious about how the authors approached the highly abstract concept of playfulness in a technical paper. Using an intention classification layer was a great idea. I think it nicely encapsulates the user queries and improves response quality.

One interesting observation in the paper was that casual conversations often occurred in a mix with work-related conversations. Up to now, I thought the two types of conversations happened separately when chatting with a bot. I think this mix happens more frequently when talking with a human, so I assume it was the result of the users trying to anthropomorphize the agent.

Moving on to a more critical reflection, I think it would have been nicer if the paper focused more on one side of the types of conversations (i.e. playful conversations). The paper tries to address both work-related conversations and playful conversations at the same time. I know that the authors were interested in looking at human-AI interaction in the wild, but I think this also made the results less compact and lose focus. I also had the feeling that this study was very specific to the agent that the authors designed (CHIP). I am unsure how the results would generalize to other CAs.

These are the questions that I had while reading the paper:

1. The authors mentioned that certain nuances could be used to detect user intention. What would be considered a nuance of a playful intention? It seems that there’s a high correlation between the user’s urge to anthropomorphize the bot and a playful conversation. Could phrases like ‘you’ or ‘we’ be used to detect playful intention?

2. As in my reflection, I think this study is a bit too specific to CHIP. What do you think? Do you think the results will generalize well to other kinds of conversational bots?

3. According to this paper, work-related conversations and playful conversations frequently happened together. Would there be a case where a playful conversation will never happen? What kind of environment would not require a playful conversation?

Read More

3/4/20 – Jooyoung Whang – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

In this paper, the authors study the effectiveness of vision-to-language systems for automatically generating alt texts for images and the impact of human-in-the-loop for this task. The authors set up four methods for generating alt text. First is a simple implementation of modern vision-to-language alt text generation. The second is a human-adjusted version of the first method. The third method is a more involved one, where a Blind or Vision Impaired (BVI) user chats with a non-BVI user to gain more context about an image. The final method is a generalized version of the third method, where the authors analyzed the patterns of questions asked during the third method to form a structured set of pre-defined questions that a crowdsource worker can directly provide the answer to without having the need for a lengthy conversation. The authors conclude that current vision-to-language techniques can, in fact, harm context understanding for BVI users, and simple human-in-the-loop methods significantly outperform. They also found that the method of the structured questions worked the best.

This was an interesting study that implicitly pointed out the limitation of computers at understanding social context which is a human affordance. The authors stated that the results of a vision-to-language system often confused the users because the system did not get the point. This made me wonder if the current limitation could be overcome in the future.

I was also concerned whether the authors’ proposed methods were even practical. Sure, the human-in-the-loop method involving Mturk workers greatly enhanced the description of a Twitter image, but based on their report, it’ll take too long to retrieve the description. The paper reports that to answer one of the structured questions, it takes on average, 1 minute. This is excluding the time it takes for a Mturk worker to accept a HIT. The authors suggested pre-generating alt texts for popular Tweets, but this does not completely solve the problem.

I was also skeptical about the way the authors performed validation with the 7 BVI users. In their validation, they simulated their third method (TweetTalk, a conversation between BVI and sighted users). However, they did not do it by using their application, but rather a face-to-face conversation between the researchers and the participants. The authors claimed that they tried to replicate the environment as much as possible, but I think there still can be flaws since the researchers serving as the sighted user already had expert knowledge about their experiment. Also, as stated in the paper’s limitations section, the validation was performed with too fewer participants. This may not fully capture the BVI users’ behaviors.

These are the questions that I had while reading this paper:

1. Do you think the authors’ proposed methods are actually practical? What could be done to make them practical if you don’t think so?

2. What do you think were the human affordances needed for the human element of this experiment other than social awareness?

3. Do you think the authors’ validation with the BVI users is sound? Also, the validation was only done for the third method. How can the validation be done for the rest of the methods?

Read More