04/29/20 – Lulwah AlKulaib-Siangliulue et al., “IdeaHound”

April 28, 2020 Lulwah AlKulaib 1 Comment

Summary

Collaborative-creative online communities where users propose ideas and solutions to existing problems have shown that the suggestions can be simple, repetitive, and mundane. Thus, organizations view these platforms as marketing venues instead of innovation resources. Small groups and individual targeted creativity interventions demonstrated that they can improve the creativity outcomes. The authors propose a system that improves the quality of ideas proposed in large scale collaborative platforms. IdeaHound is an ideation system that integrates the task of deﬁning semantic relationships among ideas into the primary task of idea generation. The proposed system creates a semantic model of the solution space by combining implicit human actions with machine learning. The system uses the semantic model to enable three main idea-enhancing interventions: sampling diverse examples, exploring similar ideas, and providing a visual overview of the emerging solution space. Users could use the system features to produce higher quality ideas. The authors conducted studies and experiments to show the effectiveness of the system and the need that this system fills.

Reflection

This was an interesting paper to read. I have never used an ideation platform and have no background in this area but I learned a lot from this reading. To my understanding, the proposed system would be helpful in crowd lead communities that propose ideas for an issue or a problem. This system seems to be useful in keeping track of similar ideas semantically and grouping them for users when they type their own. This solution would drastically reduce repetitiveness and help users to build on top of others ideas. Also, allowing users to have a visual overview of the solution space is another tool that would help in brainstorming. I think that the visualization is helpful to identify different topics, least explored ideas, and most focused on ideas. Which could help redirect users in different directions based on the cluster visualization.

I think that the interface when looking at the big picture is reasonable and very helpful. Yet when looking at the whiteboard and clusters of ideas up close, I feel like this could have been done better. Even though it is simple, using sticky notes as a resemblance of scrap notes, the page looks crowded and can be overwhelming for some users (at least me). I believe that there should be best practices for organizing ideas and maybe having a different interface would help making that process less messy and easier to understand.

I am conflicted about the paper. I like the idea behind the system but the user interface is just too overwhelming for me. I do not know what would be the best visual item to replace the sticky notes or how that section of the system would be altered but maybe in a future version they would offer choices for different users (sticky notes and maybe a menu with toggle/sorting ability items?).

Discussion

Do you agree with the user interface design for the user workspace?
How would you design this system’s user interface?
Have you ever been part of a crowd based ideation process? What were your thoughts about that experience? Would you use a system similar to the proposed system for your task?
What would be a better way to deal with a massive amount of ideas than sticky notes?

04/29/20 – Lulwah AlKulaib-Fraser et al., “DiscoverySpace”

April 28, 2020April 28, 2020 Lulwah AlKulaib Leave a comment

Summary

As software develops over time the complexity increases with all the newly added features. The acquired complexity over time could be beneficial for experts yet it presents issues for beginner end users. When thinking of developing off the shelf software for end users, developers must consider the different technical backgrounds their end users have and try to have the interface accessible for all potential users. To resolve this difficulty in Adobe Photoshop, the authors present DiscoverySpace, a prototype extension panel that suggests task-level action macros to apply to photographs based on visual features. The extension is meant to help new Adobe Photoshop users by making suggestions once the user starts a task (opens a picture), uses simple human language in search, shows previews of what a suggestion does (before and after), offer faceted browsing to make searching a better experience, and show suggestions that are relevant to the users’ current task which also alerts him to new or unknown possibilities. The authors investigate the effectiveness of the extension by running a study and comparing two groups, one was using the extension, and the other did not. They find that action suggestions might help new users from losing confidence in their abilities, help them accomplish their tasks, and discover new features.

Reflection

As an on-and-off Adobe Photoshop user, I was interested in this paper and this extension. I thought it would be nice to have those suggestions as a reminder when I use the software after months of not using it. Since I am more focused on Adobe Lightroom when it comes to editing photos, it is easy for me to confuse the panels and actions available in both softwares. I was somewhat surprised that the users who had the extension were still answering that they couldn’t figure something out 50% of the time. Even though there was a drop of 30% from users who were not using the extension, it still raises the question: where is the problem? Was it the software? Extension lacks some details? Or was it just the fact that users need time to become familiar with the interface?

I also was puzzled when I saw that the authors used random sampling when it comes to suggesting actions to the user. I feel like editing photos is a process and depending on the photo there are actions that should be taken before others. Maybe using that functionality would be specifically for learning about the interface or the result of each action. Else, I don’t think it was the best functionality to propose.

I don’t know if I agree with the authors with how they measure the performance confidence in their survey. Using technology has always made us feel more confident. I trust a calculator more than doing simple math in my head real quick. I felt that this wasn’t a fair comparison measure.

Discussion

Would you use this extension if you had Adobe Photoshop? Why? Or Why not?
What would you change about this extension? Why?
Can you think of other extensions that you use on a regular basis that are useful in terms of learning about some software or platform?

04/22/20 – Lulwah AlKulaib-Acclerator

April 21, 2020April 21, 2020 Lulwah AlKulaib Leave a comment

Summary

Most of the crowdsourcing tasks in the real world are submitted to platforms as one big task due to the difficulty in decomposing tasks into small, independent units. The authors argue that by decomposing and distributing tasks we could utilize more of the resources provided by crowdsourcing platforms at a lower cost than the existing traditional method. They propose a computational system that can frame interdependent, small tasks to represent one big picture system. This proposal is difficult, so to investigate its viability, the authors prototype the system to test the distributed information combination after all tasks are done and to evaluate the output across multiple topics. The system compared well to existing top information sources on the web and it exceeded or approached quality ratings for highly curated reputable sources. The authors also suggested some design patterns that should help other researchers/systems when thinking of breaking big picture projects into smaller pieces.

Reflection

This was an interesting paper. I haven’t thought about breaking down a project into smaller pieces to save on costs or that it would get better quality results by doing so. I agree that some of the existing tasks are too big, complex, and time consuming and maybe those need to be broken down to smaller tasks. I still can’t imagine how breaking tasks so small that they can’t cost more than $1 generalizes well to all existing projects that we have on Amazon MTurk.

The authors mention that their system, even though it has a strong performance, was generated by non-expert workers that did not see the big picture, and that it should not be thought of as a replacement to expert creation and curation of content. I agree with that. No matter how good the crowd is, if they’re non-experts and they don’t have full access to the full picture, there would be some information missing which could lead to mistakes and imperfection. That shouldn’t be compared to a domain knowledge expert who would do a better job even if it costs more. Cost should not be a reason we favor the results of this system.

The design patterns suggested were a useful touch and the way they were explained help in understanding the proposed system as well. I think that we should adapt some of these design patterns as best as we could in our projects. Learning about this paper late enough in our experiment design would make it hard to implement breaking our tasks down to simpler tasks and test that theory on different topics. I would have loved to see how we each reported since we have an array of different experiments and simplifying some tasks could be impossible.

Discussion

What are your thoughts on breaking down tasks to such a small size?
Do you think that this could be applicable to all fields and generate similarly good quality? If not, where do you think this might not perform well?
Do you think that this system could replace domain experts? Why? Why not?
What applications is this system best suited for?

04/22/20 – Lulwah AlKulaib-Solvent

April 21, 2020April 21, 2020 Lulwah AlKulaib Leave a comment

Summary

The paper argues that scientific discoveries are based on analogies in distant domains. Nowadays, it is difficult for researchers to keep up in finding analogies due to the rapidly growing number of papers in each discipline and the difficulty of finding useful analogies from unfamiliar domains. The authors propose a system to solve this issue. Solvent, a mixed initiative system for finding analogies between research papers. They hire human annotators that structure academic papers abstracts into different aspects and then a model constructs the semantic representations from the provided annotations. The resulting semantic annotations are then used in finding analogies within research papers in that domain and across different domains. In their studies, they show that the proposed system finds more analogies than existing baseline approaches in the information retrieval field. They outperform state of the art and prove that annotations can generalize beyond the domain and that the analogies that the semantic model found are found to be useful by experts. Their system is a step in a new direction towards computationally augmented knowledge sharing between different fields.

Reflection

This was a very interesting paper to read. The authors use of scientific ontologies and scholarly discourse like those in Core Information about Scientific Papers (CISP) ontology makes me think of how relevant their work is, even when their goal differs from the corpora paper. I found the section where they explain adapting the annotation methods for research papers very useful for a different project.

One thing that I had in mind while reading the paper was how scalable is this to larger datasets. As they have shown us in the studies, the datasets are relatively small. The authors explain in the limitations that part of the bottleneck is having a good set of gold standard matches that they can use to evaluate their approach. I think that’s a valid reason, but still doesn’t eliminate the question of what would it require? and how well would it work?

When going over their results and seeing how they outperformed existing state of the art models/approaches, I also thought about real world applications and how useful this model is. I never thought of using analogies to perform discovery in different scientific domains. I always thought it would be more reliable to have a co-author from that domain that would weigh in. Especially nowadays with the vast communities of academics and researchers on social media it’s no longer that hard to find someone that could be a collaborator on a domain that isn’t yours. Also when looking at their results, their high precision was only in results recommending the top k% of most similar pairs analogies. I wonder if automating that has a greater impact than using the knowledge of a domain expert.

Discussion

Would you use a similar model in your research?
What would be an application that you can think of where this would help you while working on a new domain?
Do you think that this system could be outperformed by using a domain expert instead of the automated process?

04/15/20 – Lulwah AlKulaib-RiskPerceptions

April 15, 2020 Lulwah AlKulaib 1 Comment

Summary

People’s choice in using technology is associated with many factors, one of them is the perception of associated risk. The authors wanted to study the influence of associated risk to technology used so they adapted a survey instrument from risk perception literature to assess mental models of users and technologists around risks of emerging, data-driven technologies, for example: identity theft, personalized ﬁlter bubbles.. Etc. The authors surveyed 175 individuals on MTurk for comparative and individual assessments of risk, including characterizations using psychological factors. They report their findings around group differences of experts (Tech employees) and non-experts (MTurk workers) in how they assess risk and what factors may contribute to their conceptions of technological harm. They conclude that technologists see these risks as posing a bigger threat to society than do non-experts. Moreover, across groups, participants did not see technological risks as voluntarily assumed. The differences in how participants characterize risk have a connection to the future of design, decision making, and public communications. The authors discuss those by calling them risk-sensitive design.

Reflection

This was an interesting paper. Being a computer science student has always been one of the reasons I question technology, why a service is being offered for free, what’s in it for the company, and what do they gain from my use?

It is interesting to see that the author’s findings are close to my real life experiences. When talking to friends who do not care about risk and are more interested in the service that makes something easier for them and I mention those risks to them they usually don’t think of those risks so they don’t consider them when making those decisions. Some of those risks are important for them to understand since a lot of the available technology (apps at least) could be used maliciously against their users.

I believe that risk is viewed differently in experts’ views and non experts’ views and that should be highlighted. This explains how problems like the filter bubble mentioned in the paper have become so concerning. It is very important to know how to respond when there’s such a huge gap in how experts and the public think about risk. There should be a conversation to bridge the gap and educate the public in ways that are easy to perceive and accept.

I also think that with the new design elements and how designers are using risk sensitive design techniques for technologies is important. It helps in introducing technology in a more comforting/socially responsible way. It feels more gradual than sudden which makes users more perceptive to using it.

Discussion

What are your thoughts about the paper?
How do you define technology risk?
What are the top 5 risks that you can think of in technology from your point of view? How do you think that would differ when asking someone who does not have your background knowledge?
What are your recommendations for bridging the gap between experts and non-experts when it comes to risk?

04/15/20 – Lulwah AlKulaib-BelieveItOrNot

April 15, 2020 Lulwah AlKulaib 2 Comments

Summary

Fact checking is important to be done in a timely manner, especially nowadays when it’s used on live TV shows. While existing work presents many automated fact-checking systems, the human-in-the-loop is neglected.This paper presents the design and evaluation of a mixed initiative approach to fact-checking. The authors combine human knowledge and experience with the efﬁciency and scalability of automated information retrieval and machine learning. The authors present a user study in which participants used the proposed system to help their own assessment of claims. The presented results suggest that individuals tend to trust the system especially that participant accuracy assessing claims improved when exposed to correct model predictions.Yet, the participants’ trust is overestimated when the model was wrong. The exposure to the system’s predictions often reduced human accuracy. Participants that were given the option to interact with these incorrect predictions were often able improve their own performance. This suggests that in order to have better models, they have to be transparent especially when it comes to human-computer interaction as AI models might fail and humans could be the key factor in correcting them.

Reflection

I enjoyed reading this paper. It was very informative on the importance of transparent models in AI and machine learning. Also how transparent models could make the performance better when we include the human in the loop.

In their limitations, the authors discuss important points in relying on crowdworkers. They explain how MTurk participants should not all be given the same weight when looking at their responses since different participant demographics or incentives may inﬂuence ﬁndings. For example, non-US MTurk workers may not be representative of American news consumers or familiar with the latest and that could affect their responses. The authors also acknowledge that MTurk workers are paid by the task and that could cause some of them to respond by agreeing with the model’s response when in reality that is not the case, just so they could acquire the HIT and get paid. They found a minority of these responses and it made me think of ways to mitigate it. Like the papers from last week studying the behavior of an MTurk worker while completing the task might be an indicator if the worker actually agrees with the model or it is just to get paid.

The authors mention the negative impact that could potentially stem from their work and that could be as we saw in their experiment the model did a mistake but the humans over trusted it. The dependability on AI and technology makes users give them credit more than they should and such errors could impact the users perception of the truth.Addressing these limitations should be an essential requirement for further work.

Discussion

Where would you use a system like this most?
How would you suggest to mitigate errors produced by the system?
As humans, we trust AI and technology more than we should, how would you redesign the experiment to ensure that the crowdworkers actually check the presented claims?

04/08/20 – Lulwah AlKulaib-CrowdScape

April 8, 2020 Lulwah AlKulaib 1 Comment

Summary

The paper presents a system supporting the evaluation of complex crowd work through mixed-initiative machine learning and interactive visualization. This system proposes a solution for challenges in quality control that occur in crowdsourcing platforms. Previous work shows quality control concepts based on worker output or behavior which was not effective in evaluating complex tasks. Their suggested system combines the behavior and output of a worker to show the evaluation of complex crowd work. Their system features allow users to develop hypotheses about their crowd, test them, and refine selections based on machine learning and visual feedback. They use MTurk and Rzeszotarski and Kittur’s Task Fingerprinting system to create an interactive data visualization of the crowd workers. They posted four varieties of complex tasks: translating text from Japanese to English, picking a favorite color using an HSV color picker and writing its name, writing about their favorite place, and tagging science tutorial videos from Youtube. They conclude that the information gathered from crowd workers behavior is beneficial in reinforcing or contradicting the conception of the cognitive process that crowd workers use to complete tasks and in developing and testing mental models of the behavior of crowd workers who have good or bad outputs. This model helps its users identify further good workers and output in a sort of positive feedback loop.

Reflection

This paper presents an interesting approach in addressing how to discover low quality responses from crowd workers. It is an interesting way to combine these two methods and makes me think of our project and what limitations might arise from following their method in logging behaviors of crowd workers. I was not thinking of disclosing to the crowdworkers about their behavior while responding is being recorded and now it’s making me look at previous work if that has affected the crowd workers response or not. I found it interesting that crowdworkers used machine translation in the Japanese to English translation task even when they knew their behavior was being recorded. I assume that since there wasn’t a requirement of speaking Japanese or the requirements were relaxed that crowd workers were able to perform the task and use tools like Google Translate. If the requirements were there, then the workers won’t be paid for the task. This has also alerted me to the importance of task requirements and explanation for crowd workers. Since some Turkers could abuse the system and give us low quality results simply because the rules weren’t so clear.

Having the authors list their limitations was useful for me. It gave me another perspective to think about how to evaluate the responses that we get in our project and what we can do to make our feedback approach better.

Discussion

Would you use behavioral traces as an element in your project? If yes, would you tell the crowd workers that you are collecting that data? Why? Why not?
Do you think that implicit feedback and behavioral traces can help determine the quality of a crowd worker’s answer? Why? Or why not?
Do you think that collecting such feedback is a privacy issue? Why? Or Why not?

04/08/20 – Lulwah AlKulaib-Agency

April 8, 2020 Lulwah AlKulaib 3 Comments

Summary

The paper considers the design of systems that enable rich and adaptive interaction between people and algorithms. The authors attempt to balance the complementary strengths and weaknesses of humans and algorithms while promoting human control and skillful action.They aim to employ AI methods while ensuring that people remain in control. Supporting that people should be unconstrained in pursuing complex goals and exercising domain expertise.They share case studies of interactive systems that they developed in three fields: data wrangling, exploratory analysis, and natural language translation that integrates proactive computational support into interactive systems. They examine the strategy of designing shared representations that augment interactive systems with predictive models of users’ capabilities and potential actions, surfaced via interaction mechanisms that enable user review and revision for each case study. These models enable automated reasoning about tasks in a human centered fashion and can adapt over time by observing and learning from user behavior. To improve outcomes and support learning by both people and machines, they describe the use of shared representations of tasks augmented with predictive models of human capabilities and actions. They conclude with how they could better construct and deploy systems that integrate agency and automation via shared representations. They also mention that they found that neither automated suggestions nor direct manipulation play a strictly dominant role.But that a fluent interleaving of both modalities can enable more productive, yet flexible, work.

Reflection

The paper was very interesting to read. The case studies presented were thought provoking. They’re all papers based on research that I have read and gone through while learning about natural language processing and the thought of them being suggestive makes me wonder about such work. How user-interface toolkits might affect design and development of models.

I also wonder as presented in the future work, how to evaluate systems across varied levels of agency and automation. What would the goal be in that evaluation process? Would it differ across machine learning disciplines? The case studies presented in the paper had specific evaluation metrics used and I wonder how that generalizes to other models. What other methods could be used for evaluation in the future and how does one compare two systems when comparing their results is no longer enough?

I believe that this paper sheds some light to how evaluation criteria can be topic specific, and those will be shared across applications that are relevant to human experience in learning. It is important to pay attention to how they promote interpretability, learning, and skill acquisition instead of deskilling workers. Also, it’s essential that we think of appropriate designs that would optimize trade offs between automated support and human engagement.

Discussion

What is your takeaway from this paper?
Do you agree that we need better design tools that aid the creation of effective AI-infused interactive systems? Why? Or Why not?
What determines a balanced AI – Human interaction?
When is AI agency/control harmful? When is it useful?
Is insuring humans being in control of AI models important? If models were trained by domain experts and domain expertise, then why do we mistrust them?

03/25/20 – Lulwah AlKulaib- AllWorkNoPlay

March 24, 2020 Lulwah AlKulaib 3 Comments

Summary

The paper studies a field deployment of a question and answer chatbot in the field of human resource. They focus on the users’ conversational interactions with the chatbot. The HR chatbot provided company related information assistance to 377 new employees for 6 weeks. The author’s motivation was that studying conversational interactions and the rich signals which are used for inferring user status. These signals would be utilized to develop adapting agents in terms of functionality and interaction style. By contrasting the signals, they show the various functions of conversational interactions. The authors discuss design implications for conversational agents, and directions for developing adaptive agents based on users’ conversational behaviors. In their paper, they try to address two main research questions:

• RQ1: What kinds of conversational interactions did users have with the QA agent in the wild?

• RQ2: What kinds of conversational interactions can be used as signals for inferring user satisfaction with the agent’s functional performance, and playful interactions?

They answer RQ1 by presenting a characterization of the users’ conversational input and high level conversational acts. Then after providing a characterization of the conversational interactions, the authors study what signals exist in them for inferring user satisfaction (RQ2).

Reflection

In the paper, the authors study and analysis of conversations as signals of user satisfaction (RQ2). I found that part most interesting as their results show that users were fairly divided in terms of opinion when it comes to the chatbot’s functionality and playfulness. This means that there’s a need for adapting system functions and interaction styles for different users.

This observation makes me think of other systems where there’s a human in the loop interaction and how would system functions and interaction styles affect users satisfaction. And in systems that aren’t chatbot based, how is that satisfaction measured? Also, when thinking of systems that handle a substantial amount of interaction, would it be different? Does it matter if satisfaction is self reported by the user? Or would it be better to measure it based on their interaction with the system?

The paper acknowledges that the results are based on a survey data as a limitation. The authors mention that they had a response rate of 34% and that means that they can’t rule out self-selection bias. They also acknowledge that some observations might be specific to the workplace context and the user sample of the study.

The results in this paper provide some understanding of functions of conversational behaviors with conversational agents derived from human conversations. I would love to see similar resources for other non conversational systems and how user satisfaction is measured there.

Discussion

Is user satisfaction an important factor/evaluation method in your project?
How would you quantify user satisfaction in your project?
Would you measure satisfaction using a self reported survey by the user? Or would you measure it based on the user’s interaction with the system? And why?
Did you notice any other limitations in this paper other than the ones mentioned?

03/25/20 – Lulwah AlKulaib-VQAGames

March 24, 2020 Lulwah AlKulaib Leave a comment

Summary

The paper presents a cooperative game between humans and AI called GuessWhich. The game is a live interaction in a conversational manner where the human is given multiple photos as choices and the AI has only one photo, the human would ask the AI, ALICE, questions to identify which one is the correct choice. ALICE was trained using both supervised learning and reinforcement learning on a publicly available visual dialog dataset then was used in the evaluation of the human-AI team performance. The authors find no significant difference in performance between ALICE’s supervised learning and reinforcement learning versions when paired with human partners. Their findings suggest that while self-talk and reinforcement learning are interesting directions to pursue for building better visual conversational agents, there appears to be a disconnection between AI-AI and human-AI evaluations. Progress in the former does not seem to be predictive of progress in the latter. It is important to note that measuring AI progress in isolation is not as useful for systems that require human-AI interaction.

Reflection

The concept presented in this paper is interesting. As someone who doesn’t work in the HCI field, it has opened my eyes to thinking of the different ways models that I have worked on shouldn’t be measured in isolation. As the authors showed that evaluating visual conversational agents through a human computation game gives results that differ from our conventional AI-AI evaluation. When thinking of this, I wonder how such methods would apply to tasks in which automated metrics correlate poorly with human judgement. Tasks like natural language generation in image captioning. Comparing how a method inspired by the one given in this paper would differ than the suggested method in last week’s papers. Given the difficulties presented in these tasks and the interactive nature of them, it is clear that the most appropriate way to evaluate these kinds of tasks is with a human in the loop but how would a large-scale human in the loop evaluation happen? Especially when there’s limited financial and infrastructure resources.

This paper made me think of the challenges that come with human in the loop evaluations:

1- In order to have it done properly, we must have a set of clear and simple instructions for crowdworkers.

2- There should be a way to ensure the quality of the crowdworkers.

3- For the evaluation’s sake, we need uninterrupted communication.

My takeaway from the paper is that while traditional platforms were adequate for evaluation tasks using automatic metrics, there is a critical need to support human in the loop evaluation for free form multimodal tasks.

Discussion

What are the ways that we could use this paper to evaluate tasks like image captioning?
What are other challenges that come with human in the loop evaluations?
Is there a benchmark of human- AI in the field of your project? How would you ensure that your results are comparable?
How would you utilize the knowledge about human- AI evaluation in your project?
Have you worked with measuring evaluations with human in the loop? What was your experience there?