03/04/20 – Lulwah AlKulaib- CrowdStreetView

March 4, 2020 Lulwah AlKulaib 1 Comment

Summary

The authors try to assess the accessibility of sidewalks by hiring AMT workers to analyze Google Street View images. Traditionally, sidewalk assessment is conducted in person via street audits which are highly labor intensive and expensive or by reporting calls from citizens. The authors propose using their system as an alternative for a proactive solution to this issue. They perform two studies:

A feasibility study (Study 1): examines the feasibility of the labeling task with six dedicated labelers including three wheelchair users
A crowdsourcing study (Study 2): investigates the comparative performance of turkers

In study 1, since labeling sidewalk accessibility problems is subjective and potentially ambiguous, the authors investigate the viability of labeling across two groups:

Three members of the research team
Three wheelchair users – accessibility experts

They use the results of study 1 to provide ground truth labels to evaluate crowdworkers performance and to get a baseline understanding of what labeling this dataset looks like. In study 2, the authors investigate the potential of using crowd workers to perform the labeling task. They evaluate their performance on two levels of labeling accuracy:

Image level: tests for the presence or absence of the correct label in an image
Pixel level: examines the pixel level accuracies of the provided labels

They show that AMT workers are capable of finding accessibility problems with an accuracy of 80.6 % and determining the correct problem type with an accuracy of 78.3%. They get better results when using majority voting as a labeling technique 86.9% and 83.9% respectively. They collected 13,379 labels, 19,189verification labels from 402 workers. Their findings suggest that crowdsourcing both the labeling task and the verification task leads to a better quality result.

Reflection

The authors have selected experts in the paper as wheelchair users, when in real life they’re civil engineers. I wonder how that would have changed their labels/results. Since accessibility in the street is not only for wheelchair users. It’s worth investigating by using a pool of multiple experts.

I also think that selecting the dataset of photos to work on was a requirement for this labeling system, else it would have been tedious amount of work on “bad” images. I can’t imagine how this would be a scalable system on google street view as a whole. The dataset requires refinement to be able to label.

In addition, the focal point of the camera was not considered and reduces the scalability of the project. Even though the authors suggest a solution of installing a camera angled towards sidewalks, until that is implemented, I don’t see how this model could work well in the real world (not a controlled experiment).

Discussion

What are improvements that the authors could have done to their analysis?
How would their labeling system work for random Google street view photos?
How would the focal point of the GSV camera affect the labeling?
If cameras were angled towards sidewalks, and we were able to get a huge amount of photos for analysis, what would be a good way to implement this project?

03/04/20 – Lulwah AlKulaib- SocialAltText

March 4, 2020 Lulwah AlKulaib Leave a comment

Summary

The authors propose a system to generate Alt text for images embedded in social media posts by utilizing crowd workers. Their goal is to have a better experience for the blind and visually impared (BVI) when using social media. Existing tools provide imperfect descriptions some by automatic caption generation, and others by object recognition. These systems are not enough as in many cases their results aren’t descriptive enough for BVI users. The authors study how crowdsourcing can be used for both:

evaluating the value provided of existing automated approaches
Enabling workflows that provide scalable and useful alt text for BVI users

They utilize real-time crowdsourcing to test experiments with varied depth levels of interaction of the crowd in assisting visually impaired users. They show the shortcomings of existing AI image captioning systems and compare them with their method. The paper suggests two experiences:

TweetTalk: is a conversational assistant workflow.
Structured Q&A: that builds upon and enhances the state of the art generated captions.

They evaluated the conversational assistant with 235 crowdworkers. They evaluated 85 tweets for the baseline image caption, each tweet was evaluated 3 times with a total of 255 evaluations.

Reflection

The paper presents a novel concept and their approach is a different take on utilizing crowdworkers. I believe that the experiment would have worked better if they tested it on some visually impared users. Since the crowdworkers hired were not visually impaired, it makes it harder to say that BVI users would have the same reaction. Since the study targets BVI users, they should have been the pool of testers. People interact with the same element in different ways and what they showed seemed too controlled. Also, the questions were not all the same for all images, which makes this harder to generalize. The presented model tries to solve a problem for social media photos and not having a plan to repeat for each photo might make interpreting images difficult.

I appreciated the authors’ use of existing systems and their attempt at improving the AI generated captions. Their results obtain better accuracy compared to state of the art work.

I would have loved seeing how different social media applications measured compared with each other. Since different applications vary in how they present photos. Twitter for example gives a limited amount of character count while Facebook could present more text which might help BVI users in understanding the image better.

In the limitations section, the authors mention that human in the loop workflows raise privacy concerns and that the alt text would generalize to friendsourcing and utilizing social network users. I wonder how that generalizes to social media applications in real time. And how reliable would friendsourcing be for BVI users.

Discussion

What are improvements that you would suggest to better the TweetTalk experiment?
Do you know of any applications that use human in the loop in real time?
Would you have any privacy concerns if one of the social media applications integrated a human in the loop approach to help BVI users?

02/26/20 – Lulwah AlKulaib- Explaining Models

February 25, 2020 Lulwah AlKulaib 3 Comments

Summary

The authors believe that in order to ensure fairness in machine learning systems, it is mandatory to have a human in the loop process. In order to identify fairness problems and make improvements, they suppose relying on developers, users, and the general public is an effective way to follow that process. The paper conducts an empirical study with four types of programmatically generated explanations to understand how they impact people’s fairness judgments of ML systems. They try to answer three research questions:

RQ1 How do different styles of explanation impact fairness judgment of a ML system?
RQ2 How do individual factors in cognitive style and prior position on algorithmic fairness impact the fairness judgment with regard to different explanations?
RQ3 What are the benefits and drawbacks of different explanations in supporting fairness judgment of ML systems?

The authors focus on a racial discrimination case study in terms of model unfairness and Case-specific disparate impact. They performed an experiment with 160 Mechanical Turk workers. Their hypothesis proposed that given local explanations focus on justifying a particular case, they should more effectively surface fairness discrepancies between cases.

The authors show that:

Certain explanations are considered inherently less fair, while others can enhance people’s confidence in the fairness of the algorithm
Different fairness problems-such as model-wide fairness issues versus case-specific fairness discrepancies-may be more effectively exposed through different styles of explanation
Individual differences, including prior positions and judgment criteria of algorithmic fairness, impact how people react to different styles of explanation.

Reflection

This is a really informative paper. I like that it had a straightforward hypothesis and chose one existing case study that they evaluated. But I would have loved to see this addressed with judges instead of crowdworkers. They mentioned it in their limitations and I hope that they find enough judges willing to work on a follow-up paper. I believe that they would have insightful knowledge to contribute especially since they practice it. It would give a more meaningful analysis to the case study itself from professionals in the field.

I also wonder how this might scale to different machine learning systems that cover similar racial biases. Having a specific case study makes it harder to generalize even for something in the same domain. But definitely worth investigating since there are so many existing case studies! I also wonder if changing the case study analyzed, we’d notice a difference in the local vs. global explanations patterns in fairness judgement. And how would a mix of both affect the judgement, too.

Discussion

What are other ways you would approach this case study?
What are some explanations that weren’t covered in this study?
How would you facilitate this study to be performed with judges?
What are other case studies that you could generalize this to with small changes to the hypothesis?

02/26/20 – Lulwah AlKulaib- Interpretability

February 25, 2020 Lulwah AlKulaib 1 Comment

Summary

Machine learning (ML) models are integrated in many departments nowadays (for example: criminal justice, healthcare, marketing, etc.). The universal presence of ML has moved beyond academic research and grew into an engineering discipline. Because of that, it is important to interpret ML models and understand how they work by developing interpretability tools. Machine Learning engineers, practitioners, and data scientists have been using these tools. However, due to the minimal evaluation of the extent to which these tools achieve interpretability, the authors study the use of two interpretability tools to uncover issues that arise when building and evaluating models. The interpretability tools are: InterpretML implementation of GAMs and the SHAP Python package. They conduct a contextual inquiry and survey197 data scientists to observe how they use interpretability tools to uncover common issues that arise when building and evaluating ML models. Their results show that data scientists did utilize visualizations produced by interpretability tools to uncover issues in datasets and models. Yet, the availability of these tools has led to researchers over-trust and misuse of them.

Reflection

Machine learning is now being used to address important problems like predicting crime rates in cities to help police distribute manpower, identifying cancerous cells, predicting recidivism in the judiciary system, and locating buildings that are subject to catching on fire. Unfortunately, these models have been shown to learn biases. Detecting these biases is subtle, especially to beginners in the field. I agree with the authors that it is troublesome when machine learning is misused, whether intently or due to ignorance, in situations where ethics and fairness are eminent. Lacking models explainability can lead to biased and ill-informed decisions. In our ethics class, we went over case studies where interpretability was lacking and caused representing racial bias in facial analysis systems [1], biasing recidivism predictions [2], and textual gender biases learned from language [3]. Some of these systems were used in real life and have affected people’s lives. I think that using a similar analysis to the one presented in this paper before deploying systems into practice should be mandatory. It would give developers better understanding of their systems and help them avoid making biased decisions that can be corrected before going into public use. Also, informing developers on how dependable are interpretability tools and when to tell that they’re over trusting them, or when are they misusing them is important. Interpretability is a “new” field to machine learning and I’ve been seeing conferences adding sessions about it lately. I’m interested in learning more about interpretability and how we can adapt it in different machine learning modules.

Discussion

Have you used any of the mentioned interpretability packages in your research? How did it help in improving your model?
What are case studies that you know of where machine learning bias is evident? Were these biases corrected? If so, How?
Do you have any interpretability related resources that you can share with the rest of the class?
Do you plan to use these packages in your project?

References

https://splinternews.com/predictive-policing-the-future-of-crime-fighting-or-t-1793855820
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).

02/19/20 – Lulwah AlKulaib- Dream Team

February 18, 2020 Lulwah AlKulaib 2 Comments

Summary

The authors mention that the previous HCI research focused on ideal team structures and how roles, norms, and interaction patterns are influenced by systems. The state of research directed teams towards those structures by increasing shared awareness, adding channels of communications, convening effective collaborators. Yet organizational behavior research denies the existence of universally ideal team structures. And believes that structural contingency theory has demonstrated that the best team structures depend on the task, the members, and some other factors. The authors introduce Dream Team, a system that identifies effective team structures for each team by adapting teams to different structures and evaluating each fit. Dream Team explores over time, experimenting with values along many dimensions of team structures such as hierarchy, interaction patterns, and norms. The system utilizes feedback, such as team performance or satisfaction, to iteratively identify the team structures that best fit each team. It helps teams in identifying the structures that are most effective for them by experimenting with different structures over time on multi-armed bandits.

Reflection

The paper presented a system that focuses on virtual teams. In my opinion, the presented system is a very specific application to a very specific problem. The authors address their long list of limitations, including how they don’t believe their system generalizes to other problems easily. I also believe that the way they utilize feedback in the system is complex and unclear. Their reward function did not explain how qualitative factors were taken into consideration. The authors mention that high variance tasks would require more time for DreamTeam to converge.

Which means more time to get a response from the system, and I don’t know how that would be useful if it slows teams down? Also, when looking at the snapshot of the slack integration, it seems that they handle team satisfaction based on users response to a task, which is not always the case when it comes to collaboration on slack. The enthusiasm of the responses just seems out of the norm. The authors did not address how would their system address “team satisfaction” when there’s little to no response? Would that be counted as a negative response? Or would it be neutral? And even though their system worked well for the very specific task they chose, it’s also a virtual team. Which raises questions about how would this method be applicable for in person teams or hybrid teams? It seems that their controlled environment was very controlled. Even though they presented a good idea, I doubt how applicable it is to real life situations.

Discussion

In your opinion, what makes a dream team?
Are you pro or against ideal team structures? Why?
What were the qualities of collaborators in the best group project/research you had?
What makes the “chemistry” between team members?
What does a successful collaborative team project look like during a cycle?
What tools do you use in project management?
Would you use DreamTeam in your project?
What would you change in DreamTeam to make it work better for you?

02/19/20 – Lulwah AlKulaib- OrderWikipedia

February 18, 2020 Lulwah AlKulaib 2 Comments

Summary

The paper examines the roles of software tools in English language Wikipedia. The authors shed light on the process of counter-vandalism in Wikipedia. They explain in detail how participants and their assisted editing tools review Wikipedia contributions and enforce standards. They show that the editing process in Wikipedia is not a disconnected activity where editors force their views on others. Specifically, vandal fighting is shown as a distributed cognition process where users come to know their projects and users who edit it in a way that is impossible for a single individual. The authors claim the blocking of a vandal a cognitive process made possible by a complex network of interactions between humans, encyclopedia articles, software systems, and databases. Humans and non-humans work to produce and maintain a social order in the collaborative production of an encyclopedia with hundreds of thousands of diverse and often unorganized contributors. The authors introduce trace ethnography as a method of studying the seemingly ad-hoc assemblage of editors, administrators, bots, assisted editing tools, and others who constitute Wikipedia’s vandal fighting network.

Reflection

The paper comes off as a survey paper. I found that the authors explained some methods that already existed and used one of the authors experience to elaborate on others’ work. I couldn’t see their contribution but maybe that was needed 10 years ago? The tools that they mentioned (Huggle, AIV, Twinkle, ..etc.) were standard tools to be used when editing Wikipedia’s articles and monitoring edits made by others. They reflected on how those tools were helpful in a manner that made fighting vandalism an easier task. They mention that these tools facilitate viewing each edited article by linking it with a detailed edit summary with an explanation why it was done, by whom, and related IP addresses. They explain how they use such software to detect vandalism and how to revert back to the correct version of the article. They presented a case study of a Wikipedia vandal and showed logs of the changes that he was able to make in an hour. The authors also referenced Ed hutchins who explains how cognitive work must be performed in order to keep US Navy ships on course at any given time. And how that is a similar reference to what it takes to manage Wikipedia. Technological actors in Wikipedia, such as Huggle, make what would be a difficult task into a mundane affair. Reverting edits becomes a matter of pressing a button. The paper was informative for someone who hasn’t worked on editing Wiki articles but I thought that this paper could have been presented as a tutorial, it would’ve been more beneficial.

Discussion

Have you worked on Wikipedia article editing before?
Did you encounter using the tools mentioned in the paper?
Is there any application that comes to mind where this can be used other than Wikipedia?
Do you think such tools could be beneficial when it comes to open source software version control?
How would this method generalize to open source software version control?

02/05/20 – Lulwah AlKulaib- Making Better Use of the Crowd

February 4, 2020February 5, 2020 Lulwah AlKulaib Leave a comment

Summary

The survey provides an overview of machine learning projects utilizing crowdsourcing research. The author focuses on four application areas where crowdsourcing can be used in machine learning research: data generation, models evaluation and debugging, hybrid intelligence systems, and behavioral studies to inform ML research. She argues that crowdsourced studies of human behavior can be valuable for understanding how end users interact with machine learning systems. Then, she argues that these studies are also useful to understand crowdworkers themselves. She explains that it is important to understand crowdworkers and how that would help in defining recommendations of best practices that can be used when working with the crowd. The case studies that she presents show how to effectively run a crowdwork study and provide additional sources of motivation for workers. The case studies also answer how common is dishonesty on crowdsourcing platforms and how to mitigate it when encountered. They also show the hidden social network of crowdworkers and unmask the misconception of independence and isolation in crowdworkers. The author concludes with new best practices and tips for projects that use crowdsourcing. She also emphasizes the importance of pilots to a project’s success.

Reflection

This paper focuses on answering the question: how crowdsourcing can advance machine learning research? It asks the readers to consider how machine learning researchers

think about crowdsourcing. Suggesting an analysis of multiple ways in which crowdsourcing

can benefit and sometimes benefit from machine learning research. The author focuses her attention on 4 categories:

Data generation:

She analyzes case studies that aim to improve the quality of crowdsourced labels.

Evaluating and debugging models:

She discusses some papers that used crowdsourcing in evaluating unsupervised machine learning models.

Hybrid intelligence systems:

She shows examples of utilizing the “human in the loop” and how these systems are able to achieve more than would be possible with state of the art machine learning or AI systems alone because they make use of people’s skills and knowledge.

Behavioral studies to inform machine learning research:

This category discusses interpretable machine learning models design, the impact of algorithmic decisions on people’s lives, and questions that are interdisciplinary in nature and require better understanding of how humans interact with machine learning systems and AI.

The remainder of her survey provides best practices for crowdsourcing by analyzing multiple case studies. She addresses dishonest and spam-like behavior, how to set payments for tasks, what are the incentives for crowdworkers, how crowdworkers can motivate each other, and the communication and collaboration between crowdworkers.

I find that the community of crowdworkers was the most interesting to read. We have always thought that they’re isolated and independent workers. Finding about the forums, how they promote good jobs, and how they encourage one another was surprising.

I also find the suggested tips and best practices suggested are beneficial for crowdsource task posters. Especially if they’re new to the environment.

Discussion

What was something unexpected that you learned from this reading?
What are your tips for new crowdsource platform users?
What would you utilize from this reading into your project planning/work?

02/05/20 – Lulwah AlKulaib- Power to the people

February 4, 2020February 5, 2020 Lulwah AlKulaib 2 Comments

Summary

The paper argues that users have little to do with application development nowadays. They mention that developers apply machine learning techniques to solve problems but limit their interaction with end users to mediation by practitioners. This results in a long process with multiple iterations which limits the users ability to affect the models. They shed light on the importance of studying users in these systems and present case studies as examples of how these systems could result in better user experiences and more effective learning systems. The authors bring to our attention the advantages of studying user interaction with interactive machine learning systems and some flaws that developers must watch out for. They also present case studies of novel interfaces for interactive machine learning, clarify the different ways that could create richer interactions with users, and emphasize the importance of evaluating them with end users. The authors conclude their paper by underlying that any approach should be appropriately evaluated and tested before deployment since permitting user interactions were often beneficial but not always. They believe that by acknowledging the challenges in this approach, they would produce better machine learning systems as well as better end users.

Reflection

This paper focuses on the importance of the end user’s role in interactive machine learning systems. It raises the questions about how can users effectively influence machine learning systems? and how the machine learning system can appropriately influence the users? The paper also shows case studies that explain how people interact with machine learning systems. In those cases some unexpected results were found, like: people violated assumptions of the machine learning algorithm or they weren’t willing to comply with them. Other cases showed that studies can lead to insights about input and output types that interactive machine learning systems should support. The paper discusses case studies about some novel interfaces for interactive machine learning. Whether the novelty comes from new methods from receiving inputs or giving outputs. They mention that the new input techniques can give users more control over the system while output techniques can make the system more transparent or understandable. The paper does mention though that not all novel interfaces were beneficial, and some certain input and output types lead to obstacles for the user which reduces the accuracy of the learner model. The paper raises a good point about how different end users have different needs and expectations of the systems and therefore, rich interaction techniques must be designed accordingly. I agree with the authors that conducting studies of novel interactive machine learning systems is critical. And that those studies could be the basis of guideline development for future interactive learning systems.

Discussion

How would you apply interactive machine learning in your project?
Have you encountered such systems in other research papers you have read?
What are applications that could benefit from utilizing interactive machine learning systems?
How would you utilize some case studies suggestions from the paper in a machine learning model rather than the user experience?

01/29/20 – Lulwah AlKulaib- Human Computation: A Survey & Taxonomy of a Growing Field

January 28, 2020January 28, 2020 Lulwah AlKulaib 1 Comment

Summary:

The paper briefly speaks of the history of human computation. The first dissertation (2005), workshop (2009), and the different backgrounds of scholars in human computation. The authors agree with Von Ahn’s definition of the human computation as: “… a paradigm for utilizing human processing power to solve problems that computers cannot yet solve.” and mention multiple definitions from other papers and scholars. They believe that two conditions need to be satisfied to constitute human computation:

The problems fit the general paradigm of computation, and so, might someday be solvable by computers.
The human participation is directed by a computational system or process.

They present a classification for human computation systems made of 6 main factors divided into two groups:

Motivation, human skill, aggregation.
Quality control, process, task-request cardinality.

The authors also explain how to find new research problems based on the proposed classification system:

Combining different dimensions to discover new applications.
Creating new values for a given dimension.

Reflection:

The interesting issue I found the authors discussing was that they believe that the Wikipedia model does not belong to human computation. Because current Wikipedia articles are created through a dynamic social process of discussion about the facts and presentation of each topic among a network of authors and editors. I never thought of Wikipedia as human computation although there are tasks in there that I believe could be classified as such. Especially when looking at non-English articles. As we all know, the NLP field has created great solutions for the English language, yet some languages, even widely spoken ones, are playing catch up. So, this brings me to disagree with the authors’ opinion about Wikipedia. I agree that some parts of Wikipedia are related to social computing like allowing collaborative writing, but they also have human computation aspects like Arabic articles linked data identification (for the info box). Even though using NLP techniques might work for English articles on Wikipedia, Arabic is still behind when it comes to such task and the machine is unable to complete it correctly.

On another note, I like the way the authors broke up their classification and explained each section. It clarified their point of view and they provided an example for each part. I think that the distinctions were addressed in detail and they left enough room to consider the classification of future work. I believe that this was the reason that other scientists have adapted the classification. Seeing that the paper was cited more than 900 times, it makes me believe that there’s some agreement in the field.

Discussion:

Give examples of human computation tasks.
Do you agree/disagree with the author’s opinion about Wikipedia’s articles being excluded from the human computation classification?
How is human computation different from crowdsourcing, social computing, data mining, and collective intelligence?
Can you think of a new human computation system that the authors didn’t discuss? Classify it according to the dimensions mentioned in the paper.
Do you agree with the authors’ classification system? Why/Why not?
What is something new that you learned from this paper?

01/29/20 – Lulwah AlKulaib- An Affordance Based Framework for Human Computation and Human-Computer Collaboration

January 28, 2020January 28, 2020 Lulwah AlKulaib 1 Comment

Summary:

The authors reviewed literature from top ranking conferences in visual analytics, human computer interaction, and visualization. From the 1271 papers, they identified 49 papers representative of human-computer collaborative problem solving. In their analysis of the 49 papers, they found patterns of design that depends on a set of human and machine-intelligence affordances. The authors believe that these affordances form the basis of a common framework for understanding and discussing the analyzed collection. They use these features to describe the properties of two human-computer collaboration projects. The case studies they explain in the paper were reCAPTCHA and PatViz. The authors explain each case study and how they leverage the human and machine affordances. They also suggest a list of under explored affordances and suggest scenarios in which they might be useful. The authors believe that their framework will benefit the field of visual analytics as a whole. In presenting the preliminary framework, they aspire to have laid the foundation for a more rigorous analysis of tools presented in the field.

Reflection:

The paper presents a summary on the state of research in human-computer collaboration and related fields in 2012. The authors considered most of the advances that happened then lacking a cohesive direction. They set a negative tone in that part of the paper. They emphasized their point of view by proposing three questions that they claim cannot be answered systematically:

How do we tell if a problem would benefit from a collaborative technique?
How do we decide which tasks to delegate to which party, and when?
Finally, How does one system compare to others trying to solve the same problem?

Another point worth discussing, is that the authors answer the second question by saying that researchers using affordances language would steer to matching tasks according to the strengths of humans or machines instead of matching them based on their deficiencies. I’m not sure I agree. I feel like the case studies they provided were not enough to back this claim and it wasn’t sufficient for them to use in their discussion section.

The authors also raise a point about the importance of developing a common language to describe how much and how well affordances are being leveraged. I agree with their proposal and believe that this measure exists in other fields like AI, as they mentioned.

Discussion:

What are the values of having the suggested method to evaluate projects?
The authors argue against using crowdsourcing for problem solving. Do you agree with them? Why/Why not?
Are affordances sufficient for understanding crowdsourcing problems? Why/Why not?
What is the best way to measure human work? (other than those mentioned in the paper)
How do we account for individual differences in human operators? (other than those mentioned in the paper)
Give examples that the authors didn’t propose for the questions that they mention initially:
- How do we tell if a problem would benefit from a collaborative technique?
- How do we decide which tasks to delegate to which party, and when?
- How does one system compare to others trying to solve the same problem?