03/04/20 – Nan LI – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary:

The main objective of this paper is to investigate the feasibility of using crowd workers to locate and assess sidewalk accessibility problems using Google Street View imagery. To achieve this goal, the author conducted two studies to examine the feasibility of finding, labeling sidewalk accessibility problems. The paper uses the results of the first study to prove the possibility of labeling tasks, define what does good labeling performance like, and also provide verified ground truth labels that can be used to assess the performance of crowd workers. Then, the paper evaluates the annotation correctness from two discrete levels of granularity: image level and pixel level. The previous evaluation check for the absence or presence of a label and the later examination in a more precise way, which related to image segmentation work in computer vision. Finally, the paper talked about the quality control mechanisms, which include statistical filtering, an approach for revealing effective performance thresholds for eliminating poor quality turkers, and verification interface, which is a subjective approach to validates labels.

Reflection:

The most impressive point in this paper is the feasibility study, study 1. Since this study not only investigates the feasibility of the labeling work but also provides a standards of good labeling performance and indicate the validated ground truth labels, which can be used to evaluate the crowd worker’s performance. This pre-study provides all the clues, directions, and even the evaluation matrix for the later experiment. It provides the most valuable information for the early stage of the research with a very low workload and effort. I think sometimes it is a research issue that we put a lot of effort into driving the project forward instead of preparing and investigate the feasibility. As a result, we stuck by some problems that we can foresee if we conduct a pre-study.

However, I don’t think the pixel-level assessment is a good idea for this project. Because the labeling task does not require such a high accuracy for the inaccessible area, and it is to accurate to mark the inaccessible area with the unite of the pixel. As the table indicated in the paper’s results of pixel-level agreement analysis, the area overlaps for both binary classification, and multiclass classification are even no more than 50%. Also, though, the author thinks even a 10-15% overlap agreement at the pixel level would be sufficient to localize problems in images, this makes me more confused about whether the author wants to make an accurate evaluation or not.

Finally, considering our final project, it is worth to think about the number of crowd workers that we need for the task. We need to think about the accuracy of turkers per job. The paper made a point that performance improves with turker count, but these gains diminish in magnitude as group size grows. Thus, we might want to figure out the trade-off between accuracy and cost so that we can have a better idea of choice for hiring the workers.

Questions:

  • What do you think about the approach for this paper? Do you believe a pre-study is valuable? Will you apply this in your research?
  • What do you think about the matrix the author used for evaluating the labeling performance? What else matrix would you like to apply in assessing the rate of overlap area?
  • Have you ever considered how many turkers you need to hire would meet your accuracy need for the task? How do you evaluate this number?

Word Count: 578

Read More

03/04/20 – Nan LI – Real-Time Captioning by Groups of Non-Experts

Summary:

In this paper, the author focused on the main limitation of real-time captioning. The author made the point that the caption with high accuracy and low latency requires expensive stenographers who need an appointment in advance and who are trained to use specialized keyboards. The less expensive option is automatic speech recognition. However, its low accuracy and high error rate would greatly influence the user experience and cause many inconveniences for deaf people. To alleviate these problems, the author introduced an end-to-end system called LEGION: SCRIBE, which enables multiple works to provide simultaneous captioning in real-time, and the system combines their input into a final answer with high precision, high coverage, low latency. The author experimented with crowd workers and other local participants and compared the results with CART, ASR, and individual. The results indicate that this end-to-end system with a group of works can outperform both individual and ASR regarding the coverage, precision, and latency.

Reflection:

First, I think the author made a good point about the limitation of real-time captioning, especially the inconvenience that brings to deaf and hard of hearing people. Thus, the greatest contribution this end-to-end system provided is the accessibility of cheap and reliable real-time captioning channel. However, I have several concerns about it.

First, this end-to-end system requires a group of workers, even paid with a low salary for each person, as the caption time increases, the salary for all workers is still a significant overhead.

Second, since to satisfy the coverage requirement, a high precision, high coverage, low latency caption requires at least five more workers to work together. As mentioned in the experiment, the MTruk works need to watch a 40s video to understand how to use this system. Therefore, there may be a problem that the system cannot find the required number of workers on time.

Third, since the system only combines the work from all workers. Thus, there is a coverage problem, which is if all of the workers miss a part of the information, the system output will be incomplete. Based on my experience, if one of the people did not get part of the information, usually, most people cannot get it either. As the example presented in the paper, no workers typed the “non-aqueous” which was used in a clip about chemistry.

Finally, I am considering combining human correction and ASR caption. Since humans have the strength that remembers the pre-mentioned knowledge, for example, an abbreviation, yet they cannot type fast enough to cover all the content. Nevertheless, ASR usually does not miss any portion of the speech, yet it will make some unreasonable mistakes. Thus, it might be a good idea to let humans correct inaccurate captions of ASR instead of trying to type all the speech contents.

Question:

  • What do you think of this end-to-end system? Can you evaluate it from different perspectives, such as expense, accuracy?
  • How would you solve the problem of inadequate speech coverage?
  • What do you think of the idea that combines human and ASR’s work together? Do you think it will be more efficient or less efficient?

Word Count: 517

Read More

03/04/20 – Lee Lisle – Combining Crowdsourcing and Google Street View to Identify Street-Level Accessibility Problems

Summary

Hara, Le, and Froehlich developed an interface that uses Google Street View to identify accessibility issues in city sidewalks. They then perform a study using three researchers and 3 accessibility experts (wheelchair users) to evaluate their interface. This severed as both a way to assess usability issues with their interface as well as a ground truth to verify the results of their second study. That study involved launching crowdworking tasks to identify accessibility problems as well as categorizing what type each problem is. Over 7,517 Mechanical Turk HITs they found that crowdworkers could identify accessibility problems 80.6% of the time and could correctly classify the problem type 78.3% of the time. Combining their approach with a majority voting scheme, they raised these values to 86.9% and 83.8%.

Personal Reflection

Their first step to see if their solution was even feasible seemed like an odd study. Their users were research members and experts, both of which are theoretically more driven than a typical crowdworker. Furthermore, I felt like internal testing and piloting would be more appropriate than a soft-launch like this. While they do bring up that they needed a ground truth to contextualize their second study, I initially felt that this should then be performed by only experts and not as a complete preliminary study. However, as I read more of the paper, I felt that the comparison between the groups (experts vs. researchers) was relevant as it highlighted how wheelchair bound people and able-bodied people can see situations differently. They could not have collected this data on Mechanical Turk alone as they couldn’t guarantee that they were recruiting wheelchair bound participants otherwise.

It was also good to see the human-AI collaboration highlighted in this study. That they’re using the selection (and subsequent images generated by those selections) as training data for a machine learning algorithm, it should lessen the need for future work.

Their pay level also seemed very low at 1-5 cents per image. Even assuming a selection and classification takes only 10 seconds, their total page loading only takes 5 seconds, and they always get 5 cents per image, that’s $12 an hour for ideal circumstances.

The good part of this research is that it cheaply identifies problems quickly. This can be used to identify a large amount of issues and save time in deploying people to fix issues that are co-located in the same area rather than deploying people to find issues and then solve them with lesser potential coverage. It also solves a public need for a highly vulnerable population which makes their solution’s impact even better.

Lastly, it was good to see how the various levels of redundancy impacted their results. The falloff from increasing past 5 workers was harsher than I expected, and the increase in identification is likely not worth doubling the cost of these tasks.

Questions

  1. What other public needs could a Google Street View/crowdsourcing hybrid solve?
  2. What are the tradeoffs for the various stakeholders involved in solutions like this? (The people who need the fixes, the workers who typically had to identify these problems, the workers who are deployed to maintain the identified areas, and any others)
  3. Should every study measure the impact of redundancy? How might redundant workers affect your own projects?

Read More

03/04/20 – Lee Lisle – Real-Time Captioning by Groups of Non-Experts

Summary

            Lasecki et al. present a novel captioning system for the deaf and hard of hearing population group entitled LEGION:SCRIBE. Their implementation involves crowdsourcing multiple people per audio stream to achieve low-latency as well as highly accurate results. They then detail that the competitors for this are professional stenographers (but they are expensive) and automatic speech recognition (ASR, which has large issues with accuracy). They then go over how they intend on evaluating SCRIBE, with their Multiple Sequence Alignment (MSA) approach that aligns the output from multiple crowdworkers together to get the best possible caption. Their approach also allows for changing the quality to improve coverage or precision, where coverage will provide a more complete caption and precision attains a lower word error rate. They then conducted an experiment where they transcribed a set of lectures using various methods including various types of SCRIBE (varying number of workers and coverage) and an ASR. SCRIBE outperformed the ASR in both latency and accuracy.

Personal Reflection

This work is pretty relevant to me as my semester project is on transcribing notes for users in VR. I was struck by how quickly they were able to get captions from the crowd, but also how many errors still were present in the finished product. In figure 11, the WER for CART was a quarter of their method that only got slightly better than half of the words correct. And in figure 14, none of the transcriptions seem terribly acceptable, though CART was close. I wonder if their WER performed so poorly due to the nature of the talks or that there were multiple speakers in each scene. I wish that they had discussed how much impact having multiple speakers is in transcription services rather than the somewhat vague descriptions they had.

It was interesting that they could get the transcriptions done through Mechanical Turk at the rate of $36 per hour. This is roughly 1/3 of their professional stenographer (at $1.75 per minute or $105 per hour). The cost savings are impressive, though the coverage could be a lot better.

Lastly, I was glad they included one of their final sections, “Leveraging Hybrid Workforces,” as it is particularly relevant to this class. They were able to increase their coverage and precision by including an ASR as one of the inputs into their MSA combiner, regardless if they were using one worker or ten. This indicates that there is a lot of value in human-AI collaboration in this space.

Questions

  1. If such low-latency wasn’t a key issue, could the captions get an even lower WER? Is it worth a 20 second latency? A 60 second latency? Is it worth the extra cost it might incur?
  2. Combined with our reading last week on acceptable false positives and false negatives from AI, what is an acceptable WER for this use case?
  3. Their MSA combiner showed a lot of promise as a tool for potentially different fields. What other ways could their combiner be used?
  4. Error checking is a problem in many different fields, but especially crowdsourcing as the errors can be caused in many different ways. What other ways are there to combat errors in crowdsourcing? Would you choose this way or another?

Read More

03/04/2020 – Dylan Finch – Real-Time Captioning by Groups of Non-Experts

Word count: 564

Summary of the Reading

This paper aims to help with accessibility of audio streams by making it easier to create captions for deaf listeners. The typical solution to this problem is to hire expensive, highly trained professionals who require specialized keyboards, stenographers. Or, in other cases, people with less training to create captions, but these captions may take longer to write, creating a latency between what is said in the audio and the captions. This is not desirable, because it makes it harder for the deaf person to connect the audio with any accompanying video. This paper aims to marry cheap, easy to produce captions with the ability to have the cpations created in real time and with little latency. The solution is to use many people who do not require specialized training. When working together, a group of crowd workers can achieve high caption coverage of audio with a latency of only 2.9 seconds.

Reflections and Connections

I think that this paper highlights one of the coolest things that crowdsourcing can do. It can take big, complicated tasks that used to require highly trained individuals and make them accomplishable by ordinary people. This is extremely powerful. It makes all kinds of technologies and techniques much more accessible. It is hard to hire one highly trained stenographer, but it is easy to hire a few normal people. This is the same idea that powers Wikipedia. Many people make small edits, using specialized knowledge that they know, and, together, they create a highly accessible and complete collection of knowledge. This same principle can and should be applied to many more fields. I would love to see what other professions could be democratized through the use of many normal people to replace one highly trained person. 

This research also shows how it is possible to break up tasks that may have traditionally been thought of as atomic. Transcribing audio is a very hard task to solve using crowd workers because there are not real discrete tasks that could b e sent to crowd workers. The stream of audio is continuous and always changing. However, this paper shows that it is possible to break up this activity into manageable chunks that can be accomplished by crowd workers, the researchers just needed to think outside of the box. I think that this kind of thinking will become increasingly important as more and more work is crowdsourced. I think that as we learn how to solve more and more problems using crowdsourcing, the issue becomes less and less ot can we solve this using crowdsource and becomes much more about how can we break up this problem into manageable pieces that can be done by the crowd. This kind of research has applications elsewhere, too. I think that in the future this kind of research will be much more important. 

Questions

  1. What are some similar tasks that could be crowdsourced using a method similar to the one described in the paper?
  2. How do you think that crowdsourcing will impact the accessibility of our world? Are there other ways that crowdsourcing could make our world more accessible?
  3. Do you think there will come a time when most professions can be accomplished by crowd workers? What do you think the extent of crowd expertise will be?

Read More

03/04/2020 – Dylan Finch – Pull the Plug?

Word count: 596

Summary of the Reading

The main goal of this paper is to make image segmentation more efficient. Image segmentation as it is now, requires humans to help with the process. there are just some images that machines cannot segment on their own. However, there are many cases where an image segmentation algorithm can do all of the work on its own. This presents a problem: we do not know when we can use an algorithm and when we have to use a human, so we have to have humans review all of the segmentations. This is highly inefficient. This paper tries to solve this problem by introducing an algorithm that can decide when a human is required to segment an image. The process described in the paper involves scoring each segmented image done by machines, then giving humans the task of reviewing the lowest scoring images. Overall, the process was very effective and saved a lot of human effort.

Reflections and Connections

I think that this paper gives a great example of how humans and machines should interact, especially when it comes to humans and AIs interacting. Often times, we set out in research with the goal of creating a completely automated process that throws the human away and tries to create an AI or some other kind of machine that will do all of the work. This is often a very bad solution. AIs as they currently are, are not good enough to do most complex tasks all by themselves. In the cases of tasks like image segmentation, this is an especially big issue. These tasks are very easy for humans to do and very hard for AIs to do. So, it is good to see a researcher who is willing to use human strengths to make up for the weaknesses of machines. I think it is a good thing to have both things working together.

This paper also gives us some very important research, trying to answer the question of when we should machines and when we should use humans. This is a very tough question and it comes up in a lot of different fields. Humans are expensive, but machines are often imperfect. It can be very hard to decide when you should use one or the other. This paper does a great job of answering this question for image segmentation and I would love to see more similar research in other fields explain when it is best to use humans and machines in those fields. 

While I like this paper, I do also worry that it is simply moving the problem, rather than actually solving it. Now, instead of needing to improve a segmentation algorithm, we need to improve the scoring algorithm for the segmentations. Have we really improved the solution or have we just moved the area that now needs further improvement? 

Questions

  1. How could this kind of technology be used in other fields? How can we more efficiently use human and machine strengths together?
  2. In general, when do you think it is appropriate to create a system like this? When should we not fully rely on AI or machines?
  3. Did this paper just move the problem, or do you think that this method is better than just creating a better image segmentation algorithm? 
  4. Does creating systems like this stifle innovation on the main problem?
  5. Do you think machines will one day be good enough to segment images with no human input? How far off do you think that is?

Read More

3/4/20 – Jooyoung Whang – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

In this paper, the authors study the effectiveness of vision-to-language systems for automatically generating alt texts for images and the impact of human-in-the-loop for this task. The authors set up four methods for generating alt text. First is a simple implementation of modern vision-to-language alt text generation. The second is a human-adjusted version of the first method. The third method is a more involved one, where a Blind or Vision Impaired (BVI) user chats with a non-BVI user to gain more context about an image. The final method is a generalized version of the third method, where the authors analyzed the patterns of questions asked during the third method to form a structured set of pre-defined questions that a crowdsource worker can directly provide the answer to without having the need for a lengthy conversation. The authors conclude that current vision-to-language techniques can, in fact, harm context understanding for BVI users, and simple human-in-the-loop methods significantly outperform. They also found that the method of the structured questions worked the best.

This was an interesting study that implicitly pointed out the limitation of computers at understanding social context which is a human affordance. The authors stated that the results of a vision-to-language system often confused the users because the system did not get the point. This made me wonder if the current limitation could be overcome in the future.

I was also concerned whether the authors’ proposed methods were even practical. Sure, the human-in-the-loop method involving Mturk workers greatly enhanced the description of a Twitter image, but based on their report, it’ll take too long to retrieve the description. The paper reports that to answer one of the structured questions, it takes on average, 1 minute. This is excluding the time it takes for a Mturk worker to accept a HIT. The authors suggested pre-generating alt texts for popular Tweets, but this does not completely solve the problem.

I was also skeptical about the way the authors performed validation with the 7 BVI users. In their validation, they simulated their third method (TweetTalk, a conversation between BVI and sighted users). However, they did not do it by using their application, but rather a face-to-face conversation between the researchers and the participants. The authors claimed that they tried to replicate the environment as much as possible, but I think there still can be flaws since the researchers serving as the sighted user already had expert knowledge about their experiment. Also, as stated in the paper’s limitations section, the validation was performed with too fewer participants. This may not fully capture the BVI users’ behaviors.

These are the questions that I had while reading this paper:

1. Do you think the authors’ proposed methods are actually practical? What could be done to make them practical if you don’t think so?

2. What do you think were the human affordances needed for the human element of this experiment other than social awareness?

3. Do you think the authors’ validation with the BVI users is sound? Also, the validation was only done for the third method. How can the validation be done for the rest of the methods?

Read More

03/04/2020 – Vikram Mohanty – Combining crowdsourcing and google street view to identify street-level accessibility problems

Authors: Kotaro Hara, Vicki Le, and Jon Froehlich

Summary

This paper discusses the feasibility of using AMT crowd workers to label sidewalk accessibility problems in Google Street View. The authors create ground truth datasets with the help of wheelchair users, and found that Turkers reached an accuracy of 81%. The paper also discusses some quality control and improvement methods, which was shown to be effective i.e. improved the accuracy to 93%. 

Reflection

This paper reminded me of Jeff Bigham’s quote – “Discovery of important problems, mapping them onto computationally tractable solutions, collecting meaningful datasets, and designing interactions that make sense to people is where HCI and its inherent methodologies shine.” It’s a great example for two important things mentioned in the quote : a) discovery of important problems, and b) collecting meaningful datasets. The paper’s contribution mentions that the datasets collected will be used for building computer vision algorithms, and this paper’s workflow involves the potential end-users (wheelchair users) early on in the process. Further, the paper attempts to use Turkers to generate datasets that are comparable in quality to that of the wheelchair users, essentially setting a high quality standard for generating potential AI datasets. This is a desirable approach for training datasets, which can potentially help prevent problems in popular datasets as outlined here: https://www.excavating.ai/

The paper also proposed two generalizable methods for improving data quality from Turkers. Filtering out low-quality workers during data collection by seeding in gold standard data may require designing modular workflows, but the time investment may well be worth it. 

It’s great to see how this work evolved to now form the basis for Project Sidewalk, a live project where volunteers can map accessibility areas in the neighborhood.

Questions

  1. What’s your usual process for gathering datasets? How is it different from this paper’s approach? Would you be willing to involve potential end-users in the process? 
  2. What would you do to ensure quality control in your AMT tasks? 
  3. Do you think collecting more fine-grained data for training CV algorithms will come at a trade-off for the interface not being simple enough for Turkers?

Read More

03/04/2020- Bipasha Banerjee – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary

The paper by Hara et al. attempts to address the problem of sidewalk accessibility by using crowd workers to label the data. The authors had different contributions in addition to just making crowd workers label images. They conduct two studies, a feasibility study and an online crowdsourcing study using AMT. The first study aims to find out how practical it is to label sidewalks using reliable crowd workers (experts). This study also gives an idea of the baseline performance and acts as a validated ground truth data. The second study aims to find out the feasibility of using Amazon Mechanical Turks for this task. They have evaluated the accuracy of image-level as well as pixel-level. The authors have conducted a thorough background study on the current sidewalk accessibility issues, the current audit methods, and that of crowdsourcing and image labeling. They were successful in showing that untrained crowd workers could identify and label sidewalk accessibility issues correctly in the google street view imagery. 

Reflection

Combining crowdsourcing and google street view to identify street- level accessibility is essential and useful for people. The paper was an interesting read and the authors described the system well. In the video[1], the authors show the instructions for the workers. The video gave a fascinating insight into how the task was designed for the workers, explaining every labeling task in detail. 

The paper mentions accessibility, but they have restricted their research for wheelchair users. This works for the first study as they are able to label the obstacles correctly, and this gives us the ground truth data for the next study as well as establishes the feasibility of using crowd workers to identify and label accessibility effectively. However, accessibility problems on sidewalks are also faced by other groups like people with reduced vision, etc. I am curious to see how the experiments would differ if the user-group and the need changes?

The experiments are based on google street view, which is not known to be the best at certain times. There are certain apps that help people get real-time updates on traffic while driving like the app Waze [2]. I was wondering if google maps or any other app insert dynamic updates for street walks, it would be beneficial. It would not only help people but also help the authority in determining which sidewalks are frequently used and the most common issues people face. The paper is a bit old. But, newer technology would surely help users. The paper [3] by the same author is a massive advancement in collecting sidewalk accessibility data. This paper is a good read based on the latest technology.

The paper mentions that active feedback to crowd workers would help improve labeling tasks. I think that dynamic, real-time feedback would be immensely helpful. However, I do understand that it is challenging to implement when using crowd workers, but an internal study could be conducted. For this, a pair or more people need to work simultaneously, where one label and the rest give feedback or some other combinations. 

Questions

  1. Sidewalk accessibility has been discussed for people with accessibility problems. They have considered people in wheelchairs for their studies. I do understand that such people would be needed for study 1, where labeling is a factor. However, how does the idea extend to people with other accessibility issues like reduced vision?
  2. This paper was published in 2013. The authors do mention in the conclusion section that with improvement in GSV and computer vision will overall help. Has any further study been conducted? How much modification of the current system is needed to accommodate the advancement in GSV and computer vision in general? 
  3. Can dynamic feedback to workers be implemented? 

References 

[1] https://www.youtube.com/watch?v=aD1bx_SikGo

[2] https://www.waze.com/waze

[3] http://kotarohara.com/assets/Papers/Saha_ProjectSidewalkAWebBasedCrowdsourcingToolForCollectingSidewalkAccessibilityDataAtScale_CHI2019.pdf

Read More

03/04/2020 – Sushmethaa Muhundan – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

The popularity of social media has increased exponentially over the past few decades and with this comes a wave of image content that is flooding social media. Amidst this growing popularity, people who are blind or visually impaired (BIV) often find it extremely difficult to understand such content. Although existing solutions offer limited capabilities to caption images and provide alternative text, these are often insufficient and have a negative impact on the experience of BIV users if inaccurate. This paper aims to provide a better platform to improve the experience of BIV users by combining crowd input with existing automated captioning approaches. As part of the experiments, numerous workflows with varying degrees of human involvement and automated systems involvement were designed and evaluated. The four frameworks that were introduced as part of this study include a fully-automated captioning workflow, a human-corrected captioning workflow, a conversational assistant workflow, and a structured Q&A workflow. It was observed that though the workflows involving humans in the loop was time-consuming, it increased the user’s satisfaction by providing accurate descriptions of the images.

Throughout the paper, I really liked the focus on improving the experience of blind or visually impaired users while using social media and ensuring that accurate image description is provided so that the BIV users understand the context. The paper explores innovative means of leveraging humans in the loop to solve this pervasive issue.

Also, the particular platform being targeted here is social media which comes with its own challenges. Social media is a setting where the context and emotions of the images are as important as the image description itself to provide the BIV users sufficient information to understand the post. Another aspect that I found interesting was the focus on scalability which is extremely important in a social media setting.

I found the concepts of TweetTalk conversational workflow and the Structured Q&A workflow interesting as they proved a mixed approach by involving humans in the loop whenever necessary. The intent of the conversational workflow is to understand the aspects that make a caption valuable to a BIV user. I felt that this fundamental understanding is extremely essential to build further systems that ensure user satisfaction.

It was good to see that the sample tweets were chosen based on broad areas of topics that represented the various interests reported by blind users. An interesting insight that came out of the study was that no captions were preferred to inaccurate captions to avoid the cost of recovery from misinterpretation based on an inaccurate caption.

  1. Despite being validated by 7 BIV people, the study largely involved simulating a BIV user’s behavior. Do the observations hold good for scenarios with actual BIV users or is the problem not captured via these simulations?
  2. Apart from the two new workflows used in this paper, what are some other techniques that can be used to improve the captioning of the images on social media that captures the essence of the post?
  3. Besides social media, what other applications or platforms have similar drawbacks from the perspective of BIV users? Can the workflows that were introduced in this paper be used to solve those problems as well?

Read More