03/04/20 – Nan LI – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary:

The main objective of this paper is to investigate the feasibility of using crowd workers to locate and assess sidewalk accessibility problems using Google Street View imagery. To achieve this goal, the author conducted two studies to examine the feasibility of finding, labeling sidewalk accessibility problems. The paper uses the results of the first study to prove the possibility of labeling tasks, define what does good labeling performance like, and also provide verified ground truth labels that can be used to assess the performance of crowd workers. Then, the paper evaluates the annotation correctness from two discrete levels of granularity: image level and pixel level. The previous evaluation check for the absence or presence of a label and the later examination in a more precise way, which related to image segmentation work in computer vision. Finally, the paper talked about the quality control mechanisms, which include statistical filtering, an approach for revealing effective performance thresholds for eliminating poor quality turkers, and verification interface, which is a subjective approach to validates labels.

Reflection:

The most impressive point in this paper is the feasibility study, study 1. Since this study not only investigates the feasibility of the labeling work but also provides a standards of good labeling performance and indicate the validated ground truth labels, which can be used to evaluate the crowd worker’s performance. This pre-study provides all the clues, directions, and even the evaluation matrix for the later experiment. It provides the most valuable information for the early stage of the research with a very low workload and effort. I think sometimes it is a research issue that we put a lot of effort into driving the project forward instead of preparing and investigate the feasibility. As a result, we stuck by some problems that we can foresee if we conduct a pre-study.

However, I don’t think the pixel-level assessment is a good idea for this project. Because the labeling task does not require such a high accuracy for the inaccessible area, and it is to accurate to mark the inaccessible area with the unite of the pixel. As the table indicated in the paper’s results of pixel-level agreement analysis, the area overlaps for both binary classification, and multiclass classification are even no more than 50%. Also, though, the author thinks even a 10-15% overlap agreement at the pixel level would be sufficient to localize problems in images, this makes me more confused about whether the author wants to make an accurate evaluation or not.

Finally, considering our final project, it is worth to think about the number of crowd workers that we need for the task. We need to think about the accuracy of turkers per job. The paper made a point that performance improves with turker count, but these gains diminish in magnitude as group size grows. Thus, we might want to figure out the trade-off between accuracy and cost so that we can have a better idea of choice for hiring the workers.

Questions:

  • What do you think about the approach for this paper? Do you believe a pre-study is valuable? Will you apply this in your research?
  • What do you think about the matrix the author used for evaluating the labeling performance? What else matrix would you like to apply in assessing the rate of overlap area?
  • Have you ever considered how many turkers you need to hire would meet your accuracy need for the task? How do you evaluate this number?

Word Count: 578

Read More

03/04/20 – Nan LI – Real-Time Captioning by Groups of Non-Experts

Summary:

In this paper, the author focused on the main limitation of real-time captioning. The author made the point that the caption with high accuracy and low latency requires expensive stenographers who need an appointment in advance and who are trained to use specialized keyboards. The less expensive option is automatic speech recognition. However, its low accuracy and high error rate would greatly influence the user experience and cause many inconveniences for deaf people. To alleviate these problems, the author introduced an end-to-end system called LEGION: SCRIBE, which enables multiple works to provide simultaneous captioning in real-time, and the system combines their input into a final answer with high precision, high coverage, low latency. The author experimented with crowd workers and other local participants and compared the results with CART, ASR, and individual. The results indicate that this end-to-end system with a group of works can outperform both individual and ASR regarding the coverage, precision, and latency.

Reflection:

First, I think the author made a good point about the limitation of real-time captioning, especially the inconvenience that brings to deaf and hard of hearing people. Thus, the greatest contribution this end-to-end system provided is the accessibility of cheap and reliable real-time captioning channel. However, I have several concerns about it.

First, this end-to-end system requires a group of workers, even paid with a low salary for each person, as the caption time increases, the salary for all workers is still a significant overhead.

Second, since to satisfy the coverage requirement, a high precision, high coverage, low latency caption requires at least five more workers to work together. As mentioned in the experiment, the MTruk works need to watch a 40s video to understand how to use this system. Therefore, there may be a problem that the system cannot find the required number of workers on time.

Third, since the system only combines the work from all workers. Thus, there is a coverage problem, which is if all of the workers miss a part of the information, the system output will be incomplete. Based on my experience, if one of the people did not get part of the information, usually, most people cannot get it either. As the example presented in the paper, no workers typed the “non-aqueous” which was used in a clip about chemistry.

Finally, I am considering combining human correction and ASR caption. Since humans have the strength that remembers the pre-mentioned knowledge, for example, an abbreviation, yet they cannot type fast enough to cover all the content. Nevertheless, ASR usually does not miss any portion of the speech, yet it will make some unreasonable mistakes. Thus, it might be a good idea to let humans correct inaccurate captions of ASR instead of trying to type all the speech contents.

Question:

  • What do you think of this end-to-end system? Can you evaluate it from different perspectives, such as expense, accuracy?
  • How would you solve the problem of inadequate speech coverage?
  • What do you think of the idea that combines human and ASR’s work together? Do you think it will be more efficient or less efficient?

Word Count: 517

Read More

03/04/20 – Lee Lisle – Real-Time Captioning by Groups of Non-Experts

Summary

            Lasecki et al. present a novel captioning system for the deaf and hard of hearing population group entitled LEGION:SCRIBE. Their implementation involves crowdsourcing multiple people per audio stream to achieve low-latency as well as highly accurate results. They then detail that the competitors for this are professional stenographers (but they are expensive) and automatic speech recognition (ASR, which has large issues with accuracy). They then go over how they intend on evaluating SCRIBE, with their Multiple Sequence Alignment (MSA) approach that aligns the output from multiple crowdworkers together to get the best possible caption. Their approach also allows for changing the quality to improve coverage or precision, where coverage will provide a more complete caption and precision attains a lower word error rate. They then conducted an experiment where they transcribed a set of lectures using various methods including various types of SCRIBE (varying number of workers and coverage) and an ASR. SCRIBE outperformed the ASR in both latency and accuracy.

Personal Reflection

This work is pretty relevant to me as my semester project is on transcribing notes for users in VR. I was struck by how quickly they were able to get captions from the crowd, but also how many errors still were present in the finished product. In figure 11, the WER for CART was a quarter of their method that only got slightly better than half of the words correct. And in figure 14, none of the transcriptions seem terribly acceptable, though CART was close. I wonder if their WER performed so poorly due to the nature of the talks or that there were multiple speakers in each scene. I wish that they had discussed how much impact having multiple speakers is in transcription services rather than the somewhat vague descriptions they had.

It was interesting that they could get the transcriptions done through Mechanical Turk at the rate of $36 per hour. This is roughly 1/3 of their professional stenographer (at $1.75 per minute or $105 per hour). The cost savings are impressive, though the coverage could be a lot better.

Lastly, I was glad they included one of their final sections, “Leveraging Hybrid Workforces,” as it is particularly relevant to this class. They were able to increase their coverage and precision by including an ASR as one of the inputs into their MSA combiner, regardless if they were using one worker or ten. This indicates that there is a lot of value in human-AI collaboration in this space.

Questions

  1. If such low-latency wasn’t a key issue, could the captions get an even lower WER? Is it worth a 20 second latency? A 60 second latency? Is it worth the extra cost it might incur?
  2. Combined with our reading last week on acceptable false positives and false negatives from AI, what is an acceptable WER for this use case?
  3. Their MSA combiner showed a lot of promise as a tool for potentially different fields. What other ways could their combiner be used?
  4. Error checking is a problem in many different fields, but especially crowdsourcing as the errors can be caused in many different ways. What other ways are there to combat errors in crowdsourcing? Would you choose this way or another?

Read More

03/04/20 – Lee Lisle – Combining Crowdsourcing and Google Street View to Identify Street-Level Accessibility Problems

Summary

Hara, Le, and Froehlich developed an interface that uses Google Street View to identify accessibility issues in city sidewalks. They then perform a study using three researchers and 3 accessibility experts (wheelchair users) to evaluate their interface. This severed as both a way to assess usability issues with their interface as well as a ground truth to verify the results of their second study. That study involved launching crowdworking tasks to identify accessibility problems as well as categorizing what type each problem is. Over 7,517 Mechanical Turk HITs they found that crowdworkers could identify accessibility problems 80.6% of the time and could correctly classify the problem type 78.3% of the time. Combining their approach with a majority voting scheme, they raised these values to 86.9% and 83.8%.

Personal Reflection

Their first step to see if their solution was even feasible seemed like an odd study. Their users were research members and experts, both of which are theoretically more driven than a typical crowdworker. Furthermore, I felt like internal testing and piloting would be more appropriate than a soft-launch like this. While they do bring up that they needed a ground truth to contextualize their second study, I initially felt that this should then be performed by only experts and not as a complete preliminary study. However, as I read more of the paper, I felt that the comparison between the groups (experts vs. researchers) was relevant as it highlighted how wheelchair bound people and able-bodied people can see situations differently. They could not have collected this data on Mechanical Turk alone as they couldn’t guarantee that they were recruiting wheelchair bound participants otherwise.

It was also good to see the human-AI collaboration highlighted in this study. That they’re using the selection (and subsequent images generated by those selections) as training data for a machine learning algorithm, it should lessen the need for future work.

Their pay level also seemed very low at 1-5 cents per image. Even assuming a selection and classification takes only 10 seconds, their total page loading only takes 5 seconds, and they always get 5 cents per image, that’s $12 an hour for ideal circumstances.

The good part of this research is that it cheaply identifies problems quickly. This can be used to identify a large amount of issues and save time in deploying people to fix issues that are co-located in the same area rather than deploying people to find issues and then solve them with lesser potential coverage. It also solves a public need for a highly vulnerable population which makes their solution’s impact even better.

Lastly, it was good to see how the various levels of redundancy impacted their results. The falloff from increasing past 5 workers was harsher than I expected, and the increase in identification is likely not worth doubling the cost of these tasks.

Questions

  1. What other public needs could a Google Street View/crowdsourcing hybrid solve?
  2. What are the tradeoffs for the various stakeholders involved in solutions like this? (The people who need the fixes, the workers who typically had to identify these problems, the workers who are deployed to maintain the identified areas, and any others)
  3. Should every study measure the impact of redundancy? How might redundant workers affect your own projects?

Read More

03/04/2020 – Vikram Mohanty – Combining crowdsourcing and google street view to identify street-level accessibility problems

Authors: Kotaro Hara, Vicki Le, and Jon Froehlich

Summary

This paper discusses the feasibility of using AMT crowd workers to label sidewalk accessibility problems in Google Street View. The authors create ground truth datasets with the help of wheelchair users, and found that Turkers reached an accuracy of 81%. The paper also discusses some quality control and improvement methods, which was shown to be effective i.e. improved the accuracy to 93%. 

Reflection

This paper reminded me of Jeff Bigham’s quote – “Discovery of important problems, mapping them onto computationally tractable solutions, collecting meaningful datasets, and designing interactions that make sense to people is where HCI and its inherent methodologies shine.” It’s a great example for two important things mentioned in the quote : a) discovery of important problems, and b) collecting meaningful datasets. The paper’s contribution mentions that the datasets collected will be used for building computer vision algorithms, and this paper’s workflow involves the potential end-users (wheelchair users) early on in the process. Further, the paper attempts to use Turkers to generate datasets that are comparable in quality to that of the wheelchair users, essentially setting a high quality standard for generating potential AI datasets. This is a desirable approach for training datasets, which can potentially help prevent problems in popular datasets as outlined here: https://www.excavating.ai/

The paper also proposed two generalizable methods for improving data quality from Turkers. Filtering out low-quality workers during data collection by seeding in gold standard data may require designing modular workflows, but the time investment may well be worth it. 

It’s great to see how this work evolved to now form the basis for Project Sidewalk, a live project where volunteers can map accessibility areas in the neighborhood.

Questions

  1. What’s your usual process for gathering datasets? How is it different from this paper’s approach? Would you be willing to involve potential end-users in the process? 
  2. What would you do to ensure quality control in your AMT tasks? 
  3. Do you think collecting more fine-grained data for training CV algorithms will come at a trade-off for the interface not being simple enough for Turkers?

Read More

03/04/2020- Bipasha Banerjee – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary

The paper by Hara et al. attempts to address the problem of sidewalk accessibility by using crowd workers to label the data. The authors had different contributions in addition to just making crowd workers label images. They conduct two studies, a feasibility study and an online crowdsourcing study using AMT. The first study aims to find out how practical it is to label sidewalks using reliable crowd workers (experts). This study also gives an idea of the baseline performance and acts as a validated ground truth data. The second study aims to find out the feasibility of using Amazon Mechanical Turks for this task. They have evaluated the accuracy of image-level as well as pixel-level. The authors have conducted a thorough background study on the current sidewalk accessibility issues, the current audit methods, and that of crowdsourcing and image labeling. They were successful in showing that untrained crowd workers could identify and label sidewalk accessibility issues correctly in the google street view imagery. 

Reflection

Combining crowdsourcing and google street view to identify street- level accessibility is essential and useful for people. The paper was an interesting read and the authors described the system well. In the video[1], the authors show the instructions for the workers. The video gave a fascinating insight into how the task was designed for the workers, explaining every labeling task in detail. 

The paper mentions accessibility, but they have restricted their research for wheelchair users. This works for the first study as they are able to label the obstacles correctly, and this gives us the ground truth data for the next study as well as establishes the feasibility of using crowd workers to identify and label accessibility effectively. However, accessibility problems on sidewalks are also faced by other groups like people with reduced vision, etc. I am curious to see how the experiments would differ if the user-group and the need changes?

The experiments are based on google street view, which is not known to be the best at certain times. There are certain apps that help people get real-time updates on traffic while driving like the app Waze [2]. I was wondering if google maps or any other app insert dynamic updates for street walks, it would be beneficial. It would not only help people but also help the authority in determining which sidewalks are frequently used and the most common issues people face. The paper is a bit old. But, newer technology would surely help users. The paper [3] by the same author is a massive advancement in collecting sidewalk accessibility data. This paper is a good read based on the latest technology.

The paper mentions that active feedback to crowd workers would help improve labeling tasks. I think that dynamic, real-time feedback would be immensely helpful. However, I do understand that it is challenging to implement when using crowd workers, but an internal study could be conducted. For this, a pair or more people need to work simultaneously, where one label and the rest give feedback or some other combinations. 

Questions

  1. Sidewalk accessibility has been discussed for people with accessibility problems. They have considered people in wheelchairs for their studies. I do understand that such people would be needed for study 1, where labeling is a factor. However, how does the idea extend to people with other accessibility issues like reduced vision?
  2. This paper was published in 2013. The authors do mention in the conclusion section that with improvement in GSV and computer vision will overall help. Has any further study been conducted? How much modification of the current system is needed to accommodate the advancement in GSV and computer vision in general? 
  3. Can dynamic feedback to workers be implemented? 

References 

[1] https://www.youtube.com/watch?v=aD1bx_SikGo

[2] https://www.waze.com/waze

[3] http://kotarohara.com/assets/Papers/Saha_ProjectSidewalkAWebBasedCrowdsourcingToolForCollectingSidewalkAccessibilityDataAtScale_CHI2019.pdf

Read More

03/04/2020 – Sushmethaa Muhundan – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

The popularity of social media has increased exponentially over the past few decades and with this comes a wave of image content that is flooding social media. Amidst this growing popularity, people who are blind or visually impaired (BIV) often find it extremely difficult to understand such content. Although existing solutions offer limited capabilities to caption images and provide alternative text, these are often insufficient and have a negative impact on the experience of BIV users if inaccurate. This paper aims to provide a better platform to improve the experience of BIV users by combining crowd input with existing automated captioning approaches. As part of the experiments, numerous workflows with varying degrees of human involvement and automated systems involvement were designed and evaluated. The four frameworks that were introduced as part of this study include a fully-automated captioning workflow, a human-corrected captioning workflow, a conversational assistant workflow, and a structured Q&A workflow. It was observed that though the workflows involving humans in the loop was time-consuming, it increased the user’s satisfaction by providing accurate descriptions of the images.

Throughout the paper, I really liked the focus on improving the experience of blind or visually impaired users while using social media and ensuring that accurate image description is provided so that the BIV users understand the context. The paper explores innovative means of leveraging humans in the loop to solve this pervasive issue.

Also, the particular platform being targeted here is social media which comes with its own challenges. Social media is a setting where the context and emotions of the images are as important as the image description itself to provide the BIV users sufficient information to understand the post. Another aspect that I found interesting was the focus on scalability which is extremely important in a social media setting.

I found the concepts of TweetTalk conversational workflow and the Structured Q&A workflow interesting as they proved a mixed approach by involving humans in the loop whenever necessary. The intent of the conversational workflow is to understand the aspects that make a caption valuable to a BIV user. I felt that this fundamental understanding is extremely essential to build further systems that ensure user satisfaction.

It was good to see that the sample tweets were chosen based on broad areas of topics that represented the various interests reported by blind users. An interesting insight that came out of the study was that no captions were preferred to inaccurate captions to avoid the cost of recovery from misinterpretation based on an inaccurate caption.

  1. Despite being validated by 7 BIV people, the study largely involved simulating a BIV user’s behavior. Do the observations hold good for scenarios with actual BIV users or is the problem not captured via these simulations?
  2. Apart from the two new workflows used in this paper, what are some other techniques that can be used to improve the captioning of the images on social media that captures the essence of the post?
  3. Besides social media, what other applications or platforms have similar drawbacks from the perspective of BIV users? Can the workflows that were introduced in this paper be used to solve those problems as well?

Read More

03/04/2020- Bipasha Banerjee – Pull the Plug? Predicting If Computers or Humans Should Segment Images

Summary

The paper by Gurari et al. discusses the segmentation of images and when segmentation should be done by humans and when is a machine only approach applicable. The work described in this paper is interdisciplinary, involving computer vision and human computation. They have considered both fine-grained as well as coarse-grained segmentation approaches to determine where the human or the machine perform better. The PTP framework describes whether to pull the plug on humans or machines. The framework aims to predict if the labeling of images should come from humans or machines and the quality of the labeled image. Their prediction framework is a regression model that captures the segmentation quality. The training data was populated with masks to reflect the quality of the segmentation. The three algorithms used are Hough transform with circles, Otsu thresholding, and adaptive thresholding. For labels, the Jacquard index was considered to indicate the quality of each instance. Nine features were proposed derived from the binary segmentation mask to catch the failure. It was finally derived that a mixed approach performed better than completely relying on humans or computers. 

Reflection 

The use of machines vs. humans is a complex debate. Leveraging both machine and human capabilities is necessary for efficiency and dealing with “big data.” The paper aims to find when to use computers to create coarse-grained segments and when to replace with humans for fine-grained data. I liked that the authors published the code. This helps in the advancement of research and reproducibility.

The authors have used three datasets but all based on images. In my opinion, detecting images is a relatively simple task to identify bounding boxes. I work with texts, and I have observed that segmentation results of large amounts of text are not simple. Most of the available tools fail to segment long documents like ETDs effectively. Nonetheless, segmentation is an important task, and I am intrigued to see how this work can be extended to text. 

Using crowd workers can be tricky. Although Amazon Mechanical Turk allows requesters to specify the experience, quality, etc. of workers, however, the time taken by a worker can vary depending on various factors. Would humans familiar with the dataset or the domain annotate faster? This needs to be thought of well, in my opinion, especially when we are trying to compete against machines. Machines are faster and good at handling vast amounts of data whereas; humans are good at accuracy. This paper highlights the old problem of accuracy vs. speed.

Questions

  1. The segmentation has been done on datasets with images. How does this extend to text? 
  2. Would experts on the topic or people familiar with databases require less time to annotate?
  3. Although three datasets have been used, I wonder if the domain matters? Would complex images affect the accuracy of machines?

Read More

03/04/2020- Myles Frantz – Real-time captioning by groups of non-experts

Summation

Machine Learning is at the forefront of most technologies though it is still highly inaccurate; given the example of Youtube, auto-generated captions of recorded videos still proving the infancy of the technology. To improve and go beyond this, the team at Rochester created a hybrid way to combine multiple crowd workers’ efforts in order to more accurately and more timely create captioning. This methodology was set up in order to verify previous machine learning algorithms or to generate captions themselves. Overall throughout the experiment, the tie-breaker throughout the experiment is a majority vote. Comparing the accuracy of the general system of Scribe compared to other captioning systems is comparatively similar in precision and Word Error Rate, though at a lower cost.

Response

I could see how combining the two aspects, crowd workers and the initial baseline could create a good and accurate process for generating captions. Using crowd workers to asses and verify the baseline generation for captions ensures the quality of the captions generated and the potential to end up improving the machine learning algorithm. Furthering this, more workers can be given jobs and the captioning system could ultimately improve, improving both the jobs available and the core machine learning algorithm itself.

Questions

  • Not currently experienced in this specific field and disregarding the publishing date of 2012, this combination of crowd workers verifying the auto-generated captions does not seem ultimately novel in this case. Through their study of the latest and greatest in the field didn’t include any crowd workers in any capacity, this may have been limited to their scope. In your opinion does this research currently stand up to some of the more recent papers for auto-captioning or is it just a product of the time?
  • Potentially a problem for within the crowd working community, their techniques utilize a majority vote to confirm which words are accurately representing the phrase. Though there may be some statistics on ensuring the mechanical turkers have sufficient experience and can be relied on, this area may be vulnerable to malicious actors out numbering the non-malicious actors. Based on the phrases being interpreted and explicitly written, do you think a scenario similar to the Mountain Dew naming campaign (Dub The Dew – https://www.huffpost.com/entry/4chan-mountain-dew_n_1773076) in which a group of malicious actors overloaded the names, could happen to this type of system?
  • In using the audio of this technology, the raw audio of a speech or some event would be fed directly to the mechanical turkers working on the Scribe program. Depending on the environment where the speech was given or the quality of the microphone, not even majority of users may be able to hear the correct words (potentially regardless of the volume of the speech). Would there be potential future for combining this kind of technology along with some sort of machine learning algorithms that isolate and remove the white noise or smaller conversations around the main speakers of the event?

Read More

03/04/2020- Myles Frantz – Combining crowdsourcing and Google Street View to identify street-level accessibility problems

Summation

Taxes in the US are a very divisive topic, and unfortunately, system infrastructures such as road maintenance are left to take the impact. Due to the lack of resources generally allocated, states typically only allocated the necessary amount of resources to fix problems, while generally leaving out accessibility options. This team from the University of Maryland has taken it upon themselves to prototype identification of these lack of accessibility options via crowd sourcing. They developed a system that utilized information from Google Street View and users could identify problems with the road. The next users could also confirm or deny the previous conclusions from previous users. Throughout this, they ran this experiment using 229 images manually collected through Google Street View and once with 3 handicap users, then with 185 various mechanical turkers. Throughout this, they were able to achieve an accuracy of at least 78% compared to the ground truth. Further trimming down the lower ranking turkers raised the accuracy by about 6% at the cost of filtering out 52% of the turkers.

Response

I can appreciate this approach since I believe (as was also stated in the paper) that a manual effort to identify the accessibility problems would cost a lot of money and time. Both of those requirements are typically sticking and stringent points from government contracts. Though they may not be ready to open this kind of availability to crowd workers, the accuracy is creating a stronger argument. Further ensuring better workers, the study also proved that by dropping the number of raw workers available for better results was ultimately fruitful, and potentially may be in alignment with the type of budget the government could provide.

Questions

  • Manually creating ground truth from experts would likely be unsustainable, since the cost in that specific requirement would increase. Since I don’t believe you can require a kind of accessibility handicap in Amazon Mechanical Turk, if this kind of work was solely based on Mechanical (or other crowdsourcing tools), would the ground truth ultimately suffer due to the potential lack of experience and expertise?
  • This may be an external perspective; however, it seems there is a definitive split of ideas within the paper, creating a system for crowd workers to identify and then creating a system to optimize the potential crowd workers working on the project. Do you think both ideas were equally weighed and spread throughout the paper or the Google Street View system was a means of utilizing the techniques for optimizing the crowd workers?
  • Since these images (solely utilizing Google Street View) are likely only taken periodically (due to resource constraints of the Google Street View Cars), the images are highly likely to be older and under change from any recent construction. When there is a delay from the Google Street View pictures, structures and buildings may have changed without getting updated in the system. Do you think there might be enough changes in the streets that the turkers work would become obsolete?

Read More