03/04/20 – Lee Lisle – Combining Crowdsourcing and Google Street View to Identify Street-Level Accessibility Problems

Summary

Hara, Le, and Froehlich developed an interface that uses Google Street View to identify accessibility issues in city sidewalks. They then perform a study using three researchers and 3 accessibility experts (wheelchair users) to evaluate their interface. This severed as both a way to assess usability issues with their interface as well as a ground truth to verify the results of their second study. That study involved launching crowdworking tasks to identify accessibility problems as well as categorizing what type each problem is. Over 7,517 Mechanical Turk HITs they found that crowdworkers could identify accessibility problems 80.6% of the time and could correctly classify the problem type 78.3% of the time. Combining their approach with a majority voting scheme, they raised these values to 86.9% and 83.8%.

Personal Reflection

Their first step to see if their solution was even feasible seemed like an odd study. Their users were research members and experts, both of which are theoretically more driven than a typical crowdworker. Furthermore, I felt like internal testing and piloting would be more appropriate than a soft-launch like this. While they do bring up that they needed a ground truth to contextualize their second study, I initially felt that this should then be performed by only experts and not as a complete preliminary study. However, as I read more of the paper, I felt that the comparison between the groups (experts vs. researchers) was relevant as it highlighted how wheelchair bound people and able-bodied people can see situations differently. They could not have collected this data on Mechanical Turk alone as they couldn’t guarantee that they were recruiting wheelchair bound participants otherwise.

It was also good to see the human-AI collaboration highlighted in this study. That they’re using the selection (and subsequent images generated by those selections) as training data for a machine learning algorithm, it should lessen the need for future work.

Their pay level also seemed very low at 1-5 cents per image. Even assuming a selection and classification takes only 10 seconds, their total page loading only takes 5 seconds, and they always get 5 cents per image, that’s $12 an hour for ideal circumstances.

The good part of this research is that it cheaply identifies problems quickly. This can be used to identify a large amount of issues and save time in deploying people to fix issues that are co-located in the same area rather than deploying people to find issues and then solve them with lesser potential coverage. It also solves a public need for a highly vulnerable population which makes their solution’s impact even better.

Lastly, it was good to see how the various levels of redundancy impacted their results. The falloff from increasing past 5 workers was harsher than I expected, and the increase in identification is likely not worth doubling the cost of these tasks.

Questions

  1. What other public needs could a Google Street View/crowdsourcing hybrid solve?
  2. What are the tradeoffs for the various stakeholders involved in solutions like this? (The people who need the fixes, the workers who typically had to identify these problems, the workers who are deployed to maintain the identified areas, and any others)
  3. Should every study measure the impact of redundancy? How might redundant workers affect your own projects?

Read More

03/04/20 – Lee Lisle – Real-Time Captioning by Groups of Non-Experts

Summary

            Lasecki et al. present a novel captioning system for the deaf and hard of hearing population group entitled LEGION:SCRIBE. Their implementation involves crowdsourcing multiple people per audio stream to achieve low-latency as well as highly accurate results. They then detail that the competitors for this are professional stenographers (but they are expensive) and automatic speech recognition (ASR, which has large issues with accuracy). They then go over how they intend on evaluating SCRIBE, with their Multiple Sequence Alignment (MSA) approach that aligns the output from multiple crowdworkers together to get the best possible caption. Their approach also allows for changing the quality to improve coverage or precision, where coverage will provide a more complete caption and precision attains a lower word error rate. They then conducted an experiment where they transcribed a set of lectures using various methods including various types of SCRIBE (varying number of workers and coverage) and an ASR. SCRIBE outperformed the ASR in both latency and accuracy.

Personal Reflection

This work is pretty relevant to me as my semester project is on transcribing notes for users in VR. I was struck by how quickly they were able to get captions from the crowd, but also how many errors still were present in the finished product. In figure 11, the WER for CART was a quarter of their method that only got slightly better than half of the words correct. And in figure 14, none of the transcriptions seem terribly acceptable, though CART was close. I wonder if their WER performed so poorly due to the nature of the talks or that there were multiple speakers in each scene. I wish that they had discussed how much impact having multiple speakers is in transcription services rather than the somewhat vague descriptions they had.

It was interesting that they could get the transcriptions done through Mechanical Turk at the rate of $36 per hour. This is roughly 1/3 of their professional stenographer (at $1.75 per minute or $105 per hour). The cost savings are impressive, though the coverage could be a lot better.

Lastly, I was glad they included one of their final sections, “Leveraging Hybrid Workforces,” as it is particularly relevant to this class. They were able to increase their coverage and precision by including an ASR as one of the inputs into their MSA combiner, regardless if they were using one worker or ten. This indicates that there is a lot of value in human-AI collaboration in this space.

Questions

  1. If such low-latency wasn’t a key issue, could the captions get an even lower WER? Is it worth a 20 second latency? A 60 second latency? Is it worth the extra cost it might incur?
  2. Combined with our reading last week on acceptable false positives and false negatives from AI, what is an acceptable WER for this use case?
  3. Their MSA combiner showed a lot of promise as a tool for potentially different fields. What other ways could their combiner be used?
  4. Error checking is a problem in many different fields, but especially crowdsourcing as the errors can be caused in many different ways. What other ways are there to combat errors in crowdsourcing? Would you choose this way or another?

Read More

03/04/2020 – Vikram Mohanty – Combining crowdsourcing and google street view to identify street-level accessibility problems

Authors: Kotaro Hara, Vicki Le, and Jon Froehlich

Summary

This paper discusses the feasibility of using AMT crowd workers to label sidewalk accessibility problems in Google Street View. The authors create ground truth datasets with the help of wheelchair users, and found that Turkers reached an accuracy of 81%. The paper also discusses some quality control and improvement methods, which was shown to be effective i.e. improved the accuracy to 93%. 

Reflection

This paper reminded me of Jeff Bigham’s quote – “Discovery of important problems, mapping them onto computationally tractable solutions, collecting meaningful datasets, and designing interactions that make sense to people is where HCI and its inherent methodologies shine.” It’s a great example for two important things mentioned in the quote : a) discovery of important problems, and b) collecting meaningful datasets. The paper’s contribution mentions that the datasets collected will be used for building computer vision algorithms, and this paper’s workflow involves the potential end-users (wheelchair users) early on in the process. Further, the paper attempts to use Turkers to generate datasets that are comparable in quality to that of the wheelchair users, essentially setting a high quality standard for generating potential AI datasets. This is a desirable approach for training datasets, which can potentially help prevent problems in popular datasets as outlined here: https://www.excavating.ai/

The paper also proposed two generalizable methods for improving data quality from Turkers. Filtering out low-quality workers during data collection by seeding in gold standard data may require designing modular workflows, but the time investment may well be worth it. 

It’s great to see how this work evolved to now form the basis for Project Sidewalk, a live project where volunteers can map accessibility areas in the neighborhood.

Questions

  1. What’s your usual process for gathering datasets? How is it different from this paper’s approach? Would you be willing to involve potential end-users in the process? 
  2. What would you do to ensure quality control in your AMT tasks? 
  3. Do you think collecting more fine-grained data for training CV algorithms will come at a trade-off for the interface not being simple enough for Turkers?

Read More

3/4/20 – Jooyoung Whang – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

In this paper, the authors study the effectiveness of vision-to-language systems for automatically generating alt texts for images and the impact of human-in-the-loop for this task. The authors set up four methods for generating alt text. First is a simple implementation of modern vision-to-language alt text generation. The second is a human-adjusted version of the first method. The third method is a more involved one, where a Blind or Vision Impaired (BVI) user chats with a non-BVI user to gain more context about an image. The final method is a generalized version of the third method, where the authors analyzed the patterns of questions asked during the third method to form a structured set of pre-defined questions that a crowdsource worker can directly provide the answer to without having the need for a lengthy conversation. The authors conclude that current vision-to-language techniques can, in fact, harm context understanding for BVI users, and simple human-in-the-loop methods significantly outperform. They also found that the method of the structured questions worked the best.

This was an interesting study that implicitly pointed out the limitation of computers at understanding social context which is a human affordance. The authors stated that the results of a vision-to-language system often confused the users because the system did not get the point. This made me wonder if the current limitation could be overcome in the future.

I was also concerned whether the authors’ proposed methods were even practical. Sure, the human-in-the-loop method involving Mturk workers greatly enhanced the description of a Twitter image, but based on their report, it’ll take too long to retrieve the description. The paper reports that to answer one of the structured questions, it takes on average, 1 minute. This is excluding the time it takes for a Mturk worker to accept a HIT. The authors suggested pre-generating alt texts for popular Tweets, but this does not completely solve the problem.

I was also skeptical about the way the authors performed validation with the 7 BVI users. In their validation, they simulated their third method (TweetTalk, a conversation between BVI and sighted users). However, they did not do it by using their application, but rather a face-to-face conversation between the researchers and the participants. The authors claimed that they tried to replicate the environment as much as possible, but I think there still can be flaws since the researchers serving as the sighted user already had expert knowledge about their experiment. Also, as stated in the paper’s limitations section, the validation was performed with too fewer participants. This may not fully capture the BVI users’ behaviors.

These are the questions that I had while reading this paper:

1. Do you think the authors’ proposed methods are actually practical? What could be done to make them practical if you don’t think so?

2. What do you think were the human affordances needed for the human element of this experiment other than social awareness?

3. Do you think the authors’ validation with the BVI users is sound? Also, the validation was only done for the third method. How can the validation be done for the rest of the methods?

Read More

03/04/2020 – Dylan Finch – Pull the Plug?

Word count: 596

Summary of the Reading

The main goal of this paper is to make image segmentation more efficient. Image segmentation as it is now, requires humans to help with the process. there are just some images that machines cannot segment on their own. However, there are many cases where an image segmentation algorithm can do all of the work on its own. This presents a problem: we do not know when we can use an algorithm and when we have to use a human, so we have to have humans review all of the segmentations. This is highly inefficient. This paper tries to solve this problem by introducing an algorithm that can decide when a human is required to segment an image. The process described in the paper involves scoring each segmented image done by machines, then giving humans the task of reviewing the lowest scoring images. Overall, the process was very effective and saved a lot of human effort.

Reflections and Connections

I think that this paper gives a great example of how humans and machines should interact, especially when it comes to humans and AIs interacting. Often times, we set out in research with the goal of creating a completely automated process that throws the human away and tries to create an AI or some other kind of machine that will do all of the work. This is often a very bad solution. AIs as they currently are, are not good enough to do most complex tasks all by themselves. In the cases of tasks like image segmentation, this is an especially big issue. These tasks are very easy for humans to do and very hard for AIs to do. So, it is good to see a researcher who is willing to use human strengths to make up for the weaknesses of machines. I think it is a good thing to have both things working together.

This paper also gives us some very important research, trying to answer the question of when we should machines and when we should use humans. This is a very tough question and it comes up in a lot of different fields. Humans are expensive, but machines are often imperfect. It can be very hard to decide when you should use one or the other. This paper does a great job of answering this question for image segmentation and I would love to see more similar research in other fields explain when it is best to use humans and machines in those fields. 

While I like this paper, I do also worry that it is simply moving the problem, rather than actually solving it. Now, instead of needing to improve a segmentation algorithm, we need to improve the scoring algorithm for the segmentations. Have we really improved the solution or have we just moved the area that now needs further improvement? 

Questions

  1. How could this kind of technology be used in other fields? How can we more efficiently use human and machine strengths together?
  2. In general, when do you think it is appropriate to create a system like this? When should we not fully rely on AI or machines?
  3. Did this paper just move the problem, or do you think that this method is better than just creating a better image segmentation algorithm? 
  4. Does creating systems like this stifle innovation on the main problem?
  5. Do you think machines will one day be good enough to segment images with no human input? How far off do you think that is?

Read More

03/04/2020 – Dylan Finch – Real-Time Captioning by Groups of Non-Experts

Word count: 564

Summary of the Reading

This paper aims to help with accessibility of audio streams by making it easier to create captions for deaf listeners. The typical solution to this problem is to hire expensive, highly trained professionals who require specialized keyboards, stenographers. Or, in other cases, people with less training to create captions, but these captions may take longer to write, creating a latency between what is said in the audio and the captions. This is not desirable, because it makes it harder for the deaf person to connect the audio with any accompanying video. This paper aims to marry cheap, easy to produce captions with the ability to have the cpations created in real time and with little latency. The solution is to use many people who do not require specialized training. When working together, a group of crowd workers can achieve high caption coverage of audio with a latency of only 2.9 seconds.

Reflections and Connections

I think that this paper highlights one of the coolest things that crowdsourcing can do. It can take big, complicated tasks that used to require highly trained individuals and make them accomplishable by ordinary people. This is extremely powerful. It makes all kinds of technologies and techniques much more accessible. It is hard to hire one highly trained stenographer, but it is easy to hire a few normal people. This is the same idea that powers Wikipedia. Many people make small edits, using specialized knowledge that they know, and, together, they create a highly accessible and complete collection of knowledge. This same principle can and should be applied to many more fields. I would love to see what other professions could be democratized through the use of many normal people to replace one highly trained person. 

This research also shows how it is possible to break up tasks that may have traditionally been thought of as atomic. Transcribing audio is a very hard task to solve using crowd workers because there are not real discrete tasks that could b e sent to crowd workers. The stream of audio is continuous and always changing. However, this paper shows that it is possible to break up this activity into manageable chunks that can be accomplished by crowd workers, the researchers just needed to think outside of the box. I think that this kind of thinking will become increasingly important as more and more work is crowdsourced. I think that as we learn how to solve more and more problems using crowdsourcing, the issue becomes less and less ot can we solve this using crowdsource and becomes much more about how can we break up this problem into manageable pieces that can be done by the crowd. This kind of research has applications elsewhere, too. I think that in the future this kind of research will be much more important. 

Questions

  1. What are some similar tasks that could be crowdsourced using a method similar to the one described in the paper?
  2. How do you think that crowdsourcing will impact the accessibility of our world? Are there other ways that crowdsourcing could make our world more accessible?
  3. Do you think there will come a time when most professions can be accomplished by crowd workers? What do you think the extent of crowd expertise will be?

Read More

03/04/2020- Myles Frantz – Combining crowdsourcing and Google Street View to identify street-level accessibility problems

Summation

Taxes in the US are a very divisive topic, and unfortunately, system infrastructures such as road maintenance are left to take the impact. Due to the lack of resources generally allocated, states typically only allocated the necessary amount of resources to fix problems, while generally leaving out accessibility options. This team from the University of Maryland has taken it upon themselves to prototype identification of these lack of accessibility options via crowd sourcing. They developed a system that utilized information from Google Street View and users could identify problems with the road. The next users could also confirm or deny the previous conclusions from previous users. Throughout this, they ran this experiment using 229 images manually collected through Google Street View and once with 3 handicap users, then with 185 various mechanical turkers. Throughout this, they were able to achieve an accuracy of at least 78% compared to the ground truth. Further trimming down the lower ranking turkers raised the accuracy by about 6% at the cost of filtering out 52% of the turkers.

Response

I can appreciate this approach since I believe (as was also stated in the paper) that a manual effort to identify the accessibility problems would cost a lot of money and time. Both of those requirements are typically sticking and stringent points from government contracts. Though they may not be ready to open this kind of availability to crowd workers, the accuracy is creating a stronger argument. Further ensuring better workers, the study also proved that by dropping the number of raw workers available for better results was ultimately fruitful, and potentially may be in alignment with the type of budget the government could provide.

Questions

  • Manually creating ground truth from experts would likely be unsustainable, since the cost in that specific requirement would increase. Since I don’t believe you can require a kind of accessibility handicap in Amazon Mechanical Turk, if this kind of work was solely based on Mechanical (or other crowdsourcing tools), would the ground truth ultimately suffer due to the potential lack of experience and expertise?
  • This may be an external perspective; however, it seems there is a definitive split of ideas within the paper, creating a system for crowd workers to identify and then creating a system to optimize the potential crowd workers working on the project. Do you think both ideas were equally weighed and spread throughout the paper or the Google Street View system was a means of utilizing the techniques for optimizing the crowd workers?
  • Since these images (solely utilizing Google Street View) are likely only taken periodically (due to resource constraints of the Google Street View Cars), the images are highly likely to be older and under change from any recent construction. When there is a delay from the Google Street View pictures, structures and buildings may have changed without getting updated in the system. Do you think there might be enough changes in the streets that the turkers work would become obsolete?

Read More

03/04/2020- Myles Frantz – Real-time captioning by groups of non-experts

Summation

Machine Learning is at the forefront of most technologies though it is still highly inaccurate; given the example of Youtube, auto-generated captions of recorded videos still proving the infancy of the technology. To improve and go beyond this, the team at Rochester created a hybrid way to combine multiple crowd workers’ efforts in order to more accurately and more timely create captioning. This methodology was set up in order to verify previous machine learning algorithms or to generate captions themselves. Overall throughout the experiment, the tie-breaker throughout the experiment is a majority vote. Comparing the accuracy of the general system of Scribe compared to other captioning systems is comparatively similar in precision and Word Error Rate, though at a lower cost.

Response

I could see how combining the two aspects, crowd workers and the initial baseline could create a good and accurate process for generating captions. Using crowd workers to asses and verify the baseline generation for captions ensures the quality of the captions generated and the potential to end up improving the machine learning algorithm. Furthering this, more workers can be given jobs and the captioning system could ultimately improve, improving both the jobs available and the core machine learning algorithm itself.

Questions

  • Not currently experienced in this specific field and disregarding the publishing date of 2012, this combination of crowd workers verifying the auto-generated captions does not seem ultimately novel in this case. Through their study of the latest and greatest in the field didn’t include any crowd workers in any capacity, this may have been limited to their scope. In your opinion does this research currently stand up to some of the more recent papers for auto-captioning or is it just a product of the time?
  • Potentially a problem for within the crowd working community, their techniques utilize a majority vote to confirm which words are accurately representing the phrase. Though there may be some statistics on ensuring the mechanical turkers have sufficient experience and can be relied on, this area may be vulnerable to malicious actors out numbering the non-malicious actors. Based on the phrases being interpreted and explicitly written, do you think a scenario similar to the Mountain Dew naming campaign (Dub The Dew – https://www.huffpost.com/entry/4chan-mountain-dew_n_1773076) in which a group of malicious actors overloaded the names, could happen to this type of system?
  • In using the audio of this technology, the raw audio of a speech or some event would be fed directly to the mechanical turkers working on the Scribe program. Depending on the environment where the speech was given or the quality of the microphone, not even majority of users may be able to hear the correct words (potentially regardless of the volume of the speech). Would there be potential future for combining this kind of technology along with some sort of machine learning algorithms that isolate and remove the white noise or smaller conversations around the main speakers of the event?

Read More

03/04/2020- Bipasha Banerjee – Pull the Plug? Predicting If Computers or Humans Should Segment Images

Summary

The paper by Gurari et al. discusses the segmentation of images and when segmentation should be done by humans and when is a machine only approach applicable. The work described in this paper is interdisciplinary, involving computer vision and human computation. They have considered both fine-grained as well as coarse-grained segmentation approaches to determine where the human or the machine perform better. The PTP framework describes whether to pull the plug on humans or machines. The framework aims to predict if the labeling of images should come from humans or machines and the quality of the labeled image. Their prediction framework is a regression model that captures the segmentation quality. The training data was populated with masks to reflect the quality of the segmentation. The three algorithms used are Hough transform with circles, Otsu thresholding, and adaptive thresholding. For labels, the Jacquard index was considered to indicate the quality of each instance. Nine features were proposed derived from the binary segmentation mask to catch the failure. It was finally derived that a mixed approach performed better than completely relying on humans or computers. 

Reflection 

The use of machines vs. humans is a complex debate. Leveraging both machine and human capabilities is necessary for efficiency and dealing with “big data.” The paper aims to find when to use computers to create coarse-grained segments and when to replace with humans for fine-grained data. I liked that the authors published the code. This helps in the advancement of research and reproducibility.

The authors have used three datasets but all based on images. In my opinion, detecting images is a relatively simple task to identify bounding boxes. I work with texts, and I have observed that segmentation results of large amounts of text are not simple. Most of the available tools fail to segment long documents like ETDs effectively. Nonetheless, segmentation is an important task, and I am intrigued to see how this work can be extended to text. 

Using crowd workers can be tricky. Although Amazon Mechanical Turk allows requesters to specify the experience, quality, etc. of workers, however, the time taken by a worker can vary depending on various factors. Would humans familiar with the dataset or the domain annotate faster? This needs to be thought of well, in my opinion, especially when we are trying to compete against machines. Machines are faster and good at handling vast amounts of data whereas; humans are good at accuracy. This paper highlights the old problem of accuracy vs. speed.

Questions

  1. The segmentation has been done on datasets with images. How does this extend to text? 
  2. Would experts on the topic or people familiar with databases require less time to annotate?
  3. Although three datasets have been used, I wonder if the domain matters? Would complex images affect the accuracy of machines?

Read More

03/04/2020 – Sushmethaa Muhundan – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

The popularity of social media has increased exponentially over the past few decades and with this comes a wave of image content that is flooding social media. Amidst this growing popularity, people who are blind or visually impaired (BIV) often find it extremely difficult to understand such content. Although existing solutions offer limited capabilities to caption images and provide alternative text, these are often insufficient and have a negative impact on the experience of BIV users if inaccurate. This paper aims to provide a better platform to improve the experience of BIV users by combining crowd input with existing automated captioning approaches. As part of the experiments, numerous workflows with varying degrees of human involvement and automated systems involvement were designed and evaluated. The four frameworks that were introduced as part of this study include a fully-automated captioning workflow, a human-corrected captioning workflow, a conversational assistant workflow, and a structured Q&A workflow. It was observed that though the workflows involving humans in the loop was time-consuming, it increased the user’s satisfaction by providing accurate descriptions of the images.

Throughout the paper, I really liked the focus on improving the experience of blind or visually impaired users while using social media and ensuring that accurate image description is provided so that the BIV users understand the context. The paper explores innovative means of leveraging humans in the loop to solve this pervasive issue.

Also, the particular platform being targeted here is social media which comes with its own challenges. Social media is a setting where the context and emotions of the images are as important as the image description itself to provide the BIV users sufficient information to understand the post. Another aspect that I found interesting was the focus on scalability which is extremely important in a social media setting.

I found the concepts of TweetTalk conversational workflow and the Structured Q&A workflow interesting as they proved a mixed approach by involving humans in the loop whenever necessary. The intent of the conversational workflow is to understand the aspects that make a caption valuable to a BIV user. I felt that this fundamental understanding is extremely essential to build further systems that ensure user satisfaction.

It was good to see that the sample tweets were chosen based on broad areas of topics that represented the various interests reported by blind users. An interesting insight that came out of the study was that no captions were preferred to inaccurate captions to avoid the cost of recovery from misinterpretation based on an inaccurate caption.

  1. Despite being validated by 7 BIV people, the study largely involved simulating a BIV user’s behavior. Do the observations hold good for scenarios with actual BIV users or is the problem not captured via these simulations?
  2. Apart from the two new workflows used in this paper, what are some other techniques that can be used to improve the captioning of the images on social media that captures the essence of the post?
  3. Besides social media, what other applications or platforms have similar drawbacks from the perspective of BIV users? Can the workflows that were introduced in this paper be used to solve those problems as well?

Read More