03/04/20 – Nan LI – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary:

The main objective of this paper is to investigate the feasibility of using crowd workers to locate and assess sidewalk accessibility problems using Google Street View imagery. To achieve this goal, the author conducted two studies to examine the feasibility of finding, labeling sidewalk accessibility problems. The paper uses the results of the first study to prove the possibility of labeling tasks, define what does good labeling performance like, and also provide verified ground truth labels that can be used to assess the performance of crowd workers. Then, the paper evaluates the annotation correctness from two discrete levels of granularity: image level and pixel level. The previous evaluation check for the absence or presence of a label and the later examination in a more precise way, which related to image segmentation work in computer vision. Finally, the paper talked about the quality control mechanisms, which include statistical filtering, an approach for revealing effective performance thresholds for eliminating poor quality turkers, and verification interface, which is a subjective approach to validates labels.

Reflection:

The most impressive point in this paper is the feasibility study, study 1. Since this study not only investigates the feasibility of the labeling work but also provides a standards of good labeling performance and indicate the validated ground truth labels, which can be used to evaluate the crowd worker’s performance. This pre-study provides all the clues, directions, and even the evaluation matrix for the later experiment. It provides the most valuable information for the early stage of the research with a very low workload and effort. I think sometimes it is a research issue that we put a lot of effort into driving the project forward instead of preparing and investigate the feasibility. As a result, we stuck by some problems that we can foresee if we conduct a pre-study.

However, I don’t think the pixel-level assessment is a good idea for this project. Because the labeling task does not require such a high accuracy for the inaccessible area, and it is to accurate to mark the inaccessible area with the unite of the pixel. As the table indicated in the paper’s results of pixel-level agreement analysis, the area overlaps for both binary classification, and multiclass classification are even no more than 50%. Also, though, the author thinks even a 10-15% overlap agreement at the pixel level would be sufficient to localize problems in images, this makes me more confused about whether the author wants to make an accurate evaluation or not.

Finally, considering our final project, it is worth to think about the number of crowd workers that we need for the task. We need to think about the accuracy of turkers per job. The paper made a point that performance improves with turker count, but these gains diminish in magnitude as group size grows. Thus, we might want to figure out the trade-off between accuracy and cost so that we can have a better idea of choice for hiring the workers.

Questions:

  • What do you think about the approach for this paper? Do you believe a pre-study is valuable? Will you apply this in your research?
  • What do you think about the matrix the author used for evaluating the labeling performance? What else matrix would you like to apply in assessing the rate of overlap area?
  • Have you ever considered how many turkers you need to hire would meet your accuracy need for the task? How do you evaluate this number?

Word Count: 578

Read More

03/04/20 – Nan LI – Real-Time Captioning by Groups of Non-Experts

Summary:

In this paper, the author focused on the main limitation of real-time captioning. The author made the point that the caption with high accuracy and low latency requires expensive stenographers who need an appointment in advance and who are trained to use specialized keyboards. The less expensive option is automatic speech recognition. However, its low accuracy and high error rate would greatly influence the user experience and cause many inconveniences for deaf people. To alleviate these problems, the author introduced an end-to-end system called LEGION: SCRIBE, which enables multiple works to provide simultaneous captioning in real-time, and the system combines their input into a final answer with high precision, high coverage, low latency. The author experimented with crowd workers and other local participants and compared the results with CART, ASR, and individual. The results indicate that this end-to-end system with a group of works can outperform both individual and ASR regarding the coverage, precision, and latency.

Reflection:

First, I think the author made a good point about the limitation of real-time captioning, especially the inconvenience that brings to deaf and hard of hearing people. Thus, the greatest contribution this end-to-end system provided is the accessibility of cheap and reliable real-time captioning channel. However, I have several concerns about it.

First, this end-to-end system requires a group of workers, even paid with a low salary for each person, as the caption time increases, the salary for all workers is still a significant overhead.

Second, since to satisfy the coverage requirement, a high precision, high coverage, low latency caption requires at least five more workers to work together. As mentioned in the experiment, the MTruk works need to watch a 40s video to understand how to use this system. Therefore, there may be a problem that the system cannot find the required number of workers on time.

Third, since the system only combines the work from all workers. Thus, there is a coverage problem, which is if all of the workers miss a part of the information, the system output will be incomplete. Based on my experience, if one of the people did not get part of the information, usually, most people cannot get it either. As the example presented in the paper, no workers typed the “non-aqueous” which was used in a clip about chemistry.

Finally, I am considering combining human correction and ASR caption. Since humans have the strength that remembers the pre-mentioned knowledge, for example, an abbreviation, yet they cannot type fast enough to cover all the content. Nevertheless, ASR usually does not miss any portion of the speech, yet it will make some unreasonable mistakes. Thus, it might be a good idea to let humans correct inaccurate captions of ASR instead of trying to type all the speech contents.

Question:

  • What do you think of this end-to-end system? Can you evaluate it from different perspectives, such as expense, accuracy?
  • How would you solve the problem of inadequate speech coverage?
  • What do you think of the idea that combines human and ASR’s work together? Do you think it will be more efficient or less efficient?

Word Count: 517

Read More

03/04/20 – Lee Lisle – Real-Time Captioning by Groups of Non-Experts

Summary

            Lasecki et al. present a novel captioning system for the deaf and hard of hearing population group entitled LEGION:SCRIBE. Their implementation involves crowdsourcing multiple people per audio stream to achieve low-latency as well as highly accurate results. They then detail that the competitors for this are professional stenographers (but they are expensive) and automatic speech recognition (ASR, which has large issues with accuracy). They then go over how they intend on evaluating SCRIBE, with their Multiple Sequence Alignment (MSA) approach that aligns the output from multiple crowdworkers together to get the best possible caption. Their approach also allows for changing the quality to improve coverage or precision, where coverage will provide a more complete caption and precision attains a lower word error rate. They then conducted an experiment where they transcribed a set of lectures using various methods including various types of SCRIBE (varying number of workers and coverage) and an ASR. SCRIBE outperformed the ASR in both latency and accuracy.

Personal Reflection

This work is pretty relevant to me as my semester project is on transcribing notes for users in VR. I was struck by how quickly they were able to get captions from the crowd, but also how many errors still were present in the finished product. In figure 11, the WER for CART was a quarter of their method that only got slightly better than half of the words correct. And in figure 14, none of the transcriptions seem terribly acceptable, though CART was close. I wonder if their WER performed so poorly due to the nature of the talks or that there were multiple speakers in each scene. I wish that they had discussed how much impact having multiple speakers is in transcription services rather than the somewhat vague descriptions they had.

It was interesting that they could get the transcriptions done through Mechanical Turk at the rate of $36 per hour. This is roughly 1/3 of their professional stenographer (at $1.75 per minute or $105 per hour). The cost savings are impressive, though the coverage could be a lot better.

Lastly, I was glad they included one of their final sections, “Leveraging Hybrid Workforces,” as it is particularly relevant to this class. They were able to increase their coverage and precision by including an ASR as one of the inputs into their MSA combiner, regardless if they were using one worker or ten. This indicates that there is a lot of value in human-AI collaboration in this space.

Questions

  1. If such low-latency wasn’t a key issue, could the captions get an even lower WER? Is it worth a 20 second latency? A 60 second latency? Is it worth the extra cost it might incur?
  2. Combined with our reading last week on acceptable false positives and false negatives from AI, what is an acceptable WER for this use case?
  3. Their MSA combiner showed a lot of promise as a tool for potentially different fields. What other ways could their combiner be used?
  4. Error checking is a problem in many different fields, but especially crowdsourcing as the errors can be caused in many different ways. What other ways are there to combat errors in crowdsourcing? Would you choose this way or another?

Read More

03/04/20 – Lee Lisle – Combining Crowdsourcing and Google Street View to Identify Street-Level Accessibility Problems

Summary

Hara, Le, and Froehlich developed an interface that uses Google Street View to identify accessibility issues in city sidewalks. They then perform a study using three researchers and 3 accessibility experts (wheelchair users) to evaluate their interface. This severed as both a way to assess usability issues with their interface as well as a ground truth to verify the results of their second study. That study involved launching crowdworking tasks to identify accessibility problems as well as categorizing what type each problem is. Over 7,517 Mechanical Turk HITs they found that crowdworkers could identify accessibility problems 80.6% of the time and could correctly classify the problem type 78.3% of the time. Combining their approach with a majority voting scheme, they raised these values to 86.9% and 83.8%.

Personal Reflection

Their first step to see if their solution was even feasible seemed like an odd study. Their users were research members and experts, both of which are theoretically more driven than a typical crowdworker. Furthermore, I felt like internal testing and piloting would be more appropriate than a soft-launch like this. While they do bring up that they needed a ground truth to contextualize their second study, I initially felt that this should then be performed by only experts and not as a complete preliminary study. However, as I read more of the paper, I felt that the comparison between the groups (experts vs. researchers) was relevant as it highlighted how wheelchair bound people and able-bodied people can see situations differently. They could not have collected this data on Mechanical Turk alone as they couldn’t guarantee that they were recruiting wheelchair bound participants otherwise.

It was also good to see the human-AI collaboration highlighted in this study. That they’re using the selection (and subsequent images generated by those selections) as training data for a machine learning algorithm, it should lessen the need for future work.

Their pay level also seemed very low at 1-5 cents per image. Even assuming a selection and classification takes only 10 seconds, their total page loading only takes 5 seconds, and they always get 5 cents per image, that’s $12 an hour for ideal circumstances.

The good part of this research is that it cheaply identifies problems quickly. This can be used to identify a large amount of issues and save time in deploying people to fix issues that are co-located in the same area rather than deploying people to find issues and then solve them with lesser potential coverage. It also solves a public need for a highly vulnerable population which makes their solution’s impact even better.

Lastly, it was good to see how the various levels of redundancy impacted their results. The falloff from increasing past 5 workers was harsher than I expected, and the increase in identification is likely not worth doubling the cost of these tasks.

Questions

  1. What other public needs could a Google Street View/crowdsourcing hybrid solve?
  2. What are the tradeoffs for the various stakeholders involved in solutions like this? (The people who need the fixes, the workers who typically had to identify these problems, the workers who are deployed to maintain the identified areas, and any others)
  3. Should every study measure the impact of redundancy? How might redundant workers affect your own projects?

Read More

03/04/2020- Bipasha Banerjee – Pull the Plug? Predicting If Computers or Humans Should Segment Images

Summary

The paper by Gurari et al. discusses the segmentation of images and when segmentation should be done by humans and when is a machine only approach applicable. The work described in this paper is interdisciplinary, involving computer vision and human computation. They have considered both fine-grained as well as coarse-grained segmentation approaches to determine where the human or the machine perform better. The PTP framework describes whether to pull the plug on humans or machines. The framework aims to predict if the labeling of images should come from humans or machines and the quality of the labeled image. Their prediction framework is a regression model that captures the segmentation quality. The training data was populated with masks to reflect the quality of the segmentation. The three algorithms used are Hough transform with circles, Otsu thresholding, and adaptive thresholding. For labels, the Jacquard index was considered to indicate the quality of each instance. Nine features were proposed derived from the binary segmentation mask to catch the failure. It was finally derived that a mixed approach performed better than completely relying on humans or computers. 

Reflection 

The use of machines vs. humans is a complex debate. Leveraging both machine and human capabilities is necessary for efficiency and dealing with “big data.” The paper aims to find when to use computers to create coarse-grained segments and when to replace with humans for fine-grained data. I liked that the authors published the code. This helps in the advancement of research and reproducibility.

The authors have used three datasets but all based on images. In my opinion, detecting images is a relatively simple task to identify bounding boxes. I work with texts, and I have observed that segmentation results of large amounts of text are not simple. Most of the available tools fail to segment long documents like ETDs effectively. Nonetheless, segmentation is an important task, and I am intrigued to see how this work can be extended to text. 

Using crowd workers can be tricky. Although Amazon Mechanical Turk allows requesters to specify the experience, quality, etc. of workers, however, the time taken by a worker can vary depending on various factors. Would humans familiar with the dataset or the domain annotate faster? This needs to be thought of well, in my opinion, especially when we are trying to compete against machines. Machines are faster and good at handling vast amounts of data whereas; humans are good at accuracy. This paper highlights the old problem of accuracy vs. speed.

Questions

  1. The segmentation has been done on datasets with images. How does this extend to text? 
  2. Would experts on the topic or people familiar with databases require less time to annotate?
  3. Although three datasets have been used, I wonder if the domain matters? Would complex images affect the accuracy of machines?

Read More

03/04/2020- Myles Frantz – Real-time captioning by groups of non-experts

Summation

Machine Learning is at the forefront of most technologies though it is still highly inaccurate; given the example of Youtube, auto-generated captions of recorded videos still proving the infancy of the technology. To improve and go beyond this, the team at Rochester created a hybrid way to combine multiple crowd workers’ efforts in order to more accurately and more timely create captioning. This methodology was set up in order to verify previous machine learning algorithms or to generate captions themselves. Overall throughout the experiment, the tie-breaker throughout the experiment is a majority vote. Comparing the accuracy of the general system of Scribe compared to other captioning systems is comparatively similar in precision and Word Error Rate, though at a lower cost.

Response

I could see how combining the two aspects, crowd workers and the initial baseline could create a good and accurate process for generating captions. Using crowd workers to asses and verify the baseline generation for captions ensures the quality of the captions generated and the potential to end up improving the machine learning algorithm. Furthering this, more workers can be given jobs and the captioning system could ultimately improve, improving both the jobs available and the core machine learning algorithm itself.

Questions

  • Not currently experienced in this specific field and disregarding the publishing date of 2012, this combination of crowd workers verifying the auto-generated captions does not seem ultimately novel in this case. Through their study of the latest and greatest in the field didn’t include any crowd workers in any capacity, this may have been limited to their scope. In your opinion does this research currently stand up to some of the more recent papers for auto-captioning or is it just a product of the time?
  • Potentially a problem for within the crowd working community, their techniques utilize a majority vote to confirm which words are accurately representing the phrase. Though there may be some statistics on ensuring the mechanical turkers have sufficient experience and can be relied on, this area may be vulnerable to malicious actors out numbering the non-malicious actors. Based on the phrases being interpreted and explicitly written, do you think a scenario similar to the Mountain Dew naming campaign (Dub The Dew – https://www.huffpost.com/entry/4chan-mountain-dew_n_1773076) in which a group of malicious actors overloaded the names, could happen to this type of system?
  • In using the audio of this technology, the raw audio of a speech or some event would be fed directly to the mechanical turkers working on the Scribe program. Depending on the environment where the speech was given or the quality of the microphone, not even majority of users may be able to hear the correct words (potentially regardless of the volume of the speech). Would there be potential future for combining this kind of technology along with some sort of machine learning algorithms that isolate and remove the white noise or smaller conversations around the main speakers of the event?

Read More

03/04/2020- Myles Frantz – Combining crowdsourcing and Google Street View to identify street-level accessibility problems

Summation

Taxes in the US are a very divisive topic, and unfortunately, system infrastructures such as road maintenance are left to take the impact. Due to the lack of resources generally allocated, states typically only allocated the necessary amount of resources to fix problems, while generally leaving out accessibility options. This team from the University of Maryland has taken it upon themselves to prototype identification of these lack of accessibility options via crowd sourcing. They developed a system that utilized information from Google Street View and users could identify problems with the road. The next users could also confirm or deny the previous conclusions from previous users. Throughout this, they ran this experiment using 229 images manually collected through Google Street View and once with 3 handicap users, then with 185 various mechanical turkers. Throughout this, they were able to achieve an accuracy of at least 78% compared to the ground truth. Further trimming down the lower ranking turkers raised the accuracy by about 6% at the cost of filtering out 52% of the turkers.

Response

I can appreciate this approach since I believe (as was also stated in the paper) that a manual effort to identify the accessibility problems would cost a lot of money and time. Both of those requirements are typically sticking and stringent points from government contracts. Though they may not be ready to open this kind of availability to crowd workers, the accuracy is creating a stronger argument. Further ensuring better workers, the study also proved that by dropping the number of raw workers available for better results was ultimately fruitful, and potentially may be in alignment with the type of budget the government could provide.

Questions

  • Manually creating ground truth from experts would likely be unsustainable, since the cost in that specific requirement would increase. Since I don’t believe you can require a kind of accessibility handicap in Amazon Mechanical Turk, if this kind of work was solely based on Mechanical (or other crowdsourcing tools), would the ground truth ultimately suffer due to the potential lack of experience and expertise?
  • This may be an external perspective; however, it seems there is a definitive split of ideas within the paper, creating a system for crowd workers to identify and then creating a system to optimize the potential crowd workers working on the project. Do you think both ideas were equally weighed and spread throughout the paper or the Google Street View system was a means of utilizing the techniques for optimizing the crowd workers?
  • Since these images (solely utilizing Google Street View) are likely only taken periodically (due to resource constraints of the Google Street View Cars), the images are highly likely to be older and under change from any recent construction. When there is a delay from the Google Street View pictures, structures and buildings may have changed without getting updated in the system. Do you think there might be enough changes in the streets that the turkers work would become obsolete?

Read More

03/04/2020 – Dylan Finch – Real-Time Captioning by Groups of Non-Experts

Word count: 564

Summary of the Reading

This paper aims to help with accessibility of audio streams by making it easier to create captions for deaf listeners. The typical solution to this problem is to hire expensive, highly trained professionals who require specialized keyboards, stenographers. Or, in other cases, people with less training to create captions, but these captions may take longer to write, creating a latency between what is said in the audio and the captions. This is not desirable, because it makes it harder for the deaf person to connect the audio with any accompanying video. This paper aims to marry cheap, easy to produce captions with the ability to have the cpations created in real time and with little latency. The solution is to use many people who do not require specialized training. When working together, a group of crowd workers can achieve high caption coverage of audio with a latency of only 2.9 seconds.

Reflections and Connections

I think that this paper highlights one of the coolest things that crowdsourcing can do. It can take big, complicated tasks that used to require highly trained individuals and make them accomplishable by ordinary people. This is extremely powerful. It makes all kinds of technologies and techniques much more accessible. It is hard to hire one highly trained stenographer, but it is easy to hire a few normal people. This is the same idea that powers Wikipedia. Many people make small edits, using specialized knowledge that they know, and, together, they create a highly accessible and complete collection of knowledge. This same principle can and should be applied to many more fields. I would love to see what other professions could be democratized through the use of many normal people to replace one highly trained person. 

This research also shows how it is possible to break up tasks that may have traditionally been thought of as atomic. Transcribing audio is a very hard task to solve using crowd workers because there are not real discrete tasks that could b e sent to crowd workers. The stream of audio is continuous and always changing. However, this paper shows that it is possible to break up this activity into manageable chunks that can be accomplished by crowd workers, the researchers just needed to think outside of the box. I think that this kind of thinking will become increasingly important as more and more work is crowdsourced. I think that as we learn how to solve more and more problems using crowdsourcing, the issue becomes less and less ot can we solve this using crowdsource and becomes much more about how can we break up this problem into manageable pieces that can be done by the crowd. This kind of research has applications elsewhere, too. I think that in the future this kind of research will be much more important. 

Questions

  1. What are some similar tasks that could be crowdsourced using a method similar to the one described in the paper?
  2. How do you think that crowdsourcing will impact the accessibility of our world? Are there other ways that crowdsourcing could make our world more accessible?
  3. Do you think there will come a time when most professions can be accomplished by crowd workers? What do you think the extent of crowd expertise will be?

Read More

03/04/2020 – Dylan Finch – Pull the Plug?

Word count: 596

Summary of the Reading

The main goal of this paper is to make image segmentation more efficient. Image segmentation as it is now, requires humans to help with the process. there are just some images that machines cannot segment on their own. However, there are many cases where an image segmentation algorithm can do all of the work on its own. This presents a problem: we do not know when we can use an algorithm and when we have to use a human, so we have to have humans review all of the segmentations. This is highly inefficient. This paper tries to solve this problem by introducing an algorithm that can decide when a human is required to segment an image. The process described in the paper involves scoring each segmented image done by machines, then giving humans the task of reviewing the lowest scoring images. Overall, the process was very effective and saved a lot of human effort.

Reflections and Connections

I think that this paper gives a great example of how humans and machines should interact, especially when it comes to humans and AIs interacting. Often times, we set out in research with the goal of creating a completely automated process that throws the human away and tries to create an AI or some other kind of machine that will do all of the work. This is often a very bad solution. AIs as they currently are, are not good enough to do most complex tasks all by themselves. In the cases of tasks like image segmentation, this is an especially big issue. These tasks are very easy for humans to do and very hard for AIs to do. So, it is good to see a researcher who is willing to use human strengths to make up for the weaknesses of machines. I think it is a good thing to have both things working together.

This paper also gives us some very important research, trying to answer the question of when we should machines and when we should use humans. This is a very tough question and it comes up in a lot of different fields. Humans are expensive, but machines are often imperfect. It can be very hard to decide when you should use one or the other. This paper does a great job of answering this question for image segmentation and I would love to see more similar research in other fields explain when it is best to use humans and machines in those fields. 

While I like this paper, I do also worry that it is simply moving the problem, rather than actually solving it. Now, instead of needing to improve a segmentation algorithm, we need to improve the scoring algorithm for the segmentations. Have we really improved the solution or have we just moved the area that now needs further improvement? 

Questions

  1. How could this kind of technology be used in other fields? How can we more efficiently use human and machine strengths together?
  2. In general, when do you think it is appropriate to create a system like this? When should we not fully rely on AI or machines?
  3. Did this paper just move the problem, or do you think that this method is better than just creating a better image segmentation algorithm? 
  4. Does creating systems like this stifle innovation on the main problem?
  5. Do you think machines will one day be good enough to segment images with no human input? How far off do you think that is?

Read More

3/4/20 – Jooyoung Whang – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

In this paper, the authors study the effectiveness of vision-to-language systems for automatically generating alt texts for images and the impact of human-in-the-loop for this task. The authors set up four methods for generating alt text. First is a simple implementation of modern vision-to-language alt text generation. The second is a human-adjusted version of the first method. The third method is a more involved one, where a Blind or Vision Impaired (BVI) user chats with a non-BVI user to gain more context about an image. The final method is a generalized version of the third method, where the authors analyzed the patterns of questions asked during the third method to form a structured set of pre-defined questions that a crowdsource worker can directly provide the answer to without having the need for a lengthy conversation. The authors conclude that current vision-to-language techniques can, in fact, harm context understanding for BVI users, and simple human-in-the-loop methods significantly outperform. They also found that the method of the structured questions worked the best.

This was an interesting study that implicitly pointed out the limitation of computers at understanding social context which is a human affordance. The authors stated that the results of a vision-to-language system often confused the users because the system did not get the point. This made me wonder if the current limitation could be overcome in the future.

I was also concerned whether the authors’ proposed methods were even practical. Sure, the human-in-the-loop method involving Mturk workers greatly enhanced the description of a Twitter image, but based on their report, it’ll take too long to retrieve the description. The paper reports that to answer one of the structured questions, it takes on average, 1 minute. This is excluding the time it takes for a Mturk worker to accept a HIT. The authors suggested pre-generating alt texts for popular Tweets, but this does not completely solve the problem.

I was also skeptical about the way the authors performed validation with the 7 BVI users. In their validation, they simulated their third method (TweetTalk, a conversation between BVI and sighted users). However, they did not do it by using their application, but rather a face-to-face conversation between the researchers and the participants. The authors claimed that they tried to replicate the environment as much as possible, but I think there still can be flaws since the researchers serving as the sighted user already had expert knowledge about their experiment. Also, as stated in the paper’s limitations section, the validation was performed with too fewer participants. This may not fully capture the BVI users’ behaviors.

These are the questions that I had while reading this paper:

1. Do you think the authors’ proposed methods are actually practical? What could be done to make them practical if you don’t think so?

2. What do you think were the human affordances needed for the human element of this experiment other than social awareness?

3. Do you think the authors’ validation with the BVI users is sound? Also, the validation was only done for the third method. How can the validation be done for the rest of the methods?

Read More