03/04/2020- Myles Frantz – Real-time captioning by groups of non-experts

Summation

Machine Learning is at the forefront of most technologies though it is still highly inaccurate; given the example of Youtube, auto-generated captions of recorded videos still proving the infancy of the technology. To improve and go beyond this, the team at Rochester created a hybrid way to combine multiple crowd workers’ efforts in order to more accurately and more timely create captioning. This methodology was set up in order to verify previous machine learning algorithms or to generate captions themselves. Overall throughout the experiment, the tie-breaker throughout the experiment is a majority vote. Comparing the accuracy of the general system of Scribe compared to other captioning systems is comparatively similar in precision and Word Error Rate, though at a lower cost.

Response

I could see how combining the two aspects, crowd workers and the initial baseline could create a good and accurate process for generating captions. Using crowd workers to asses and verify the baseline generation for captions ensures the quality of the captions generated and the potential to end up improving the machine learning algorithm. Furthering this, more workers can be given jobs and the captioning system could ultimately improve, improving both the jobs available and the core machine learning algorithm itself.

Questions

  • Not currently experienced in this specific field and disregarding the publishing date of 2012, this combination of crowd workers verifying the auto-generated captions does not seem ultimately novel in this case. Through their study of the latest and greatest in the field didn’t include any crowd workers in any capacity, this may have been limited to their scope. In your opinion does this research currently stand up to some of the more recent papers for auto-captioning or is it just a product of the time?
  • Potentially a problem for within the crowd working community, their techniques utilize a majority vote to confirm which words are accurately representing the phrase. Though there may be some statistics on ensuring the mechanical turkers have sufficient experience and can be relied on, this area may be vulnerable to malicious actors out numbering the non-malicious actors. Based on the phrases being interpreted and explicitly written, do you think a scenario similar to the Mountain Dew naming campaign (Dub The Dew – https://www.huffpost.com/entry/4chan-mountain-dew_n_1773076) in which a group of malicious actors overloaded the names, could happen to this type of system?
  • In using the audio of this technology, the raw audio of a speech or some event would be fed directly to the mechanical turkers working on the Scribe program. Depending on the environment where the speech was given or the quality of the microphone, not even majority of users may be able to hear the correct words (potentially regardless of the volume of the speech). Would there be potential future for combining this kind of technology along with some sort of machine learning algorithms that isolate and remove the white noise or smaller conversations around the main speakers of the event?

Read More

03/04/2020- Myles Frantz – Combining crowdsourcing and Google Street View to identify street-level accessibility problems

Summation

Taxes in the US are a very divisive topic, and unfortunately, system infrastructures such as road maintenance are left to take the impact. Due to the lack of resources generally allocated, states typically only allocated the necessary amount of resources to fix problems, while generally leaving out accessibility options. This team from the University of Maryland has taken it upon themselves to prototype identification of these lack of accessibility options via crowd sourcing. They developed a system that utilized information from Google Street View and users could identify problems with the road. The next users could also confirm or deny the previous conclusions from previous users. Throughout this, they ran this experiment using 229 images manually collected through Google Street View and once with 3 handicap users, then with 185 various mechanical turkers. Throughout this, they were able to achieve an accuracy of at least 78% compared to the ground truth. Further trimming down the lower ranking turkers raised the accuracy by about 6% at the cost of filtering out 52% of the turkers.

Response

I can appreciate this approach since I believe (as was also stated in the paper) that a manual effort to identify the accessibility problems would cost a lot of money and time. Both of those requirements are typically sticking and stringent points from government contracts. Though they may not be ready to open this kind of availability to crowd workers, the accuracy is creating a stronger argument. Further ensuring better workers, the study also proved that by dropping the number of raw workers available for better results was ultimately fruitful, and potentially may be in alignment with the type of budget the government could provide.

Questions

  • Manually creating ground truth from experts would likely be unsustainable, since the cost in that specific requirement would increase. Since I don’t believe you can require a kind of accessibility handicap in Amazon Mechanical Turk, if this kind of work was solely based on Mechanical (or other crowdsourcing tools), would the ground truth ultimately suffer due to the potential lack of experience and expertise?
  • This may be an external perspective; however, it seems there is a definitive split of ideas within the paper, creating a system for crowd workers to identify and then creating a system to optimize the potential crowd workers working on the project. Do you think both ideas were equally weighed and spread throughout the paper or the Google Street View system was a means of utilizing the techniques for optimizing the crowd workers?
  • Since these images (solely utilizing Google Street View) are likely only taken periodically (due to resource constraints of the Google Street View Cars), the images are highly likely to be older and under change from any recent construction. When there is a delay from the Google Street View pictures, structures and buildings may have changed without getting updated in the system. Do you think there might be enough changes in the streets that the turkers work would become obsolete?

Read More

03/04/2020 – Dylan Finch – Real-Time Captioning by Groups of Non-Experts

Word count: 564

Summary of the Reading

This paper aims to help with accessibility of audio streams by making it easier to create captions for deaf listeners. The typical solution to this problem is to hire expensive, highly trained professionals who require specialized keyboards, stenographers. Or, in other cases, people with less training to create captions, but these captions may take longer to write, creating a latency between what is said in the audio and the captions. This is not desirable, because it makes it harder for the deaf person to connect the audio with any accompanying video. This paper aims to marry cheap, easy to produce captions with the ability to have the cpations created in real time and with little latency. The solution is to use many people who do not require specialized training. When working together, a group of crowd workers can achieve high caption coverage of audio with a latency of only 2.9 seconds.

Reflections and Connections

I think that this paper highlights one of the coolest things that crowdsourcing can do. It can take big, complicated tasks that used to require highly trained individuals and make them accomplishable by ordinary people. This is extremely powerful. It makes all kinds of technologies and techniques much more accessible. It is hard to hire one highly trained stenographer, but it is easy to hire a few normal people. This is the same idea that powers Wikipedia. Many people make small edits, using specialized knowledge that they know, and, together, they create a highly accessible and complete collection of knowledge. This same principle can and should be applied to many more fields. I would love to see what other professions could be democratized through the use of many normal people to replace one highly trained person. 

This research also shows how it is possible to break up tasks that may have traditionally been thought of as atomic. Transcribing audio is a very hard task to solve using crowd workers because there are not real discrete tasks that could b e sent to crowd workers. The stream of audio is continuous and always changing. However, this paper shows that it is possible to break up this activity into manageable chunks that can be accomplished by crowd workers, the researchers just needed to think outside of the box. I think that this kind of thinking will become increasingly important as more and more work is crowdsourced. I think that as we learn how to solve more and more problems using crowdsourcing, the issue becomes less and less ot can we solve this using crowdsource and becomes much more about how can we break up this problem into manageable pieces that can be done by the crowd. This kind of research has applications elsewhere, too. I think that in the future this kind of research will be much more important. 

Questions

  1. What are some similar tasks that could be crowdsourced using a method similar to the one described in the paper?
  2. How do you think that crowdsourcing will impact the accessibility of our world? Are there other ways that crowdsourcing could make our world more accessible?
  3. Do you think there will come a time when most professions can be accomplished by crowd workers? What do you think the extent of crowd expertise will be?

Read More

03/04/2020 – Nurendra Choudhary – Real-time captioning by groups of non-experts

Summary

In this paper, the authors discuss a collaborative real-time captioning framework called LEGION:SCRIBE. They compare their system against the previous approach called CART and Automated Speech Recognition (ASR) system. The authors initiate the discussion with the benefits of captioning. They proceed to explain the expensive cost of hiring stenographers. Stenographers are the fastest and most accurate captioners with access to specialized keyboards and expertise in the area. However, they are prohibitively expensive (100-120$ an hour). ASR is much cheaper but their low accuracy deems them inapplicable in most real-world scenarios. 

To alleviate the issues, the authors introduce SCRIBE framework. In SCRIBE, crowd-workers caption smaller parts of the speech. The parts are merged using an independent framework to form the final sentence. The latency of the system is 2.89s, emphasizing its real-time nature, which is a significant improvement over ~5s of CART.

Reflection

The paper introduces an interesting approach to collate data from multiple crowd workers for sequence learning tasks. The method has been applied before in cases such as Google Translate (translating small phrases) and ASR (voice recognition of speech segments). However, SCRIBE distinguishes itself by bringing in real-time improvement in the system. But, the system relies on the availability of crowd workers. This may lead to unreliable behaviour in the system. Additionally, the hired workers are not professionals. Hence, the quality is affected by human behavioral features such as mindset, emotions or mental stamina. I believe a study on the evolution of SCRIBE overtime and its dependence on such features needs to be analyzed.

Furthermore, I question the crowd management system. Amazon MT cannot guarantee real-time labourers. Currently, given the supply of workers with respect to the tasks, workers are always available. However, as more users adopt the system, this need not always hold true. So, crowd management systems should provide alternatives that guarantee such requirements. Also, the work provider needs to find alternatives to maintain real-time interaction, in case the crowd fails. In case of SCRIBE, the authors can append an ASR module in a situation of crowd failure. ASR may not give the best results but would be able to ensure smoother user experience.

The current development system does not consider the volatility of crowd management systems. This makes them an external single point of failure. I think there should be a push in the direction of simultaneously adopting multiple management systems for the framework to increase their reliability. This will also improve system efficiency because it has a more diverse set of results as choice. Thus benefiting the overall model structure and user adoption. 

Questions

  1. Google Translate uses a similar strategy by asking its users to translate parts of sentences. Can this technique be globally applied to any sequential learning framework? Is there a way we can divide sequences into independent segments? In case of dependent segments, can we just use a similar merging module or is it always problem-dependent?
  2. The system depends on the availability of crowd workers. Should there be a study on the availability aspect? What kind of systems would be benefitted from this?
  3. Should there be a new crowd work management system with a sole focus on providing real-time data provisions?
  4. Should the responsibility of ensuring real-time nature be on the management system or the work provider? How will it impact the current development framework?

Word Count: 567

Read More

03/04/2020 – Palakh Mignonne Jude – Combining Crowdsourcing and Google Street View To Identify Street-Level Accessibility Problems

SUMMARY

The authors of this paper aim to investigate the feasibility of recruiting MTurk workers to label and assess sidewalk accessibility problems as can be viewed by making use of Google Street View. The authors conducted two studies, the first, with 6 people (3 from their team of researchers and 3 wheelchair users) and the second, that investigated the performance of turkers. The authors created an interactive labeling interface as well as a validation interface (to help users to accept/reject previous labels).  The authors proposed different levels of annotation correctness comprising of two spectra – localization spectrum which includes image level and pixel level granularity and specificity spectrum which includes the amount of information evaluated for each label. They defined image-level correctness in terms of accuracy, precision, recall, and f-measure. In order to computer inter-rater agreement at the image-level, they utilized Fleiss’ kappa. In order to evaluate the more challenging pixel-level agreement, they aimed to verify the labeling by indicating that pixel-level overlap was greater between labelers on the same image versus across different images. The authors used the labels produced from Study 1 as the ground truth dataset to evaluate turker performance. The authors also proposed two quality control approaches – filtering turkers based on a threshold of performance and filtering labels based on crowdsourced validations.

REFLECTION

I really liked the motivation of this paper especially given the large number of people that have physical disabilities. I am very interested to know how something like this would extend to other countries such as India as it would greatly aid people with physical disabilities over there since there are many places with poor walking surfaces and do not have support for wheelchairs. I think that having such a system in place in India would definitely help disabled people be better informed about places that can be visited.

I also liked the quality control mechanisms of filtering tuckers and filtering labels since these appear to be good ways to improve the overall quality of the labels obtained. I thought it was interesting that the performance of the system improved with tucker count but the gains diminished in magnitude as the group size grew. I thought that the design of the labelling and verification interface was good and that it made it easy for users to perform their tasks.

QUESTIONS

  1. As indicated in the limitations section, this work ‘ignored practical aspects such as locating the GSV camera in geographical space and selecting an optimal viewpoint’. Has any follow-up study been performed that takes into account these physical aspects? How complex would it be to conduct such a study?
  2. The authors mention that image quality can be poor in some cases due to a variety of factors. How much of an impact would this cause to the task at hand? Which labels would have been most affected if the image quality was very poor?
  3. The validation of labels was performed by crowd workers via the verification interface. Would there have been any change in the results obtained if experts had been used for the validation of labels instead of crowd workers (since they may have been able to identify more errors in the labels as compared to normal crowd workers)?

Read More

03/04/20 – Akshita Jha – Pull the Plug? Predicting If Computers or Humans Should Segment Images

Summary:
“Pull the Plug? Predicting If Computers or Humans Should Segment Images” by Gurari et. al. talks about image segmentation. They propose a resource allocation framework that tries to predict when best to use a computer for segmenting images and when to switch to humans. Image segmentation is the process of “partitioning a single image into multiple segments” in order to simplify the image into something that is easier to analyze. The authors implement two systems that decide when to replace humans with computers to create fine-grained segments and when to replace computers with humans in order to get coarse segments. They demonstrate through experiments that this mixed model of humans and computers beats the state of the art systems for image segmentation. The authors use the resource allocation framework, “Pull the Plug”, on humans or computers. They do this by giving the system an image and trying to predict if an annotation should from a human or a computer. The authors evaluate the model using Pearson’s correlation coefficient (CC) and mean absolute error (MAE). CC indicates the correlation strength of the predicted score to the actual scores given by the Jaccard index on the ground truth. MAE is the average prediction errors. The authors thoroughly experiment with initializing segmentation tools and reducing human effort initialization.

Reflections:
This is an interesting work that successfully makes uses of mixed modes involving both humans and computers to enrich the precision and accuracy of a task. The two methods that the authors design for segmenting an image was particularly thoughtful. First, given an image, the authors design a system that tries to predict whether the image requires fine-grained segmentation or coarse-grained segmentation. This is non-trivial as this task requires the system to possess a certain level of “intelligence”. The authors use segmentation tolls but the motivation of the system design is to remain agnostic to these particular segmentation tools. The systems rank several segmentation tools by using a tool designed by the authors to predict the quality of the segmentation. The system then allocates the available human budget to create coarse segmentations. The second system tries to capture whether an image requires fine-grained segmentation or not. They do this by building on the coarse segmentation given by the first system. The second system refines the segmentation and allocates the available human budget to create fine-grained segmentation for low predicted quality segmentations. Both these tasks rely on the system proposed by the authors to predict the quality of candidate segmentation.

Questions:
1. The authors rely on their proposed system of predicting the quality of candidate segmentations. What kind of errors do you expect?
2. Can you think of a way to improve this system?
3. Can we replace the segmentation quality prediction system with a human? Do you expect the system to improve or would the performance go down? How would it affect the overall experience of the system?
4. In most such systems, humans are needed only for annotation. Can we think of more creative ways to engage humans while improving the system performance?

Read More

Subil Abraham – 03/04/2020 – Real-time captioning by groups of non-experts

This paper pioneers the approach of using crowd work for closed captioning systems. The scenario they target is classes and lectures, where a student can hold up their phone and record the speaker and the sound the transmitted to the crowd workers. The sound that is passed is given as bite sized pieces for the crowd workers to transcribe, and the paper’s implementation of the multiple sequence alignment algorithms takes those transcriptions and combines them. The focus of the tool is very much on real-time captioning so the amount of time a crowd worker can spend on a portion of sound is limited. The authors design interfaces on the worker side to promote continuous transcription, and on the user side to allow them to correct the received transcriptions in real time, enhancing the quality further. The authors had to deal with interesting challenges in resolving errors in the transcription, which they did by a combination of comaparing transcriptions of the same section from different crowd workers, using bigram and trigram data to validate the word ordering. Evaluations showed that precision was stable while coverage increased with increase in the number of workers, while having lower error rate compared to automatic transcription and untrained transcribers.

One thing that needs to be pointed out about this work is that I believe that ASR is always rapidly improving and has made significant strides from when this paper was published. From my own anecdotal experience, Youtube’s automatic closed captions are getting very very close to being fully accurate (however, thinking back on our reading of the Ghost Work book at the beginning of the semester, I wonder if Youtube is cheating a bit and using crowd work intervention for some their videos to help their captioning AI along). I also find that the author’s solution for merging the transcriptions of the different sound bites is interesting. How they would solve that was the first thing that was on my mind because it was not going to be a matter of simply aligning the time stamps because those were definitely going to be imprecise. So I do like their clever multi part solution. Finally, I was a little surprised and disappointed that the WER was at ~45% which was a lot higher than I expected. I was expecting the error rate to be a lot closer to professional transcribers but unfortunately not. The software still has a way to go in that.

  1. How could you get the error rate down to the professional transcriber’s level? What is going wrong there that is causing it to be that high?
  2. It’s interesting to me that they couldn’t just play isolated sound clips but instead had to raise and lower volume on a continuous stream for better accuracy. Where are the other places humans work better when they have a continuous stream of data rather than discrete pieces of data?
  3. Is there an ideal balance between choosing precision and coverage in the context of this paper? This was something that also came up in last week’s readings. Should the user decide what the balance should be? How would they do it when there can be multiple users all at the same location trying to request captioning for the same thing?

Read More

Subil Abraham – 03/04/2020 – Pull the Plug

The paper proposes a way of solving the issue of deciding when a computer or human should do the work of foreground segmentation of images. Foreground segmentation is a common task in computer vision where the idea is that there is an element in an image that is the focus of the image and that is what is needed for actual processing. However, automatic foreground segmentation is not always reliable so sometimes it is necessary to get humans to do it. The important question is deciding which images you send to humans for segmentation because hiring humans are expensive. The paper proposes a machine learning method that calculates the quality of a given coarse or fine grained segmentation and decide if it is necessary to bring in a human to do the segmentation. They evaluate their framework by examining the quality of different segmentation algorithms and are able to acheive the quality equivalent to 100% human work by using only 32.5% human effort for Grab Cut segmentation, 65% human effort for Chan Vese, and 70% human effort for Lankton.

The authors have pursued a truly interesting idea in that they are not trying to create a better way of automatic image segmentation, but rather creating a way of determining if the auto image segmentation is good enough. My initial thought was couldn’t something like this be used to just make a better automated image segmenter? I mean, if you can tell the quality, then you know how to make it better. But apparently that’s a hard enough problem that it is far more helpful to just defer to a human when you predict that your segmentation quality is not where you want it. It’s interesting that they talk about pulling the plug on both computers and humans but the focus of the paper seems to be focused on pulling the plug on computers i.e. the human workers are the backup plan in case the computer can’t do the quality work and not the other way around. This applies to both their cases, coarse grained and fine grained segmentation work. I would like to see future work where the primary work is done by humans first and then test to see how pulling the plug on the human work would be effective and where the productivity would increase. This would have to be work in something that is purely in the human domain (i.e. can’t use regular office work because that is easily automatable).

  1. What are examples of work where we pull the plug on the human first, rather pulling the plug on computers?
  2. It’s an interesting turn around that we are using AI effort to determine quality and decide when to bring humans in, rather than improving the AI of the original task itself. What other tasks could you apply this, where there are existing AI methods but an AI way of determining quality and deciding when to bring in humans would be useful?
  3. How would you set up a segmentation workflow (or another application’s workflow) where when you pull the plug on the computer or human, you are giving the best case result to the other for improvement, rather than starting over from scratch?

Read More

03/04/2020 – Pull the Plug? Predicting If Computers or Humans Should Segment Image – Yuhang Liu

Summary:

This paper examines a new image segmentation method. Image segmentation is a key step in any image analysis task. There have been many methods before, including low-efficiency manual methods and automated methods that can produce high-quality pictures, but these methods have certain disadvantages. The authors therefore propose a distribution framework that can predict how best to assign fixed labor to collect higher quality segmentation for a given image and automated method. Specifically, the author has implemented two systems, which can perform the following processing on images when doing image segmentation:

  1. Use computers instead of humans to create the rough segmentation needed to initialize the segmentation tool,
  2. Use computers to replace humans to create the final fine-grained segmentation. The final experiments also proved that relying on this hybrid, interactive segmentation system can achieve faster and more efficient segmentation.

Reflection:

Once, I did a related image recognition project. Our subject is a railway turnout monitoring system based on computer vision, which is to detect the turnout of the railroad track from the picture, and the most critical step is to separate the outline of the railroad track. At that time, we only using the method of computer separation, the main problem we encountered at the time was that when the scene became complicated, we would face to complex line segments, which would affect the detection results. As mentioned in this paper, using human-machine, the combined method can greatly improve the accuracy rate. I very much agree with it, and hope that one day I can try it myself. At the same time, what I most agree with is that the system can automatically assign work instead of all photos going through a same process. For a photo, only the machine can participate, or artificial processing is required. This variety of interactive methods, It is far more advantageous than a single method, which can greatly save workers’ time without affecting accuracy, and the most important point is that complex interaction methods can adapt to process more diverse pictures. Finally, I think similar operations can be applied to other aspects. This method of assigning tasks through the system can coordinate the working relationship between humans and machines, for example, in other fields, such as sound sentiment analysis and musical background separation. In these aspects, humans have the incomparable advantages of machines and can achieve good results, but it takes a long time and is very expensive. Therefore, if we can classify this kind of thinking, deal with the common working relationship between humans and machines, and give complex situations to people or pass the rough points of the machine first, then the separation cost will be greatly reduced, and the accuracy rate will not be affected, so I It is believed that this method has great application prospects, not only because of the many application directions of image separation, but we can also learn from this idea to complete more detailed analysis in more fields.

Question:

  1. Is this idea of cooperation between man and machine worth learning?
  2. Because the system defines the working range of people and machines, will the machine reduce the accuracy due to the results of human work?
  3. Does man-machine cooperation pose new problems, such as increasing costs?

Read More

03/04/20 – Akshita Jha – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

Summary:
“Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind” by Salisbury et. al. talks about the important problem of accessibility. The authors talk about the challenges that arise from an automatic image captioning system and how the imperfections in the system may hinder a blind person’s understanding of social media posts that have embedded imagery. The authors use mixed methods to evaluate and subsequently modify the captions generated by the automated system for images embedded in social media posts. They study how crowdsourcing can enhance the existing workflows and that provide scalable and useful alt text for the blind. The imperfections of the current automated captioning system hinder the user’s understanding of an image. The authors do a detailed analysis of the conversations collected by them to design user-friendly experiences that can effectively assist blind users. The authors focus on three research questions: (i) What value is provided by a state-of-the-art vision-to-language API in assisting BVI users, and what are the areas for improvement? (ii) What are the trade-offs between alternative workflows
for the crowd assisting BVI users? (iii) Can human-in-the loop workflows result in reusable content that can be shared with other BVI users? The authors study varying levels of human engagements and automated systems to come up with a final system that better understands the requirements for creating good quality al-text for blind and visually impaired users.

Reflections:
This is an interesting work as it talks about the often ignored problem of accessibility. The authors focus on images embedded in social media posts. Most of the times the automatic captions given by an automated system trained using a machine learning algorithm are inadequate and non descriptive. This might not be so much of a problem for day to day users but can be a huge challenge for blind people. This is a thoughtful analysis done by the authors keeping accessibility in mind. The authors validate their approach by running a follow-up study with seven blind and visually impaired users. The users were asked to compare the uncorrected vision to language caption and the alt text provided by their system. The findings showed that the blind and visually impaired users would prefer the conversational system designed by the authors to better understand the images. However, if the authors had taken the feedback from the target user group while developing the system that would have been more helpful instead of just asking the users to test the system. Also, the tweets used by the authors might not be representative of the kinds of tweets in the target users’ timeline.

Questions:
1. What do you think about the approach taken by the authors to generate the alt-text?
2. Would it have been helpful to conduct a survey to understand the needs of the blind and visually impaired users before developing the system?
3. Don’t you think using a conversational agent to understand the image embedded in tweets is too cumbersome and time consuming?

Read More