03/04/20 – Nan LI – Real-Time Captioning by Groups of Non-Experts


In this paper, the author focused on the main limitation of real-time captioning. The author made the point that the caption with high accuracy and low latency requires expensive stenographers who need an appointment in advance and who are trained to use specialized keyboards. The less expensive option is automatic speech recognition. However, its low accuracy and high error rate would greatly influence the user experience and cause many inconveniences for deaf people. To alleviate these problems, the author introduced an end-to-end system called LEGION: SCRIBE, which enables multiple works to provide simultaneous captioning in real-time, and the system combines their input into a final answer with high precision, high coverage, low latency. The author experimented with crowd workers and other local participants and compared the results with CART, ASR, and individual. The results indicate that this end-to-end system with a group of works can outperform both individual and ASR regarding the coverage, precision, and latency.


First, I think the author made a good point about the limitation of real-time captioning, especially the inconvenience that brings to deaf and hard of hearing people. Thus, the greatest contribution this end-to-end system provided is the accessibility of cheap and reliable real-time captioning channel. However, I have several concerns about it.

First, this end-to-end system requires a group of workers, even paid with a low salary for each person, as the caption time increases, the salary for all workers is still a significant overhead.

Second, since to satisfy the coverage requirement, a high precision, high coverage, low latency caption requires at least five more workers to work together. As mentioned in the experiment, the MTruk works need to watch a 40s video to understand how to use this system. Therefore, there may be a problem that the system cannot find the required number of workers on time.

Third, since the system only combines the work from all workers. Thus, there is a coverage problem, which is if all of the workers miss a part of the information, the system output will be incomplete. Based on my experience, if one of the people did not get part of the information, usually, most people cannot get it either. As the example presented in the paper, no workers typed the “non-aqueous” which was used in a clip about chemistry.

Finally, I am considering combining human correction and ASR caption. Since humans have the strength that remembers the pre-mentioned knowledge, for example, an abbreviation, yet they cannot type fast enough to cover all the content. Nevertheless, ASR usually does not miss any portion of the speech, yet it will make some unreasonable mistakes. Thus, it might be a good idea to let humans correct inaccurate captions of ASR instead of trying to type all the speech contents.


  • What do you think of this end-to-end system? Can you evaluate it from different perspectives, such as expense, accuracy?
  • How would you solve the problem of inadequate speech coverage?
  • What do you think of the idea that combines human and ASR’s work together? Do you think it will be more efficient or less efficient?

Word Count: 517

Read More

03/04/20 – Lee Lisle – Real-Time Captioning by Groups of Non-Experts


            Lasecki et al. present a novel captioning system for the deaf and hard of hearing population group entitled LEGION:SCRIBE. Their implementation involves crowdsourcing multiple people per audio stream to achieve low-latency as well as highly accurate results. They then detail that the competitors for this are professional stenographers (but they are expensive) and automatic speech recognition (ASR, which has large issues with accuracy). They then go over how they intend on evaluating SCRIBE, with their Multiple Sequence Alignment (MSA) approach that aligns the output from multiple crowdworkers together to get the best possible caption. Their approach also allows for changing the quality to improve coverage or precision, where coverage will provide a more complete caption and precision attains a lower word error rate. They then conducted an experiment where they transcribed a set of lectures using various methods including various types of SCRIBE (varying number of workers and coverage) and an ASR. SCRIBE outperformed the ASR in both latency and accuracy.

Personal Reflection

This work is pretty relevant to me as my semester project is on transcribing notes for users in VR. I was struck by how quickly they were able to get captions from the crowd, but also how many errors still were present in the finished product. In figure 11, the WER for CART was a quarter of their method that only got slightly better than half of the words correct. And in figure 14, none of the transcriptions seem terribly acceptable, though CART was close. I wonder if their WER performed so poorly due to the nature of the talks or that there were multiple speakers in each scene. I wish that they had discussed how much impact having multiple speakers is in transcription services rather than the somewhat vague descriptions they had.

It was interesting that they could get the transcriptions done through Mechanical Turk at the rate of $36 per hour. This is roughly 1/3 of their professional stenographer (at $1.75 per minute or $105 per hour). The cost savings are impressive, though the coverage could be a lot better.

Lastly, I was glad they included one of their final sections, “Leveraging Hybrid Workforces,” as it is particularly relevant to this class. They were able to increase their coverage and precision by including an ASR as one of the inputs into their MSA combiner, regardless if they were using one worker or ten. This indicates that there is a lot of value in human-AI collaboration in this space.


  1. If such low-latency wasn’t a key issue, could the captions get an even lower WER? Is it worth a 20 second latency? A 60 second latency? Is it worth the extra cost it might incur?
  2. Combined with our reading last week on acceptable false positives and false negatives from AI, what is an acceptable WER for this use case?
  3. Their MSA combiner showed a lot of promise as a tool for potentially different fields. What other ways could their combiner be used?
  4. Error checking is a problem in many different fields, but especially crowdsourcing as the errors can be caused in many different ways. What other ways are there to combat errors in crowdsourcing? Would you choose this way or another?

Read More

03/04/2020- Myles Frantz – Real-time captioning by groups of non-experts


Machine Learning is at the forefront of most technologies though it is still highly inaccurate; given the example of Youtube, auto-generated captions of recorded videos still proving the infancy of the technology. To improve and go beyond this, the team at Rochester created a hybrid way to combine multiple crowd workers’ efforts in order to more accurately and more timely create captioning. This methodology was set up in order to verify previous machine learning algorithms or to generate captions themselves. Overall throughout the experiment, the tie-breaker throughout the experiment is a majority vote. Comparing the accuracy of the general system of Scribe compared to other captioning systems is comparatively similar in precision and Word Error Rate, though at a lower cost.


I could see how combining the two aspects, crowd workers and the initial baseline could create a good and accurate process for generating captions. Using crowd workers to asses and verify the baseline generation for captions ensures the quality of the captions generated and the potential to end up improving the machine learning algorithm. Furthering this, more workers can be given jobs and the captioning system could ultimately improve, improving both the jobs available and the core machine learning algorithm itself.


  • Not currently experienced in this specific field and disregarding the publishing date of 2012, this combination of crowd workers verifying the auto-generated captions does not seem ultimately novel in this case. Through their study of the latest and greatest in the field didn’t include any crowd workers in any capacity, this may have been limited to their scope. In your opinion does this research currently stand up to some of the more recent papers for auto-captioning or is it just a product of the time?
  • Potentially a problem for within the crowd working community, their techniques utilize a majority vote to confirm which words are accurately representing the phrase. Though there may be some statistics on ensuring the mechanical turkers have sufficient experience and can be relied on, this area may be vulnerable to malicious actors out numbering the non-malicious actors. Based on the phrases being interpreted and explicitly written, do you think a scenario similar to the Mountain Dew naming campaign (Dub The Dew – https://www.huffpost.com/entry/4chan-mountain-dew_n_1773076) in which a group of malicious actors overloaded the names, could happen to this type of system?
  • In using the audio of this technology, the raw audio of a speech or some event would be fed directly to the mechanical turkers working on the Scribe program. Depending on the environment where the speech was given or the quality of the microphone, not even majority of users may be able to hear the correct words (potentially regardless of the volume of the speech). Would there be potential future for combining this kind of technology along with some sort of machine learning algorithms that isolate and remove the white noise or smaller conversations around the main speakers of the event?

Read More

03/04/2020 – Dylan Finch – Real-Time Captioning by Groups of Non-Experts

Word count: 564

Summary of the Reading

This paper aims to help with accessibility of audio streams by making it easier to create captions for deaf listeners. The typical solution to this problem is to hire expensive, highly trained professionals who require specialized keyboards, stenographers. Or, in other cases, people with less training to create captions, but these captions may take longer to write, creating a latency between what is said in the audio and the captions. This is not desirable, because it makes it harder for the deaf person to connect the audio with any accompanying video. This paper aims to marry cheap, easy to produce captions with the ability to have the cpations created in real time and with little latency. The solution is to use many people who do not require specialized training. When working together, a group of crowd workers can achieve high caption coverage of audio with a latency of only 2.9 seconds.

Reflections and Connections

I think that this paper highlights one of the coolest things that crowdsourcing can do. It can take big, complicated tasks that used to require highly trained individuals and make them accomplishable by ordinary people. This is extremely powerful. It makes all kinds of technologies and techniques much more accessible. It is hard to hire one highly trained stenographer, but it is easy to hire a few normal people. This is the same idea that powers Wikipedia. Many people make small edits, using specialized knowledge that they know, and, together, they create a highly accessible and complete collection of knowledge. This same principle can and should be applied to many more fields. I would love to see what other professions could be democratized through the use of many normal people to replace one highly trained person. 

This research also shows how it is possible to break up tasks that may have traditionally been thought of as atomic. Transcribing audio is a very hard task to solve using crowd workers because there are not real discrete tasks that could b e sent to crowd workers. The stream of audio is continuous and always changing. However, this paper shows that it is possible to break up this activity into manageable chunks that can be accomplished by crowd workers, the researchers just needed to think outside of the box. I think that this kind of thinking will become increasingly important as more and more work is crowdsourced. I think that as we learn how to solve more and more problems using crowdsourcing, the issue becomes less and less ot can we solve this using crowdsource and becomes much more about how can we break up this problem into manageable pieces that can be done by the crowd. This kind of research has applications elsewhere, too. I think that in the future this kind of research will be much more important. 


  1. What are some similar tasks that could be crowdsourced using a method similar to the one described in the paper?
  2. How do you think that crowdsourcing will impact the accessibility of our world? Are there other ways that crowdsourcing could make our world more accessible?
  3. Do you think there will come a time when most professions can be accomplished by crowd workers? What do you think the extent of crowd expertise will be?

Read More

03/04/2020 – Nurendra Choudhary – Real-time captioning by groups of non-experts


In this paper, the authors discuss a collaborative real-time captioning framework called LEGION:SCRIBE. They compare their system against the previous approach called CART and Automated Speech Recognition (ASR) system. The authors initiate the discussion with the benefits of captioning. They proceed to explain the expensive cost of hiring stenographers. Stenographers are the fastest and most accurate captioners with access to specialized keyboards and expertise in the area. However, they are prohibitively expensive (100-120$ an hour). ASR is much cheaper but their low accuracy deems them inapplicable in most real-world scenarios. 

To alleviate the issues, the authors introduce SCRIBE framework. In SCRIBE, crowd-workers caption smaller parts of the speech. The parts are merged using an independent framework to form the final sentence. The latency of the system is 2.89s, emphasizing its real-time nature, which is a significant improvement over ~5s of CART.


The paper introduces an interesting approach to collate data from multiple crowd workers for sequence learning tasks. The method has been applied before in cases such as Google Translate (translating small phrases) and ASR (voice recognition of speech segments). However, SCRIBE distinguishes itself by bringing in real-time improvement in the system. But, the system relies on the availability of crowd workers. This may lead to unreliable behaviour in the system. Additionally, the hired workers are not professionals. Hence, the quality is affected by human behavioral features such as mindset, emotions or mental stamina. I believe a study on the evolution of SCRIBE overtime and its dependence on such features needs to be analyzed.

Furthermore, I question the crowd management system. Amazon MT cannot guarantee real-time labourers. Currently, given the supply of workers with respect to the tasks, workers are always available. However, as more users adopt the system, this need not always hold true. So, crowd management systems should provide alternatives that guarantee such requirements. Also, the work provider needs to find alternatives to maintain real-time interaction, in case the crowd fails. In case of SCRIBE, the authors can append an ASR module in a situation of crowd failure. ASR may not give the best results but would be able to ensure smoother user experience.

The current development system does not consider the volatility of crowd management systems. This makes them an external single point of failure. I think there should be a push in the direction of simultaneously adopting multiple management systems for the framework to increase their reliability. This will also improve system efficiency because it has a more diverse set of results as choice. Thus benefiting the overall model structure and user adoption. 


  1. Google Translate uses a similar strategy by asking its users to translate parts of sentences. Can this technique be globally applied to any sequential learning framework? Is there a way we can divide sequences into independent segments? In case of dependent segments, can we just use a similar merging module or is it always problem-dependent?
  2. The system depends on the availability of crowd workers. Should there be a study on the availability aspect? What kind of systems would be benefitted from this?
  3. Should there be a new crowd work management system with a sole focus on providing real-time data provisions?
  4. Should the responsibility of ensuring real-time nature be on the management system or the work provider? How will it impact the current development framework?

Word Count: 567

Read More

Subil Abraham – 03/04/2020 – Real-time captioning by groups of non-experts

This paper pioneers the approach of using crowd work for closed captioning systems. The scenario they target is classes and lectures, where a student can hold up their phone and record the speaker and the sound the transmitted to the crowd workers. The sound that is passed is given as bite sized pieces for the crowd workers to transcribe, and the paper’s implementation of the multiple sequence alignment algorithms takes those transcriptions and combines them. The focus of the tool is very much on real-time captioning so the amount of time a crowd worker can spend on a portion of sound is limited. The authors design interfaces on the worker side to promote continuous transcription, and on the user side to allow them to correct the received transcriptions in real time, enhancing the quality further. The authors had to deal with interesting challenges in resolving errors in the transcription, which they did by a combination of comaparing transcriptions of the same section from different crowd workers, using bigram and trigram data to validate the word ordering. Evaluations showed that precision was stable while coverage increased with increase in the number of workers, while having lower error rate compared to automatic transcription and untrained transcribers.

One thing that needs to be pointed out about this work is that I believe that ASR is always rapidly improving and has made significant strides from when this paper was published. From my own anecdotal experience, Youtube’s automatic closed captions are getting very very close to being fully accurate (however, thinking back on our reading of the Ghost Work book at the beginning of the semester, I wonder if Youtube is cheating a bit and using crowd work intervention for some their videos to help their captioning AI along). I also find that the author’s solution for merging the transcriptions of the different sound bites is interesting. How they would solve that was the first thing that was on my mind because it was not going to be a matter of simply aligning the time stamps because those were definitely going to be imprecise. So I do like their clever multi part solution. Finally, I was a little surprised and disappointed that the WER was at ~45% which was a lot higher than I expected. I was expecting the error rate to be a lot closer to professional transcribers but unfortunately not. The software still has a way to go in that.

  1. How could you get the error rate down to the professional transcriber’s level? What is going wrong there that is causing it to be that high?
  2. It’s interesting to me that they couldn’t just play isolated sound clips but instead had to raise and lower volume on a continuous stream for better accuracy. Where are the other places humans work better when they have a continuous stream of data rather than discrete pieces of data?
  3. Is there an ideal balance between choosing precision and coverage in the context of this paper? This was something that also came up in last week’s readings. Should the user decide what the balance should be? How would they do it when there can be multiple users all at the same location trying to request captioning for the same thing?

Read More

03/04/2020- Ziyao Wang – Real-time captioning by groups of non-experts

Traditional real-time captioning tasks are completed by professional captionists. However, the cost to hire them is expensive. Alternatively, some automatic speech recognition systems have been developed. But there is still problem that these systems perform badly when the audio quality is low or there are multiple people talking. In this paper, the authors developed a system which can hire several non-expert workers to do the caption task and merge their works together to obtain a high accuracy caption output. As the workers have a significant lower salary compared with the experts, the cost will be reduced even multiple workers are hired. Also, the system has a good performance collecting workers’ jobs and merging them to get a high accuracy output with low latency.


When solving problems with the requirement of high accuracy and low latency, I always hold the view that only AI or experts can complete such kind of tasks. However, in this paper, the authors showed us that non-experts can also complete this kind of tasks if we can have a group of people work together.

Compared with the professionals, hiring non-experts will cost much less. Compared with AI, people can handle some complicated situations better. This system combined this two advantages and provided a cheap real-time captioning system with high accuracy.

It is for sure that this system has lots of advantages, but we should still consider it critically. For the cost, it is true that hiring non-experts will spend much less than hiring professional captionists. However, the system needs to hire 10 workers to get 80 to 90 percentage accuracy. Even though the workers have a low salary, for example 10 dollars per hour, the total cost will reach 100 dollars per hour. Hiring experts will only cost around 120 dollars for one hour, which shows that the saving of applying the system is relatively low.

For the accuracy part, there is possibility that all the 10 workers missed a part of the audio. As a result, even merging all the results provided by the workers, the system will still miss this part’s caption. Instead, though the AI system may provide caption with errors, the system can at least provide something for all words in the audio.

For these two reasons, I think hiring less workers, for example three to five workers, to fix the errors in the system generated caption will save more money while the system can still maintain high accuracy. And with the provided caption, the workers’ tasks will be easier, and they may provide more accurate results. Also, for the circumstances in which AI system performs well, the workers will not need to spent time typing, and the latency of the system will be reduced.


What are the advantages of hiring non-expert humans to do the captioning compared with the experts or AI systems?

Will a system hiring less workers to fix the errors in the AI generated caption be cheaper? Will this system perform better?

For the system mentioned in the second question, does it have any limitations or drawbacks?

Read More

03/04/20 – Fanglan Chen – Real-time Captioning by Groups of Non-experts


Lasecki et al.’s paper “Real-time Captioning by Groups of Non-experts” explores a new approach of relying on a group of non-expert captionists to provide speech captions of good quality, and presents an end-to-end system called LE-GION: SCRIBE which allows collective instantaneous captioning for live lectures on-demand. In the speech captioning task, professional stenographers can achieve high accuracy. However, the manual efforts are very expensive and must be arranged in advance. For effective captioning, the researchers introduce the idea of having a group of non-expects to caption audio and merging their inputs to achieve more accurate captions. Their proposed SCRIBE has two components, one is an interface for real-time captioning designed to collect the partial captions from each crowd worker, and the other is real-time input combiner for merging the collective captions into a single out-put stream in real-time. Their experiments show that proposed solution is feasible and non-experts can provide captioning of good quality and content coverage with short per-word latency. The proposed model can be potentially extended to allow dynamic groups to exceed the capacity of individuals in various human performance tasks.


This paper conducts an interesting study of how to achieve better performance of a single task via collaborative efforts of a group of individuals. I think this idea aligns with ensemble modeling in machine learning. The idea presented in the paper is to generate multiple partial outputs (provided by team members and crowd workers) and then use an algorithm to automatically merge all of the noisy partial inputs into a single output. Similarly, ensemble modeling is a machine learning method where multiple diverse models are developed to generate or predict an outcome, either by using multiple different algorithms or using different training data sets. Then the ensemble model aggregates the output of each base model and generates the final output. The motivation for relying on a group of non-expert captionists to achieve better performance beyond the capacity of each non-expert corresponds to the idea of using ensemble models to reduce the generalization error and get more reliable results. As long as the base models are diverse and independent, the performance of the model increases when the ensemble approach is used. This approach also seeks the collaborative efforts of crowds in obtaining the final results. In both approaches, even though the model has multiple human/machine inputs as its sources, it acts and performs as a single model. I would be curious to see how ensemble models perform on the same task compared with the crowdsourcing proposed in the paper.

In addition, I think the proposed framework in the paper may work for general audio captioning. I am wondering how it would perform in regards to domain-specific lectures. As we know, lectures in many domains, such as medical science, chemistry, psychology, etc. are expected to have some terminologies that might be difficult to capture by an individual without the professional background in the field. There would be possible cases that none of the crowd worker can type those terms correctly, which may result in the incorrect caption. I think the paper can be strengthened with a discussion about under what kind of situations the proposed method works best. To continue the point, another possibility is to leverage the advantages of pre-trained speed recognition models and crowd works to develop a human-AI team to achieve desirable performance.


I think the following questions are worthy of further discussion.

  • Would it be helpful if the recruiting process of crowd workers involves the consideration on their backgrounds, especially for some domain-specific lectures?
  • Although ASR may not be reliable on its own, is it useful leverage it as a contributor to the input of crowd workers? 
  • Is there any other potential to add a machine-in-the-loop component in the proposed framework?
  • What do you think about the proposed approach compared with the ensemble modeling that merges the outputs of multiple speech recognition algorithms to get the final results?

Read More

03/04/2020 – Mohannad Al Ameedi – Real-Time Captioning by Groups of Non-Experts


In this paper, the authors proposing a low latency captioning solution for the deaf and hard of hearing people that can work in real-time setting. Although, there are available solutions, but they are either very expensive or low quality. The proposed system allows people with hearing disability to request a captioning at any time and get the result in a few seconds. The system depends on a combination of non-expert crowd sourcing workers and local staff to provide the captioning. Each request will be handled by multiple people and the result will be a combination of all the participants’ input.  The request will be submitted in an audio stream format and the result will be in a text format. Crowdsource platform is used to submit the request and the result is retrieved in seconds. The proposed system uses an algorithm that work on a stream manner where the input can be process as it is received and aggregate the result at the end. The system outperforms all other available options on both coverage and accuracy.  The proposed solution is feasible to be applied in a production setting.


I found the idea of real time captioning very interesting. My understanding was there is always a latency when depending on crowdsourcing and cannot be applied in real world scenarios, but it will be interesting to know how the system will work when the number of users increase.

I also found the concept of multiple people working on the same audio stream and combining the result very interesting. Collecting captions from multiple people and then trying to figure out what is unique and what is duplicate and producing a final sentence, paragraph, or script is a challenging task.

This work is like multiple people work on one task or multiple developers writing code to implement a single feature. Normally the supervisor or development lead will merge the result, but in this case the algorithm is taking care of the merge.


  • The authors measured the system on a limited number of users, do you think the system will continue outperforming other methods if it is get deployed in real world setting?
  • Since we have an increasing number of live streaming on work, school, and other places, can we use the same concept to pass the URL and get instance captioning? What are the limitations of this approach?
  • What are the privacy concerns with this approach especially if it is get used in medical field? Normally limited number of people get hired to help on such tasks, while the crowdsourcing is opened to a wide range of people.

Read More