Summary
Lasecki et al. present a novel captioning system for the deaf and hard of hearing population group entitled LEGION:SCRIBE. Their implementation involves crowdsourcing multiple people per audio stream to achieve low-latency as well as highly accurate results. They then detail that the competitors for this are professional stenographers (but they are expensive) and automatic speech recognition (ASR, which has large issues with accuracy). They then go over how they intend on evaluating SCRIBE, with their Multiple Sequence Alignment (MSA) approach that aligns the output from multiple crowdworkers together to get the best possible caption. Their approach also allows for changing the quality to improve coverage or precision, where coverage will provide a more complete caption and precision attains a lower word error rate. They then conducted an experiment where they transcribed a set of lectures using various methods including various types of SCRIBE (varying number of workers and coverage) and an ASR. SCRIBE outperformed the ASR in both latency and accuracy.
Personal Reflection
This work is pretty relevant to me as my semester project is on transcribing notes for users in VR. I was struck by how quickly they were able to get captions from the crowd, but also how many errors still were present in the finished product. In figure 11, the WER for CART was a quarter of their method that only got slightly better than half of the words correct. And in figure 14, none of the transcriptions seem terribly acceptable, though CART was close. I wonder if their WER performed so poorly due to the nature of the talks or that there were multiple speakers in each scene. I wish that they had discussed how much impact having multiple speakers is in transcription services rather than the somewhat vague descriptions they had.
It was interesting that they could get the transcriptions done through Mechanical Turk at the rate of $36 per hour. This is roughly 1/3 of their professional stenographer (at $1.75 per minute or $105 per hour). The cost savings are impressive, though the coverage could be a lot better.
Lastly, I was glad they included one of their final sections, “Leveraging Hybrid Workforces,” as it is particularly relevant to this class. They were able to increase their coverage and precision by including an ASR as one of the inputs into their MSA combiner, regardless if they were using one worker or ten. This indicates that there is a lot of value in human-AI collaboration in this space.
Questions
- If such low-latency wasn’t a key issue, could the captions get an even lower WER? Is it worth a 20 second latency? A 60 second latency? Is it worth the extra cost it might incur?
- Combined with our reading last week on acceptable false positives and false negatives from AI, what is an acceptable WER for this use case?
- Their MSA combiner showed a lot of promise as a tool for potentially different fields. What other ways could their combiner be used?
- Error checking is a problem in many different fields, but especially crowdsourcing as the errors can be caused in many different ways. What other ways are there to combat errors in crowdsourcing? Would you choose this way or another?
Hi Lee,
Great comment! You just made the point that it should also combine ASR as part of the captioning process. This is the exactly topic that our group plan to implement. Because we think the coverage is the main deficiency for this end-to-end system. Because it is possible that the system output is incomplete if all of the workers miss a part of the information. Also, based on my experience, it happens a lot if that part of the speech is unclear. In addition, regarding your first question, I think the latency would have a great impact on the meeting, conversation, or other situation that requires participants to interact with each other. However, for a lecture, or video, maybe we can check the record and add the subtitle later. But for the previous situation, latency does significantly impact the quality of their ongoing event. For me, it worth an extra cost to make sure an important meeting goes on well.
I prefer a low latency system with acceptable error rate. This would be great for the deaf people. When they joined a seminar or a class, if the system has a 20 seconds latency, the discussion topic may have moved forward and even the deaf people have some thoughts or questions, they are not able to talk promptly. If the captioning system is real-time, even the deaf people can involve in the discussion.
As is mentioned in the paper, the WER should be considered with latency and cases in which words are replaced by other words with similar meaning. Most errors made by humans are replacing words in ground truth with similar words which make the sentences still understandable. Because of this, I think the actual accuracy of human workers should be better than the judgement only using WER.