03/04/20 – Nan LI – Real-Time Captioning by Groups of Non-Experts

Summary:

In this paper, the author focused on the main limitation of real-time captioning. The author made the point that the caption with high accuracy and low latency requires expensive stenographers who need an appointment in advance and who are trained to use specialized keyboards. The less expensive option is automatic speech recognition. However, its low accuracy and high error rate would greatly influence the user experience and cause many inconveniences for deaf people. To alleviate these problems, the author introduced an end-to-end system called LEGION: SCRIBE, which enables multiple works to provide simultaneous captioning in real-time, and the system combines their input into a final answer with high precision, high coverage, low latency. The author experimented with crowd workers and other local participants and compared the results with CART, ASR, and individual. The results indicate that this end-to-end system with a group of works can outperform both individual and ASR regarding the coverage, precision, and latency.

Reflection:

First, I think the author made a good point about the limitation of real-time captioning, especially the inconvenience that brings to deaf and hard of hearing people. Thus, the greatest contribution this end-to-end system provided is the accessibility of cheap and reliable real-time captioning channel. However, I have several concerns about it.

First, this end-to-end system requires a group of workers, even paid with a low salary for each person, as the caption time increases, the salary for all workers is still a significant overhead.

Second, since to satisfy the coverage requirement, a high precision, high coverage, low latency caption requires at least five more workers to work together. As mentioned in the experiment, the MTruk works need to watch a 40s video to understand how to use this system. Therefore, there may be a problem that the system cannot find the required number of workers on time.

Third, since the system only combines the work from all workers. Thus, there is a coverage problem, which is if all of the workers miss a part of the information, the system output will be incomplete. Based on my experience, if one of the people did not get part of the information, usually, most people cannot get it either. As the example presented in the paper, no workers typed the “non-aqueous” which was used in a clip about chemistry.

Finally, I am considering combining human correction and ASR caption. Since humans have the strength that remembers the pre-mentioned knowledge, for example, an abbreviation, yet they cannot type fast enough to cover all the content. Nevertheless, ASR usually does not miss any portion of the speech, yet it will make some unreasonable mistakes. Thus, it might be a good idea to let humans correct inaccurate captions of ASR instead of trying to type all the speech contents.

Question:

  • What do you think of this end-to-end system? Can you evaluate it from different perspectives, such as expense, accuracy?
  • How would you solve the problem of inadequate speech coverage?
  • What do you think of the idea that combines human and ASR’s work together? Do you think it will be more efficient or less efficient?

Word Count: 517

4 thoughts on “03/04/20 – Nan LI – Real-Time Captioning by Groups of Non-Experts

  1. To answer your first question, I think the system can work well with a small number of users, but cannot scale when the number of users increase. Responding to thousands of users with low latency is a challenging task, but it can well perform on captioning scripts that are not required immediately.

  2. For question 3, I think the rate of translation will decrease but the accuracy rate will greatly improve. Because blind people also need time to read, I think the overall efficiency is improved.

  3. Interesting questions, Nan! To answer your first question, I think that’s a valid concern—cost and accuracy. Certainly it needs to be evaluated not in a vacuum, but in comparison to existing ASR and MSR systems. With automated systems, it’s cheap but inefficient. While manual systems are *very expensive*. I think this work finds a great balance between the two. Overall, I really like the system and it was an ingenious use of human computation.

  4. I like your questions, but I wanted to address your second reflection item. While you could not just “turn on” captioning with this approach, if you either A) only used workers that had already done this HIT before or B) informed the user that captions would begin in 60 seconds it would ameliorate this issue. For A, the workers already have been trained in doing the task, and if the software was deployed the knowledgable workforce would grow naturally. If you paid a good rate (over that of competitors), the now-skilled workers would prioritize your HITs. For B, the padding allows for both quikTurkit to gather workers and have them watch the training video. So I believe there are workarounds to your concern.

Leave a Reply