Summary:
In this paper, the author focused on the main limitation of real-time captioning. The author made the point that the caption with high accuracy and low latency requires expensive stenographers who need an appointment in advance and who are trained to use specialized keyboards. The less expensive option is automatic speech recognition. However, its low accuracy and high error rate would greatly influence the user experience and cause many inconveniences for deaf people. To alleviate these problems, the author introduced an end-to-end system called LEGION: SCRIBE, which enables multiple works to provide simultaneous captioning in real-time, and the system combines their input into a final answer with high precision, high coverage, low latency. The author experimented with crowd workers and other local participants and compared the results with CART, ASR, and individual. The results indicate that this end-to-end system with a group of works can outperform both individual and ASR regarding the coverage, precision, and latency.
Reflection:
First, I think the author made a good point about the limitation of real-time captioning, especially the inconvenience that brings to deaf and hard of hearing people. Thus, the greatest contribution this end-to-end system provided is the accessibility of cheap and reliable real-time captioning channel. However, I have several concerns about it.
First, this end-to-end system requires a group of workers, even paid with a low salary for each person, as the caption time increases, the salary for all workers is still a significant overhead.
Second, since to satisfy the coverage requirement, a high precision, high coverage, low latency caption requires at least five more workers to work together. As mentioned in the experiment, the MTruk works need to watch a 40s video to understand how to use this system. Therefore, there may be a problem that the system cannot find the required number of workers on time.
Third, since the system only combines the work from all workers. Thus, there is a coverage problem, which is if all of the workers miss a part of the information, the system output will be incomplete. Based on my experience, if one of the people did not get part of the information, usually, most people cannot get it either. As the example presented in the paper, no workers typed the “non-aqueous” which was used in a clip about chemistry.
Finally, I am considering combining human correction and ASR caption. Since humans have the strength that remembers the pre-mentioned knowledge, for example, an abbreviation, yet they cannot type fast enough to cover all the content. Nevertheless, ASR usually does not miss any portion of the speech, yet it will make some unreasonable mistakes. Thus, it might be a good idea to let humans correct inaccurate captions of ASR instead of trying to type all the speech contents.
Question:
- What do you think of this end-to-end system? Can you evaluate it from different perspectives, such as expense, accuracy?
- How would you solve the problem of inadequate speech coverage?
- What do you think of the idea that combines human and ASR’s work together? Do you think it will be more efficient or less efficient?
Word Count: 517