Traditional real-time captioning tasks are completed by professional captionists. However, the cost to hire them is expensive. Alternatively, some automatic speech recognition systems have been developed. But there is still problem that these systems perform badly when the audio quality is low or there are multiple people talking. In this paper, the authors developed a system which can hire several non-expert workers to do the caption task and merge their works together to obtain a high accuracy caption output. As the workers have a significant lower salary compared with the experts, the cost will be reduced even multiple workers are hired. Also, the system has a good performance collecting workers’ jobs and merging them to get a high accuracy output with low latency.
Reflections:
When solving problems with the requirement of high accuracy and low latency, I always hold the view that only AI or experts can complete such kind of tasks. However, in this paper, the authors showed us that non-experts can also complete this kind of tasks if we can have a group of people work together.
Compared with the professionals, hiring non-experts will cost much less. Compared with AI, people can handle some complicated situations better. This system combined this two advantages and provided a cheap real-time captioning system with high accuracy.
It is for sure that this system has lots of advantages, but we should still consider it critically. For the cost, it is true that hiring non-experts will spend much less than hiring professional captionists. However, the system needs to hire 10 workers to get 80 to 90 percentage accuracy. Even though the workers have a low salary, for example 10 dollars per hour, the total cost will reach 100 dollars per hour. Hiring experts will only cost around 120 dollars for one hour, which shows that the saving of applying the system is relatively low.
For the accuracy part, there is possibility that all the 10 workers missed a part of the audio. As a result, even merging all the results provided by the workers, the system will still miss this part’s caption. Instead, though the AI system may provide caption with errors, the system can at least provide something for all words in the audio.
For these two reasons, I think hiring less workers, for example three to five workers, to fix the errors in the system generated caption will save more money while the system can still maintain high accuracy. And with the provided caption, the workers’ tasks will be easier, and they may provide more accurate results. Also, for the circumstances in which AI system performs well, the workers will not need to spent time typing, and the latency of the system will be reduced.
Questions:
What are the advantages of hiring non-expert humans to do the captioning compared with the experts or AI systems?
Will a system hiring less workers to fix the errors in the AI generated caption be cheaper? Will this system perform better?
For the system mentioned in the second question, does it have any limitations or drawbacks?
The advantages of hiring non-professional for captioning are low cost and fast response. The crowdsourcing platform has a lot of such labor, so there are enough resources to complete similar work.