Summation
Machine Learning is at the forefront of most technologies though it is still highly inaccurate; given the example of Youtube, auto-generated captions of recorded videos still proving the infancy of the technology. To improve and go beyond this, the team at Rochester created a hybrid way to combine multiple crowd workers’ efforts in order to more accurately and more timely create captioning. This methodology was set up in order to verify previous machine learning algorithms or to generate captions themselves. Overall throughout the experiment, the tie-breaker throughout the experiment is a majority vote. Comparing the accuracy of the general system of Scribe compared to other captioning systems is comparatively similar in precision and Word Error Rate, though at a lower cost.
Response
I could see how combining the two aspects, crowd workers and the initial baseline could create a good and accurate process for generating captions. Using crowd workers to asses and verify the baseline generation for captions ensures the quality of the captions generated and the potential to end up improving the machine learning algorithm. Furthering this, more workers can be given jobs and the captioning system could ultimately improve, improving both the jobs available and the core machine learning algorithm itself.
Questions
- Not currently experienced in this specific field and disregarding the publishing date of 2012, this combination of crowd workers verifying the auto-generated captions does not seem ultimately novel in this case. Through their study of the latest and greatest in the field didn’t include any crowd workers in any capacity, this may have been limited to their scope. In your opinion does this research currently stand up to some of the more recent papers for auto-captioning or is it just a product of the time?
- Potentially a problem for within the crowd working community, their techniques utilize a majority vote to confirm which words are accurately representing the phrase. Though there may be some statistics on ensuring the mechanical turkers have sufficient experience and can be relied on, this area may be vulnerable to malicious actors out numbering the non-malicious actors. Based on the phrases being interpreted and explicitly written, do you think a scenario similar to the Mountain Dew naming campaign (Dub The Dew – https://www.huffpost.com/entry/4chan-mountain-dew_n_1773076) in which a group of malicious actors overloaded the names, could happen to this type of system?
- In using the audio of this technology, the raw audio of a speech or some event would be fed directly to the mechanical turkers working on the Scribe program. Depending on the environment where the speech was given or the quality of the microphone, not even majority of users may be able to hear the correct words (potentially regardless of the volume of the speech). Would there be potential future for combining this kind of technology along with some sort of machine learning algorithms that isolate and remove the white noise or smaller conversations around the main speakers of the event?
I think this work can stand on its own merit because it was the first to do crowd based captioning in real time. Yes, auto generated captions are improving but they are not at the state where we can fully replace humans. And of course, the cost of a professional captioner is very high and not accessible for regular use by smaller organizations that need their talent. So this provides a good intermediary solution. And I think that gives it merit.
To answer your third question, I believe low-quality equipment will hinder all of the potential actors involved. They bring up that ASR depends on high quality mics and speakers in the introduction, and the complaints from the crowdworkers that they couldn’t hear the input indicates that Scribe also needs hardware of a certain quality. I’m sure the stenographer has similar issues as they are human, too.