03/04/20 – Lulwah AlKulaib- SocialAltText

March 4, 2020 Lulwah AlKulaib Leave a comment

Summary

The authors propose a system to generate Alt text for images embedded in social media posts by utilizing crowd workers. Their goal is to have a better experience for the blind and visually impared (BVI) when using social media. Existing tools provide imperfect descriptions some by automatic caption generation, and others by object recognition. These systems are not enough as in many cases their results aren’t descriptive enough for BVI users. The authors study how crowdsourcing can be used for both:

evaluating the value provided of existing automated approaches
Enabling workflows that provide scalable and useful alt text for BVI users

They utilize real-time crowdsourcing to test experiments with varied depth levels of interaction of the crowd in assisting visually impaired users. They show the shortcomings of existing AI image captioning systems and compare them with their method. The paper suggests two experiences:

TweetTalk: is a conversational assistant workflow.
Structured Q&A: that builds upon and enhances the state of the art generated captions.

They evaluated the conversational assistant with 235 crowdworkers. They evaluated 85 tweets for the baseline image caption, each tweet was evaluated 3 times with a total of 255 evaluations.

Reflection

The paper presents a novel concept and their approach is a different take on utilizing crowdworkers. I believe that the experiment would have worked better if they tested it on some visually impared users. Since the crowdworkers hired were not visually impaired, it makes it harder to say that BVI users would have the same reaction. Since the study targets BVI users, they should have been the pool of testers. People interact with the same element in different ways and what they showed seemed too controlled. Also, the questions were not all the same for all images, which makes this harder to generalize. The presented model tries to solve a problem for social media photos and not having a plan to repeat for each photo might make interpreting images difficult.

I appreciated the authors’ use of existing systems and their attempt at improving the AI generated captions. Their results obtain better accuracy compared to state of the art work.

I would have loved seeing how different social media applications measured compared with each other. Since different applications vary in how they present photos. Twitter for example gives a limited amount of character count while Facebook could present more text which might help BVI users in understanding the image better.

In the limitations section, the authors mention that human in the loop workflows raise privacy concerns and that the alt text would generalize to friendsourcing and utilizing social network users. I wonder how that generalizes to social media applications in real time. And how reliable would friendsourcing be for BVI users.

Discussion

What are improvements that you would suggest to better the TweetTalk experiment?
Do you know of any applications that use human in the loop in real time?
Would you have any privacy concerns if one of the social media applications integrated a human in the loop approach to help BVI users?

03/04/2020 – Ziyao Wang – Combining crowdsourcing and google street view to identify street-level accessibility problems

March 4, 2020 Ziyao Wang Leave a comment

In this paper, the authors focused on the mechanism that using untrained crowdworkers to find and label accessibility problems in Google Street View imagery. They provide the workers images from Google Street View imagery to let them find, label and access sidewalk accessibility problems. They compared the results of this labeling task completed by six dedicated labelers including three wheelchair users and by MTurk workers. The comparison shows that the crowdworkers can determining the presence of an accessibility problem with high accuracy, which means this mechanism is promising about sidewalk accessibility. However, that mechanism still have problems such as locating the GSV camera in geographic space and selecting an optimal viewpoint, sidewalk width problem and the age of the images. In the experiments, the workers cannot label some of the images due to camera position, and the images may be captured three years ago. Additionally, there is no method to measure the width of the sidewalk, which is a need by the wheelchair users.

Reflections:

The authors combined the Google Street View imagery and MTurk Crowdsourcing to build a system which can detect accessibility challenges. This kind of hybrid system has a high accuracy in the finding and labeling of such kind of accessibility challenges. If this system can be used practically, the disables will benefit a lot with the help of the system.

However, there is some problems in the system. As is mentioned in the paper, the images in the Google Street View are old. Some of the images may be captured years ago. If the detection is based on these pictures, some new access problems will be detected. For this problem, I have a rough idea about letting the users of the system to update the image library. When they found some difference between the images from library and practical sidewalk, they can upload the latest pictures captured by them. As a result, other users will not suffer from the images’ age problem. However, this solution will change the whole system. Google Street View imagery requires professional capture devices which is not available to most of the users. As a result, the Google Street View will not update its imagery using the photos captured by the users, and the system cannot update itself using the imagery. Instead, the system has to build its own image library, which is totally different from the introduced system in the paper. Additionally, the photos provided by the users may be with low resolution, and it will be difficult for the MTurk workers to label the accessibility challenges.

Similarly, the problem that the workers cannot measure the width of the sidewalk can be solved if users can upload the width when they are using the system. However, it still faces the problem of lacking an own database and the system needs to be modified hugely.

Instead of detecting accessibility challenges, I think the system is more useful in tracking and labeling bike lanes. Compared with the accessibility of sidewalk, to detect the existence of bike lanes will suffer less from the age problem, because even the bike lanes were built years ago, they can still work. Also, there is no need to measure the width of the lanes, as all the lanes should have enough space for bikes to pass.

Question:

Is there any approach to solve the age problem, camera point problem and measuring width problem in the system?

What do you think about applying such a system to track and label bike lanes?

What other kinds of street detection problems can this system being applied to?

03/04/2020- Ziyao Wang – Real-time captioning by groups of non-experts

March 4, 2020 Ziyao Wang 1 Comment

Traditional real-time captioning tasks are completed by professional captionists. However, the cost to hire them is expensive. Alternatively, some automatic speech recognition systems have been developed. But there is still problem that these systems perform badly when the audio quality is low or there are multiple people talking. In this paper, the authors developed a system which can hire several non-expert workers to do the caption task and merge their works together to obtain a high accuracy caption output. As the workers have a significant lower salary compared with the experts, the cost will be reduced even multiple workers are hired. Also, the system has a good performance collecting workers’ jobs and merging them to get a high accuracy output with low latency.

Reflections:

When solving problems with the requirement of high accuracy and low latency, I always hold the view that only AI or experts can complete such kind of tasks. However, in this paper, the authors showed us that non-experts can also complete this kind of tasks if we can have a group of people work together.

Compared with the professionals, hiring non-experts will cost much less. Compared with AI, people can handle some complicated situations better. This system combined this two advantages and provided a cheap real-time captioning system with high accuracy.

It is for sure that this system has lots of advantages, but we should still consider it critically. For the cost, it is true that hiring non-experts will spend much less than hiring professional captionists. However, the system needs to hire 10 workers to get 80 to 90 percentage accuracy. Even though the workers have a low salary, for example 10 dollars per hour, the total cost will reach 100 dollars per hour. Hiring experts will only cost around 120 dollars for one hour, which shows that the saving of applying the system is relatively low.

For the accuracy part, there is possibility that all the 10 workers missed a part of the audio. As a result, even merging all the results provided by the workers, the system will still miss this part’s caption. Instead, though the AI system may provide caption with errors, the system can at least provide something for all words in the audio.

For these two reasons, I think hiring less workers, for example three to five workers, to fix the errors in the system generated caption will save more money while the system can still maintain high accuracy. And with the provided caption, the workers’ tasks will be easier, and they may provide more accurate results. Also, for the circumstances in which AI system performs well, the workers will not need to spent time typing, and the latency of the system will be reduced.

Questions:

What are the advantages of hiring non-expert humans to do the captioning compared with the experts or AI systems?

Will a system hiring less workers to fix the errors in the AI generated caption be cheaper? Will this system perform better?

For the system mentioned in the second question, does it have any limitations or drawbacks?

03/04/2020 – Palakh Mignonne Jude – Combining Crowdsourcing and Google Street View To Identify Street-Level Accessibility Problems

March 4, 2020 Palakh Mignonne Jude Leave a comment

SUMMARY

The authors of this paper aim to investigate the feasibility of recruiting MTurk workers to label and assess sidewalk accessibility problems as can be viewed by making use of Google Street View. The authors conducted two studies, the first, with 6 people (3 from their team of researchers and 3 wheelchair users) and the second, that investigated the performance of turkers. The authors created an interactive labeling interface as well as a validation interface (to help users to accept/reject previous labels). The authors proposed different levels of annotation correctness comprising of two spectra – localization spectrum which includes image level and pixel level granularity and specificity spectrum which includes the amount of information evaluated for each label. They defined image-level correctness in terms of accuracy, precision, recall, and f-measure. In order to computer inter-rater agreement at the image-level, they utilized Fleiss’ kappa. In order to evaluate the more challenging pixel-level agreement, they aimed to verify the labeling by indicating that pixel-level overlap was greater between labelers on the same image versus across different images. The authors used the labels produced from Study 1 as the ground truth dataset to evaluate turker performance. The authors also proposed two quality control approaches – filtering turkers based on a threshold of performance and filtering labels based on crowdsourced validations.

REFLECTION

I really liked the motivation of this paper especially given the large number of people that have physical disabilities. I am very interested to know how something like this would extend to other countries such as India as it would greatly aid people with physical disabilities over there since there are many places with poor walking surfaces and do not have support for wheelchairs. I think that having such a system in place in India would definitely help disabled people be better informed about places that can be visited.

I also liked the quality control mechanisms of filtering tuckers and filtering labels since these appear to be good ways to improve the overall quality of the labels obtained. I thought it was interesting that the performance of the system improved with tucker count but the gains diminished in magnitude as the group size grew. I thought that the design of the labelling and verification interface was good and that it made it easy for users to perform their tasks.

QUESTIONS

As indicated in the limitations section, this work ‘ignored practical aspects such as locating the GSV camera in geographical space and selecting an optimal viewpoint’. Has any follow-up study been performed that takes into account these physical aspects? How complex would it be to conduct such a study?
The authors mention that image quality can be poor in some cases due to a variety of factors. How much of an impact would this cause to the task at hand? Which labels would have been most affected if the image quality was very poor?
The validation of labels was performed by crowd workers via the verification interface. Would there have been any change in the results obtained if experts had been used for the validation of labels instead of crowd workers (since they may have been able to identify more errors in the labels as compared to normal crowd workers)?

03/04/20 – Sukrit Venkatagiri – Pull the Plug?

March 4, 2020 Sukrit Venkatagiri Leave a comment

Paper: Danna Gurari, Suyog Jain, Margrit Betke, and Kristen Grauman. 2016. Pull the Plug? Predicting If Computers or Humans Should Segment Images. 382–391.

Summary:
This paper proposes a resource allocation framework for predicting how best to allocate a fixed budget of human annotation effort in order to collect higher quality segmentations for a given batch of images and methods. The framework uses a “pull-the-plug” model, predicting when to use human versus computer annotators. More specifically, the paper proposes a system that intelligently allocates computer effort to replace human effort for initial coarse segmentations. Second, it automatically identifies images to have humans re-annotate by predicting which of the images the automated methods did not segment well enough. This method could be used for a variety of uses cases, and the paper tests it on three datasets and 8 segmentation methods. The findings show that this method significantly outperformed prior work across a variety of metrics, ranging from quality prediction, initial segmentation, fine-grained segmentation, and cost.

Reflection:
Overall, this was an interesting paper to read that is largely focused on performance and accuracy. The paper shows that the methods are superior to prior work and is now the state of the art for image segmentation when it comes to these three datasets, and for saving costs.

I wonder what this paper might have looked like if it was more focused on creativity and innovation, rather than performance and cost-savings. For example, in HCI there are studies of using crowds to generate ideas, solve mysteries, and critique designs. Perhaps this approach might be used in a way that humans and machines can provide suggestions and they build off of each other.

More specifically, related to this paper, I wonder how the results would generalize to datasets other than the three used here, or to real-world examples, for perhaps self-driving cars, etc. Certainly, a lot more work needs to be done, and the system would need to be real-time, meaning human computation might not be a feasible method for self-driving cars. Though, certainly they could be used for generating training dataset for self-driving car algorithms.

This entire approach relies on the proposed prediction module, and it would be interesting to explore other edge cases where the predictions are better made by humans rather than through machine intelligence.

Finally, the finding that the computer segmented images more similarly to experts than crowd workers was interesting, and I wonder why—was it because the computer algorithms were trained on expert-generated training sets? Perhaps the crowd workers would perform better over time or with training. In that case, the results might have been better overall when combining the two.

Questions:

How might you use this approach in your class project?
Where does CV fail and where can humans augment it? What about the reverse?
What are the limitations of a “pull-the-plug” approach, and how can they be overcome?
Where else might this approach be used?

03/04/20 – Fanglan Chen – Real-time Captioning by Groups of Non-experts

March 4, 2020 Fanglan Chen 1 Comment

Summary

Lasecki et al.’s paper “Real-time Captioning by Groups of Non-experts” explores a new approach of relying on a group of non-expert captionists to provide speech captions of good quality, and presents an end-to-end system called LE-GION: SCRIBE which allows collective instantaneous captioning for live lectures on-demand. In the speech captioning task, professional stenographers can achieve high accuracy. However, the manual efforts are very expensive and must be arranged in advance. For effective captioning, the researchers introduce the idea of having a group of non-expects to caption audio and merging their inputs to achieve more accurate captions. Their proposed SCRIBE has two components, one is an interface for real-time captioning designed to collect the partial captions from each crowd worker, and the other is real-time input combiner for merging the collective captions into a single out-put stream in real-time. Their experiments show that proposed solution is feasible and non-experts can provide captioning of good quality and content coverage with short per-word latency. The proposed model can be potentially extended to allow dynamic groups to exceed the capacity of individuals in various human performance tasks.

Reflection

This paper conducts an interesting study of how to achieve better performance of a single task via collaborative efforts of a group of individuals. I think this idea aligns with ensemble modeling in machine learning. The idea presented in the paper is to generate multiple partial outputs (provided by team members and crowd workers) and then use an algorithm to automatically merge all of the noisy partial inputs into a single output. Similarly, ensemble modeling is a machine learning method where multiple diverse models are developed to generate or predict an outcome, either by using multiple different algorithms or using different training data sets. Then the ensemble model aggregates the output of each base model and generates the final output. The motivation for relying on a group of non-expert captionists to achieve better performance beyond the capacity of each non-expert corresponds to the idea of using ensemble models to reduce the generalization error and get more reliable results. As long as the base models are diverse and independent, the performance of the model increases when the ensemble approach is used. This approach also seeks the collaborative efforts of crowds in obtaining the final results. In both approaches, even though the model has multiple human/machine inputs as its sources, it acts and performs as a single model. I would be curious to see how ensemble models perform on the same task compared with the crowdsourcing proposed in the paper.

In addition, I think the proposed framework in the paper may work for general audio captioning. I am wondering how it would perform in regards to domain-specific lectures. As we know, lectures in many domains, such as medical science, chemistry, psychology, etc. are expected to have some terminologies that might be difficult to capture by an individual without the professional background in the field. There would be possible cases that none of the crowd worker can type those terms correctly, which may result in the incorrect caption. I think the paper can be strengthened with a discussion about under what kind of situations the proposed method works best. To continue the point, another possibility is to leverage the advantages of pre-trained speed recognition models and crowd works to develop a human-AI team to achieve desirable performance.

Discussion

I think the following questions are worthy of further discussion.

Would it be helpful if the recruiting process of crowd workers involves the consideration on their backgrounds, especially for some domain-specific lectures?
Although ASR may not be reliable on its own, is it useful leverage it as a contributor to the input of crowd workers?
Is there any other potential to add a machine-in-the-loop component in the proposed framework?
What do you think about the proposed approach compared with the ensemble modeling that merges the outputs of multiple speech recognition algorithms to get the final results?

03/04/20 – Fanglan Chen – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

March 4, 2020 Fanglan Chen 3 Comments

Summary

Hara et al.’s paper “Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems” explores the crowdsourcing approach to locate and assess sidewalk accessibility issues by labeling Google Street View (GSV) imagery. Traditional approaches for sidewalk assessment relies on street audits which are very labor intensive and expensive or by reporting calls from citizens. The researchers propose using their designed interactive user interface as an alternative to proactively deal with this issue. Specifically, they investigates the viability of the labeling sidewalk issues amongst two groups of diligent and motivated labelers (Study 1) and then explores the potential of relying on crowd workers to perform this labeling task and evaluate performance at different levels of labeling accuracy (Study 2). By investigating the viability of labeling across two groups (three members of the research team and three wheelchair users), the results of study 1 is used to provide ground truth labels to evaluate crowd workers performance and to get a baseline understanding of what labeling this dataset looks like. Study 2 explores the potential of using crowd workers to perform the labeling task. Their performance is evaluated on both image and pixel levels of labeling accuracy. The findings suggest that it is feasible to use crowdsourcing for the labeling and verification tasks, which leads to the final result of better quality.

Reflection

Overall, this paper proposes an interesting approach for sidewalk assessment. What I think most is how feasible we can use that to deal with real-world issues. In the scenario studied by the researchers, the sidewalk under poor condition has severe problems and relates to a larger accessibility issue of urban space. The proposed crowdsourcing approach is novel. However, if we take a close look at the data source, we may question to what extent it can facilitate the assessment in real-time. It seems impossible to update the Google Street View (GSV) imagery on a daily basis. The image sources are historical instead the ones that can reflect the current conditions of the road sidewalks.

I think the image quality may be another big problem in this approach. Firstly, the resolution of the GSV imagery is comparatively low and sometimes under poor light conditions, which is challenging to let the crowd workers make the correct judgement. There is possibility to use some existing machine learning models to enhance the image quality via increasing its resolution or adjusting the brightness. That could be a potential place to introduce the assistance of machine learning algorithms to achieve better results in the task.

In addition, the focal point of the camera was another issue which may reduce the scalability of the project. The CSV imagery is not collected merely for the sidewalk accessibility assessment, which would usually contain a lot of noises (e.g. block objects). It would be interesting to conduct a study about how much percent of the GSV imagery is of good quality in regards to the sidewalk assessment task.

Discussion

I think the following questions are worthy of further discussion.

Are there any other important accessible issues existing but not considered in the study?
What are improvements you can think about the authors could improve their analysis?
What other potential human performance tasks can be explored by incorporating street view images?
How effective do you think this approach can deal with the urgent real-world problems?

03/04/2020 – Palakh Mignonne Jude – Pull the Plug? Predicting If Computers or Humans Should Segment Images

March 4, 2020 Palakh Mignonne Jude Leave a comment

SUMMARY

The authors of this paper aim to build a prediction system that is capable of determining whether the segmentation of images should be done by humans or computers, keeping in mind that there is a fixed budget of human annotation effort. They focus on the task of foreground object segmentation. They utilized varied domain image datasets such as the Biomedical Image Library with 271 grayscale microscopy images sets, Weizmann with 100 grayscale everyday object images, and Interactive Image Segmentation with 151 RGB everyday object images with the aim of showcasing the generalizability of their technique. They developed a resource allocation framework ‘PTP’ that predicts if it should ‘Plug The Plug’ on machines or humans. They conducted studies on both coarse segmentation as well as fine-grained segmentation. The ‘machine’ algorithms were selected from among the algorithms currently used for foreground segmentation such as Otsu thresholding, Hough transform, etc. The regression model was built using a multiple linear regression model. The 522 images from the 3 data sets mentioned earlier were given to crowd workers from AMT to perform coarse segmentation. The authors found that their proposed system was able to eliminate 30-60 minutes of human annotation time.

REFLECTION

I liked the idea of the proposed system that capitalized on the strengths of both humans and machines and aims to identify when the skill of one or the other is more suited for the task at hand. It reminded me about reCAPTCHA (as highlighted by the paper ‘An Affordance-Based Framework for Human Computation and Human-Computer Collaboration’) that also utilized multiple affordances (both human and machine) in order to achieve a common goal.

I found it interesting to learn that this system was able to eliminate 30-60 minutes of human annotation time. I believe that if such a system were to be used effectively, it would enable developers to build systems faster and ensure that human efforts are not wasted in any way. I thought it was good that the authors attempted to incorporate variety when selecting their data sets, however, I believe that it would have been interesting if the authors had combined these data sets with a few more data sets that contained more complex images (ones with many images that could have been in the foreground). I also liked that the authors have published their code as an open source repository for future extensions of their work.

QUESTIONS

As part of this study, the authors focus on foreground segmentation. Would the proposed system extend well in case of other object segmentation or would the quality of the segmentation and the performance of the system be hampered in any way?
While the authors have attempted to indicate the generalizability of their system by utilizing different data sets, the Weizmann and BU-BIL datasets were grayscale images with relatively clear foreground images. If the images were to contain multiple objects, would the amount of time that this system eliminated be as high? Is there any relation between the difficulty of the annotation task and the success of this system?
Have there any been any new systems (since this paper was published) that attempt to build on top of the methodology proposed by the authors in this paper? What modifications/improvements could be made to this proposed system to improve it (if any improvement is possible)?

03/04/20 – Sukrit Venkatagiri – Toward Scalable Social Alt Text

March 4, 2020 Sukrit Venkatagiri Leave a comment

Paper: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind. In Fifth AAAI Conference on Human Computation and Crowdsourcing.

Summary:
This paper explores a variety of approaches for supporting blind and visually impaired people (BVI) with alt-text captions. They consider two baseline methods using existing computer vision approaches (Vision-to-Language) and Human Corrected Captions. They also considered two workflows that did not depend on CV approaches—TweetTalk conversational workflow, and Structured Q&A workflow. Based on the questions asked from TweetTalk, they generated a set of structured questions to be used in Structured Q&A workflow. They found that V2L performed the worst, and that overall, any approach with CV as a baseline did not perform well. Their TweetTalk conversational approach is more generalizable but also difficult to recruit workers. Finally, they conducted a study of TweetTalk with 7 BVI people and learned that they found it potentially useful. The authors discuss their findings in relation to prior work, as well as the tradeoffs between human-only and AI-only systems, paid v/s volunteer work, and conversational assistants v/s structured Q&A. They also extensively discuss the limitations of this work.

Reflection:
Overall, I really liked this paper and found it very interesting. I think their multiple approaches to evaluating human-AI collaboration was interesting (AI alone, human-corrected, human chat, asynchronous human answers), in addition to the quality perception ratings that were obtained from third party workers. I think this paper makes a strong contribution, but wish they could go into more detail to clarify exactly how the system worked, the different experimental setups, and any other interesting findings that were there. Sadly, there is an 8-page page limit, which may have prevented them from going into more detail.

I appreciate the fact that they built on and used prior work in this paper, namely MacLeod et al. 2017, Mao et al. 2012, and Microsoft’s Cognitive Services API. This way, they did not need to build their own database, CV algorithms, or real-time crowdworker recruiting system. Instead, it allowed them to focus on more high-level goals.

Their findings were interesting. Especially the fact that human-corrected CV descriptions performed poorly. It is unclear how satisfaction is different between the various conditions, for first-party ratings. It may be because users had context through conversation and but was not included in their ratings. The results also show that current V2L systems have worse accuracy than human-in-the-loop approaches. Sadly, there was no significant difference in accuracy between HCC and description generated after TweetTalk, but SQA improved significantly.

Finally, the validation with BVI users is welcome, and I believe more Human-AI work needs to actually work with real users. I wonder how the findings might differ if they were used in a real, social context, or with people on MTurk instead of the researchers-as-workers.

Overall, this was a great paper to read and hope others build on this work, similar to how the authors here have directly leveraged prior work to advance our understanding of human-AI collaboration for alt-text generation.

Questions:

Are there any better human-AI workflows that might be used that the authors did not consider? How would they work and why would they be better?
What are the limitations of CV that led to the findings in this paper that any approach with CV performed poorly?
How would you validate this system in the real world?
What are some other next steps for improving the state of the art in alt-text generation?

03/04/2020 – Mohannad Al Ameedi – Real-Time Captioning by Groups of Non-Experts

March 4, 2020March 4, 2020 mohada4 1 Comment

Summary

In this paper, the authors proposing a low latency captioning solution for the deaf and hard of hearing people that can work in real-time setting. Although, there are available solutions, but they are either very expensive or low quality. The proposed system allows people with hearing disability to request a captioning at any time and get the result in a few seconds. The system depends on a combination of non-expert crowd sourcing workers and local staff to provide the captioning. Each request will be handled by multiple people and the result will be a combination of all the participants’ input. The request will be submitted in an audio stream format and the result will be in a text format. Crowdsource platform is used to submit the request and the result is retrieved in seconds. The proposed system uses an algorithm that work on a stream manner where the input can be process as it is received and aggregate the result at the end. The system outperforms all other available options on both coverage and accuracy. The proposed solution is feasible to be applied in a production setting.

Reflection

I found the idea of real time captioning very interesting. My understanding was there is always a latency when depending on crowdsourcing and cannot be applied in real world scenarios, but it will be interesting to know how the system will work when the number of users increase.

I also found the concept of multiple people working on the same audio stream and combining the result very interesting. Collecting captions from multiple people and then trying to figure out what is unique and what is duplicate and producing a final sentence, paragraph, or script is a challenging task.

This work is like multiple people work on one task or multiple developers writing code to implement a single feature. Normally the supervisor or development lead will merge the result, but in this case the algorithm is taking care of the merge.

Questions

The authors measured the system on a limited number of users, do you think the system will continue outperforming other methods if it is get deployed in real world setting?
Since we have an increasing number of live streaming on work, school, and other places, can we use the same concept to pass the URL and get instance captioning? What are the limitations of this approach?
What are the privacy concerns with this approach especially if it is get used in medical field? Normally limited number of people get hired to help on such tasks, while the crowdsourcing is opened to a wide range of people.