3/4/20 – Jooyoung Whang – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

March 4, 2020 Jooyoung Whang 1 Comment

In this paper, the authors study the effectiveness of vision-to-language systems for automatically generating alt texts for images and the impact of human-in-the-loop for this task. The authors set up four methods for generating alt text. First is a simple implementation of modern vision-to-language alt text generation. The second is a human-adjusted version of the first method. The third method is a more involved one, where a Blind or Vision Impaired (BVI) user chats with a non-BVI user to gain more context about an image. The final method is a generalized version of the third method, where the authors analyzed the patterns of questions asked during the third method to form a structured set of pre-defined questions that a crowdsource worker can directly provide the answer to without having the need for a lengthy conversation. The authors conclude that current vision-to-language techniques can, in fact, harm context understanding for BVI users, and simple human-in-the-loop methods significantly outperform. They also found that the method of the structured questions worked the best.

This was an interesting study that implicitly pointed out the limitation of computers at understanding social context which is a human affordance. The authors stated that the results of a vision-to-language system often confused the users because the system did not get the point. This made me wonder if the current limitation could be overcome in the future.

I was also concerned whether the authors’ proposed methods were even practical. Sure, the human-in-the-loop method involving Mturk workers greatly enhanced the description of a Twitter image, but based on their report, it’ll take too long to retrieve the description. The paper reports that to answer one of the structured questions, it takes on average, 1 minute. This is excluding the time it takes for a Mturk worker to accept a HIT. The authors suggested pre-generating alt texts for popular Tweets, but this does not completely solve the problem.

I was also skeptical about the way the authors performed validation with the 7 BVI users. In their validation, they simulated their third method (TweetTalk, a conversation between BVI and sighted users). However, they did not do it by using their application, but rather a face-to-face conversation between the researchers and the participants. The authors claimed that they tried to replicate the environment as much as possible, but I think there still can be flaws since the researchers serving as the sighted user already had expert knowledge about their experiment. Also, as stated in the paper’s limitations section, the validation was performed with too fewer participants. This may not fully capture the BVI users’ behaviors.

These are the questions that I had while reading this paper:

1. Do you think the authors’ proposed methods are actually practical? What could be done to make them practical if you don’t think so?

2. What do you think were the human affordances needed for the human element of this experiment other than social awareness?

3. Do you think the authors’ validation with the BVI users is sound? Also, the validation was only done for the third method. How can the validation be done for the rest of the methods?

03/04/2020 – Vikram Mohanty – Combining crowdsourcing and google street view to identify street-level accessibility problems

March 4, 2020 Vikram Mohanty Leave a comment

Authors: Kotaro Hara, Vicki Le, and Jon Froehlich

Summary

This paper discusses the feasibility of using AMT crowd workers to label sidewalk accessibility problems in Google Street View. The authors create ground truth datasets with the help of wheelchair users, and found that Turkers reached an accuracy of 81%. The paper also discusses some quality control and improvement methods, which was shown to be effective i.e. improved the accuracy to 93%.

Reflection

This paper reminded me of Jeff Bigham’s quote – “Discovery of important problems, mapping them onto computationally tractable solutions, collecting meaningful datasets, and designing interactions that make sense to people is where HCI and its inherent methodologies shine.” It’s a great example for two important things mentioned in the quote : a) discovery of important problems, and b) collecting meaningful datasets. The paper’s contribution mentions that the datasets collected will be used for building computer vision algorithms, and this paper’s workflow involves the potential end-users (wheelchair users) early on in the process. Further, the paper attempts to use Turkers to generate datasets that are comparable in quality to that of the wheelchair users, essentially setting a high quality standard for generating potential AI datasets. This is a desirable approach for training datasets, which can potentially help prevent problems in popular datasets as outlined here: https://www.excavating.ai/.

The paper also proposed two generalizable methods for improving data quality from Turkers. Filtering out low-quality workers during data collection by seeding in gold standard data may require designing modular workflows, but the time investment may well be worth it.

It’s great to see how this work evolved to now form the basis for Project Sidewalk, a live project where volunteers can map accessibility areas in the neighborhood.

Questions

What’s your usual process for gathering datasets? How is it different from this paper’s approach? Would you be willing to involve potential end-users in the process?
What would you do to ensure quality control in your AMT tasks?
Do you think collecting more fine-grained data for training CV algorithms will come at a trade-off for the interface not being simple enough for Turkers?

03/04/2020- Bipasha Banerjee – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

March 4, 2020 bipashab 2 Comments

Summary

The paper by Hara et al. attempts to address the problem of sidewalk accessibility by using crowd workers to label the data. The authors had different contributions in addition to just making crowd workers label images. They conduct two studies, a feasibility study and an online crowdsourcing study using AMT. The first study aims to find out how practical it is to label sidewalks using reliable crowd workers (experts). This study also gives an idea of the baseline performance and acts as a validated ground truth data. The second study aims to find out the feasibility of using Amazon Mechanical Turks for this task. They have evaluated the accuracy of image-level as well as pixel-level. The authors have conducted a thorough background study on the current sidewalk accessibility issues, the current audit methods, and that of crowdsourcing and image labeling. They were successful in showing that untrained crowd workers could identify and label sidewalk accessibility issues correctly in the google street view imagery.

Reflection

Combining crowdsourcing and google street view to identify street- level accessibility is essential and useful for people. The paper was an interesting read and the authors described the system well. In the video[1], the authors show the instructions for the workers. The video gave a fascinating insight into how the task was designed for the workers, explaining every labeling task in detail.

The paper mentions accessibility, but they have restricted their research for wheelchair users. This works for the first study as they are able to label the obstacles correctly, and this gives us the ground truth data for the next study as well as establishes the feasibility of using crowd workers to identify and label accessibility effectively. However, accessibility problems on sidewalks are also faced by other groups like people with reduced vision, etc. I am curious to see how the experiments would differ if the user-group and the need changes?

The experiments are based on google street view, which is not known to be the best at certain times. There are certain apps that help people get real-time updates on traffic while driving like the app Waze [2]. I was wondering if google maps or any other app insert dynamic updates for street walks, it would be beneficial. It would not only help people but also help the authority in determining which sidewalks are frequently used and the most common issues people face. The paper is a bit old. But, newer technology would surely help users. The paper [3] by the same author is a massive advancement in collecting sidewalk accessibility data. This paper is a good read based on the latest technology.

The paper mentions that active feedback to crowd workers would help improve labeling tasks. I think that dynamic, real-time feedback would be immensely helpful. However, I do understand that it is challenging to implement when using crowd workers, but an internal study could be conducted. For this, a pair or more people need to work simultaneously, where one label and the rest give feedback or some other combinations.

Questions

Sidewalk accessibility has been discussed for people with accessibility problems. They have considered people in wheelchairs for their studies. I do understand that such people would be needed for study 1, where labeling is a factor. However, how does the idea extend to people with other accessibility issues like reduced vision?
This paper was published in 2013. The authors do mention in the conclusion section that with improvement in GSV and computer vision will overall help. Has any further study been conducted? How much modification of the current system is needed to accommodate the advancement in GSV and computer vision in general?
Can dynamic feedback to workers be implemented?

References

[1] https://www.youtube.com/watch?v=aD1bx_SikGo

[2] https://www.waze.com/waze

[3] http://kotarohara.com/assets/Papers/Saha_ProjectSidewalkAWebBasedCrowdsourcingToolForCollectingSidewalkAccessibilityDataAtScale_CHI2019.pdf

03/04/2020 – Nurendra Choudhary – Real-time captioning by groups of non-experts

March 4, 2020 Nurendra Choudhary 1 Comment

Summary

In this paper, the authors discuss a collaborative real-time captioning framework called LEGION:SCRIBE. They compare their system against the previous approach called CART and Automated Speech Recognition (ASR) system. The authors initiate the discussion with the benefits of captioning. They proceed to explain the expensive cost of hiring stenographers. Stenographers are the fastest and most accurate captioners with access to specialized keyboards and expertise in the area. However, they are prohibitively expensive (100-120$ an hour). ASR is much cheaper but their low accuracy deems them inapplicable in most real-world scenarios.

To alleviate the issues, the authors introduce SCRIBE framework. In SCRIBE, crowd-workers caption smaller parts of the speech. The parts are merged using an independent framework to form the final sentence. The latency of the system is 2.89s, emphasizing its real-time nature, which is a significant improvement over ~5s of CART.

Reflection

The paper introduces an interesting approach to collate data from multiple crowd workers for sequence learning tasks. The method has been applied before in cases such as Google Translate (translating small phrases) and ASR (voice recognition of speech segments). However, SCRIBE distinguishes itself by bringing in real-time improvement in the system. But, the system relies on the availability of crowd workers. This may lead to unreliable behaviour in the system. Additionally, the hired workers are not professionals. Hence, the quality is affected by human behavioral features such as mindset, emotions or mental stamina. I believe a study on the evolution of SCRIBE overtime and its dependence on such features needs to be analyzed.

Furthermore, I question the crowd management system. Amazon MT cannot guarantee real-time labourers. Currently, given the supply of workers with respect to the tasks, workers are always available. However, as more users adopt the system, this need not always hold true. So, crowd management systems should provide alternatives that guarantee such requirements. Also, the work provider needs to find alternatives to maintain real-time interaction, in case the crowd fails. In case of SCRIBE, the authors can append an ASR module in a situation of crowd failure. ASR may not give the best results but would be able to ensure smoother user experience.

The current development system does not consider the volatility of crowd management systems. This makes them an external single point of failure. I think there should be a push in the direction of simultaneously adopting multiple management systems for the framework to increase their reliability. This will also improve system efficiency because it has a more diverse set of results as choice. Thus benefiting the overall model structure and user adoption.

Questions

Google Translate uses a similar strategy by asking its users to translate parts of sentences. Can this technique be globally applied to any sequential learning framework? Is there a way we can divide sequences into independent segments? In case of dependent segments, can we just use a similar merging module or is it always problem-dependent?
The system depends on the availability of crowd workers. Should there be a study on the availability aspect? What kind of systems would be benefitted from this?
Should there be a new crowd work management system with a sole focus on providing real-time data provisions?
Should the responsibility of ensuring real-time nature be on the management system or the work provider? How will it impact the current development framework?

Word Count: 567

3/4/20 – Jooyoung Whang – Pull the Plug? Predicting If Computers or Humans Should Segment Images

March 4, 2020 Jooyoung Whang Leave a comment

In this paper, the authors attempt to appropriately distribute human and computer resources for creating segmentation of foreground objects in an image to achieve highly precise segmentations. They introduce that the segmentation process consists of roughly segmenting the image (initialization), and then going through another fine-grained iteration to come up with the final result. They repeat their study for both of the steps. To figure out where to allocate human resources, the authors’ proposed an algorithm that tries to score the acquired segmentations by detecting: highly jagged edges on the boundary, non-compact segmentations, near-edge segmentation locations, and segmentation area ratio to the full image. The authors find that a mix of humans and computers for image segmentation performs better than when completely using one or the other.

I liked the authors’ proposed algorithm to detect when a segmentation fails. It was interesting to see that they focused on visible features and qualities that humans can see instead of relying on deep neural networks that are often hard to interpret the internal workings of. At the same time, I am a little concerned about whether the proposed visual features for failed segmentations are enough to generalize and scale for all kinds of images. For example, the authors note that failed segmentations often have highly jagged edges. What if the foreground object (or an animal in this case) was a porcupine? The score would be fairly low even when an algorithm correctly segments the creature from the background. Of course, the paper reports that the method generalized well for everyday images and biomedical images, so my concern may be a trivial one.

As I am not experienced in the field of image segmentation analysis, I wondered if there were any case where an image contained more than one foreground objects and only one of them is of interest to a researcher. From my short knowledge about fore and background separation, a graph search is done by treating the image as a graph of connected pixels to find pixels that stand out. It does not care about “objects of interest.” It made me curious if it was possible to give additional semantic information in the process.

The followings are the questions that I had while reading the paper:

1. Do you think the qualities that PTP looks for is enough to measure the score of the quality of segmented images? What other properties would a failed segmentation have? One quality I can think of is that failed segmentations often have disjoint parts in the segmentations.

2. Can you think of some cases where PTP could fail? Would there be any case where the score for a segmentation score really low even if the segmentation was done correctly?

3. As I’ve written in my reflection, are there methods that allow segmentation algorithms to consider the “interest” for an object? For example, if an image contained a car and a cat both in the foreground and the researcher was interested in the cat, would the algorithm be able to only separate out the cat?

03/04/20 – Lulwah AlKulaib- SocialAltText

March 4, 2020 Lulwah AlKulaib Leave a comment

Summary

The authors propose a system to generate Alt text for images embedded in social media posts by utilizing crowd workers. Their goal is to have a better experience for the blind and visually impared (BVI) when using social media. Existing tools provide imperfect descriptions some by automatic caption generation, and others by object recognition. These systems are not enough as in many cases their results aren’t descriptive enough for BVI users. The authors study how crowdsourcing can be used for both:

evaluating the value provided of existing automated approaches
Enabling workflows that provide scalable and useful alt text for BVI users

They utilize real-time crowdsourcing to test experiments with varied depth levels of interaction of the crowd in assisting visually impaired users. They show the shortcomings of existing AI image captioning systems and compare them with their method. The paper suggests two experiences:

TweetTalk: is a conversational assistant workflow.
Structured Q&A: that builds upon and enhances the state of the art generated captions.

They evaluated the conversational assistant with 235 crowdworkers. They evaluated 85 tweets for the baseline image caption, each tweet was evaluated 3 times with a total of 255 evaluations.

Reflection

The paper presents a novel concept and their approach is a different take on utilizing crowdworkers. I believe that the experiment would have worked better if they tested it on some visually impared users. Since the crowdworkers hired were not visually impaired, it makes it harder to say that BVI users would have the same reaction. Since the study targets BVI users, they should have been the pool of testers. People interact with the same element in different ways and what they showed seemed too controlled. Also, the questions were not all the same for all images, which makes this harder to generalize. The presented model tries to solve a problem for social media photos and not having a plan to repeat for each photo might make interpreting images difficult.

I appreciated the authors’ use of existing systems and their attempt at improving the AI generated captions. Their results obtain better accuracy compared to state of the art work.

I would have loved seeing how different social media applications measured compared with each other. Since different applications vary in how they present photos. Twitter for example gives a limited amount of character count while Facebook could present more text which might help BVI users in understanding the image better.

In the limitations section, the authors mention that human in the loop workflows raise privacy concerns and that the alt text would generalize to friendsourcing and utilizing social network users. I wonder how that generalizes to social media applications in real time. And how reliable would friendsourcing be for BVI users.

Discussion

What are improvements that you would suggest to better the TweetTalk experiment?
Do you know of any applications that use human in the loop in real time?
Would you have any privacy concerns if one of the social media applications integrated a human in the loop approach to help BVI users?

03/04/2020 – Ziyao Wang – Combining crowdsourcing and google street view to identify street-level accessibility problems

March 4, 2020 Ziyao Wang Leave a comment

In this paper, the authors focused on the mechanism that using untrained crowdworkers to find and label accessibility problems in Google Street View imagery. They provide the workers images from Google Street View imagery to let them find, label and access sidewalk accessibility problems. They compared the results of this labeling task completed by six dedicated labelers including three wheelchair users and by MTurk workers. The comparison shows that the crowdworkers can determining the presence of an accessibility problem with high accuracy, which means this mechanism is promising about sidewalk accessibility. However, that mechanism still have problems such as locating the GSV camera in geographic space and selecting an optimal viewpoint, sidewalk width problem and the age of the images. In the experiments, the workers cannot label some of the images due to camera position, and the images may be captured three years ago. Additionally, there is no method to measure the width of the sidewalk, which is a need by the wheelchair users.

Reflections:

The authors combined the Google Street View imagery and MTurk Crowdsourcing to build a system which can detect accessibility challenges. This kind of hybrid system has a high accuracy in the finding and labeling of such kind of accessibility challenges. If this system can be used practically, the disables will benefit a lot with the help of the system.

However, there is some problems in the system. As is mentioned in the paper, the images in the Google Street View are old. Some of the images may be captured years ago. If the detection is based on these pictures, some new access problems will be detected. For this problem, I have a rough idea about letting the users of the system to update the image library. When they found some difference between the images from library and practical sidewalk, they can upload the latest pictures captured by them. As a result, other users will not suffer from the images’ age problem. However, this solution will change the whole system. Google Street View imagery requires professional capture devices which is not available to most of the users. As a result, the Google Street View will not update its imagery using the photos captured by the users, and the system cannot update itself using the imagery. Instead, the system has to build its own image library, which is totally different from the introduced system in the paper. Additionally, the photos provided by the users may be with low resolution, and it will be difficult for the MTurk workers to label the accessibility challenges.

Similarly, the problem that the workers cannot measure the width of the sidewalk can be solved if users can upload the width when they are using the system. However, it still faces the problem of lacking an own database and the system needs to be modified hugely.

Instead of detecting accessibility challenges, I think the system is more useful in tracking and labeling bike lanes. Compared with the accessibility of sidewalk, to detect the existence of bike lanes will suffer less from the age problem, because even the bike lanes were built years ago, they can still work. Also, there is no need to measure the width of the lanes, as all the lanes should have enough space for bikes to pass.

Question:

Is there any approach to solve the age problem, camera point problem and measuring width problem in the system?

What do you think about applying such a system to track and label bike lanes?

What other kinds of street detection problems can this system being applied to?

03/04/2020- Ziyao Wang – Real-time captioning by groups of non-experts

March 4, 2020 Ziyao Wang 1 Comment

Traditional real-time captioning tasks are completed by professional captionists. However, the cost to hire them is expensive. Alternatively, some automatic speech recognition systems have been developed. But there is still problem that these systems perform badly when the audio quality is low or there are multiple people talking. In this paper, the authors developed a system which can hire several non-expert workers to do the caption task and merge their works together to obtain a high accuracy caption output. As the workers have a significant lower salary compared with the experts, the cost will be reduced even multiple workers are hired. Also, the system has a good performance collecting workers’ jobs and merging them to get a high accuracy output with low latency.

Reflections:

When solving problems with the requirement of high accuracy and low latency, I always hold the view that only AI or experts can complete such kind of tasks. However, in this paper, the authors showed us that non-experts can also complete this kind of tasks if we can have a group of people work together.

Compared with the professionals, hiring non-experts will cost much less. Compared with AI, people can handle some complicated situations better. This system combined this two advantages and provided a cheap real-time captioning system with high accuracy.

It is for sure that this system has lots of advantages, but we should still consider it critically. For the cost, it is true that hiring non-experts will spend much less than hiring professional captionists. However, the system needs to hire 10 workers to get 80 to 90 percentage accuracy. Even though the workers have a low salary, for example 10 dollars per hour, the total cost will reach 100 dollars per hour. Hiring experts will only cost around 120 dollars for one hour, which shows that the saving of applying the system is relatively low.

For the accuracy part, there is possibility that all the 10 workers missed a part of the audio. As a result, even merging all the results provided by the workers, the system will still miss this part’s caption. Instead, though the AI system may provide caption with errors, the system can at least provide something for all words in the audio.

For these two reasons, I think hiring less workers, for example three to five workers, to fix the errors in the system generated caption will save more money while the system can still maintain high accuracy. And with the provided caption, the workers’ tasks will be easier, and they may provide more accurate results. Also, for the circumstances in which AI system performs well, the workers will not need to spent time typing, and the latency of the system will be reduced.

Questions:

What are the advantages of hiring non-expert humans to do the captioning compared with the experts or AI systems?

Will a system hiring less workers to fix the errors in the AI generated caption be cheaper? Will this system perform better?

For the system mentioned in the second question, does it have any limitations or drawbacks?

03/04/2020 – Palakh Mignonne Jude – Combining Crowdsourcing and Google Street View To Identify Street-Level Accessibility Problems

March 4, 2020 Palakh Mignonne Jude Leave a comment

SUMMARY

The authors of this paper aim to investigate the feasibility of recruiting MTurk workers to label and assess sidewalk accessibility problems as can be viewed by making use of Google Street View. The authors conducted two studies, the first, with 6 people (3 from their team of researchers and 3 wheelchair users) and the second, that investigated the performance of turkers. The authors created an interactive labeling interface as well as a validation interface (to help users to accept/reject previous labels). The authors proposed different levels of annotation correctness comprising of two spectra – localization spectrum which includes image level and pixel level granularity and specificity spectrum which includes the amount of information evaluated for each label. They defined image-level correctness in terms of accuracy, precision, recall, and f-measure. In order to computer inter-rater agreement at the image-level, they utilized Fleiss’ kappa. In order to evaluate the more challenging pixel-level agreement, they aimed to verify the labeling by indicating that pixel-level overlap was greater between labelers on the same image versus across different images. The authors used the labels produced from Study 1 as the ground truth dataset to evaluate turker performance. The authors also proposed two quality control approaches – filtering turkers based on a threshold of performance and filtering labels based on crowdsourced validations.

REFLECTION

I really liked the motivation of this paper especially given the large number of people that have physical disabilities. I am very interested to know how something like this would extend to other countries such as India as it would greatly aid people with physical disabilities over there since there are many places with poor walking surfaces and do not have support for wheelchairs. I think that having such a system in place in India would definitely help disabled people be better informed about places that can be visited.

I also liked the quality control mechanisms of filtering tuckers and filtering labels since these appear to be good ways to improve the overall quality of the labels obtained. I thought it was interesting that the performance of the system improved with tucker count but the gains diminished in magnitude as the group size grew. I thought that the design of the labelling and verification interface was good and that it made it easy for users to perform their tasks.

QUESTIONS

As indicated in the limitations section, this work ‘ignored practical aspects such as locating the GSV camera in geographical space and selecting an optimal viewpoint’. Has any follow-up study been performed that takes into account these physical aspects? How complex would it be to conduct such a study?
The authors mention that image quality can be poor in some cases due to a variety of factors. How much of an impact would this cause to the task at hand? Which labels would have been most affected if the image quality was very poor?
The validation of labels was performed by crowd workers via the verification interface. Would there have been any change in the results obtained if experts had been used for the validation of labels instead of crowd workers (since they may have been able to identify more errors in the labels as compared to normal crowd workers)?

03/04/20 – Akshita Jha – Pull the Plug? Predicting If Computers or Humans Should Segment Images

March 4, 2020 Akshita Jha 1 Comment

Summary:
“Pull the Plug? Predicting If Computers or Humans Should Segment Images” by Gurari et. al. talks about image segmentation. They propose a resource allocation framework that tries to predict when best to use a computer for segmenting images and when to switch to humans. Image segmentation is the process of “partitioning a single image into multiple segments” in order to simplify the image into something that is easier to analyze. The authors implement two systems that decide when to replace humans with computers to create fine-grained segments and when to replace computers with humans in order to get coarse segments. They demonstrate through experiments that this mixed model of humans and computers beats the state of the art systems for image segmentation. The authors use the resource allocation framework, “Pull the Plug”, on humans or computers. They do this by giving the system an image and trying to predict if an annotation should from a human or a computer. The authors evaluate the model using Pearson’s correlation coefficient (CC) and mean absolute error (MAE). CC indicates the correlation strength of the predicted score to the actual scores given by the Jaccard index on the ground truth. MAE is the average prediction errors. The authors thoroughly experiment with initializing segmentation tools and reducing human effort initialization.

Reflections:
This is an interesting work that successfully makes uses of mixed modes involving both humans and computers to enrich the precision and accuracy of a task. The two methods that the authors design for segmenting an image was particularly thoughtful. First, given an image, the authors design a system that tries to predict whether the image requires fine-grained segmentation or coarse-grained segmentation. This is non-trivial as this task requires the system to possess a certain level of “intelligence”. The authors use segmentation tolls but the motivation of the system design is to remain agnostic to these particular segmentation tools. The systems rank several segmentation tools by using a tool designed by the authors to predict the quality of the segmentation. The system then allocates the available human budget to create coarse segmentations. The second system tries to capture whether an image requires fine-grained segmentation or not. They do this by building on the coarse segmentation given by the first system. The second system refines the segmentation and allocates the available human budget to create fine-grained segmentation for low predicted quality segmentations. Both these tasks rely on the system proposed by the authors to predict the quality of candidate segmentation.

Questions:
1. The authors rely on their proposed system of predicting the quality of candidate segmentations. What kind of errors do you expect?
2. Can you think of a way to improve this system?
3. Can we replace the segmentation quality prediction system with a human? Do you expect the system to improve or would the performance go down? How would it affect the overall experience of the system?
4. In most such systems, humans are needed only for annotation. Can we think of more creative ways to engage humans while improving the system performance?