03/04/2020- Bipasha Banerjee – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary

The paper by Hara et al. attempts to address the problem of sidewalk accessibility by using crowd workers to label the data. The authors had different contributions in addition to just making crowd workers label images. They conduct two studies, a feasibility study and an online crowdsourcing study using AMT. The first study aims to find out how practical it is to label sidewalks using reliable crowd workers (experts). This study also gives an idea of the baseline performance and acts as a validated ground truth data. The second study aims to find out the feasibility of using Amazon Mechanical Turks for this task. They have evaluated the accuracy of image-level as well as pixel-level. The authors have conducted a thorough background study on the current sidewalk accessibility issues, the current audit methods, and that of crowdsourcing and image labeling. They were successful in showing that untrained crowd workers could identify and label sidewalk accessibility issues correctly in the google street view imagery. 

Reflection

Combining crowdsourcing and google street view to identify street- level accessibility is essential and useful for people. The paper was an interesting read and the authors described the system well. In the video[1], the authors show the instructions for the workers. The video gave a fascinating insight into how the task was designed for the workers, explaining every labeling task in detail. 

The paper mentions accessibility, but they have restricted their research for wheelchair users. This works for the first study as they are able to label the obstacles correctly, and this gives us the ground truth data for the next study as well as establishes the feasibility of using crowd workers to identify and label accessibility effectively. However, accessibility problems on sidewalks are also faced by other groups like people with reduced vision, etc. I am curious to see how the experiments would differ if the user-group and the need changes?

The experiments are based on google street view, which is not known to be the best at certain times. There are certain apps that help people get real-time updates on traffic while driving like the app Waze [2]. I was wondering if google maps or any other app insert dynamic updates for street walks, it would be beneficial. It would not only help people but also help the authority in determining which sidewalks are frequently used and the most common issues people face. The paper is a bit old. But, newer technology would surely help users. The paper [3] by the same author is a massive advancement in collecting sidewalk accessibility data. This paper is a good read based on the latest technology.

The paper mentions that active feedback to crowd workers would help improve labeling tasks. I think that dynamic, real-time feedback would be immensely helpful. However, I do understand that it is challenging to implement when using crowd workers, but an internal study could be conducted. For this, a pair or more people need to work simultaneously, where one label and the rest give feedback or some other combinations. 

Questions

  1. Sidewalk accessibility has been discussed for people with accessibility problems. They have considered people in wheelchairs for their studies. I do understand that such people would be needed for study 1, where labeling is a factor. However, how does the idea extend to people with other accessibility issues like reduced vision?
  2. This paper was published in 2013. The authors do mention in the conclusion section that with improvement in GSV and computer vision will overall help. Has any further study been conducted? How much modification of the current system is needed to accommodate the advancement in GSV and computer vision in general? 
  3. Can dynamic feedback to workers be implemented? 

References 

[1] https://www.youtube.com/watch?v=aD1bx_SikGo

[2] https://www.waze.com/waze

[3] http://kotarohara.com/assets/Papers/Saha_ProjectSidewalkAWebBasedCrowdsourcingToolForCollectingSidewalkAccessibilityDataAtScale_CHI2019.pdf

Read More

03/04/2020 – Nurendra Choudhary – Real-time captioning by groups of non-experts

Summary

In this paper, the authors discuss a collaborative real-time captioning framework called LEGION:SCRIBE. They compare their system against the previous approach called CART and Automated Speech Recognition (ASR) system. The authors initiate the discussion with the benefits of captioning. They proceed to explain the expensive cost of hiring stenographers. Stenographers are the fastest and most accurate captioners with access to specialized keyboards and expertise in the area. However, they are prohibitively expensive (100-120$ an hour). ASR is much cheaper but their low accuracy deems them inapplicable in most real-world scenarios. 

To alleviate the issues, the authors introduce SCRIBE framework. In SCRIBE, crowd-workers caption smaller parts of the speech. The parts are merged using an independent framework to form the final sentence. The latency of the system is 2.89s, emphasizing its real-time nature, which is a significant improvement over ~5s of CART.

Reflection

The paper introduces an interesting approach to collate data from multiple crowd workers for sequence learning tasks. The method has been applied before in cases such as Google Translate (translating small phrases) and ASR (voice recognition of speech segments). However, SCRIBE distinguishes itself by bringing in real-time improvement in the system. But, the system relies on the availability of crowd workers. This may lead to unreliable behaviour in the system. Additionally, the hired workers are not professionals. Hence, the quality is affected by human behavioral features such as mindset, emotions or mental stamina. I believe a study on the evolution of SCRIBE overtime and its dependence on such features needs to be analyzed.

Furthermore, I question the crowd management system. Amazon MT cannot guarantee real-time labourers. Currently, given the supply of workers with respect to the tasks, workers are always available. However, as more users adopt the system, this need not always hold true. So, crowd management systems should provide alternatives that guarantee such requirements. Also, the work provider needs to find alternatives to maintain real-time interaction, in case the crowd fails. In case of SCRIBE, the authors can append an ASR module in a situation of crowd failure. ASR may not give the best results but would be able to ensure smoother user experience.

The current development system does not consider the volatility of crowd management systems. This makes them an external single point of failure. I think there should be a push in the direction of simultaneously adopting multiple management systems for the framework to increase their reliability. This will also improve system efficiency because it has a more diverse set of results as choice. Thus benefiting the overall model structure and user adoption. 

Questions

  1. Google Translate uses a similar strategy by asking its users to translate parts of sentences. Can this technique be globally applied to any sequential learning framework? Is there a way we can divide sequences into independent segments? In case of dependent segments, can we just use a similar merging module or is it always problem-dependent?
  2. The system depends on the availability of crowd workers. Should there be a study on the availability aspect? What kind of systems would be benefitted from this?
  3. Should there be a new crowd work management system with a sole focus on providing real-time data provisions?
  4. Should the responsibility of ensuring real-time nature be on the management system or the work provider? How will it impact the current development framework?

Word Count: 567

Read More

03/04/20 – Akshita Jha – Pull the Plug? Predicting If Computers or Humans Should Segment Images

Summary:
“Pull the Plug? Predicting If Computers or Humans Should Segment Images” by Gurari et. al. talks about image segmentation. They propose a resource allocation framework that tries to predict when best to use a computer for segmenting images and when to switch to humans. Image segmentation is the process of “partitioning a single image into multiple segments” in order to simplify the image into something that is easier to analyze. The authors implement two systems that decide when to replace humans with computers to create fine-grained segments and when to replace computers with humans in order to get coarse segments. They demonstrate through experiments that this mixed model of humans and computers beats the state of the art systems for image segmentation. The authors use the resource allocation framework, “Pull the Plug”, on humans or computers. They do this by giving the system an image and trying to predict if an annotation should from a human or a computer. The authors evaluate the model using Pearson’s correlation coefficient (CC) and mean absolute error (MAE). CC indicates the correlation strength of the predicted score to the actual scores given by the Jaccard index on the ground truth. MAE is the average prediction errors. The authors thoroughly experiment with initializing segmentation tools and reducing human effort initialization.

Reflections:
This is an interesting work that successfully makes uses of mixed modes involving both humans and computers to enrich the precision and accuracy of a task. The two methods that the authors design for segmenting an image was particularly thoughtful. First, given an image, the authors design a system that tries to predict whether the image requires fine-grained segmentation or coarse-grained segmentation. This is non-trivial as this task requires the system to possess a certain level of “intelligence”. The authors use segmentation tolls but the motivation of the system design is to remain agnostic to these particular segmentation tools. The systems rank several segmentation tools by using a tool designed by the authors to predict the quality of the segmentation. The system then allocates the available human budget to create coarse segmentations. The second system tries to capture whether an image requires fine-grained segmentation or not. They do this by building on the coarse segmentation given by the first system. The second system refines the segmentation and allocates the available human budget to create fine-grained segmentation for low predicted quality segmentations. Both these tasks rely on the system proposed by the authors to predict the quality of candidate segmentation.

Questions:
1. The authors rely on their proposed system of predicting the quality of candidate segmentations. What kind of errors do you expect?
2. Can you think of a way to improve this system?
3. Can we replace the segmentation quality prediction system with a human? Do you expect the system to improve or would the performance go down? How would it affect the overall experience of the system?
4. In most such systems, humans are needed only for annotation. Can we think of more creative ways to engage humans while improving the system performance?

Read More

Subil Abraham – 03/04/2020 – Real-time captioning by groups of non-experts

This paper pioneers the approach of using crowd work for closed captioning systems. The scenario they target is classes and lectures, where a student can hold up their phone and record the speaker and the sound the transmitted to the crowd workers. The sound that is passed is given as bite sized pieces for the crowd workers to transcribe, and the paper’s implementation of the multiple sequence alignment algorithms takes those transcriptions and combines them. The focus of the tool is very much on real-time captioning so the amount of time a crowd worker can spend on a portion of sound is limited. The authors design interfaces on the worker side to promote continuous transcription, and on the user side to allow them to correct the received transcriptions in real time, enhancing the quality further. The authors had to deal with interesting challenges in resolving errors in the transcription, which they did by a combination of comaparing transcriptions of the same section from different crowd workers, using bigram and trigram data to validate the word ordering. Evaluations showed that precision was stable while coverage increased with increase in the number of workers, while having lower error rate compared to automatic transcription and untrained transcribers.

One thing that needs to be pointed out about this work is that I believe that ASR is always rapidly improving and has made significant strides from when this paper was published. From my own anecdotal experience, Youtube’s automatic closed captions are getting very very close to being fully accurate (however, thinking back on our reading of the Ghost Work book at the beginning of the semester, I wonder if Youtube is cheating a bit and using crowd work intervention for some their videos to help their captioning AI along). I also find that the author’s solution for merging the transcriptions of the different sound bites is interesting. How they would solve that was the first thing that was on my mind because it was not going to be a matter of simply aligning the time stamps because those were definitely going to be imprecise. So I do like their clever multi part solution. Finally, I was a little surprised and disappointed that the WER was at ~45% which was a lot higher than I expected. I was expecting the error rate to be a lot closer to professional transcribers but unfortunately not. The software still has a way to go in that.

  1. How could you get the error rate down to the professional transcriber’s level? What is going wrong there that is causing it to be that high?
  2. It’s interesting to me that they couldn’t just play isolated sound clips but instead had to raise and lower volume on a continuous stream for better accuracy. Where are the other places humans work better when they have a continuous stream of data rather than discrete pieces of data?
  3. Is there an ideal balance between choosing precision and coverage in the context of this paper? This was something that also came up in last week’s readings. Should the user decide what the balance should be? How would they do it when there can be multiple users all at the same location trying to request captioning for the same thing?

Read More

Subil Abraham – 03/04/2020 – Pull the Plug

The paper proposes a way of solving the issue of deciding when a computer or human should do the work of foreground segmentation of images. Foreground segmentation is a common task in computer vision where the idea is that there is an element in an image that is the focus of the image and that is what is needed for actual processing. However, automatic foreground segmentation is not always reliable so sometimes it is necessary to get humans to do it. The important question is deciding which images you send to humans for segmentation because hiring humans are expensive. The paper proposes a machine learning method that calculates the quality of a given coarse or fine grained segmentation and decide if it is necessary to bring in a human to do the segmentation. They evaluate their framework by examining the quality of different segmentation algorithms and are able to acheive the quality equivalent to 100% human work by using only 32.5% human effort for Grab Cut segmentation, 65% human effort for Chan Vese, and 70% human effort for Lankton.

The authors have pursued a truly interesting idea in that they are not trying to create a better way of automatic image segmentation, but rather creating a way of determining if the auto image segmentation is good enough. My initial thought was couldn’t something like this be used to just make a better automated image segmenter? I mean, if you can tell the quality, then you know how to make it better. But apparently that’s a hard enough problem that it is far more helpful to just defer to a human when you predict that your segmentation quality is not where you want it. It’s interesting that they talk about pulling the plug on both computers and humans but the focus of the paper seems to be focused on pulling the plug on computers i.e. the human workers are the backup plan in case the computer can’t do the quality work and not the other way around. This applies to both their cases, coarse grained and fine grained segmentation work. I would like to see future work where the primary work is done by humans first and then test to see how pulling the plug on the human work would be effective and where the productivity would increase. This would have to be work in something that is purely in the human domain (i.e. can’t use regular office work because that is easily automatable).

  1. What are examples of work where we pull the plug on the human first, rather pulling the plug on computers?
  2. It’s an interesting turn around that we are using AI effort to determine quality and decide when to bring humans in, rather than improving the AI of the original task itself. What other tasks could you apply this, where there are existing AI methods but an AI way of determining quality and deciding when to bring in humans would be useful?
  3. How would you set up a segmentation workflow (or another application’s workflow) where when you pull the plug on the computer or human, you are giving the best case result to the other for improvement, rather than starting over from scratch?

Read More

03/04/2020 – Pull the Plug? Predicting If Computers or Humans Should Segment Image – Yuhang Liu

Summary:

This paper examines a new image segmentation method. Image segmentation is a key step in any image analysis task. There have been many methods before, including low-efficiency manual methods and automated methods that can produce high-quality pictures, but these methods have certain disadvantages. The authors therefore propose a distribution framework that can predict how best to assign fixed labor to collect higher quality segmentation for a given image and automated method. Specifically, the author has implemented two systems, which can perform the following processing on images when doing image segmentation:

  1. Use computers instead of humans to create the rough segmentation needed to initialize the segmentation tool,
  2. Use computers to replace humans to create the final fine-grained segmentation. The final experiments also proved that relying on this hybrid, interactive segmentation system can achieve faster and more efficient segmentation.

Reflection:

Once, I did a related image recognition project. Our subject is a railway turnout monitoring system based on computer vision, which is to detect the turnout of the railroad track from the picture, and the most critical step is to separate the outline of the railroad track. At that time, we only using the method of computer separation, the main problem we encountered at the time was that when the scene became complicated, we would face to complex line segments, which would affect the detection results. As mentioned in this paper, using human-machine, the combined method can greatly improve the accuracy rate. I very much agree with it, and hope that one day I can try it myself. At the same time, what I most agree with is that the system can automatically assign work instead of all photos going through a same process. For a photo, only the machine can participate, or artificial processing is required. This variety of interactive methods, It is far more advantageous than a single method, which can greatly save workers’ time without affecting accuracy, and the most important point is that complex interaction methods can adapt to process more diverse pictures. Finally, I think similar operations can be applied to other aspects. This method of assigning tasks through the system can coordinate the working relationship between humans and machines, for example, in other fields, such as sound sentiment analysis and musical background separation. In these aspects, humans have the incomparable advantages of machines and can achieve good results, but it takes a long time and is very expensive. Therefore, if we can classify this kind of thinking, deal with the common working relationship between humans and machines, and give complex situations to people or pass the rough points of the machine first, then the separation cost will be greatly reduced, and the accuracy rate will not be affected, so I It is believed that this method has great application prospects, not only because of the many application directions of image separation, but we can also learn from this idea to complete more detailed analysis in more fields.

Question:

  1. Is this idea of cooperation between man and machine worth learning?
  2. Because the system defines the working range of people and machines, will the machine reduce the accuracy due to the results of human work?
  3. Does man-machine cooperation pose new problems, such as increasing costs?

Read More

03/04/20 – Akshita Jha – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

Summary:
“Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind” by Salisbury et. al. talks about the important problem of accessibility. The authors talk about the challenges that arise from an automatic image captioning system and how the imperfections in the system may hinder a blind person’s understanding of social media posts that have embedded imagery. The authors use mixed methods to evaluate and subsequently modify the captions generated by the automated system for images embedded in social media posts. They study how crowdsourcing can enhance the existing workflows and that provide scalable and useful alt text for the blind. The imperfections of the current automated captioning system hinder the user’s understanding of an image. The authors do a detailed analysis of the conversations collected by them to design user-friendly experiences that can effectively assist blind users. The authors focus on three research questions: (i) What value is provided by a state-of-the-art vision-to-language API in assisting BVI users, and what are the areas for improvement? (ii) What are the trade-offs between alternative workflows
for the crowd assisting BVI users? (iii) Can human-in-the loop workflows result in reusable content that can be shared with other BVI users? The authors study varying levels of human engagements and automated systems to come up with a final system that better understands the requirements for creating good quality al-text for blind and visually impaired users.

Reflections:
This is an interesting work as it talks about the often ignored problem of accessibility. The authors focus on images embedded in social media posts. Most of the times the automatic captions given by an automated system trained using a machine learning algorithm are inadequate and non descriptive. This might not be so much of a problem for day to day users but can be a huge challenge for blind people. This is a thoughtful analysis done by the authors keeping accessibility in mind. The authors validate their approach by running a follow-up study with seven blind and visually impaired users. The users were asked to compare the uncorrected vision to language caption and the alt text provided by their system. The findings showed that the blind and visually impaired users would prefer the conversational system designed by the authors to better understand the images. However, if the authors had taken the feedback from the target user group while developing the system that would have been more helpful instead of just asking the users to test the system. Also, the tweets used by the authors might not be representative of the kinds of tweets in the target users’ timeline.

Questions:
1. What do you think about the approach taken by the authors to generate the alt-text?
2. Would it have been helpful to conduct a survey to understand the needs of the blind and visually impaired users before developing the system?
3. Don’t you think using a conversational agent to understand the image embedded in tweets is too cumbersome and time consuming?

Read More

03/04/2020 – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind – Yuhang Liu

Summary:

The authors of this paper explored that visually impaired users are limited by the availability of suitable alternative text when accessing images in social media. The author believes that the beneficial of those new tools that can automatically generate captions are unknown to the blind. So through experiments, the authors studied how to use crowdsourcing to evaluate the value provided by existing automation methods, and how to provide a scalable and useful alternative text workflow for blind users. Using real-time crowdsourcing, the authors designed crowd-interaction experiments that can change the depth. These experiments can help explain the shortcomings of existing methods. The experiments show that the shortcomings of existing AI image captioning systems often prevent users from understanding the images they cannot see , And even some conversations can produce erroneous results, which greatly affect the user experience. The authors carried out a detailed analysis and designed a design that is scalable, requires crowdsourced workers to participate in improving the display content, and can effectively help users without real-time interaction.

Reflection:

First of all, I very much agree with the author’s approach. In a society where the role of social networks is increasingly important, we really should strive to make social media serve more people, especially for the disadvantaged groups in our lives. The blind daliy travel inconveniently, social media is their main way to understand the world, so designing such a system would be a very good idea if it can help them. Secondly, the author used the crowdsourcing method to study the existing methods. The method they designed is also very effective. As a cheap human resource, the crowdsourcing method can test a large number of systems in a short time, but I think this method There are also some limitations. It may be difficult for these crowdsourced workers to think about the problem from the perspective of the blind, which makes their ideas, although similar to the blind, not very accurate, so there are some gaps of the results with blind users. Finally, I have some doubts about the system proposed by the author. The authors finally proposed a workflow that combines different levels of automation and human participation. This shows that this interaction requires the participation of another person, so I think this interaction There are some disadvantages to this method. Not only will it cause a certain delay, but because it requires other human resources, it also requires some blind users to pay more. I think the ultimate direction of development should be free from human constraints, so I think we can compare the results of workers with the original results and let machine learning. That is to use the results of crowdsourcing workers for machine learning. I think it can reduce the cost of the system while increasing the efficiency of the system, and provide faster and better services for more blind users.

Question:

  1. Do you think there is a better way to implement these functions, such as studying the answers of workers, and achieving a completely automatic display system?
  2. Are there some disadvantages to using crowdsourcing platforms?
  3. Is it better to change text to speech for the visually impaired?

Read More

03/04/20 – Lulwah AlKulaib- CrowdStreetView

Summary

The authors try to assess the accessibility of sidewalks by hiring AMT workers to analyze Google Street View images. Traditionally, sidewalk assessment is conducted in person via street audits which are  highly labor intensive and expensive or by reporting calls from citizens. The authors propose using their system as an alternative for a proactive solution to this issue. They perform two studies:

  • A feasibility study (Study 1): examines the feasibility of the labeling task with six dedicated labelers including three wheelchair users
  • A crowdsourcing study (Study 2): investigates the comparative performance of turkers

In study 1, since labeling sidewalk accessibility problems is subjective and potentially ambiguous, the authors investigate the viability of labeling across two groups:

  • Three members of the research team
  • Three wheelchair users – accessibility experts

They use the results of study 1 to provide ground truth labels to evaluate crowdworkers performance and to get a baseline understanding of what labeling this dataset looks like. In study 2, the authors investigate the potential of using crowd workers to perform the labeling task. They evaluate their performance on two levels of labeling accuracy:

  • Image level: tests for the presence or absence of the correct label in an image 
  • Pixel level: examines the pixel level accuracies of the provided labels

They show that AMT workers are capable of finding accessibility problems with an accuracy of 80.6 % and determining the correct problem type with an accuracy of 78.3%. They get better results when using majority voting as a labeling technique 86.9% and 83.9% respectively. They collected 13,379 labels, 19,189verification  labels from 402 workers. Their findings suggest that crowdsourcing both the labeling task and the verification task leads to a better quality result.

Reflection

The authors have selected experts in the paper as wheelchair users, when in real life they’re civil engineers. I wonder how that would have changed their labels/results. Since accessibility in the street is not only for wheelchair users. It’s worth investigating by using a pool of multiple experts. 

I also think that selecting the dataset of photos to work on was a requirement for this labeling system, else it would have been tedious amount of work on “bad” images. I can’t imagine how this would be a scalable system on google street view as a whole. The dataset requires refinement to be able to label.

In addition, the focal point of the camera was not considered and reduces the scalability of the project. Even though the authors suggest a solution of installing a camera angled towards sidewalks, until that is implemented, I don’t see how this model could work well in the real world (not a controlled experiment).

Discussion

  • What are improvements that the authors could have done to their analysis?
  • How would their labeling system work for random Google street view photos?
  • How would the focal point of the GSV camera affect the labeling? 
  • If cameras were angled towards sidewalks, and we were able to get a huge amount of photos for analysis, what would be a good way to implement this project?

Read More

3/4/20 – Jooyoung Whang – Pull the Plug? Predicting If Computers or Humans Should Segment Images

In this paper, the authors attempt to appropriately distribute human and computer resources for creating segmentation of foreground objects in an image to achieve highly precise segmentations. They introduce that the segmentation process consists of roughly segmenting the image (initialization), and then going through another fine-grained iteration to come up with the final result. They repeat their study for both of the steps. To figure out where to allocate human resources, the authors’ proposed an algorithm that tries to score the acquired segmentations by detecting: highly jagged edges on the boundary, non-compact segmentations, near-edge segmentation locations, and segmentation area ratio to the full image. The authors find that a mix of humans and computers for image segmentation performs better than when completely using one or the other.

I liked the authors’ proposed algorithm to detect when a segmentation fails. It was interesting to see that they focused on visible features and qualities that humans can see instead of relying on deep neural networks that are often hard to interpret the internal workings of. At the same time, I am a little concerned about whether the proposed visual features for failed segmentations are enough to generalize and scale for all kinds of images. For example, the authors note that failed segmentations often have highly jagged edges. What if the foreground object (or an animal in this case) was a porcupine? The score would be fairly low even when an algorithm correctly segments the creature from the background. Of course, the paper reports that the method generalized well for everyday images and biomedical images, so my concern may be a trivial one.

As I am not experienced in the field of image segmentation analysis, I wondered if there were any case where an image contained more than one foreground objects and only one of them is of interest to a researcher. From my short knowledge about fore and background separation, a graph search is done by treating the image as a graph of connected pixels to find pixels that stand out. It does not care about “objects of interest.” It made me curious if it was possible to give additional semantic information in the process.

The followings are the questions that I had while reading the paper:

1. Do you think the qualities that PTP looks for is enough to measure the score of the quality of segmented images? What other properties would a failed segmentation have? One quality I can think of is that failed segmentations often have disjoint parts in the segmentations.

2. Can you think of some cases where PTP could fail? Would there be any case where the score for a segmentation score really low even if the segmentation was done correctly?

3. As I’ve written in my reflection, are there methods that allow segmentation algorithms to consider the “interest” for an object? For example, if an image contained a car and a cat both in the foreground and the researcher was interested in the cat, would the algorithm be able to only separate out the cat?

Read More