03/04/20 – Fanglan Chen – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary

Hara et al.’s paper “Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems” explores the crowdsourcing approach to locate and assess sidewalk accessibility issues by labeling Google Street View (GSV) imagery. Traditional approaches for sidewalk assessment relies on street audits which are very labor intensive and expensive or by reporting calls from citizens. The researchers propose using their designed interactive user interface as an alternative to proactively deal with this issue. Specifically, they investigates the viability of the labeling sidewalk issues amongst two groups of diligent and motivated labelers (Study 1) and then explores the potential of relying on crowd workers to perform this labeling task and evaluate performance at different levels of labeling accuracy (Study 2). By investigating the viability of labeling across two groups (three members of the research team and three wheelchair users), the results of study 1 is used to provide ground truth labels to evaluate crowd workers performance and to get a baseline understanding of what labeling this dataset looks like. Study 2 explores the potential of using crowd workers to perform the labeling task. Their performance is evaluated on both image and pixel levels of labeling accuracy. The findings suggest that it is feasible to use crowdsourcing for the labeling and verification tasks, which leads to the final result of better quality.

Reflection

Overall, this paper proposes an interesting approach for sidewalk assessment. What I think most is how feasible we can use that to deal with real-world issues. In the scenario studied by the researchers, the sidewalk under poor condition has severe problems and relates to a larger accessibility issue of urban space. The proposed crowdsourcing approach is novel. However, if we take a close look at the data source, we may question to what extent it can facilitate the assessment in real-time. It seems impossible to update the Google Street View (GSV) imagery on a daily basis. The image sources are historical instead the ones that can reflect the current conditions of the road sidewalks. 

I think the image quality may be another big problem in this approach. Firstly, the resolution of the GSV imagery is comparatively low and sometimes under poor light conditions, which is challenging to let the crowd workers make the correct judgement. There is possibility to use some existing machine learning models to enhance the image quality via increasing its resolution or adjusting the brightness. That could be a potential place to introduce the assistance of machine learning algorithms to achieve better results in the task.

In addition, the focal point of the camera was another issue which may reduce the scalability of the project. The CSV imagery is not collected merely for the sidewalk accessibility assessment, which would usually contain a lot of noises (e.g. block objects). It would be interesting to conduct a study about how much percent of the GSV imagery is of good quality in regards to the sidewalk assessment task.

Discussion

I think the following questions are worthy of further discussion.

  • Are there any other important accessible issues existing but not considered in the study?
  • What are improvements you can think about the authors could improve their analysis?
  • What other potential human performance tasks can be explored by incorporating street view images?
  • How effective do you think this approach can deal with the urgent real-world problems?

Read More

03/04/2020 – Mohannad Al Ameedi – Real-Time Captioning by Groups of Non-Experts

Summary

In this paper, the authors proposing a low latency captioning solution for the deaf and hard of hearing people that can work in real-time setting. Although, there are available solutions, but they are either very expensive or low quality. The proposed system allows people with hearing disability to request a captioning at any time and get the result in a few seconds. The system depends on a combination of non-expert crowd sourcing workers and local staff to provide the captioning. Each request will be handled by multiple people and the result will be a combination of all the participants’ input.  The request will be submitted in an audio stream format and the result will be in a text format. Crowdsource platform is used to submit the request and the result is retrieved in seconds. The proposed system uses an algorithm that work on a stream manner where the input can be process as it is received and aggregate the result at the end. The system outperforms all other available options on both coverage and accuracy.  The proposed solution is feasible to be applied in a production setting.

Reflection

I found the idea of real time captioning very interesting. My understanding was there is always a latency when depending on crowdsourcing and cannot be applied in real world scenarios, but it will be interesting to know how the system will work when the number of users increase.

I also found the concept of multiple people working on the same audio stream and combining the result very interesting. Collecting captions from multiple people and then trying to figure out what is unique and what is duplicate and producing a final sentence, paragraph, or script is a challenging task.

This work is like multiple people work on one task or multiple developers writing code to implement a single feature. Normally the supervisor or development lead will merge the result, but in this case the algorithm is taking care of the merge.

Questions

  • The authors measured the system on a limited number of users, do you think the system will continue outperforming other methods if it is get deployed in real world setting?
  • Since we have an increasing number of live streaming on work, school, and other places, can we use the same concept to pass the URL and get instance captioning? What are the limitations of this approach?
  • What are the privacy concerns with this approach especially if it is get used in medical field? Normally limited number of people get hired to help on such tasks, while the crowdsourcing is opened to a wide range of people.

Read More

03/04/2020 – Vikram Mohanty – Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind

Authors: Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris

Summary

This paper studies how crowdsourcing can be used to evaluate automated approaches for generating alt-text captions for BVI (Blind or Visually Impaired) users on social media. Further, the paper proposes an effective real-time crowdsourcing workflow to assist BVI users in interpreting captions. The paper shows that the shortcomings of existing AI image captioning systems frequently hinder a user’s understanding of an image they cannot see, much to the extent that clarifying conversations with sighted assistants can’t even correct. The paper finally proposes a detailed set of guidelines for future iterations of AI captioning systems. 

Reflection

This paper is another example of people working with imperfect AI. Here, the imperfect AI is a result of not relying on collecting meaningful datasets, but as a result of building algorithms from constrained datasets without having a foresight of the application i.e. alt-text for BVI users. The paper demonstrates a successful crowdsourcing workflow augmenting the AI’s suggestion, and serves as a motivation for other HCI researchers to think of design workflows that can integrate the strengths of interfaces, crowds and AI together. 

The paper shows an interesting finding where the simulated BVI users found it easier to generate a caption from scratch than from the AI’s suggestion. This shows how the AI’s suggestion can bias a user’s mental model in the wrong direction, from where recovery might be costlier compared to no suggestion in the first place. This once again stresses the need for considering real-world scenarios and users in the evaluation workflow. 

The solution proposed here is bottlenecked by the challenges presented by real-time deployment with crowd workers. Despite that, the paper makes an interesting contribution in the form of guidelines essential for future iteration of AI captioning systems. Involving potential end-users and proposing systematic goals for an AI to achieve is a desirable goal in the long-run.

Questions

  1. Why do you think people preferred to generate the captions from scratch rather than from the AI’s suggestions? 
  2. Do you ever re-initialize a system’s data/suggestions/recommendations to start from blank? Why or why not? 
  3. If you worked with an imperfect AI (which is more than likely), how do you envision mitigating the shortcomings when you are given the task to redesign the client app? 

Read More

03/04/2020 – Sushmethaa Muhundan – Pull the Plug? Predicting If Computers or Humans Should Segment Images

The paper proposes a resource allocation framework that intelligently distributes work between a human and an AI system in the context of foreground object segmentation. The advantages of using a mix of both humans and AI rather than either of them alone is demonstrated via the study conducted. The goal is to ensure that high-quality object segmentation results are produced while using considerably less human efforts involved. Two systems are implemented as part of this paper that automatically decide when to transfer control from the human to the AI component and vice versa, depending on the quality of segmentation encountered at each phase. The first system eliminates the need for human annotation effort by replacing human efforts with computers to generate coarse object segmentation which is refined by segmentation tools. The second system predicts the quality of the annotations and automatically identifies a subset of them that needs to be re-annotated by humans. Three diverse datasets were used to train and validate the system and these include datasets representing visible, phase contrast microscopy, and fluorescence microscopy images.

The paper explores leveraging the complementary strengths of humans and AI and allocates resources accordingly in order to reduce human involvement. I particularly liked the focus on quality throughout the paper. This particular system that employs a mixed approach mechanism ensures that the quality of the traditional systems which relied heavily on human involvement is met. The resultant system was successfully able to reduce significant hours of human effort and also maintain the quality of the resultant foreground object segmentation of images which is great.

Another aspect of the paper that I found impressive was the conscious effort to develop a single prediction model that is applicable across different domains. Three diverse datasets were employed as part of this initiative. The paper talks about the disadvantages of other systems that do not work well on multiple datasets. In such cases, only a domain expert or computer vision expert would be able to predict when the system would succeed. This paper claims that this is altogether avoided in this system. Also, the decision to intentionally include humans only once per image is good as opposed to the existing system where human effort is required multiple times during the initial segmentation phase of each image.

  1. This paper primarily focuses on reducing human involvement in the context of foreground object segmentation. What other applications can extend the principles of this system to achieve reduced involvement of humans in the loop while ensuring that quality is not affected?
  2. The system deals with predicting the quality of image segmentation outputs and involves the human to re-annotate only the lowest quality ones. What other ideas can be employed to ensure reduced human efforts in such a system?
  3. The paper implies that the system proposed can be applied across images from multiple domains. Were the three datasets described varied enough to ensure that this is a generalized solution?

Read More

03/04/2020 – Nurendra Choudhary – Combining crowdsourcing and google street view to identify street-level accessibility problems

Summary

In this paper, the authors discuss a crowd-sourcing method utilizing Amazon MT workers to identify accessibility issues in google street view images. They utilize two levels of street views for annotations: image-level and pixel-level. They evaluate intra and inter-annotator agreement and conclude a feasible level of accuracy of 81% (increased to 93% with minor quality control additions) for real-world scenarios.

The authors initiate the paper with a discussion about the necessity of such approaches. The solution could lead to more accessibility-aware solutions. The paper utilizes precision, recall and f1-score to consolidate and evaluate image-level annotations. For pixel-level annotations, the authors utilize two sets of evaluation metrics: overlap between annotated pixels and precision-recall scores. The experiments depict an inter-annotator agreement that makes the system feasible in real-world scenarios. The authors also utilize majority voting between annotators to improve the accuracy further.  

Reflection

The paper introduces an interesting approach to utilize crowd-sourced annotations for static image databases. This leads me to question other cheaper sources of images that can be utilized for this purpose. For example, google maps provides a more frequently updated set of images. Also, acquiring these images is more cost-effective. I think this will be a better alternative to the street-view images.

Additionally, the paper adopts majority voting to improve its results. Theoretically, this should lead to perfect accuracy. The method gets 93% accuracy after the addition. I would like to see examples where the method fails. This will enable development of better collation strategies in the future. I understand that in some cases, the image might be too unclear. However, examples of such failures would give us more data to improve the strategies.

Also, the images contain much more data than currently being collected. We can build an interpretable representation of such images that collect all world information contained in the images. However, the computational effectiveness and validity is still questionable. But, if we are able to better information systems, such representations might enable a huge leap forward in the AI research (similar to ImageNet). We can also combine this data to build a profile of any place such that it helps any user that wants to access it in the future (e.g.; accessibility of restaurants or schools). Furthermore, given the time-sensitivity of accessibility, I think a dynamic model will be better than the proposed static approach. However, this will require a cheaper method of acquiring street-view data. Hence, we need to look for alternative sources of data that may provide comparable performance while limiting the expenses.

Questions

  1. What is the generalization of this method? Can this be applied to any static image database? The paper focuses on accessibility issues. Can this be extended to other issues such as road repairs and emergency lane systems?
  2. Street view data collection requires significant effort and is also expensive. Could we utilize Google maps to achieve reasonable results? What is a possible limitation to applying the same approach on Google satellite imagery?
  3. What about the time sensitivity of the approach? How will it track real-time changes to the system? Does this approach require constant monitoring?
  4. The images contain much more information. How can we exploit it? Can we use it to detect infrastructural issues with government services such as parks, schools, roads etc.? 

Word Count: 560

Read More