VizWiz: Nearly Real-time Answers to Visual Questions

Bigham, Jeffrey P., et al. “VizWiz: Nearly Real-time Answers to Visual Questions” Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM, 2010

Discussion Leader: Sanchit

Crowdsourcing Example: Couch Surfing

Youtube video for a quick overview: VizWiz Tutorial

Summary

VizWiz is a mobile application designed to answer visual questions for blind people, in real time by taking advantage of existing crowdsourcing technologies such as Amazon’s Mechanical Turk. Existing software and hardware to aid blind people solve visual problems are either too costly or too cumbersome to use. OCR is not advanced enough and reliable to completely solve vision-based problems and existing text to speech software only helps solve a singular issue of reading text back to the blind user. The application interface is designed to take advantage of Apple’s accessibility service called VoiceOver which allows the operating system to talk to the user and describe what the current selected option or view is on the screen. Touch based gestures are used to navigate the application so that users may easily take a picture, ask a question and receive answers from remote workers in real time.

The authors also present an abstraction layer on top of the Mechanical Turk API called quikTurkit. This allows requesters to create their own website on which Mechanical Turk workers are recruited and are able to answer questions posed by users of the VizWiz application. There is a constant stream of HITs being posted on Mech Turk so that a pool of workers is available to work as soon as a new question is posed by the user. While the user is taking a picture and recording their question, VizWiz sends a notification to quikTurkit which allows the API to start recruiting workers and therefore reduce the overall time and latency in waiting for an answer to come back.

VizWiz also featured a second version which detected blurry or dark images and asked users to retake them in order to get more accurate results. The authors also developed a use case VizWiz:LocateIt which allows blind users to locate an object in 3D space. They take a picture of an area where the desired object is located and then pose a question asking for the location of the specific object. Crowdworkers then highlight the object and the application processes the camera properties, the user’s location and the highlighted object to determine how much the user should turn and how far the user should walk in order to reach the general vicinity of the object. A lot of favorable responses were generated from the post user study surveys which showed that this technology is definitely in demand by blind users and may set up future research to automate the answering process without human interaction.

Reflections

I thought the concept in itself was brilliant. It is a problem that not many people think about in their daily lives, but when you sit down and really start to ponder on how trivial tasks such as finding the right object in a space can be nearly impossible for blind people, you realize the potential of such an application. The application design was very solid. Apple designed the VoiceOver API for vision-impaired people to begin with, so using it in such an application was the best choice. Employing large gestures for UI navigation is also smart because it can be very difficult or impossible for vision-impaired people to click a specific button or option on a touch based screen/device.

QuikTurkit was in my opinion a good foundation and beginning as the backend model for this application. It can definitely be improved by not placing too much focus on speech recognition, or not bombarding Mechnical Turk with too many HITs. Finding the right balance between the number of active workers in the pool and the number of HITs to be posted will really benefit both the system load and the cost the user has to incur in the long run.

A minor observance that I noticed was that the study had 11 blind users with 5 females initially, but later on in the paper there were 3 females. Probably a typo, but thoughts? Speaking of their study, I think the heuristics made a lot of sense and the survey results were generally favorable for the application. A latency of 2-3 minutes on average is not too bad considering the helpless situation of a vision-impaired person. Any amount of additional information or answering of a question that the user can get will only be helpful. I honestly didn’t see the point for speech recognition to be a focus for their product. If workers can just listen to the question, then that should be sufficient enough to answer it. There is no need to introduce errors with failed speech recognition attempts.

In my opinion, VisWiz:LocateIt was too complicated of a system with too many external variables to worry about so that a visually-impaired user can successfully find an object. The location detection and mapping is based only on the picture taken by the user which is not guaranteed to be perfect (more often than not). Although they have several algorithms and techniques to combat ineffective pictures, I still think there are potential hazards and accidents waiting to happen based on the direction ques provided by the application. Not entirely convinced on this use case.

Overall it was a solid concept and execution in terms of the mobile application. It looks like the software is public and is being used by over 5000 blind people right now, so that is pretty impressive.

Questions:

  1. One aspect that confused me about quikTurkit was who actually deployed the server or made the website for Mechical Turk workers to use this service. As in, was it the VizWiz people who created the server or can requesters build their own websites using this service as well? And who would the requesters be? Blind people?
  1. Besides human compassion and empathy, what is stopping users from giving wrong answers? Also, who determines whether the answer was correct or not?
  1. If a handheld barcode scanner works fairly well to locate a specific product in an area, then why couldn’t the authors just use a barcode scanning API on the iPhone along with the existing voice over technology to help locate a specific product? Do you foresee any problems with this approach?

 

Leave a Reply

Your email address will not be published. Required fields are marked *