Bringing semantics into focus using visual abstraction

Zitnick, C. Lawrence, and Devi Parikh. “Bringing semantics into focus using visual abstraction.” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.

Discussion Leader: Nai-Ching Wang

Summary

To solve the problem of relating visual information and linguistic semantics of an image, the paper proposes to start studying with abstract images instead of real images to avoid complexity and low-level noise in real images. By using abstract images, it makes it possible to generate and reproduce same or similar images depending on the need of study while it is nearly impossible to do so with real images. This paper demonstrates this strength of using abstract images by recruiting different crowd users on Amazon Mechanical Turk to 1) create 1002 abstract images, 2) describe the created abstract images and 3) generate 10 images (from different crowd users) for each description. With this process, images with similar linguistic semantic meaning are then produced because they are created from the same description. Because the parameters of creation of the abstract images are known (or can be detected easily), the paper is able to find semantic importance of visual features derived from occurrence, person attributes, co-occurrence, spatial location and depth ordering of the objects in the images. The results also show that suggested important features have better recall than using low-level image features such as GIST and SPM. This paper also shows that visual features are highly related to text used to describe the images.

Reflections

Even though crowdsourcing is not the main focus of the paper, it is very interesting to see how crowdsourcing can be used and be helpful in other research fields. I really like the idea of generating different images with similar linguistic semantic meaning to find important features that determine the similarity of linguistic semantic meaning. It might be interesting to see the opposite way of study, that is, generating different descriptions with similar/same images.

For the crowdsourcing part, the quality control is not discussed in the paper probably due to its focus but it would be surprising if there was no quality control of the results from crowd workers during the study because as we discussed during class, we know maximizing compensation within a certain amount of time is an important goal for crowdsourcing markets such as Amazon Mechanical Turk. As we can imagine how to achieve that goal by submitting very short description and random placement of clip art. In addition, if multiple descriptions are required for one image, then how is the final description selected?

I can also see other crowdsourcing topics related to the study in the paper. It would be interesting to see how different workflows might affect the results. For example, ask the same crowd worker to do all the three stages vs. different crowd workers for different stages vs. different crowd workers to work collaboratively. With the setting, we might be able to find individual difference and/or social consensus in linguistic semantic meaning. In section 6, it seems to me that this part is somewhat similar to the ESP game and the words might be constrained to some types based on the need of research.

Overall, I think this paper is a very good example to show how we can leverage human computation along with algorithmic computation to understand the human cognition.

Questions

  • Do you think in real images, the reader of the images will be distracted by other complex features such that the importance of some features will decrease?
  • As for the workflow, what are the strengths and drawbacks of using same crowd users to do all the 3 stages vs. using different crowd users for different stages?
  • How do you do the quality control of the produced images with descriptions? For example, how do you make sure the description is legitimate for the given image?
  • If we want to turn the crowdsourcing part into a game, how will you do it?

Leave a Reply

Your email address will not be published. Required fields are marked *