Summary
In this paper, the authors solve the problem of large-scale human evaluation through CrowdScape, a system for large-scale human evaluation based on interactive visualizations and mixed-initiative machine learning. They track two major previous approaches of quality control – worker behaviour and worker output.
The contributions of the paper include an interactive interface for crowd worker results, visualization for crowd behavior, techniques for exploration of crowd worker products and mixed initiative machine learning for bootstrapping user intuitions. The previous work includes analyzing the crowd worker behavior and output independently, whereas, CrowdScape provides an interface for analyzing them together. CrowdSpace utilizes mouse movement, scrolling, keypresses, focus events, and clicks to build worker profiles. Additionally, the paper also points out its limitations such as neglected user behaviors like focus of fovea. Furthermore, it shows the potential of CrowdSpace in other experimental setups which are primarily offline or cognitive and don’t require user movement on the system to analyze their behavior.
Reflection
CrowdSpace is a very necessary initiative as the number of users increases for evaluation. Another interesting aspect is that this also increases developers’ creativity and possibilities as he can now evaluate more complex and focus-based algorithms. However, I feel the need for additional compensation here. The crowd workers are being tracked and this is an intrusion to their privacy. I understand that this is necessary for the process to function but given that it makes focus an essential aspect of worker compensation, they should be awarded fairly for it.
Also, the user behaviors tracked here fairly cover most significant problems in the AI community. However, more inputs should cover a better range of problems. Adding more features would not only increase problem coverage but also lead to increase in development effort. There could be several instances when a developer does not build something due to lack of evaluation techniques or popular measures. Increasing features would help get rid of this concern. For example, if we are able to track the fovea of user’s, developers could not study the effect of different advertising techniques or build algorithms to predict and track interest in different varieties of videos (business of YouTube).
Also, I am not sure of the effectiveness of tracking the movements given in the paper. The paper considers effectiveness as a combination of worker’s behavior and output. But in several tasks you need mental models that do not require the movements tracked in the paper. In such cases, the output needs to have more weightage. I think the evaluator should be given the option to change the weights of different parameters, so that he could vary the platform for different problems making it more ubiquitous.
Questions
- What kind of privacy concerns could be a problem here? Should the analyzer have access to such behavior? Is it fair to ask the user for his information? Should the user be additionally compensated for such intrusion to privacy?
- What other kinds of user behaviors are traceable? The paper mentions fovea’s focus. Can we also track listening focus or mental focus in other ways? Where will this information be useful in our problems?
- CrowdSpace uses the platform’s interactive nature and visualization to improve user experience. Should there be an overall focus on improving UX at the development level? Or should we let them be separate processes?
- CrowdSpace considers worker behavior and output to analyze human evaluation. What other aspects could be used to analyze the results?
Word Count: 582