This work aims to address quality issues in the context of crowdsourcing and explores strategies to involve humans in the evaluation process via interactive visualizations and mixed-initiative machine learning. CrowdScape is a tool proposed that aims to ensure quality even in complex or creative settings. This aims to leverage both the end output as well as workers’ behavior patterns to develop insights about performance. CrowdScape is built on top of Mechanical Turk and obtains data from two sources: the MTurk API in order to obtain the products of work done and Rzeszotarski and Kittur’s Task Fingerprinting system in order to capture worker behavioral traces. The tool utilizes these two data sources and generates an interactive data visualization platform. With respect to worker behavior, raw event logs, and aggregate worker features are incorporated to provide diverse interactive visualizations. Four specific case studies were discussed and these included tasks relating to translation, color preference survey, writing, and video tagging.
In the context of creative works and complex tasks where it is extremely difficult to evaluate the task results objectively, I feel that mixed-initiative approaches like the one described in the paper can be effective to gauge the worker’s performance.
I specifically liked the feature mentioned with respect to aggregating features of worker behavioral traces wherein the user is presented with capabilities to dynamically query the visualization system to support data analysis. This gives the user control over what features are important to them and allows users to focus on those specific behavioral traces as opposed to presenting the users with static visualizations which would have limited impact.
Another interesting feature provided by the system was that it enabled users to cluster submission based on aggregate event features and I feel that this would definitely help save time and effort from the user’s side and would thereby quicken the process.
In the translation case study presented, it was interesting to note that one of the factors that affected the study of lack of focus was tracking copy-paste keyboard usage. This would intuitively translate to the fact that the worker has used third-party software for translation. However, this alone might not be proof enough since it is possible that the worker translated the task locally and was copy-pasting his/her own work. This shows that while user behavior tracking can provide insights, it might not be sufficient to draw conclusions. Hence, coupling it with the output data and comparing and visualizing it would definitely help draw concrete conclusions.
- Apart from the techniques mentioned in the paper, what are some alternate techniques to gauge the quality of crowd workers in the context of complex or creative tasks?
- Apart from the case studies presented, what are some other domains where such systems can be developed and deployed?
- Given that the tool relies on worker’s behavior patterns and given that these may vary largely from worker to worker, are there situations in which the proposed tool would fail to produce reliable results with respect to performance and quality?
I agree that it is possible that the worker translated the task locally and then copied the work to the platform. I personally often use my own editors to write texts and then paste it to the necessary submission site. This holds true for the class blog posts as well. I tend to use google docs to write my reflections and then “copy” them to the wordpress site. If some-one was to monitor my behavior, it might suggest a number of things. Hence, yes drawing conclusions is not a good approach.