Summary
Maintaining a public and open source website can be difficult since the website is supported by individuals that are not paid. This team investigated using a crowd sourcing platform to not only support the platform but create articles. These articles (or tasks) were broken down into micro tasks there were manageable and scalable by the crowd sourced workers. These tasks were integrated throughout other HITs and were given extra contributions in order to relieve any extra reluctance on editing other crowd workers work.
Reflection
I appreciate the competitive nature of comparing both the supervised learning (SL) and a reinforcement learning (RL) in the same type of game scenario of helping the human succeed by aiding the as best as it can. However as one of their contributions, I have issue with the relative comparison between the SL and RL bots. Within their contributions, they explicitly say they find “no significant difference in performance” between the different models. While they continue to describe the two methods performing approximately equally, their self-reported data describes a better model in most measurements. Within Table 1 (the comparison of humans working with each model), SL is reported as having a better (yet small) increase and decrease in Mean Rank and Mean Reciprocal Rank respectively (lower and then higher is better respectively). Within Table 2 (the comparison of the multitude of teams), there was only one scenario where the RL Model performed better than the SL Model. Lastly even in the participants self-reported perceptions, the SL Model only decreased performance in 1 of 6 different categories. Though it may be a small decrease in performance, they’re diction downplays part of the argument their making. Though I admit the SL model having a better Mean Rank by 0.3 (from Table 1 MR difference or Table 2 Human row) doesn’t appear to be a big difference, I believe part of their contribution statement “This suggests that while self-talk and RL are interesting directions to pursue for building better visual conversational agents…” is not an accurate description since by their own data it’s empirically disproven.
Questions
- Though I admit I focus on the representation of the data and the delivery of their contributions while they focus on the Human-in-the-loop aspect of the data, within the machine learning environment I imagine the decrease in accuracy (by 0.3 or approximately 5%) would not be described as insignificant. Do you think their verbiage is truly representative of the Machine Learning relevance?
- Do you think more Turk Workers (they used data from at least 56 workers) or adding requirements of age would change their data?
- Though evaluating the quality of collaboration is imperative between Humans and AI to ensure AI’s are made adequately, it seems common there is a disparity between comparing that collaboration and AI with AI. Due to this disconnect their statement on progress between the two collaboration studies seems like a fundamental idea. Do you think this work is more idealistic in its contributions or fundamental?
I think adding more workers or requirements might change the data but the concept will stay the same. Providing information to the workers to help them with knowing the big picture or the big goal the system tries to accomplish can make a positive change on the way the task is executed to accomplish the big goal.