Structuring, Aggregating, and Evaluating Crowdsourced Design Critique

K. Luther, J.-L. Tolentino, W. Wu, A. Pavel, B. P. Bailey, M. Agrawala, B. Hartmann, and S. P. Dow, “Structuring, Aggregating, and Evaluating Crowdsourced Design Critique,” in Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, New York, NY, USA, 2015, pp. 473–485.

Discussion leader: Will Ellis

Summary

In this paper, Luther et al. describe CrowdCrit, a system for eliciting and aggregating design feedback from crowdworkers. The authors motivate their work by explaining the peer and instructor feedback process that is often employed in a design classroom setting. By designers’ exposing their designs to high-quality constructive criticism, they can improve their designs and improve their craft. However, the authors point out that it is difficult to find such essential feedback outside of the classroom. Online design communities may provide some sources of critique, but this is often too little and too shallow. To solve this problem, the authors built CrowdCrit, and they tested it in three studies. Their studies attempted to answer the questions How similar are crowd critiques to expert critiques?, How do designers react to crowd critiques?, and How does crowd critique impact the design process and results?.

In Study 1, the authors had a group of 14 CrowdCrit-sourced workers and group of 3 design experts evaluate 3 poster designs using the CrowdCrit interface. They found that while individual crowdworkers’ critiques matched poorly to experts’, the crowd in aggregate matched 45% to 60% of the experts’ design critiques. The results suggest that even more workers will produce results that more closely match experts.

In Study 2, the authors tested designers’ reactions to crowd critiques by way of a poster contest for a real concert. Designers spent an hour designing a poster according to client criteria. Both crowdworkers and the client then generated feedback using CrowdCrit. Designers then had the chance to review feedback and make changes to their designs. Finally, the client chose a poster design winner. In interviewing the designers, the authors found that they felt the majority of the crowd critiques were helpful and that they appreciated a lot of the features of the CrowdCrit system including abundant feedback, decision affirmation, scaffolded responses, and anonymity.

In Study 3, the authors evaluated the impact of crowd critique on the design process using another design contest, this time hosted on 99designs.com. After the initial design stage, half of the design participants were given crowd feedback through CrowdCrit, and the other half were given generic, unhelpful feedback. The final designs were evaluated by both the client and a group of crowdworkers meeting a certain design expertise threshold. While the designers appreciated the crowd feedback more than the generic feedback, results showed no significant differences between the quality of the treatment and control groups.

The authors conclude with implications for their work. They feel that crowd feedback may make designers feel as though they are making major revisions when in fact they’re only making minor improvements. Indeed, the nature of CrowdCrit seems to ensure that designers will receive large lists of small changes that do not cause them to make substantive design changes but, if implemented, contribute to busier, less simple designs.

Reflection

CrowdCrit is implemented on top of Amazon Mechanical Turk and, thus, has the benefit of being able to pull feedback from a lot of design novices. This paper makes the case that such feedback, in aggregate, can approximate the feedback of design experts. I am very concerned with the amount of noise introduced in the aggregation approach discussed in Study 1. Yes, with enough crowdworkers, you will eventually have enough people clicking enough critique checkboxes that all of the ones that an expert selected will also be selected by crowdworkers. However, if we assume that the critiques an expert would have made are the most salient, the designer would be unable to separate the salient from the inconsequential. I would hope that the most-selected critiques made by an army of crowdworkers would better approximate those of an actual expert, but the authors do not explore this strategy. I would also explore a weighting system that favors critiques from CrowdCrit’s design-experienced crowdworkers, not just by coloring them more boldly, but also by hiding novice critiques that have low replication.

I am impressed by the effort and technique employed by the authors to distill their seven design principles, which they came to by analyzing important works in design teaching. I think the scaffolding approach to teaching design to crowdworkers was novel, and I appreciated the explanation of the pilot studies they performed to arrive at their strategy. I wonder if those who would use a system like CrowdCrit, the designers themselves, would not benefit from participating as workers in the system. Much like a design classroom, they could benefit from the scaffolded learning and application of design principles, which they may only know in part.

In Study 3, I’m sure the authors were disappointed to find no statistically significant improvement in design outcomes using crowd feedback. However, I think the most important goal of peer and expert critique, at least in the classroom, is not to improve the design, but to improve the designer. With that in mind, it would be interesting to see a longitudinal study evaluating the output of designers who use CrowdCrit over a significant period of time.

Questions

  • Study 1 shows adding more workers produces more data, but also more “false positives”. Authors conjecture that these may not be false positives, but could in fact be critiques that the experts missed. Are the authors correct, or is this just more noise? Is the designer impaired by so many extra critiques?
  • CrowdCrit is designed to work with any kind of crowd, not just the Mechanical Turk community. Based on other papers we’ve read, how could we restructure CrowdCrit to fit within a community of practice like graphic design?
  • Study 3 seems to show that for a single design, critique does not improve a design so much as simple iteration. Is feedback actually an important part of the design process? If so, how do we know? If we accept that feedback is an important part of the design process, how might we design a study that evaluates CrowdCrit’s contribution?
  • The results of Study 2 show a lot of positive feedback from designers for CrowdCrit’s features and interface. Implied in the designers’ comments is their enthusiasm for mediated engagement with clients and users (crowdworker stand-ins in this case) over their designs. What are CrowdCrit’s most important contributions in this regard?

Something Cool

Moneyball, but for Mario—the data behind Super Mario Maker popularity

Leave a Reply

Your email address will not be published. Required fields are marked *