Mitra, Tanushree, and Eric Gilbert. “The language that gets people to give: Phrases that predict success on kickstarter.” Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 2014.
In this paper, Mitra and Gilbert present an exciting analysis of text from 45K, Kickstarter project pitches. The authors describe a method to clean the text from the project pitches and subsequently use 20K phrases from these pitches along with 59 control variables to train a penalized logistic regression model and predict the funding outcomes for these projects. The results presented in the paper suggest that the text of the project pitches plays a big role in determining whether or not a project will meet its crowdfunding goal. Later on the authors discuss, how the top phrases (best predictors) can be understood using principles like, reciprocity, scarcity, social proof and social identity from the domain of social psychology.
The approach is interesting and suggests that the “persuasiveness” of language used to describe an artefact significantly impacts the prospects of it being bought. Since there have been close to 257K projects on Kickstarter, I feel now is a good time to validate the results presented in the paper. By validation, I mean to assess whether the top phrases that were discovered to predict a successfully funded project have appeared in the pitches of projects that met there funding goals since August 2012. Also, if this is true for projects that didn’t meet there funding goals. Additionally, repeating the study might be a good idea, as there are considerably more projects (i.e., more data to fit) and “hopefully” better modeling techniques (deep neural nets?). Repeating the study might also give us insights into how the predictor phrases have changed in the last 5 years.
A fundamental idea that might be explored is coming up with a robust quantitative measure of “persuasiveness” for any general block of text, maybe using linguistic features and common English phrases present in it. We can then explore if this “persuasiveness score” for a project’s pitch is a significant predictor of success for a crowdfunding project. Additionally, I feel that information about crowdfunded projects spreads similar to news, memes or contagions in a network. Aspects like homophily, word of mouth, celebrities and influencer networks may play a big role in bringing backers to a crowdfunding project and these phenomena belong to the realm of complex systems, having properties like, nonlinearity, emergence and feedback. I feel this makes the spread of information a stochastic process, and unless a “potential” backer gets informed about the existence of a project of interest to them, it is unlikely they would search through the thousands of active projects on all the crowdfunding websites. Also, it maybe possible for certain projects that most of the “potential” backers belong to certain social community, group or clique and the key to successful funding might be to propagate news about the project to these target communities (say on social media?). Another interesting research direction might be to mine backer networks from a social network. For example, how many friends, friends of friends, and so on, of a project backer also pledged to the project? It might also be useful to look at the project’s comments page and examine how the sentiment of these comments evolve over time? Is there a pattern to the evolution of these sentiments that correlate with project success or failure?
Another trend that I have noticed (e.g. in one of the Kickstarter projects I had backed) is that majority of the project’s pitch is present in the form of images and video. In such cases, how would a text only technique to predict the success of a project fair against one that also uses images and videos from the project as features? Can we use these obscure yet pervasive data to improve our classifier accuracy? The authors discussed in the “Design Implications” section about the potential applications of this work to help both backers and project creators. I feel that there only so much money available with the everyday joe, and even if all the crowdfunded projects have highly persuasive pitches, serendipity might determine which projects get successful, isn’t it?
Although, the paper does a good job of explaining much of the domain specific terms, there were a couple of places which were difficult for me to grasp. For example, is there a logic behind throwing away all phrases that occur less than 50 times in the 9M phrase corpus? I speculate that the 9M phrase corpus would have followed a power law distribution. In this case it might be interesting to experiment with the threshold for filtering the most frequent phrases in the corpus. Moreover, certain phrases like, nv (beta=1.88), il (beta=1.99) and nm (beta=-3.08) present in the top 100 phrases listed in tables 3 and 4 of the paper don’t really make sense (But the cats definitely do!). It might be interesting to trace the origins of these phrases and examine why are they such important predictors? Also, It maybe good to briefly discuss Bonferroni correction? Other than these issues, I enjoyed reading the paper and I especially liked the word tree visualizations.