Tanushree Mitra and Eric Gilbert. 2014. The language that gets people to give: phrases that predict success on kickstarter. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (CSCW ’14). ACM, New York, NY, USA, 49-61. DOI=http://dx.doi.org/10.1145/2531602.2531656
Summary
The authors of this paper aim to find the type of language, used in crowdfunding projects, which leads to successful funding. The raw dataset is from the crowdfunding website Kickstarter. The raw data has 45K crowdfunded projects, analyzing 9M phrases and 59 other variables commonly present on crowdfunding sites between June to August 2012. The authors use a statistical model. The response variable is whether the project was funded or not. The predictive variables are partitioned into two broad categories. First, control variables such as project goal, project duration and others. Second, the predictive variables of interest, which are phrases scraped from the textual content from its Kickstarter homepage. The statistical model the authors use is penalized logistic regression that aims to predict whether the project was funded. Moreover, the preferred model is LASSO on the grounds that it is parsimonious. The result of this model has about 2.41% of cross-validation error and 2.24% prediction error. The addition of the phrases decreases the predictive error by 15%. Subsequently, the authors find that phrases have a significant predictive power and proceed to rank the coefficients, “weights”, from highest to lowest. Furthermore, they group the phrases under categories by using the Linguistic Inquiry and Word Count program (LIWC). Then they compare the non-zero β coefficient phrases to the Google 1T corpus data. They find a subset of 494 positive and 453 negative predictors by a series of statistical tests. Finally, authors discuss the theoretical implications of the results. They argue that phrases that indicate reciprocity, scarcity, social identity, liking and authority are more likely to be funded.
Reflection
This paper demonstrates the power of big data in dealing with research questions that researchers were not able to explore until a few years ago. Moreover, not only it analyzes a large amount of data, 9 million phrases, but it selects a subset and then groups them into meaningful categories. Finally, theories from social psychology are used to draw conclusions that could generalize the results. In addition, businesses that opt to crowdfunding could use more of these phrases to receive funding. Interestingly enough, most of the limitations that the authors mentioned are inherent to big data problems, as discussed in the previous lecture.
I find the “all or nothing” funding principle interesting. I think this should be highlighted, because it means that businesses should make sure that they choose their project goal and duration carefully to ensure funding. As the literature review suggests, projects with higher duration and goals are less likely to be funded. Both project goal and project duration are controlled in the model.
In addition, it should be noted that the projects belong to 13 distinct categories. It would be interesting to know the demographics of the people who fund the projects. This could answer a number of questions, such as whether every project is funded by a specific demographic category, or whether some phrases are more appealing to a specific demographic. Perhaps the businesses would prefer to have their funding from the same demographic category that they target as their future clients or customers.
Another information that could be interesting is to know how “concentrated” are the funds to a specific number of people. Was 90% of the funding for a given project from one person and the rest from hundreds of people? Furthermore, there is a heterogeneity in the sources of funding that has an effect on the dependent variable, whether it is funded or not, that is not captured.
The authors chose the LASSO because it is parsimonious. An additional advantage of using the LASSO model is that it gives us a narrower subset of non-zero coefficients for further analysis, since it works a model selection technique. For example, if ridge was used, the authors would have to analyze more phrases, most of which would probably not be important. However, a problem with the penalized regression approaches is that there are problems in their interpretation. For instance, the coefficient of a classical logistic regression indicates the likelihood that a project can be funded or not, if a specific phrase is used, ceteris paribus. However, LASSO is still preferable than artificial neural networks, because the authors are not only interested in the predictive power of the model, but ultimately in interpreting the results. Perhaps using a decision tree approach would also be useful, because it also selects a subset of variables and allows for interpretations.
Questions
- Would using other statistical models improve the performance the predictive performance?
- Can we find information about the demographics of the people who fund the projects? Is there a way we can find the demographics of the donors? We could then link the phrases to demographics. For instance, are some phrases more effective based on the gender?