Reflection #2 – [1/23] – [Meghendra Singh]

Mitra, Tanushree, and Eric Gilbert. “The language that gets people to give: Phrases that predict success on kickstarter.” Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 2014.

In this paper, Mitra and Gilbert present an exciting analysis of text from 45K, Kickstarter project pitches. The authors describe a method to clean the text from the project pitches and subsequently use 20K phrases from these pitches along with 59 control variables to train a penalized logistic regression model and predict the funding outcomes for these projects. The results presented in the paper suggest that the text of the project pitches plays a big role in determining whether or not a project will meet its crowdfunding goal. Later on the authors discuss, how the top phrases (best predictors) can be understood using principles like, reciprocity, scarcity, social proof and social identity from the domain of social psychology.

The approach is interesting and suggests that the “persuasiveness” of language used to describe an artefact significantly impacts the prospects of it being bought. Since there have been close to 257K projects on Kickstarter, I feel now is a good time to validate the results presented in the paper. By validation, I mean to assess whether the top phrases that were discovered to predict a successfully funded project have appeared in the pitches of projects that met there funding goals since August 2012. Also, if this is true for projects that didn’t meet there funding goals. Additionally, repeating the study might be a good idea, as there are considerably more projects (i.e., more data to fit) and “hopefully” better modeling techniques (deep neural nets?). Repeating the study might also give us insights into how the predictor phrases have changed in the last 5 years.

A fundamental idea that might be explored is coming up with a robust quantitative measure of “persuasiveness” for any general block of text, maybe using linguistic features and common English phrases present in it. We can then explore if this “persuasiveness score” for a project’s pitch is a significant predictor of success for a crowdfunding project. Additionally, I feel that information about crowdfunded projects spreads similar to news, memes or contagions in a network. Aspects like homophily, word of mouth, celebrities and influencer networks may play a big role in bringing backers to a crowdfunding project and these phenomena belong to the realm of complex systems, having properties like, nonlinearity, emergence and feedback. I feel this makes the spread of information a stochastic process, and unless a “potential” backer gets informed about the existence of a project of interest to them, it is unlikely they would search through the thousands of active projects on all the crowdfunding websites. Also, it maybe possible for certain projects that most of the “potential” backers belong to certain social community, group or clique and the key to successful funding might be to propagate news about the project to these target communities (say on social media?). Another interesting research direction might be to mine backer networks from a social network. For example, how many friends, friends of friends, and so on, of a project backer also pledged to the project? It might also be useful to look at the project’s comments page and examine how the sentiment of these comments evolve over time? Is there a pattern to the evolution of these sentiments that correlate with project success or failure?

Another trend that I have noticed (e.g. in one of the Kickstarter projects I had backed) is that majority of the project’s pitch is present in the form of images and video. In such cases, how would a text only technique to predict the success of a project fair against one that also uses images and videos from the project as features? Can we use these obscure yet pervasive data to improve our classifier accuracy? The authors discussed in the “Design Implications” section about the potential applications of this work to help both backers and project creators. I feel that there only so much money available with the everyday joe, and even if all the crowdfunded projects have highly persuasive pitches, serendipity might determine which projects get successful, isn’t it?

Although, the paper does a good job of explaining much of the domain specific terms, there were a couple of places which were difficult for me to grasp. For example, is there a logic behind throwing away all phrases that occur less than 50 times in the 9M phrase corpus? I speculate that the 9M phrase corpus would have followed a power law distribution. In this case it might be interesting to experiment with the threshold for filtering the most frequent phrases in the corpus. Moreover, certain phrases like, nv (beta=1.88), il (beta=1.99) and nm (beta=-3.08) present in the top 100 phrases listed in tables 3 and 4 of the paper don’t really make sense (But the cats definitely do!). It might be interesting to trace the origins of these phrases and examine why are they such important predictors? Also, It maybe good to briefly discuss Bonferroni correction? Other than these issues, I enjoyed reading the paper and I especially liked the word tree visualizations.

Read More

Reflection #1 – [1/18] – [Meghendra Singh]

  1. Data ex Machina: Introduction to Big Data – Lazer et. al.
  2. CRITICAL QUESTIONS FOR BIG DATA – Boyd et. al.

Both the papers focus on Big Data and its application to social science research. I feel that Boyd et. al. take a more critical approach towards Big Data in social science and after the initial introduction on Big Data they go on to discuss their six provocations about Big Data applied in the social media context. I find all of the questions the authors raise in this text to be very relevant to the subject matter. Also each of these questions can be seen as potential research questions relevant to the domain of Big Data.

Big Data analysis might cause other traditional approaches of analysis to be ignored and this might not always be correct. We need to be wary of the fact that the ‘Volume’ of the data doesn’t imply it’s the ‘Right’ Data. It might not represent the truth, might be biased, might not be neutral. Consider the example of inferring the political opinions in a geographic region based on the Facebook posts by users in that geographic region. What if most of the Facebook users in that geographic region belong to a particular demographic segments? What if most of them are inactive and only read posts by other users? How to account for bots and fake profiles? If we are to draw any inference from such data, we need to be sure that we account for such problems with our dataset and the fact that the digital-world does map one to one with the real-world. Potential research in this direction may include development of techniques and metrics that can measure the disparity between the real world data and that on social media. Also, techniques to bridge this divide between the real and digital worlds might also hold some value for future research.

Boyd et. al. emphasize that the tools and systems used to create, curate and analyze Big Data might impose restrictions on the amount of Big Data that can be analyzed. These aspects might also influence the temporal aspect of Big Data and the kinds of analysis that can be performed on it. For example, today’s dominant social media platforms like Twitter and Facebook offer poor archiving and search functions, which leads to the impossibility of accessing older data, thereby biasing research efforts towards recent events like elections or natural disasters. Therefore, the limitations of these platforms bias the artifacts and research that stem from such analysis. As a consequence, entire subject areas / knowledge bases / Digital Libraries, might get biased over time because of technological shortcomings of these social media platforms. This opens up a slew of research problems in the domain of Big Data and Social Sciences. Can we measure the impact of these so called ‘Platform Limitations’ on things like the research topics that occur in the domain of Big Data Mining? A more technological challenge that needs to be addressed is how can we make the Petabyte scale data on these platforms accessible to a larger audience (and not just researchers from top tier universities and Companies owning these social media platforms)?

The authors also emphasize that design decisions made while analyzing Big Data, like a specific ‘Data Cleaning’ process applied might make it biased and hence an unsuitable candidate for generating authentic insights. Also, interpreting data is often prone to spurious correlations and apophenia. Big Data might magnify the opportunities for such misinterpretations to creep into our research and we should be wary of these aspects specially while handling Big Data. Then there are issues relating to anonymity and privacy. Boyd et. al. also emphasize that data accessibility does not imply permission to use. Another important aspect discussed which can also be a potential research question is: How to validate research done using Big Data when the Data itself is inaccessible to the reviewers?

I felt that Lazer et. al. take a more balanced approach while discussing the challenges associated with the use of Big Data for Social Science research. They discuss pros and cons of Big Data for Social Science research instead of focusing only on the problems. I liked that Lazer et. al. describe various studies like the Copenhagen Network Study and The Billion Prices Project to emphasize specific problems associated with the use of Big Data for Social Science. The paper brings to bear specific facts like there is very little research that uses big data in prominent sociology journals. The fact that most of the big data relevant to social science research is massive and passive. I also find their classification of big data sources into: Digital life, Digital traces and Digitalized life comprehensive. I feel that there is a huge overlap between the challenges discussed in the two papers, although the terms used in each paper were different. I really like that Lazer et. al. present these challenges in a structured way and each class of challenges (Generalizability, Too Many Big Data, Artifacts and Reactivity and Ideal user assumption) has a meaningful and distinct definition.

Read More