Reflection #2 – [01/23] – [John Wenskovitch]

This paper examined a dataset of more than 45,000 Kickstarter projects to determine what properties make a successful Kickstarter campaign (in this case, defining success as driving sufficient crowd financial interest to meet the project’s funding goal).  More specifically, the authors used both quantitative control variables (e.g., project duration, video present, number of updates) as predictors, as well as scraping the language in each project’s home page for common phrases.  By combining both components, the authors created a penalized logistic regression model that could predict whether or not a project would be successfully funded with only a 2.24% error rate.  The authors extended their discussion of the phrases to common persuasive techniques from literature such as reciprocity and scarcity to better explain the persuasive power of some phrases.

I thought that one of the most useful part of this paper relative to the upcoming course project was the collection of descriptions and uses of the tools used by the authors.  Should my group’s course project attempt something similar, it is nice to know about the existence of tools such as Beautiful Soup, cv.glmnet, LIWC, and Many Eyes for data collection, preprocessing, analysis, and presentation.  Other techniques such as Bonferroni correction and data repositories like the Google 1T corpus could also come in handy, and it is nice to know that they exist.  Has anyone else in the class ever used any of these tools?  Are the straightforward and user-friendly, or a nightmare to work with?

The authors aimed to find phrases that were common across all Kickstarter projects, and so they eliminated phrases that did not appear in all 13 project categories.  As a result, phrases such as “game credits” and “our menu” were removed from the Games and Food categories respectively.  I can certainly understand this approach for an initial study into Kickstarter funding phraseology, but I would be curious to see if any of these specific phrases (or lack of them) were strong predictors of funding within each category.  I would speculate that a lack of phrases related to menus would be harmful to a funding goal in the Food category.  There might even be some common predictors that are shared across a subset of the 13 project categories; it would be interesting to see if phrases in the Film & Video and Photography categories were shared, or Music and Dance for another example.  How do you think some of the results from this study might have changed if the filtering steps were more or less restrictive?

Even after taking machine learning and data analytics classes, I still treat the outputs of many machine learning models as computational magic.  As I glanced through Tables 3 and 4, a number of phrases surprised me in each group.  For example, the phrase “trash” was an indicator that a project was more likely to be funded, while “hand made by” was an indicator that a project would not be funded.  I would have expected each of these phrases to fall into the other group.  Further, I noted that very similar phrases also existed across categories:  “funding will help” indicated funding, whereas “campaign will help” indicated non-funding.  Did anyone else notice unexpected phrases that intuitively felt like they were placed in the wrong group?  Does the principle of keeping data in context that we discussed last week come into play here?  Similarly, I thought that the Authority persuasive argument went counter to my own feelings.  I would tend to view phrases like “project will be” as cocky and therefore would have a negative reaction to them, rather than treating them as expert opinions.  Of course, that’s just my own view, and I’d have to read the referenced works to better understand the argument in the other direction.

I suspect that this paper didn’t get as much attention as Google Flu Trends (no offense, professor), but I’m curious to know if the phrasing in Kickstarter projects changed after this work was published.  Perhaps this could be an interesting follow-up study; have Kickstarter creators become more likely to use phrases that indicated funding and less likely to use phrases that indicated non-funding after the paper and datasets were released?  Another interesting follow-up study was hinted at in the Future Work section.  Since Kickstarter projects can be tied to Facebook, and because “Facebook Connected” was a positive predictor of a project being funded, a researcher could explore the methods by which these Kickstarter projects are disseminated via social media.  Are projects more likely to be funded based on number of posts?  Quality of posts (or phrasing in posts)?  The number of Facebook profiles that see a post related to the project?  That interact with a post related to the project?

Read More

Reflection #1 – [01/18] – [John Wenskovitch]

This pair of papers presents an overview of Big Data from a social science perspective, focused primarily on presenting the pitfalls, issues, and assumptions associated with using Big Data in studies.  The first paper that I read, Lazer and Radford’s “Data ex Machina,” provided a broader overview of Big Data topics, including data sources and opportunities provided by the use of Big Data, in addition to the issues (the paper refers to them as “vulnerabilities”).  The second paper, boyd and Crawford’s “Critical Questions for Big Data,” places greater emphasis on the issues through the use of six “provocations” to spark conversation.

One thing that stuck out to me immediately from both papers was the discussion regarding the definition (or lack thereof) of Big Data.  Both papers noted that the definition has shifted over time.  Even the most recent publications cited in those sections have differing criteria based on the size, the complexity, the variability, and even the mysticism surrounding certain data.  In a way, not having a fixed definition is a good thing due to advances in both computational power and algorithmic techniques.  If we were to limit Big Data to the terabyte or petabyte scale today, the term would be laughable in no more than a few decades.

The rest of this reflection focuses on the vulnerabilities sections of both papers, as I felt those to be the most interesting parts.  Do you agree?

To begin, I was interested by the idea of “Big Data hubris,” the belief that the size of the data can solve all problems.  I recall reading other articles on Big Data which noted that patterns can be exhaustively searched for and correlations can be concluded to be meaningful because we are dealing with population-scale data rather than sample-scale data.  However, these articles both demolish that assertion.  As Lazer and Radford note, “big data often is a census of a particular, conveniently accessible social world.”  There are any number of reasons that a “population” of data isn’t actually a population because of biases, missing values, and lack of context regarding the data.  What are the best ways to prevent this problem to begin with; or in other words, how can we best ensure that our Big Data is actually representative of the whole population?

That segues nicely into the second point that caught my attention, boyd and Crawford’s Provocation #4.  Whenever data is taken out of context and researchers lose focus on important contextual metadata, the data itself loses meaning.  As a result, dimensions that a researcher examines (“tie strength” for example) can provide misleading information and therefore misleading conclusions about the data.  The metadata is just as important as the data itself when trying to find correlations and meaningful results.  As someone who works with data, I found it a bit depressing that this vulnerability even needed to be included in the discussion rather than being assumed as a given.  However, I am also aware that situations will exist in which data provenance is unknown, or is passed along inaccurately, or an employer just asks an employee to analyze a dataset and not ask any questions about it.  These situations all need to be avoided, but may be unavoidable.  How can we best ensure that data and its provenance remain linked throughout the exploration and modeling processes?

Finally, backtracking to Provocation #2, there is a discussion about seeing patterns where none exist, simply because the size of the data permits patterns and correlations that could either be real or coincidental.  The example that boyd and Crawford give is a correlation between changes in the S&P 500 and butter production in Bangladesh, but a similar one that popped into my mind was the graph showing an inverse correlation between global sea-surface temperature and the number of pirates sailing those oceans.  That makes me wonder, what other correlations have others in the class seen within a dataset that they’ve been examining that were statistically proven but nonsensical in context?

Read More