Reflection #1 – [01/18] – [John Wenskovitch] | CS6724 Spring18: Computational Social Science

This pair of papers presents an overview of Big Data from a social science perspective, focused primarily on presenting the pitfalls, issues, and assumptions associated with using Big Data in studies. The first paper that I read, Lazer and Radford’s “Data ex Machina,” provided a broader overview of Big Data topics, including data sources and opportunities provided by the use of Big Data, in addition to the issues (the paper refers to them as “vulnerabilities”). The second paper, boyd and Crawford’s “Critical Questions for Big Data,” places greater emphasis on the issues through the use of six “provocations” to spark conversation.

One thing that stuck out to me immediately from both papers was the discussion regarding the definition (or lack thereof) of Big Data. Both papers noted that the definition has shifted over time. Even the most recent publications cited in those sections have differing criteria based on the size, the complexity, the variability, and even the mysticism surrounding certain data. In a way, not having a fixed definition is a good thing due to advances in both computational power and algorithmic techniques. If we were to limit Big Data to the terabyte or petabyte scale today, the term would be laughable in no more than a few decades.

The rest of this reflection focuses on the vulnerabilities sections of both papers, as I felt those to be the most interesting parts. Do you agree?

To begin, I was interested by the idea of “Big Data hubris,” the belief that the size of the data can solve all problems. I recall reading other articles on Big Data which noted that patterns can be exhaustively searched for and correlations can be concluded to be meaningful because we are dealing with population-scale data rather than sample-scale data. However, these articles both demolish that assertion. As Lazer and Radford note, “big data often is a census of a particular, conveniently accessible social world.” There are any number of reasons that a “population” of data isn’t actually a population because of biases, missing values, and lack of context regarding the data. What are the best ways to prevent this problem to begin with; or in other words, how can we best ensure that our Big Data is actually representative of the whole population?

That segues nicely into the second point that caught my attention, boyd and Crawford’s Provocation #4. Whenever data is taken out of context and researchers lose focus on important contextual metadata, the data itself loses meaning. As a result, dimensions that a researcher examines (“tie strength” for example) can provide misleading information and therefore misleading conclusions about the data. The metadata is just as important as the data itself when trying to find correlations and meaningful results. As someone who works with data, I found it a bit depressing that this vulnerability even needed to be included in the discussion rather than being assumed as a given. However, I am also aware that situations will exist in which data provenance is unknown, or is passed along inaccurately, or an employer just asks an employee to analyze a dataset and not ask any questions about it. These situations all need to be avoided, but may be unavoidable. How can we best ensure that data and its provenance remain linked throughout the exploration and modeling processes?

Finally, backtracking to Provocation #2, there is a discussion about seeing patterns where none exist, simply because the size of the data permits patterns and correlations that could either be real or coincidental. The example that boyd and Crawford give is a correlation between changes in the S&P 500 and butter production in Bangladesh, but a similar one that popped into my mind was the graph showing an inverse correlation between global sea-surface temperature and the number of pirates sailing those oceans. That makes me wonder, what other correlations have others in the class seen within a dataset that they’ve been examining that were statistically proven but nonsensical in context?

John Wenskovitch

Leave a Reply Cancel reply