Both of the assigned papers:
- “Critical Questions for Big Data” and
- “Data ex Machina”,
present a social-science perspective on Big Data and how it affects and/or can or will affect social-sciences. The former i.e. “Critical Questions for Big Data” focuses more on the issues that big data might raise, which the authors have put into 6 buckets and the latter offers a more general overview of the what big-data is, it’s sources and resulting opportunities.
From both Boyd & Crawford’s and Lazer & Radford’s descriptions, I took away that big data is less about size or volume and more about being insightful. It is named so because given the humongous scale of data, we now have enough computational power and good enough tools to be able to draw meaningful insights that are closer to the ground truth than ever before.
I find Lazer & Radford’s mention of passive instrumentation quite appealing. As compared to techniques like self-reporting and surveys which may include the subjects’ inherent biases or voluntary lies, big-data offers the opportunity to passively collect data and generate insights that are closer the ground truth of observed social phenomena. Borrowing words from the authors themselves “First, it acts as an alternative method of estimating social phenomena, illuminating problems in traditional methods. Second, it can act as a measure of something that is otherwise unmeasured or whose measure may be disputed. Finally, now casting demonstrates big data’s potential to generate measures of more phenomena, with better geographic and temporal granularity, much more cost-effectively than traditional methods”.
For the rest of the reflection I’ll focus on the issues raised by big-data as they not only seem to be more involved, interesting and challenging to tackle but also raise more questions.
An interesting point that Boyd and Crawford make is that big data changes how we look at information and may very well definition of knowledge. Then, they go on to suggest that we should not be dismissive of other older disciplines such as philosophy, because they might offer insight that number are missing. However, they fail to give to a convince argument of why it should be so. If we are observing certain patterns of interaction or some trends among populace behavior, is it really all that necessary to rely on philosophy to explain the phenomena? Or can the reason also be mined form the data itself?
Another issue that caught my attention in both of the articles was that, one can get lost in numbers and start to see patterns where none exist, therefore solely relying on the data thinking it to self-explanatory is naive. The techniques used and the tools employed may very well influence the interpretation to be subjective rather than objective, which may enforce certain viewpoints even though they fail to exist. An example of which is the surprising co-relation between the S&P 500 stock index and butter production in Bangladesh. So this begs the question of whether there is a certain degree of personal skill or artistry involved in data analysis? I personally haven’t come across such examples but would love to know more if others in the class have.
I like provocation#3 made by Boyd and Crawford, “Bigger data are not always better data”. It is much too common and easy to be convinced that having a large enough dataset captures the true distribution of data and being someone who works with machine learning quite often, I too am guilty of it. In actuality the distribution might not be representative at all! if the methodology is flawed or suffers from unintended biases and/or comes from a source that is inherently limited, then making claims about behaviors and trends can be misleading at best. The twitter “firehose”, “gardenhose” and “user” vs “people” vs “activity” example given by the Boyd and Crawford is a perfect demonstration of it. This may have far-reaching consequences especially when Learning algorithms are fed this biased/limited data. The decisions made by these systems will intern reinforce the biases or provide flawed and limited insights which is most certainly something we want to avoid! So the question then becomes how do we systematically find these biases and limitation inorder to rectify them? Can all biases even be eliminated? This also leads to an even more basic question How does one even go about checking if the data is representative or not?
Finally, I disagree (to a certain extent) with Boyd and Crawford’s discussion on ethics and the notion of big-data rich and poor. I would certainly like to discuss these two in class and know what other people think.