Reflection #1 – [1/18] – [Meghendra Singh]

  1. Data ex Machina: Introduction to Big Data – Lazer et. al.
  2. CRITICAL QUESTIONS FOR BIG DATA – Boyd et. al.

Both the papers focus on Big Data and its application to social science research. I feel that Boyd et. al. take a more critical approach towards Big Data in social science and after the initial introduction on Big Data they go on to discuss their six provocations about Big Data applied in the social media context. I find all of the questions the authors raise in this text to be very relevant to the subject matter. Also each of these questions can be seen as potential research questions relevant to the domain of Big Data.

Big Data analysis might cause other traditional approaches of analysis to be ignored and this might not always be correct. We need to be wary of the fact that the ‘Volume’ of the data doesn’t imply it’s the ‘Right’ Data. It might not represent the truth, might be biased, might not be neutral. Consider the example of inferring the political opinions in a geographic region based on the Facebook posts by users in that geographic region. What if most of the Facebook users in that geographic region belong to a particular demographic segments? What if most of them are inactive and only read posts by other users? How to account for bots and fake profiles? If we are to draw any inference from such data, we need to be sure that we account for such problems with our dataset and the fact that the digital-world does map one to one with the real-world. Potential research in this direction may include development of techniques and metrics that can measure the disparity between the real world data and that on social media. Also, techniques to bridge this divide between the real and digital worlds might also hold some value for future research.

Boyd et. al. emphasize that the tools and systems used to create, curate and analyze Big Data might impose restrictions on the amount of Big Data that can be analyzed. These aspects might also influence the temporal aspect of Big Data and the kinds of analysis that can be performed on it. For example, today’s dominant social media platforms like Twitter and Facebook offer poor archiving and search functions, which leads to the impossibility of accessing older data, thereby biasing research efforts towards recent events like elections or natural disasters. Therefore, the limitations of these platforms bias the artifacts and research that stem from such analysis. As a consequence, entire subject areas / knowledge bases / Digital Libraries, might get biased over time because of technological shortcomings of these social media platforms. This opens up a slew of research problems in the domain of Big Data and Social Sciences. Can we measure the impact of these so called ‘Platform Limitations’ on things like the research topics that occur in the domain of Big Data Mining? A more technological challenge that needs to be addressed is how can we make the Petabyte scale data on these platforms accessible to a larger audience (and not just researchers from top tier universities and Companies owning these social media platforms)?

The authors also emphasize that design decisions made while analyzing Big Data, like a specific ‘Data Cleaning’ process applied might make it biased and hence an unsuitable candidate for generating authentic insights. Also, interpreting data is often prone to spurious correlations and apophenia. Big Data might magnify the opportunities for such misinterpretations to creep into our research and we should be wary of these aspects specially while handling Big Data. Then there are issues relating to anonymity and privacy. Boyd et. al. also emphasize that data accessibility does not imply permission to use. Another important aspect discussed which can also be a potential research question is: How to validate research done using Big Data when the Data itself is inaccessible to the reviewers?

I felt that Lazer et. al. take a more balanced approach while discussing the challenges associated with the use of Big Data for Social Science research. They discuss pros and cons of Big Data for Social Science research instead of focusing only on the problems. I liked that Lazer et. al. describe various studies like the Copenhagen Network Study and The Billion Prices Project to emphasize specific problems associated with the use of Big Data for Social Science. The paper brings to bear specific facts like there is very little research that uses big data in prominent sociology journals. The fact that most of the big data relevant to social science research is massive and passive. I also find their classification of big data sources into: Digital life, Digital traces and Digitalized life comprehensive. I feel that there is a huge overlap between the challenges discussed in the two papers, although the terms used in each paper were different. I really like that Lazer et. al. present these challenges in a structured way and each class of challenges (Generalizability, Too Many Big Data, Artifacts and Reactivity and Ideal user assumption) has a meaningful and distinct definition.

Leave a Reply

Your email address will not be published. Required fields are marked *