Reflection#1 – [1/18] – [Aparna Gupta]

Summary:

Both papers talk about what Big Data means in today’s world and how the definition of Big Data changes as per the field being studied. They put more focus on the vulnerabilities and pitfalls of Big Data and also how combining data from various sources pose a challenge to the researchers in every field. They also present the enormous institutional challenges to sociology. The first paper ‘Critical Questions for Big Data’ focuses more on the emerging significant questions in the Big Data field and the six provocations to spark conversations about the issues of Big Data. The second paper ‘Data Ex Machina’ focuses more on the evolution of Big Data and the vulnerability issues.

Reflection:

The most interesting part which I liked is how Big Data has evolved over time and how it is proliferated across society. However, the manifestation varies for each field of study, for e.g; In astronomy, it takes the form of images, in humanities, it can take the form of the digitized book and in social science, it takes the form of data across various social media website like Twitter, Facebook etc.

What caught my interest is how data from various social media platforms can be combined together and applied in a field like ‘sociology’ to analyze and understand human behavior. The transition to the digital life and how entire Internet may be viewed as an enormous platform for human expression. However, I wonder how accurately what happens on these platforms (like, tweets and posts) can be applied to social sciences to infer human experiences.

What I found disappointing was the ‘The Ideal User Assumption’ explained in Lazer’s article which raises concern and draws attention towards how data collected from a specific type of unique people can be fake. How various organizations and government agencies are using bots to achieve surreptitious goals which further poses a validity threat. According to Xinhua (2016), the Chinese peer-to-peer lending platform Ezubao was revealed to be a Ponzi scheme in which 95% of the proposals were fake. Although committees like IRBs exist to ensure the ethics in particular line of research inquiry (like human subjects).

Questions:

  1. How can we ensure the data integrity across various social media platforms?
  2. Reiterating, who is responsible for making certain that individuals and communities are not hurt by the research process?
  3. How can we claim what our data represents?
  • Does it represent a group of individuals or just an individual?
  • Can insights be drawn on data collected from an individual be generalized, especially in the field of social science?

Read More

Reflection #1 – [1/18] – [Pratik Anand]

First paper : CRITICAL QUESTIONS FOR BIG DATA

Second paper : Data ex Machina: Introduction to Big Data

Summary

Both papers start with the introduction to big data. One of main points raised is the changing definition of big data – earlier it was about size of data, now it is about search, aggregate and processing the data. First paper directly jumps to the critical analysis of big data phenomenon. Second paper, on the other hand, takes a more comprehensive and structured approach to the analysis of big data. Both paper discuss the source of data acquisition, emergence of an entire new set of digital data, data privacy concerns, its analysis and conclusion. They show the changing approach to research due to big data and how abundance of data gathered might not lead to the true representation of the whole problem sample. Even conclusions provided from balanced representative data might also be subjective and biased. The mention of “Big Data Hubris” shows that having large volume of data does not correlate to better results. While the first paper continues critical analysis of big data, second paper provides future trends where volume of data will grow in size and diversity of platforms and more generic data models will take over.

Reflection

The first paper does a really good job of raising important questions related to big data, starting with its definition. The second paper is more of an introduction to big data and only provides a high level view of issues related to big data. For the initial reading, second paper can be used to provide overview of the big data industry, followed by the first paper for discussions related to its most important issues. Issues like privacy are unethical data collection are central to this debate as individuals, corporations or even governments can easily misuse the data.The contrast between the nature of the two papers is quite evident. The second paper mentions the problems faced by big data field today and its future trends whereas the first paper discusses over the problems caused due to big data and critically analyses its aspects. Aspects like uneven access to digital data and poorly understood analysis results can have ramifications on large sections of society and the first paper raises the right kind of questions for civil discussions.

Questions

1) Since the volume of data generated will continue to grow, how the government can ensure its protection and ethical treatment ?
2) Who should actually own the digital footprint data – individual or the respective companies collecting it ?
3) Is the data-driven approach is the right approach for all kind of social problems ? Will it lead to less focus towards areas where it is inherently difficult to generate large volume of data ?

Read More

Reflection # 1 – [01/18] – [Jamal A. Khan]

Both of the assigned papers:

  • “Critical Questions for Big Data” and
  • “Data ex Machina”,

present a social-science perspective on Big Data and how it affects and/or can or will affect social-sciences. The former i.e. “Critical Questions for Big Data” focuses more on the issues that big data might raise, which the authors have put into 6 buckets and the latter offers a more general overview of the what big-data is, it’s sources and resulting opportunities.

From both Boyd & Crawford’s and Lazer & Radford’s descriptions, I took away that big data is less about size or volume and more about being insightful. It is named so because given the humongous scale of data, we now have enough computational power and good enough tools to be able to draw meaningful insights that are closer to the ground truth than ever before.

I find Lazer & Radford’s mention of passive instrumentation quite appealing.  As compared to techniques like self-reporting and surveys which may include the subjects’ inherent biases or voluntary lies, big-data offers the opportunity to passively collect data and generate insights that are closer the ground truth of observed social phenomena. Borrowing words from the authors themselves “First, it acts as an alternative method of estimating social phenomena, illuminating problems in traditional methods. Second, it can act as a measure of something that is otherwise unmeasured or whose measure may be disputed. Finally, now casting demonstrates big data’s potential to generate measures of more phenomena, with better geographic and temporal granularity, much more cost-effectively than traditional methods”.

For the rest of the reflection I’ll focus on the issues raised by big-data as they not only seem to be more involved, interesting and challenging to tackle but also raise more questions.

An interesting point that Boyd and Crawford make is that big data changes how we look at information and may very well definition of knowledge. Then, they go on to suggest that we should not be dismissive of other older disciplines such as philosophy, because they might offer insight that number are missing. However, they fail to give to a convince argument of why it should be so. If we are observing certain patterns of interaction or some trends among populace behavior, is it really all that necessary to rely on philosophy to explain the phenomena?  Or can the reason also be mined form the data itself?

Another issue that caught my attention in both of the articles was that, one can get lost in numbers and start to see patterns where none exist, therefore solely relying on the data thinking it to self-explanatory is naive. The techniques used and the tools employed may very well influence the interpretation to be subjective rather than objective, which may enforce certain viewpoints even though they fail to exist. An example of which is the surprising co-relation between the S&P 500 stock index and butter production in Bangladesh. So this begs the question of whether there is a certain degree of personal skill or artistry involved in data analysis? I personally haven’t come across such examples but would love to know more if others in the class have.

I like provocation#3 made by Boyd and Crawford, “Bigger data are not always better data”. It is much too common and easy to be convinced that having a large enough dataset captures the true distribution of data and being someone who works with machine learning quite often, I too am guilty of it.  In actuality the distribution might not be representative at all! if the methodology is flawed or suffers from unintended biases and/or comes from a source that is inherently limited, then making claims about behaviors and trends can be misleading at best. The twitter “firehose”, “gardenhose” and “user” vs “people” vs “activity” example given by the Boyd and Crawford is a perfect demonstration of it. This may have far-reaching consequences especially when Learning algorithms are fed this biased/limited data. The decisions made by these systems will intern reinforce the biases or provide flawed and limited insights which is most certainly something we want to avoid! So the question then becomes how do we systematically find these biases and limitation inorder to rectify them? Can all biases even be eliminated? This also leads to an even more basic question How does one even go about checking if the data is representative or not?

Finally, I disagree (to a certain extent) with Boyd and Crawford’s discussion on ethics and the notion of big-data rich and poor. I would certainly like to discuss these two in class and know what other people think.

Read More

Reflection #1 – [1/18] – [Ashish Baghudana]

Boyd, Danah, and Kate Crawford. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon.” Information, communication & society 15.5 (2012): 662-679.
Lazer, David, and Jason Radford. “Data ex Machina: Introduction to Big Data.” Annual Review of Sociology 0 (2017).

Summary

Computational social science is still in very nascent stages. Until recently, research in this domain was conducted by computationally minded social scientists or socially inclined computer scientists. The latter has been driven by the emergence of the Internet and social media platforms, like Facebook, Twitter or Reddit. The digitization of our social life presents itself as Big Data. The two papers present a view of Big Data – its collection and analysis, from a social sciences perspective. Social scientists relied on surveys to collect data, but had to continually question the biases in people’s answers. Both papers value Big Data’s contributing in studying the behavior of a populace, without the need for surveys. The challenge in Big Data is oriented more towards organizing and categorizing “data” rather than its size. The main focus of these two papers, however, is less on Big Data, but on the vulnerabilities in such an approach.

Danah Boyd’s paper raises six concerns about the use of Big Data in social sciences.  They argue that not all questions in social science can be answered using numbers – the whys need a deeper understanding of social behavior. More importantly, the mere scale of Big Data does not automatically make it more reliable – there still might be inherent biases present. Finally, it raises concerns about data handling and access – who owns digital data and how limited access can create similar divides in society as education.

Lazer’s paper raises almost identical vulnerabilities with Big Data. As he very aptly puts –

The scale of big data sets creates the illusion that they contain all relevant information on all relevant people.
Lazer repeatedly stresses that generalizability of social science research does not necessarily need scale, but more representative data. He also iterates Boyd’s concern about research ethics on social platforms. Lazer ends with future trends for computational social science that include – more generalized models, qualitative analysis of Big Data and multimodal sources of data.

Reflections

As a student and researcher in computer science, it is imporant to remember constantly that computational social science is not an extension of data science or machine learning. We often fall into the trap of thinking about methods before questions. It is best to think of computational social science a derivative of social science rather than computer science.

I enjoyed reading Boyd’s provocations #3 and #4. Big Data is often heralded as a messiah that can answer important behavioral questions about the society. Even if this were true, this will not be because of the scale of the data, but the ability to process and analyse this data. As researchers, it feels increasingly important to consider multiple sources and ask broader questions in social science than ones like – “will a user make a purchase?”. For this, one can’t merely look for patterns in data, but study why the patterns exist. Are these patterns because of the dataset in question, or is it reflective of true societal behavior? While the two papers mention a trend of generalization, especially in the machine learning field, I also see a trend where there is increased specialization. Methods in computer science have decreased applicability to a large enough data sample.

Finally, a major concern about Big Data is privacy and ethics. Unlike with challenges in data analysis, this concern does not have any correct answers. Universities, labs and industries will have to work more closely with IRBs to develop a good framework for the governance of Big Data.

Questions

  • To re-iterate, what are the big questions in social science, and where do we draw a balance between quantitative and qualitative analysis?
  • While computational social science helps debunking incorrect beliefs about group behavior, can it truly help understand the cause of such behavior?
  • Should we change societal behavior, if it were possible?
    • While predictive analysis is non-intrusive, what constitutes ethical behavior when social science has more intrusive effects?
  • Finally, does research in computational social science encourage large-scale surveillance?

Read More

Reflection #1 – [01/18] – [John Wenskovitch]

This pair of papers presents an overview of Big Data from a social science perspective, focused primarily on presenting the pitfalls, issues, and assumptions associated with using Big Data in studies.  The first paper that I read, Lazer and Radford’s “Data ex Machina,” provided a broader overview of Big Data topics, including data sources and opportunities provided by the use of Big Data, in addition to the issues (the paper refers to them as “vulnerabilities”).  The second paper, boyd and Crawford’s “Critical Questions for Big Data,” places greater emphasis on the issues through the use of six “provocations” to spark conversation.

One thing that stuck out to me immediately from both papers was the discussion regarding the definition (or lack thereof) of Big Data.  Both papers noted that the definition has shifted over time.  Even the most recent publications cited in those sections have differing criteria based on the size, the complexity, the variability, and even the mysticism surrounding certain data.  In a way, not having a fixed definition is a good thing due to advances in both computational power and algorithmic techniques.  If we were to limit Big Data to the terabyte or petabyte scale today, the term would be laughable in no more than a few decades.

The rest of this reflection focuses on the vulnerabilities sections of both papers, as I felt those to be the most interesting parts.  Do you agree?

To begin, I was interested by the idea of “Big Data hubris,” the belief that the size of the data can solve all problems.  I recall reading other articles on Big Data which noted that patterns can be exhaustively searched for and correlations can be concluded to be meaningful because we are dealing with population-scale data rather than sample-scale data.  However, these articles both demolish that assertion.  As Lazer and Radford note, “big data often is a census of a particular, conveniently accessible social world.”  There are any number of reasons that a “population” of data isn’t actually a population because of biases, missing values, and lack of context regarding the data.  What are the best ways to prevent this problem to begin with; or in other words, how can we best ensure that our Big Data is actually representative of the whole population?

That segues nicely into the second point that caught my attention, boyd and Crawford’s Provocation #4.  Whenever data is taken out of context and researchers lose focus on important contextual metadata, the data itself loses meaning.  As a result, dimensions that a researcher examines (“tie strength” for example) can provide misleading information and therefore misleading conclusions about the data.  The metadata is just as important as the data itself when trying to find correlations and meaningful results.  As someone who works with data, I found it a bit depressing that this vulnerability even needed to be included in the discussion rather than being assumed as a given.  However, I am also aware that situations will exist in which data provenance is unknown, or is passed along inaccurately, or an employer just asks an employee to analyze a dataset and not ask any questions about it.  These situations all need to be avoided, but may be unavoidable.  How can we best ensure that data and its provenance remain linked throughout the exploration and modeling processes?

Finally, backtracking to Provocation #2, there is a discussion about seeing patterns where none exist, simply because the size of the data permits patterns and correlations that could either be real or coincidental.  The example that boyd and Crawford give is a correlation between changes in the S&P 500 and butter production in Bangladesh, but a similar one that popped into my mind was the graph showing an inverse correlation between global sea-surface temperature and the number of pirates sailing those oceans.  That makes me wonder, what other correlations have others in the class seen within a dataset that they’ve been examining that were statistically proven but nonsensical in context?

Read More

Guidelines for Reading Reflections

In this seminar-style class, you will investigate several academic readings, write your reflections where you will not just summarize the papers but think about what additional questions the paper enables, how is it relevant to modern digital social environments, give examples, talk about your experiences if any, be creative.

These are intended to facilitate and assess understanding of the reading materials. Reading reflections should be within one page (roughly within 600 words if you are using 12pt font). You won’t be penalized if you write more, but being succinct is another great writing skill which you should aim to cultivate in this course.

You do not need to summarize the full paper, but you need to reflect on what additional questions the work enables. Does this help you think about your next big project? What will that be? What other questions the paper makes you think? What else the paper is not answering or is concerning or is just intriguing?

Again this is an individual assignment and work submitted should be written solely by you. Most importantly, a reader while glancing at your reflection should be able to easily spot these questions. So use bold, italics, bullet points or other means of highlighting them. Here is a great example of a reflection written by my colleague, Prof. Kurt Luther.

NOTEFor days when you have two paper assignments, you are free to either combine them and write one reflection or keep them separate and write two separate reflections. For grading purposes you MUST submit a single post though.

Title of your post: Reflection [#No] – [Date mm/dd]- [Your Full Name as it appears on Canvas].

Example: Reflection #1 – [1/18] – [Tanushree Mitra]

Read More