Reflection #1 – [1/18] – [Deepika Kishore Mulchandani]

[1]. Danah Boyd & Kate Crawford (2012) CRITICAL QUESTIONS FOR BIG DATA, Information, Communication & Society, 15:5, 662-679, DOI: 10.1080/1369118X.2012.678878

Summary :

In this paper, the authors describe big data as a ‘capacity to search, aggregate, and cross-reference large datasets’.  They then go on to describe the importance of handling the emergence of the era of big data critically as this would influence the future. They also discuss the dangers to privacy due to this big data phase and other concerning factors that exist. They then discuss in detail the assumptions and biases of this phenomenon using six points. The first point is how big data has changed the definition of knowledge. Another being that having a large amount of data does not necessarily mean that the data is good. The discussion on the ethics of the research being done and the lack of regulating techniques and policies are explained with examples by the authors to emphasize the importance. They also discuss the access of this data by limited organizations and the divide it creates.

[2]. D. Lazer and J. Radford, “Data ex Machina: Introduction to Big Data”, Annual Review of Sociology, vol. 43, no. 1, pp. 19-39, 2017.

Summary :

In this paper, the authors review the value of big data research and the work that needs to be done for big data in sociology. They first define big data and then discuss the following three big data sources:

  1. Digital Life: Online behavior from platforms like Facebook, Twitter, Google, etc.
  2. Digital Traces: Call Detail Records which are only records of the action and not the action itself.
  3. Digitalized Life: Google books, phones that identify proximity using Bluetooth.

The authors believe that the availability of these forms of data along with the tools and techniques required to access such data provides sociologists with the opportunity to take advantage and answer the various age-old or new questions. To this end, the authors mention the opportunities available to sociologists in the form of massive behavioral data, data obtained through nowcasting, data obtained through natural and field experiments, and, data available on social systems. The authors then proceed to discuss the pros and cons of sampling the available big data. They also mention the vulnerabilities that exist such as too much volume of data, the generalizability of data, platform dependence of data, failing ideal user assumption, and, ethical issues in big data research. In conclusion, the authors mention few of the future trends the knowledge of which will help sociologists succeed in big data research.

Reflections :

In [1], the authors ask various questions with the theme ‘Will Big Data and the research that surrounds it help the society?’ I like the definition of big data as a ‘Socio-technical’ phenomenon. I also like the thought that is provoked by the usage of the term ‘Mythology’ in the formal big data definition. The big data paradigm and its rise to fame do somewhat revolve around the belief that the volume of the data provides new true and accurate insights. This gives rise to the question ‘Do we sometimes try to find or justify false trends just because we have big data?’ I like the example using which they represent the platform dependence of social data. The social network of a person on Facebook may not be same as on Twitter by virtue of the fact that the data is different. This could be for a lot of reasons, with the basic one being that some user may not be present on both those social sites. This gives rise to another question ‘What about the population who is not on any social site?’. That chunk of the population is not being considered in any of the studies. Also, the very fact that sometimes ease of accessibility of the data is considered over the quality of data raises concerns. I also like that the authors address the quantitative nature of big data research and the importance of context. I appreciate the section in which they discuss the availability of this big data by few organizations and the ‘Big Data Rich’ and ‘Big Data Poor’ divide that it creates. This is something which has to be considered to facilitate successful big data research. In [2], I appreciate the definition of big data that has been provided by the authors. Big data is indeed a mix of computer science tools and social science questions. The authors mention that sociologists need to learn how to leverage the tools and techniques provided by computer scientists to make break-through in their research. This makes an excellent collaboration where computer scientists leverage the questions and research expertise of social scientists and social scientists leverage the tools and techniques developed for providing insights into the big data. I like the way the authors mention big data archives as depicting actual behavior “in principle”.  Although there are instances which show positive results when studying behaviors using such big data, the question that arises is ‘How genuine is this online behavior?’. Many factors play a role in these studies. The biases present in the data have to be considered. If data from social networks is being considered, one of the most basic examples of bias is the ideal user assumption as highlighted in the paper. Moreover, the veracity of the data has to be considered as well. Another important bias mentioned in the paper arises due to the incorrect sampling of data. I realize that sample data from the big data can provide valuable insights.  However, this raises the questions ‘What methods can be applied to sample data without bias?’ I appreciate the effort that the authors have invested by providing many case study examples to emphasize the points that they mention in the review. This provokes thoughts about the vulnerabilities and the work that has to be done to make big data research as ethical and methodical as possible.

Read More

Reflection #1 – [1/18] – MD MOMEN BHUIYAN

Paper #1: Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon
Paper #2: Data ex Machina: Introduction to Big Data

Summary #1:
This paper discusses what is entailed by the phenomenon of big data from the perspective of socio-technical research. The authors describe Big Data phenomenon as an interplay between technology, analysis, and mythology. Here mythology represents the uncertainty of inferring knowledge from big data. The theme of the paper is six provocations about the issues in big data: knowledge definition, objectivity claim, methodological issue, importance of context, ethics, and digital divide due to access restriction. Big data introduces new tools and research methods for information transformation which in turn changes the definition of knowledge. While quantifying information in big data might seem to provide objective claim, it is still the subjective observation that initiates it. It’s the design decisions in the interpretation of data that incorporate subjectivity and its biases. One caveat of interpreting big data is that seeing correlation where none exists (p.11). The methodological limitation of big data lies in the source and collection procedure. It is necessary to understand how much the data was filtered and if it is generalizable to the research domain. One solution to this problem is using heterogeneous sources to combine information which also amplifies noise. This heterogeneous combination is also helpful in modeling the big picture. For example, Facebook can alone provide a relationship graph of a user. But it is not necessarily the overall representation. Because multi-dimensional communication methods like social network, email, phone etc. each provide a new representation. So context is very important in describing such information. Big data also raises the question of ethics in research. The sheer volume of big data could provide enough information to de-anonymize. Information should be carefully published to protect the privacy of the entities involved. Finally, accessibility of big data divides the research community into two groups where one group is more privileged. Computational background also creates similar division among big data researchers.

Summary #2:
This paper is similar to the previous one in the sense it discusses similar issues from the previous paper. The authors here first discusses the data sources for big data by dividing them into three topological categories: digital life, digital trace, digitalized life. Digital life refers to the digitally mediated interactions like tweeting, searching etc. Digital traces are the records that indirectly provides information about the digital life like Call detail records. Finally digitalized life represent the capture of a nondigital portion of life into a digital form like constant video recording in an ATM. There is also possibility of collecting specific behavior like certain types of tweets or visiting certain webpage. These data provide several opportunities for behavioral research. Big data provides large data set from different sources and combination of these sources provides important incites like the Copenhagen Network Study. Big data also provides cost-effective solutions to some studies like unemployment detection, disease propagation study etc. Effect of external changes can be captured by big data like Hurricane, price hike etc. By focusing on underrepresented population, big data is used to study certain problems like PTSD, suicidal ideation etc. The vulnerabilities of big data include the problem of generalizability of hypothesis, heterogeneous sources, errors in the source systems, and ideal user assumption. Research on big data includes ethical issues in both acquisition and publishing of data. Finally, recent big data trends include: new sources of data, generic model for research, qualitative approach in big data research etc.

Reflection:
Both of the paper discusses issues and application of big data in identifying social phenomena. This reflection focuses on the generalizability issue in big data. The authors suggest combining multiple sources to validate can solve generalizability issue. This seems interesting given recent deep learning community has found that generalizing a model can be achieved using more data as well as using transfer learning. Similar approach can be used in finding social phenomenon in big data. For example, data from Twitter can provide with information about the spreading of rumors by people with certain attributes. Although Facebook is too different from Twitter, it is possible to use the hypothesis and the result from Twitter to initiate a learning model to apply in facebook. What do you think?

Read More

Reflection #1 – [01/18] – [Vartan Kesiz Abnousi]

Danah Boyd & Kate Crawford: Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society. https://doi.org/10.1080/1369118X.2012.678878

David Lazer and Jason Radford: Data ex Machina: Introduction to Big Data. Annual Review of Sociology.

https://doi.org/10.1146/annurev-soc-060116-053457

 

Summary

The articles aim to provide a critical analysis of  “Big Data”. Moreover, there is a brief historical account on how the term “Big Data” was born and its definition. They stress that Big Data is more than large datasets, it is about a capacity to search, aggregate, and cross-reference large data sets. They underlie the great potential of Big Data in solving problems. However, they also warn about the dangers that it brings. One danger is that it could be perceived as a panacea for all research problems and reduce the significance of other fields, like the humanities. It is being argued that the tools of Big Data do necessarily offer an “objective” narration of the reality, as some of its champions boast. For instance, the samples of Big Data are often not representative to the entire population. Therefore, making generalization for all groups of people solely based on Big Data is erroneous. As the authors argue “Big Data and whole data are also not the same “.  It is not necessary to have more data to get better insight regarding a particular issue. In addition, there are many steps in the analysis of the raw Big Data, such as “data cleaning”, where human judgement is required. As a result, the research outcome could be dramatically different due to the subjective choices that were made in the process of data analysis. In addition, concerns regarding human privacy are raised. The human subjects may either not be aware or give their consent to have their data collected. What is private and public information has become more obscure. State institutions might use the data in order to curtail individual civil liberties, a phenomenon known as Big Brother. A particularly important problem is that of research inequality, which takes numerous forms. For example, companies do not provide full access of the collected data to public research institutions. As a result, the few privileged who are within the companies and have complete access can find different, more accurate, results. In addition, those companies usually partner or provide access of their data to specific elite universities. As a result, the students of these schools will have a comparative advantage in their skills compared to rest. This sharpens both the social and research inequalities. The very definition of knowledge is changed. People now get large volumes of epidemiological data without even designing an experiment.  As the authors argue, “it is changing the objects of knowledge”. The authors also argue that Big data is vast and heterogeneous. They classify the data into three sources, digitalized life, digital traces and digital life. As digital life they refer to Twitter, Facebook, and Wikipedia which are all platforms where behaviors are all online. The authors argue that these platforms can either be viewed as generalizable microcosms of society or as distinctive realms in which much of the human experience now resides. Digital traces include information collected from sources such as phone calls, while an example of digitalized life are the video recordings of the individuals.

Reflections

Both articles are very well written. I agree with the points that the articles raise. However, I am particularly cautious about the notion of viewing digital life as a microcosm of our society. Moreover, such a generalization is more than just an abstract, subjective, idea. It is rigorously defined in probability theory. There are mathematical rules on whether a sample is representative or not. A famous example are the 1948 US presidential elections when Truman won, at the time all the elections polls were wrong because of sampling errors. I am also worried that some of these digital platforms bolster a form of herd behavior that renders individuals less rational. This herd behavior that has been studied by social scientists such as Freud and Jung, among many, has been argued that was one of the causes for the rise of Fascism.

Finally, I have some questions that could develop into research ideas such as:

  1. Does not the nature of the digital platform i.e. Twitter change an individual’s behavior? If yes, then how?
  2. Is the increasing polarization in the United States related to these digital platforms?
  3. Does digital anonymity alter someone’s behavior?
  4. Do people behave the same way across different digital platforms?
  5. Can we, as researchers, develop a methodology to render digital platforms, traces, representative to the population?

Read More

Reflection #1 – [1/18] – [Meghendra Singh]

  1. Data ex Machina: Introduction to Big Data – Lazer et. al.
  2. CRITICAL QUESTIONS FOR BIG DATA – Boyd et. al.

Both the papers focus on Big Data and its application to social science research. I feel that Boyd et. al. take a more critical approach towards Big Data in social science and after the initial introduction on Big Data they go on to discuss their six provocations about Big Data applied in the social media context. I find all of the questions the authors raise in this text to be very relevant to the subject matter. Also each of these questions can be seen as potential research questions relevant to the domain of Big Data.

Big Data analysis might cause other traditional approaches of analysis to be ignored and this might not always be correct. We need to be wary of the fact that the ‘Volume’ of the data doesn’t imply it’s the ‘Right’ Data. It might not represent the truth, might be biased, might not be neutral. Consider the example of inferring the political opinions in a geographic region based on the Facebook posts by users in that geographic region. What if most of the Facebook users in that geographic region belong to a particular demographic segments? What if most of them are inactive and only read posts by other users? How to account for bots and fake profiles? If we are to draw any inference from such data, we need to be sure that we account for such problems with our dataset and the fact that the digital-world does map one to one with the real-world. Potential research in this direction may include development of techniques and metrics that can measure the disparity between the real world data and that on social media. Also, techniques to bridge this divide between the real and digital worlds might also hold some value for future research.

Boyd et. al. emphasize that the tools and systems used to create, curate and analyze Big Data might impose restrictions on the amount of Big Data that can be analyzed. These aspects might also influence the temporal aspect of Big Data and the kinds of analysis that can be performed on it. For example, today’s dominant social media platforms like Twitter and Facebook offer poor archiving and search functions, which leads to the impossibility of accessing older data, thereby biasing research efforts towards recent events like elections or natural disasters. Therefore, the limitations of these platforms bias the artifacts and research that stem from such analysis. As a consequence, entire subject areas / knowledge bases / Digital Libraries, might get biased over time because of technological shortcomings of these social media platforms. This opens up a slew of research problems in the domain of Big Data and Social Sciences. Can we measure the impact of these so called ‘Platform Limitations’ on things like the research topics that occur in the domain of Big Data Mining? A more technological challenge that needs to be addressed is how can we make the Petabyte scale data on these platforms accessible to a larger audience (and not just researchers from top tier universities and Companies owning these social media platforms)?

The authors also emphasize that design decisions made while analyzing Big Data, like a specific ‘Data Cleaning’ process applied might make it biased and hence an unsuitable candidate for generating authentic insights. Also, interpreting data is often prone to spurious correlations and apophenia. Big Data might magnify the opportunities for such misinterpretations to creep into our research and we should be wary of these aspects specially while handling Big Data. Then there are issues relating to anonymity and privacy. Boyd et. al. also emphasize that data accessibility does not imply permission to use. Another important aspect discussed which can also be a potential research question is: How to validate research done using Big Data when the Data itself is inaccessible to the reviewers?

I felt that Lazer et. al. take a more balanced approach while discussing the challenges associated with the use of Big Data for Social Science research. They discuss pros and cons of Big Data for Social Science research instead of focusing only on the problems. I liked that Lazer et. al. describe various studies like the Copenhagen Network Study and The Billion Prices Project to emphasize specific problems associated with the use of Big Data for Social Science. The paper brings to bear specific facts like there is very little research that uses big data in prominent sociology journals. The fact that most of the big data relevant to social science research is massive and passive. I also find their classification of big data sources into: Digital life, Digital traces and Digitalized life comprehensive. I feel that there is a huge overlap between the challenges discussed in the two papers, although the terms used in each paper were different. I really like that Lazer et. al. present these challenges in a structured way and each class of challenges (Generalizability, Too Many Big Data, Artifacts and Reactivity and Ideal user assumption) has a meaningful and distinct definition.

Read More

Reflection #1 – [01/18] – [Patrick Sullivan]

Boyd and Crawford are looking into the benefits and downfalls that come about from the usage of Big Data.  This is a worthy discussion because there are high expectations and wild assumptions being made by the public about Big Data.  While Big Data can lead to some previously unattainable answers, this means we should be more wary of its results.  Using other methods to reach toward these answers might be impossible, possibly leaving Big Data’s results both unverified and unverifiable, a very unscientific characteristic.

Issues that plague a discipline or research area are not a new concept.  Statistics has sample bias and Simpson’s paradox.  Logic has the slippery slope and the false cause fallacies.  Psychology has suggestion and confirmation bias as well as hindsight, primacy, and recency effects.    However, Big Data has method and misinterpretation errors that can be compounded with nearly all the previous issues listed.  This leads to huge issues whenever the pure quantitative appearance Big Data is championed and accepted without closer investigation.

There might be ways of combating this if Big Data can adopt the same defenses other disciplines use.  Extensive training for the individuals at the forefront to challenge and counteract fallacies can be seen in Logic through lawyers and judges.  Statistics has developed data reporting standards that either avoid or reveal issues as well as explicitly reporting the uncertainty and precision of measurements. Psychology actually integrates what could be a pitfall into their actual experimental design when using the placebo effect or hiring actors to test social situations, but then show the changes to results when compared to a control group.  Big Data researchers should adopt these defenses or invent new ones to give more authority behind their assertions.

Lazer and Radford support many of these same concerns, but also point out a more recent change: intelligent hostile actors.  This is one of the largest threats to Big Data research since it is a counteracting force that naturally evolves to both survive and do more damage.  As bots and manipulation cause more destruction and chaos, every Big Data research of that data becomes less trustworthy.  Interestingly, positive outcomes can come from simply revealing the presence of hostile actors in within Big Data sources.  This would call into question the validity of findings that previously may have been undoubted thanks to the tendency of quantitative results being viewed as objective and factual.

Questions:

  • Should Big Data be more publically scrutinized for hostile actors’ data manipulation in order to keep expectations more realistic?
  • Should Big Data research findings be automatically assigned more doubt since misleading or misunderstood results can be so well hidden behind a veil of objectivity?
  • Would more skepticism towards Big Data research slow or damage the research efforts until it causes a net negative impact on society? Could we find this limit?

Read More

Reflection #1 – [1/18] – [Jiameng Pu]

  • D Boyd et al. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon.
  • D Lazer et al. Data ex Machina: Introduction to Big Data.

Summary:

These two papers both summarize and discuss some critical questions of big data ranging from its definition to society controversy, which demonstrates that it’s necessary for researchers in social science to consider a lot of questions prone to be ignored before conducting their research. In critical questions for big data, the author summarizes six crucial points regarding big data. 1)Big data today will produce a new understanding of knowledge. 2) All the researchers are interpreters of data, what’s the line between being subjective and being objective? 3) Big quantities of data do not necessarily mean better data because data sources and data size tend to offer researchers a skewed view. 4) The context of big data cannot be neglected. For example, today’s social network does not necessarily reflect sociograms and kinship network among people. 5) Research ethics is always supposed to be considered while using data. 6) The resource of data is not equal to different research groups, which entails a digital division in the research community. In Data ex Machina, the author explicitly illustrates definition, sources, opportunities, vulnerabilities of big data. By reviewing some of the particular projects in the literature, e.g., The Billion Prices Project, EventRegistry, GDELT, it offers us a convincing sight of the existing problem in big data. For examples, it expounds from three aspects, i.e., data generation, data sources, data validation, to illustrate vulnerabilities in big data research. The author concludes by discussing the future trends of big data — more quantities of data with expected standard forms and enabled research approaches will come.

Reflection:

In social science, researchers utilize a large amount of data from different platforms and analyze it to prove their assumptions or explore potential laws. Just like how Fordism produced a new understanding of labor, the human relationship at work in the twentieth century, big data today also changes the way of people’s understanding of knowledge and human network/communities. These two papers cover a lot of viewpoints I have never thought about even if I already knew big data and did some related simple tasks.  Instances like “firehose” and “bots on the social media” trigger my interest in how to improve the scientific environment of big data. Besides, they prompt readers to think in depth about research data they are using with a dialectical perspective.  Data collecting and preprocessing are more basic and critical than I’ve ever thought. Is quantity bound to represent objectivity? Can data in large number give us the whole data we need to analyze in our specific context? Are data platforms themselves unbiased?  The truth is — there are data controllers in the world, i.e., some authorities/organizations/companies have power in controlling data subjectivity and accessibility; we’ve got data interpreters, all the researchers can be considered as interpreters in some ways; we’ve got booming data platforms/sources for researchers to make choices.

In general, the papers enlighten me on the big data with a context of social science in two ways: 1) researchers should always avoid using data in a way that would obviously affect the rigor of research, e.g., use one specific platform like Twitter to analyze the kinship network.  For researchers, it’s necessary to jump out of individual subjectivity to interpret data. 2) Both organizations and researchers should put effort to construct a healthy and harmonious big-data community to improve the accessibility and validation of data, to formulate scientific usage standards and sampling frames for big data. For any authorities, networks or individuals, we are supposed to dedicate ourselves to work that can potentially benefit the whole big data community.  In this way, all the scientific researchers will have more faith and courage to face the coming era of big data with more challenges but also more valuable knowledge.

Questions:

  1. What’s the definition of knowledge in the twentieth century? How about now?
  2. How to analyze people’s relationship network without distortion? How many data platforms do we have to use? e.g., email, twitter, facebook, Instagram… what kind of combination is the best choice?
  3. To what extent do we have to consider the vulnerabilities of accessible data? For example, if we can use currently available datasets to solve a practical problem, we may ignore some of vulnerabilities and limitations a little bit.
  4. How much can systematic sampling frames help us in analyzing a specific assumption?
  5. What are the uppermost questions for researchers to think when collecting/processing data?
  6. What are the situations that would be best to avoid for researchers when collecting/preprocessing data?

Read More

Reflection#1 – [1/18] – [Aparna Gupta]

Summary:

Both papers talk about what Big Data means in today’s world and how the definition of Big Data changes as per the field being studied. They put more focus on the vulnerabilities and pitfalls of Big Data and also how combining data from various sources pose a challenge to the researchers in every field. They also present the enormous institutional challenges to sociology. The first paper ‘Critical Questions for Big Data’ focuses more on the emerging significant questions in the Big Data field and the six provocations to spark conversations about the issues of Big Data. The second paper ‘Data Ex Machina’ focuses more on the evolution of Big Data and the vulnerability issues.

Reflection:

The most interesting part which I liked is how Big Data has evolved over time and how it is proliferated across society. However, the manifestation varies for each field of study, for e.g; In astronomy, it takes the form of images, in humanities, it can take the form of the digitized book and in social science, it takes the form of data across various social media website like Twitter, Facebook etc.

What caught my interest is how data from various social media platforms can be combined together and applied in a field like ‘sociology’ to analyze and understand human behavior. The transition to the digital life and how entire Internet may be viewed as an enormous platform for human expression. However, I wonder how accurately what happens on these platforms (like, tweets and posts) can be applied to social sciences to infer human experiences.

What I found disappointing was the ‘The Ideal User Assumption’ explained in Lazer’s article which raises concern and draws attention towards how data collected from a specific type of unique people can be fake. How various organizations and government agencies are using bots to achieve surreptitious goals which further poses a validity threat. According to Xinhua (2016), the Chinese peer-to-peer lending platform Ezubao was revealed to be a Ponzi scheme in which 95% of the proposals were fake. Although committees like IRBs exist to ensure the ethics in particular line of research inquiry (like human subjects).

Questions:

  1. How can we ensure the data integrity across various social media platforms?
  2. Reiterating, who is responsible for making certain that individuals and communities are not hurt by the research process?
  3. How can we claim what our data represents?
  • Does it represent a group of individuals or just an individual?
  • Can insights be drawn on data collected from an individual be generalized, especially in the field of social science?

Read More

Reflection #1 – [1/18] – [Pratik Anand]

First paper : CRITICAL QUESTIONS FOR BIG DATA

Second paper : Data ex Machina: Introduction to Big Data

Summary

Both papers start with the introduction to big data. One of main points raised is the changing definition of big data – earlier it was about size of data, now it is about search, aggregate and processing the data. First paper directly jumps to the critical analysis of big data phenomenon. Second paper, on the other hand, takes a more comprehensive and structured approach to the analysis of big data. Both paper discuss the source of data acquisition, emergence of an entire new set of digital data, data privacy concerns, its analysis and conclusion. They show the changing approach to research due to big data and how abundance of data gathered might not lead to the true representation of the whole problem sample. Even conclusions provided from balanced representative data might also be subjective and biased. The mention of “Big Data Hubris” shows that having large volume of data does not correlate to better results. While the first paper continues critical analysis of big data, second paper provides future trends where volume of data will grow in size and diversity of platforms and more generic data models will take over.

Reflection

The first paper does a really good job of raising important questions related to big data, starting with its definition. The second paper is more of an introduction to big data and only provides a high level view of issues related to big data. For the initial reading, second paper can be used to provide overview of the big data industry, followed by the first paper for discussions related to its most important issues. Issues like privacy are unethical data collection are central to this debate as individuals, corporations or even governments can easily misuse the data.The contrast between the nature of the two papers is quite evident. The second paper mentions the problems faced by big data field today and its future trends whereas the first paper discusses over the problems caused due to big data and critically analyses its aspects. Aspects like uneven access to digital data and poorly understood analysis results can have ramifications on large sections of society and the first paper raises the right kind of questions for civil discussions.

Questions

1) Since the volume of data generated will continue to grow, how the government can ensure its protection and ethical treatment ?
2) Who should actually own the digital footprint data – individual or the respective companies collecting it ?
3) Is the data-driven approach is the right approach for all kind of social problems ? Will it lead to less focus towards areas where it is inherently difficult to generate large volume of data ?

Read More

Reflection # 1 – [01/18] – [Jamal A. Khan]

Both of the assigned papers:

  • “Critical Questions for Big Data” and
  • “Data ex Machina”,

present a social-science perspective on Big Data and how it affects and/or can or will affect social-sciences. The former i.e. “Critical Questions for Big Data” focuses more on the issues that big data might raise, which the authors have put into 6 buckets and the latter offers a more general overview of the what big-data is, it’s sources and resulting opportunities.

From both Boyd & Crawford’s and Lazer & Radford’s descriptions, I took away that big data is less about size or volume and more about being insightful. It is named so because given the humongous scale of data, we now have enough computational power and good enough tools to be able to draw meaningful insights that are closer to the ground truth than ever before.

I find Lazer & Radford’s mention of passive instrumentation quite appealing.  As compared to techniques like self-reporting and surveys which may include the subjects’ inherent biases or voluntary lies, big-data offers the opportunity to passively collect data and generate insights that are closer the ground truth of observed social phenomena. Borrowing words from the authors themselves “First, it acts as an alternative method of estimating social phenomena, illuminating problems in traditional methods. Second, it can act as a measure of something that is otherwise unmeasured or whose measure may be disputed. Finally, now casting demonstrates big data’s potential to generate measures of more phenomena, with better geographic and temporal granularity, much more cost-effectively than traditional methods”.

For the rest of the reflection I’ll focus on the issues raised by big-data as they not only seem to be more involved, interesting and challenging to tackle but also raise more questions.

An interesting point that Boyd and Crawford make is that big data changes how we look at information and may very well definition of knowledge. Then, they go on to suggest that we should not be dismissive of other older disciplines such as philosophy, because they might offer insight that number are missing. However, they fail to give to a convince argument of why it should be so. If we are observing certain patterns of interaction or some trends among populace behavior, is it really all that necessary to rely on philosophy to explain the phenomena?  Or can the reason also be mined form the data itself?

Another issue that caught my attention in both of the articles was that, one can get lost in numbers and start to see patterns where none exist, therefore solely relying on the data thinking it to self-explanatory is naive. The techniques used and the tools employed may very well influence the interpretation to be subjective rather than objective, which may enforce certain viewpoints even though they fail to exist. An example of which is the surprising co-relation between the S&P 500 stock index and butter production in Bangladesh. So this begs the question of whether there is a certain degree of personal skill or artistry involved in data analysis? I personally haven’t come across such examples but would love to know more if others in the class have.

I like provocation#3 made by Boyd and Crawford, “Bigger data are not always better data”. It is much too common and easy to be convinced that having a large enough dataset captures the true distribution of data and being someone who works with machine learning quite often, I too am guilty of it.  In actuality the distribution might not be representative at all! if the methodology is flawed or suffers from unintended biases and/or comes from a source that is inherently limited, then making claims about behaviors and trends can be misleading at best. The twitter “firehose”, “gardenhose” and “user” vs “people” vs “activity” example given by the Boyd and Crawford is a perfect demonstration of it. This may have far-reaching consequences especially when Learning algorithms are fed this biased/limited data. The decisions made by these systems will intern reinforce the biases or provide flawed and limited insights which is most certainly something we want to avoid! So the question then becomes how do we systematically find these biases and limitation inorder to rectify them? Can all biases even be eliminated? This also leads to an even more basic question How does one even go about checking if the data is representative or not?

Finally, I disagree (to a certain extent) with Boyd and Crawford’s discussion on ethics and the notion of big-data rich and poor. I would certainly like to discuss these two in class and know what other people think.

Read More

Reflection #1 – [1/18] – [Ashish Baghudana]

Boyd, Danah, and Kate Crawford. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon.” Information, communication & society 15.5 (2012): 662-679.
Lazer, David, and Jason Radford. “Data ex Machina: Introduction to Big Data.” Annual Review of Sociology 0 (2017).

Summary

Computational social science is still in very nascent stages. Until recently, research in this domain was conducted by computationally minded social scientists or socially inclined computer scientists. The latter has been driven by the emergence of the Internet and social media platforms, like Facebook, Twitter or Reddit. The digitization of our social life presents itself as Big Data. The two papers present a view of Big Data – its collection and analysis, from a social sciences perspective. Social scientists relied on surveys to collect data, but had to continually question the biases in people’s answers. Both papers value Big Data’s contributing in studying the behavior of a populace, without the need for surveys. The challenge in Big Data is oriented more towards organizing and categorizing “data” rather than its size. The main focus of these two papers, however, is less on Big Data, but on the vulnerabilities in such an approach.

Danah Boyd’s paper raises six concerns about the use of Big Data in social sciences.  They argue that not all questions in social science can be answered using numbers – the whys need a deeper understanding of social behavior. More importantly, the mere scale of Big Data does not automatically make it more reliable – there still might be inherent biases present. Finally, it raises concerns about data handling and access – who owns digital data and how limited access can create similar divides in society as education.

Lazer’s paper raises almost identical vulnerabilities with Big Data. As he very aptly puts –

The scale of big data sets creates the illusion that they contain all relevant information on all relevant people.
Lazer repeatedly stresses that generalizability of social science research does not necessarily need scale, but more representative data. He also iterates Boyd’s concern about research ethics on social platforms. Lazer ends with future trends for computational social science that include – more generalized models, qualitative analysis of Big Data and multimodal sources of data.

Reflections

As a student and researcher in computer science, it is imporant to remember constantly that computational social science is not an extension of data science or machine learning. We often fall into the trap of thinking about methods before questions. It is best to think of computational social science a derivative of social science rather than computer science.

I enjoyed reading Boyd’s provocations #3 and #4. Big Data is often heralded as a messiah that can answer important behavioral questions about the society. Even if this were true, this will not be because of the scale of the data, but the ability to process and analyse this data. As researchers, it feels increasingly important to consider multiple sources and ask broader questions in social science than ones like – “will a user make a purchase?”. For this, one can’t merely look for patterns in data, but study why the patterns exist. Are these patterns because of the dataset in question, or is it reflective of true societal behavior? While the two papers mention a trend of generalization, especially in the machine learning field, I also see a trend where there is increased specialization. Methods in computer science have decreased applicability to a large enough data sample.

Finally, a major concern about Big Data is privacy and ethics. Unlike with challenges in data analysis, this concern does not have any correct answers. Universities, labs and industries will have to work more closely with IRBs to develop a good framework for the governance of Big Data.

Questions

  • To re-iterate, what are the big questions in social science, and where do we draw a balance between quantitative and qualitative analysis?
  • While computational social science helps debunking incorrect beliefs about group behavior, can it truly help understand the cause of such behavior?
  • Should we change societal behavior, if it were possible?
    • While predictive analysis is non-intrusive, what constitutes ethical behavior when social science has more intrusive effects?
  • Finally, does research in computational social science encourage large-scale surveillance?

Read More