Reflection #2 – [1/23] – [Jamal A. Khan]

Well let me start out by saying that this was a fun read and the very first takeaway is that I know now how to “take away money” from people!

Anyhow, moving on to a more serious note, since the prime motive of the paper is to analyze the language used to pitch products/ideas and since videos (or content thereof) are a good indicator of Funded or not Funded, what effect does implicit racial bias of the crowd have? More concretely:

  • What effect does the race of both the person pitching and the crowd have?
  • Do people tend to fund people of the same racial group more?

Another aspect that I would like to investigate is the crowd itself and its statistics per funded project and how it varies across them? Can we find some trend there?

The paper more or less gives evidence of the intuitive insights both from literature and ones based on common sense e.g. people/contributors don’t stand to make profit or reap monetary benefit from the project but given some form of “reciprocation”, there’s added incentive for them to contribute apart from them liking the project. Sometimes this takes the form of something tangible like a free t-shirt and at other times it’s merely a mention in credits but the important point is that people are getting something in return for for their funding. Another prominent one is “scarcity” i.e. the desire to have something that is unique limited to only a few people. Tapping into that emotion of exclusivity and adding in personalization is good way to securing some funding.

However, not all is well! As was also noticed by some other people, there are some spurious phrases in table 3 and 4 for which it seems that they should’ve belong to the other category e.g:

  • “trash” was in funded with beta = 2.75
  • “reusable” was in not funded beta = -2.53

There were also some phrases which made no sense to be in either category e.g. “girl and” was in funded with $\beta$ = 2.0 ? I suspect that this highlights a flaw/poor choice of classifier. What would be a better classifier ? Something like word embeddings where the embeddings can be ranked?

Moving on to the model summaries provided:

It’s quite evident that the phrases provide quite a big boost in terms of capturing the distribution of the dataset, so this makes me wonder, how a phrases-only model would perform? My guess is that it’s performance should be closer to the phrases + controls model than to the controls-only model. Though I’m going off a tangent but let’s say we don’t use logistic regression and opt for something more a bit more advanced e.g. sequence models or LSTMs to predict the outcome, would the model turn out to be better than the phrases+controls model? Also, will this model stand the test of time? i.e. as language or trends of marketing evolve, will this model hold true, say, 6-10 years from now? Since the paper is from 2014 and the data from 2012-2014, does the model hold true right now?

Another thing that the authors mentioned and that caught my attention is the use of social media platforms, and it’s raised quite a few questions:

  • How does linking to Facebook affect funding? Does it build more trust among backers because it provides a vague sense of legitimacy?
  • Does choice of social media platform matter i.e. Facebook vs Instagram?
  • Does language of the posts have similar semantics or is it more click bait-ish?
  • What affect does frequency of post have?
  • Does the messaging service of Facebook pages help convince vary people to contribute?

This might make for a good term project.

I would also like to raise two technical questions, regarding the techniques used in the paper:

  • Why penalized logistic regression? Why not more modern, deep learning techniques or even other statistical models e.g. multi-kernel based Naïve Bayes or SVMs?
  • What is penalized in penalized logistic regression; does it refer to the regularize added to the RSS or likelihood?
  • I understand Lasso results in automatic feature selection, but comparison with other shrinkage/regularization techniques is missing. Hence, the choice of the regularization method seems more forced than justified.

Finally, and certainly most importantly, I’m glad that this paper recognizes that:

“Cats are the Overlords of the vast Internets, loved and praised by all and now boosters of Kick Starter Fundings”

 

 

 

Read More

Reflection #2 – [1/23] – [Ashish Baghudana]

Mitra, Tanushree, and Eric Gilbert. “The language that gets people to give: Phrases that predict success on Kickstarter.” Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 2014.

Summary

The language that gets people to give is a fascinating study that attempts to answer two questions – can we predict what crowdfunding campaigns get funded, and what features of the campaign determine its success. Analyzing over 45K Kickstarter campaigns, Mitra et al. build a penalized regression model with 59 control features such as project goal, duration, number of pledge levels etc. Using this as a baseline, they built another model with textual features extracted from the project description. To ensure generalizability, they only chose words and phrases that appear in all 13 categories of Kickstarter campaigns. The control-only model has an error rate of roughly 17%. The use of language features (~20K phrases) reduces the error rate to 2.4%, indicating a non-random increase in accuracy. The paper then relates the top features for both the funded and not-funded cases to social psychology and the theories of persuasion. Of these, campaigns that display reciprocity (the tendency to return a favor), scarcity (limited availability of the product), social proof (others like the product too), authority (an expert designing or praising the product) or sentiment (how positive is the description) tend to be funded more.

Reflection

An exciting aspect of this paper is the marriage of social psychology, statistical modeling, and natural language processing. The authors address a challenging question about what features, language or otherwise, encourage users to invest in a campaign.  The paper borrows heavily from theories of persuasion to describe the effects of certain linguistic features. While project features like the number of pledge levels are positively correlated with increased chances of funding, I was surprised to see phrases such as “used in a” or “project will be” influencing successful funding. I am equally interested in how these phrases relate to specific aspects of persuasion – in this case, reciprocity and liking/authority. The same phrases can be used in different contexts to imply different meanings. I am curious to know if the subjectivity index [1] of project descriptions make any contribution to a fund or no-fund decision.

I would expect that another important aspect of successful campaigns would be the usefulness of a product to the average user. While this is hard to measure objectively, I was surprised to find no reference to this in any of the top predictors. Substantial research in sales and marketing seems to indicate a growing emphasis on product design for successful marketing campaigns [2].

A final aspect that I find intriguing is the deliberate choice of treating all products equally on Kickstarter. How valid is this assumption when one considers funding a documentary vs. earphones? It is likely that one might focus much more on contents and vivid descriptions, while the other might focus more on technical features and benchmarks?

The paper throws open the entire field of social psychology and offers a great starting point for me to read and understand the interplay of psychology and linguistics.

Questions

  • Do different categories of campaigns experience different funding patterns?
    • Are certain types of projects more likely to be funded as compared to others?
  • While social psychology is an important aspect of successful campaigns, perhaps it would make sense only in conjecture with what the product really is?

[1] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.

[2] Why the Product is the Most Important Part of the Marketing Mix. http://bxtvisuals.com/product-important-part-marketing-mix/

Read More

Reflection #2 – [01/23] – [John Wenskovitch]

This paper examined a dataset of more than 45,000 Kickstarter projects to determine what properties make a successful Kickstarter campaign (in this case, defining success as driving sufficient crowd financial interest to meet the project’s funding goal).  More specifically, the authors used both quantitative control variables (e.g., project duration, video present, number of updates) as predictors, as well as scraping the language in each project’s home page for common phrases.  By combining both components, the authors created a penalized logistic regression model that could predict whether or not a project would be successfully funded with only a 2.24% error rate.  The authors extended their discussion of the phrases to common persuasive techniques from literature such as reciprocity and scarcity to better explain the persuasive power of some phrases.

I thought that one of the most useful part of this paper relative to the upcoming course project was the collection of descriptions and uses of the tools used by the authors.  Should my group’s course project attempt something similar, it is nice to know about the existence of tools such as Beautiful Soup, cv.glmnet, LIWC, and Many Eyes for data collection, preprocessing, analysis, and presentation.  Other techniques such as Bonferroni correction and data repositories like the Google 1T corpus could also come in handy, and it is nice to know that they exist.  Has anyone else in the class ever used any of these tools?  Are the straightforward and user-friendly, or a nightmare to work with?

The authors aimed to find phrases that were common across all Kickstarter projects, and so they eliminated phrases that did not appear in all 13 project categories.  As a result, phrases such as “game credits” and “our menu” were removed from the Games and Food categories respectively.  I can certainly understand this approach for an initial study into Kickstarter funding phraseology, but I would be curious to see if any of these specific phrases (or lack of them) were strong predictors of funding within each category.  I would speculate that a lack of phrases related to menus would be harmful to a funding goal in the Food category.  There might even be some common predictors that are shared across a subset of the 13 project categories; it would be interesting to see if phrases in the Film & Video and Photography categories were shared, or Music and Dance for another example.  How do you think some of the results from this study might have changed if the filtering steps were more or less restrictive?

Even after taking machine learning and data analytics classes, I still treat the outputs of many machine learning models as computational magic.  As I glanced through Tables 3 and 4, a number of phrases surprised me in each group.  For example, the phrase “trash” was an indicator that a project was more likely to be funded, while “hand made by” was an indicator that a project would not be funded.  I would have expected each of these phrases to fall into the other group.  Further, I noted that very similar phrases also existed across categories:  “funding will help” indicated funding, whereas “campaign will help” indicated non-funding.  Did anyone else notice unexpected phrases that intuitively felt like they were placed in the wrong group?  Does the principle of keeping data in context that we discussed last week come into play here?  Similarly, I thought that the Authority persuasive argument went counter to my own feelings.  I would tend to view phrases like “project will be” as cocky and therefore would have a negative reaction to them, rather than treating them as expert opinions.  Of course, that’s just my own view, and I’d have to read the referenced works to better understand the argument in the other direction.

I suspect that this paper didn’t get as much attention as Google Flu Trends (no offense, professor), but I’m curious to know if the phrasing in Kickstarter projects changed after this work was published.  Perhaps this could be an interesting follow-up study; have Kickstarter creators become more likely to use phrases that indicated funding and less likely to use phrases that indicated non-funding after the paper and datasets were released?  Another interesting follow-up study was hinted at in the Future Work section.  Since Kickstarter projects can be tied to Facebook, and because “Facebook Connected” was a positive predictor of a project being funded, a researcher could explore the methods by which these Kickstarter projects are disseminated via social media.  Are projects more likely to be funded based on number of posts?  Quality of posts (or phrasing in posts)?  The number of Facebook profiles that see a post related to the project?  That interact with a post related to the project?

Read More

Reflection #2 – [1/23] – [Pratik Anand]

The paper poses an interesting research question that how much success of a Kickstarter campaign depends on the campaign’s presentations, pitch and other factors which have no relation to the product itself. It is interesting because unlike other kind of media like advertisements, direct impact of such influences can be measured in terms of donation to the Kickstarter projects.

Tanushree Mitra et al. list out a number of factors which influence the viewers, positively or negatively. These factors, or control variables, are : project goal, its duration, video or animation used for the pitch, category of the product, Facebook connectivity etc. Impact of a video or an animation is well understood as they provide the information in a short amount of time and keep the viewers engaged compared to a large block of text. Project duration also plays a key factor. I can understand why longer project duration is seen negatively and such projects are less likely to reach their funding goal. The viewers have little interest or trust over paying for a product whose result they may see after a long duration. Products which take longer to develop are tell-tale signs of complexity and can lead to disastrous failures. Such trust deficit can only be offset by strong brands which usually Kickstarters don’t have.

Tanushree Mitra et al. built a logistic regression model for prediciting the success of kickstarter campaigns with these control variables. It resulted in 17.03 % error rate in 10-cross validation.
The authors factors in the phrases of language used in the Kickstarter campaign and the error rate reduces to 2.24 % which shows a strong correlation between language of the pitch and the success of the product. They try to explain the phrases as a trigger for one of these phenomena – Reciprocity, Scarcity, Social Proof, Social Identity, Liking and Authority.
Many of these phenomenon like scarcity, social proof and identity as well as authority are well studied psychological phenomenon, especially in the retail and entertainment industries which employ all kind of techniques – from loyalty bonuses, exclusive cards to ad campaigns which instill a FOMO (Fear of Missing Out) among the users [1]. Every other advertisement has an “expert” who claims that the given product/service is the best. Tanushree Mitra et al. reference these as part of Theory of Persuasion. Since, these are older tricks in the classroom marketing and advertisement books, it is debatable that how much effective they are in the kickstarter campaigns. Correlation does not imply causation.
Reciprocity, on the other hand, stands out as an effective technique. Kickstarter campaign, by their nature, do not give anything in return to the backers except for the promise that the product will come out for the consumers. If the Kickstarter campaign gives back something tangible to the backers, it is a very visible addon for them.
The paper shows that adding phrases and control variables to their model, they achieve high degree of accuracy in predicting success of a campaign. If platforms emerge for Kickstarters to tune their pitches based on these suggestions, will their effect subside from overuse ?
This study was performed in 2014, more than 3 years ago. Kickstarter now is a very different and diverse platform with newer options, long list of high profile success and failures (Pebble has been acquired after a string of losses, Ubuntu phone was a failed campaign, Oculus is a major player in VR etc). Product discovery portals like Product Hunt are also influencing popularity of the campaigns. Do these conclusions hold up for Kickstarter of 2017 ?

Reference :
1) https://www.salesforce.com/blog/2016/10/customer-loyalty-program-examples-tips.html

Read More

Reflection #1 – [1/18] – [Deepika Kishore Mulchandani]

[1]. Danah Boyd & Kate Crawford (2012) CRITICAL QUESTIONS FOR BIG DATA, Information, Communication & Society, 15:5, 662-679, DOI: 10.1080/1369118X.2012.678878

Summary :

In this paper, the authors describe big data as a ‘capacity to search, aggregate, and cross-reference large datasets’.  They then go on to describe the importance of handling the emergence of the era of big data critically as this would influence the future. They also discuss the dangers to privacy due to this big data phase and other concerning factors that exist. They then discuss in detail the assumptions and biases of this phenomenon using six points. The first point is how big data has changed the definition of knowledge. Another being that having a large amount of data does not necessarily mean that the data is good. The discussion on the ethics of the research being done and the lack of regulating techniques and policies are explained with examples by the authors to emphasize the importance. They also discuss the access of this data by limited organizations and the divide it creates.

[2]. D. Lazer and J. Radford, “Data ex Machina: Introduction to Big Data”, Annual Review of Sociology, vol. 43, no. 1, pp. 19-39, 2017.

Summary :

In this paper, the authors review the value of big data research and the work that needs to be done for big data in sociology. They first define big data and then discuss the following three big data sources:

  1. Digital Life: Online behavior from platforms like Facebook, Twitter, Google, etc.
  2. Digital Traces: Call Detail Records which are only records of the action and not the action itself.
  3. Digitalized Life: Google books, phones that identify proximity using Bluetooth.

The authors believe that the availability of these forms of data along with the tools and techniques required to access such data provides sociologists with the opportunity to take advantage and answer the various age-old or new questions. To this end, the authors mention the opportunities available to sociologists in the form of massive behavioral data, data obtained through nowcasting, data obtained through natural and field experiments, and, data available on social systems. The authors then proceed to discuss the pros and cons of sampling the available big data. They also mention the vulnerabilities that exist such as too much volume of data, the generalizability of data, platform dependence of data, failing ideal user assumption, and, ethical issues in big data research. In conclusion, the authors mention few of the future trends the knowledge of which will help sociologists succeed in big data research.

Reflections :

In [1], the authors ask various questions with the theme ‘Will Big Data and the research that surrounds it help the society?’ I like the definition of big data as a ‘Socio-technical’ phenomenon. I also like the thought that is provoked by the usage of the term ‘Mythology’ in the formal big data definition. The big data paradigm and its rise to fame do somewhat revolve around the belief that the volume of the data provides new true and accurate insights. This gives rise to the question ‘Do we sometimes try to find or justify false trends just because we have big data?’ I like the example using which they represent the platform dependence of social data. The social network of a person on Facebook may not be same as on Twitter by virtue of the fact that the data is different. This could be for a lot of reasons, with the basic one being that some user may not be present on both those social sites. This gives rise to another question ‘What about the population who is not on any social site?’. That chunk of the population is not being considered in any of the studies. Also, the very fact that sometimes ease of accessibility of the data is considered over the quality of data raises concerns. I also like that the authors address the quantitative nature of big data research and the importance of context. I appreciate the section in which they discuss the availability of this big data by few organizations and the ‘Big Data Rich’ and ‘Big Data Poor’ divide that it creates. This is something which has to be considered to facilitate successful big data research. In [2], I appreciate the definition of big data that has been provided by the authors. Big data is indeed a mix of computer science tools and social science questions. The authors mention that sociologists need to learn how to leverage the tools and techniques provided by computer scientists to make break-through in their research. This makes an excellent collaboration where computer scientists leverage the questions and research expertise of social scientists and social scientists leverage the tools and techniques developed for providing insights into the big data. I like the way the authors mention big data archives as depicting actual behavior “in principle”.  Although there are instances which show positive results when studying behaviors using such big data, the question that arises is ‘How genuine is this online behavior?’. Many factors play a role in these studies. The biases present in the data have to be considered. If data from social networks is being considered, one of the most basic examples of bias is the ideal user assumption as highlighted in the paper. Moreover, the veracity of the data has to be considered as well. Another important bias mentioned in the paper arises due to the incorrect sampling of data. I realize that sample data from the big data can provide valuable insights.  However, this raises the questions ‘What methods can be applied to sample data without bias?’ I appreciate the effort that the authors have invested by providing many case study examples to emphasize the points that they mention in the review. This provokes thoughts about the vulnerabilities and the work that has to be done to make big data research as ethical and methodical as possible.

Read More

Reflection #1 – [1/18] – MD MOMEN BHUIYAN

Paper #1: Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon
Paper #2: Data ex Machina: Introduction to Big Data

Summary #1:
This paper discusses what is entailed by the phenomenon of big data from the perspective of socio-technical research. The authors describe Big Data phenomenon as an interplay between technology, analysis, and mythology. Here mythology represents the uncertainty of inferring knowledge from big data. The theme of the paper is six provocations about the issues in big data: knowledge definition, objectivity claim, methodological issue, importance of context, ethics, and digital divide due to access restriction. Big data introduces new tools and research methods for information transformation which in turn changes the definition of knowledge. While quantifying information in big data might seem to provide objective claim, it is still the subjective observation that initiates it. It’s the design decisions in the interpretation of data that incorporate subjectivity and its biases. One caveat of interpreting big data is that seeing correlation where none exists (p.11). The methodological limitation of big data lies in the source and collection procedure. It is necessary to understand how much the data was filtered and if it is generalizable to the research domain. One solution to this problem is using heterogeneous sources to combine information which also amplifies noise. This heterogeneous combination is also helpful in modeling the big picture. For example, Facebook can alone provide a relationship graph of a user. But it is not necessarily the overall representation. Because multi-dimensional communication methods like social network, email, phone etc. each provide a new representation. So context is very important in describing such information. Big data also raises the question of ethics in research. The sheer volume of big data could provide enough information to de-anonymize. Information should be carefully published to protect the privacy of the entities involved. Finally, accessibility of big data divides the research community into two groups where one group is more privileged. Computational background also creates similar division among big data researchers.

Summary #2:
This paper is similar to the previous one in the sense it discusses similar issues from the previous paper. The authors here first discusses the data sources for big data by dividing them into three topological categories: digital life, digital trace, digitalized life. Digital life refers to the digitally mediated interactions like tweeting, searching etc. Digital traces are the records that indirectly provides information about the digital life like Call detail records. Finally digitalized life represent the capture of a nondigital portion of life into a digital form like constant video recording in an ATM. There is also possibility of collecting specific behavior like certain types of tweets or visiting certain webpage. These data provide several opportunities for behavioral research. Big data provides large data set from different sources and combination of these sources provides important incites like the Copenhagen Network Study. Big data also provides cost-effective solutions to some studies like unemployment detection, disease propagation study etc. Effect of external changes can be captured by big data like Hurricane, price hike etc. By focusing on underrepresented population, big data is used to study certain problems like PTSD, suicidal ideation etc. The vulnerabilities of big data include the problem of generalizability of hypothesis, heterogeneous sources, errors in the source systems, and ideal user assumption. Research on big data includes ethical issues in both acquisition and publishing of data. Finally, recent big data trends include: new sources of data, generic model for research, qualitative approach in big data research etc.

Reflection:
Both of the paper discusses issues and application of big data in identifying social phenomena. This reflection focuses on the generalizability issue in big data. The authors suggest combining multiple sources to validate can solve generalizability issue. This seems interesting given recent deep learning community has found that generalizing a model can be achieved using more data as well as using transfer learning. Similar approach can be used in finding social phenomenon in big data. For example, data from Twitter can provide with information about the spreading of rumors by people with certain attributes. Although Facebook is too different from Twitter, it is possible to use the hypothesis and the result from Twitter to initiate a learning model to apply in facebook. What do you think?

Read More

Reflection #1 – [01/18] – [Vartan Kesiz Abnousi]

Danah Boyd & Kate Crawford: Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society. https://doi.org/10.1080/1369118X.2012.678878

David Lazer and Jason Radford: Data ex Machina: Introduction to Big Data. Annual Review of Sociology.

https://doi.org/10.1146/annurev-soc-060116-053457

 

Summary

The articles aim to provide a critical analysis of  “Big Data”. Moreover, there is a brief historical account on how the term “Big Data” was born and its definition. They stress that Big Data is more than large datasets, it is about a capacity to search, aggregate, and cross-reference large data sets. They underlie the great potential of Big Data in solving problems. However, they also warn about the dangers that it brings. One danger is that it could be perceived as a panacea for all research problems and reduce the significance of other fields, like the humanities. It is being argued that the tools of Big Data do necessarily offer an “objective” narration of the reality, as some of its champions boast. For instance, the samples of Big Data are often not representative to the entire population. Therefore, making generalization for all groups of people solely based on Big Data is erroneous. As the authors argue “Big Data and whole data are also not the same “.  It is not necessary to have more data to get better insight regarding a particular issue. In addition, there are many steps in the analysis of the raw Big Data, such as “data cleaning”, where human judgement is required. As a result, the research outcome could be dramatically different due to the subjective choices that were made in the process of data analysis. In addition, concerns regarding human privacy are raised. The human subjects may either not be aware or give their consent to have their data collected. What is private and public information has become more obscure. State institutions might use the data in order to curtail individual civil liberties, a phenomenon known as Big Brother. A particularly important problem is that of research inequality, which takes numerous forms. For example, companies do not provide full access of the collected data to public research institutions. As a result, the few privileged who are within the companies and have complete access can find different, more accurate, results. In addition, those companies usually partner or provide access of their data to specific elite universities. As a result, the students of these schools will have a comparative advantage in their skills compared to rest. This sharpens both the social and research inequalities. The very definition of knowledge is changed. People now get large volumes of epidemiological data without even designing an experiment.  As the authors argue, “it is changing the objects of knowledge”. The authors also argue that Big data is vast and heterogeneous. They classify the data into three sources, digitalized life, digital traces and digital life. As digital life they refer to Twitter, Facebook, and Wikipedia which are all platforms where behaviors are all online. The authors argue that these platforms can either be viewed as generalizable microcosms of society or as distinctive realms in which much of the human experience now resides. Digital traces include information collected from sources such as phone calls, while an example of digitalized life are the video recordings of the individuals.

Reflections

Both articles are very well written. I agree with the points that the articles raise. However, I am particularly cautious about the notion of viewing digital life as a microcosm of our society. Moreover, such a generalization is more than just an abstract, subjective, idea. It is rigorously defined in probability theory. There are mathematical rules on whether a sample is representative or not. A famous example are the 1948 US presidential elections when Truman won, at the time all the elections polls were wrong because of sampling errors. I am also worried that some of these digital platforms bolster a form of herd behavior that renders individuals less rational. This herd behavior that has been studied by social scientists such as Freud and Jung, among many, has been argued that was one of the causes for the rise of Fascism.

Finally, I have some questions that could develop into research ideas such as:

  1. Does not the nature of the digital platform i.e. Twitter change an individual’s behavior? If yes, then how?
  2. Is the increasing polarization in the United States related to these digital platforms?
  3. Does digital anonymity alter someone’s behavior?
  4. Do people behave the same way across different digital platforms?
  5. Can we, as researchers, develop a methodology to render digital platforms, traces, representative to the population?

Read More

Reflection #1 – [1/18] – [Meghendra Singh]

  1. Data ex Machina: Introduction to Big Data – Lazer et. al.
  2. CRITICAL QUESTIONS FOR BIG DATA – Boyd et. al.

Both the papers focus on Big Data and its application to social science research. I feel that Boyd et. al. take a more critical approach towards Big Data in social science and after the initial introduction on Big Data they go on to discuss their six provocations about Big Data applied in the social media context. I find all of the questions the authors raise in this text to be very relevant to the subject matter. Also each of these questions can be seen as potential research questions relevant to the domain of Big Data.

Big Data analysis might cause other traditional approaches of analysis to be ignored and this might not always be correct. We need to be wary of the fact that the ‘Volume’ of the data doesn’t imply it’s the ‘Right’ Data. It might not represent the truth, might be biased, might not be neutral. Consider the example of inferring the political opinions in a geographic region based on the Facebook posts by users in that geographic region. What if most of the Facebook users in that geographic region belong to a particular demographic segments? What if most of them are inactive and only read posts by other users? How to account for bots and fake profiles? If we are to draw any inference from such data, we need to be sure that we account for such problems with our dataset and the fact that the digital-world does map one to one with the real-world. Potential research in this direction may include development of techniques and metrics that can measure the disparity between the real world data and that on social media. Also, techniques to bridge this divide between the real and digital worlds might also hold some value for future research.

Boyd et. al. emphasize that the tools and systems used to create, curate and analyze Big Data might impose restrictions on the amount of Big Data that can be analyzed. These aspects might also influence the temporal aspect of Big Data and the kinds of analysis that can be performed on it. For example, today’s dominant social media platforms like Twitter and Facebook offer poor archiving and search functions, which leads to the impossibility of accessing older data, thereby biasing research efforts towards recent events like elections or natural disasters. Therefore, the limitations of these platforms bias the artifacts and research that stem from such analysis. As a consequence, entire subject areas / knowledge bases / Digital Libraries, might get biased over time because of technological shortcomings of these social media platforms. This opens up a slew of research problems in the domain of Big Data and Social Sciences. Can we measure the impact of these so called ‘Platform Limitations’ on things like the research topics that occur in the domain of Big Data Mining? A more technological challenge that needs to be addressed is how can we make the Petabyte scale data on these platforms accessible to a larger audience (and not just researchers from top tier universities and Companies owning these social media platforms)?

The authors also emphasize that design decisions made while analyzing Big Data, like a specific ‘Data Cleaning’ process applied might make it biased and hence an unsuitable candidate for generating authentic insights. Also, interpreting data is often prone to spurious correlations and apophenia. Big Data might magnify the opportunities for such misinterpretations to creep into our research and we should be wary of these aspects specially while handling Big Data. Then there are issues relating to anonymity and privacy. Boyd et. al. also emphasize that data accessibility does not imply permission to use. Another important aspect discussed which can also be a potential research question is: How to validate research done using Big Data when the Data itself is inaccessible to the reviewers?

I felt that Lazer et. al. take a more balanced approach while discussing the challenges associated with the use of Big Data for Social Science research. They discuss pros and cons of Big Data for Social Science research instead of focusing only on the problems. I liked that Lazer et. al. describe various studies like the Copenhagen Network Study and The Billion Prices Project to emphasize specific problems associated with the use of Big Data for Social Science. The paper brings to bear specific facts like there is very little research that uses big data in prominent sociology journals. The fact that most of the big data relevant to social science research is massive and passive. I also find their classification of big data sources into: Digital life, Digital traces and Digitalized life comprehensive. I feel that there is a huge overlap between the challenges discussed in the two papers, although the terms used in each paper were different. I really like that Lazer et. al. present these challenges in a structured way and each class of challenges (Generalizability, Too Many Big Data, Artifacts and Reactivity and Ideal user assumption) has a meaningful and distinct definition.

Read More

Reflection #1 – [01/18] – [Patrick Sullivan]

Boyd and Crawford are looking into the benefits and downfalls that come about from the usage of Big Data.  This is a worthy discussion because there are high expectations and wild assumptions being made by the public about Big Data.  While Big Data can lead to some previously unattainable answers, this means we should be more wary of its results.  Using other methods to reach toward these answers might be impossible, possibly leaving Big Data’s results both unverified and unverifiable, a very unscientific characteristic.

Issues that plague a discipline or research area are not a new concept.  Statistics has sample bias and Simpson’s paradox.  Logic has the slippery slope and the false cause fallacies.  Psychology has suggestion and confirmation bias as well as hindsight, primacy, and recency effects.    However, Big Data has method and misinterpretation errors that can be compounded with nearly all the previous issues listed.  This leads to huge issues whenever the pure quantitative appearance Big Data is championed and accepted without closer investigation.

There might be ways of combating this if Big Data can adopt the same defenses other disciplines use.  Extensive training for the individuals at the forefront to challenge and counteract fallacies can be seen in Logic through lawyers and judges.  Statistics has developed data reporting standards that either avoid or reveal issues as well as explicitly reporting the uncertainty and precision of measurements. Psychology actually integrates what could be a pitfall into their actual experimental design when using the placebo effect or hiring actors to test social situations, but then show the changes to results when compared to a control group.  Big Data researchers should adopt these defenses or invent new ones to give more authority behind their assertions.

Lazer and Radford support many of these same concerns, but also point out a more recent change: intelligent hostile actors.  This is one of the largest threats to Big Data research since it is a counteracting force that naturally evolves to both survive and do more damage.  As bots and manipulation cause more destruction and chaos, every Big Data research of that data becomes less trustworthy.  Interestingly, positive outcomes can come from simply revealing the presence of hostile actors in within Big Data sources.  This would call into question the validity of findings that previously may have been undoubted thanks to the tendency of quantitative results being viewed as objective and factual.

Questions:

  • Should Big Data be more publically scrutinized for hostile actors’ data manipulation in order to keep expectations more realistic?
  • Should Big Data research findings be automatically assigned more doubt since misleading or misunderstood results can be so well hidden behind a veil of objectivity?
  • Would more skepticism towards Big Data research slow or damage the research efforts until it causes a net negative impact on society? Could we find this limit?

Read More

Reflection #1 – [1/18] – [Jiameng Pu]

  • D Boyd et al. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon.
  • D Lazer et al. Data ex Machina: Introduction to Big Data.

Summary:

These two papers both summarize and discuss some critical questions of big data ranging from its definition to society controversy, which demonstrates that it’s necessary for researchers in social science to consider a lot of questions prone to be ignored before conducting their research. In critical questions for big data, the author summarizes six crucial points regarding big data. 1)Big data today will produce a new understanding of knowledge. 2) All the researchers are interpreters of data, what’s the line between being subjective and being objective? 3) Big quantities of data do not necessarily mean better data because data sources and data size tend to offer researchers a skewed view. 4) The context of big data cannot be neglected. For example, today’s social network does not necessarily reflect sociograms and kinship network among people. 5) Research ethics is always supposed to be considered while using data. 6) The resource of data is not equal to different research groups, which entails a digital division in the research community. In Data ex Machina, the author explicitly illustrates definition, sources, opportunities, vulnerabilities of big data. By reviewing some of the particular projects in the literature, e.g., The Billion Prices Project, EventRegistry, GDELT, it offers us a convincing sight of the existing problem in big data. For examples, it expounds from three aspects, i.e., data generation, data sources, data validation, to illustrate vulnerabilities in big data research. The author concludes by discussing the future trends of big data — more quantities of data with expected standard forms and enabled research approaches will come.

Reflection:

In social science, researchers utilize a large amount of data from different platforms and analyze it to prove their assumptions or explore potential laws. Just like how Fordism produced a new understanding of labor, the human relationship at work in the twentieth century, big data today also changes the way of people’s understanding of knowledge and human network/communities. These two papers cover a lot of viewpoints I have never thought about even if I already knew big data and did some related simple tasks.  Instances like “firehose” and “bots on the social media” trigger my interest in how to improve the scientific environment of big data. Besides, they prompt readers to think in depth about research data they are using with a dialectical perspective.  Data collecting and preprocessing are more basic and critical than I’ve ever thought. Is quantity bound to represent objectivity? Can data in large number give us the whole data we need to analyze in our specific context? Are data platforms themselves unbiased?  The truth is — there are data controllers in the world, i.e., some authorities/organizations/companies have power in controlling data subjectivity and accessibility; we’ve got data interpreters, all the researchers can be considered as interpreters in some ways; we’ve got booming data platforms/sources for researchers to make choices.

In general, the papers enlighten me on the big data with a context of social science in two ways: 1) researchers should always avoid using data in a way that would obviously affect the rigor of research, e.g., use one specific platform like Twitter to analyze the kinship network.  For researchers, it’s necessary to jump out of individual subjectivity to interpret data. 2) Both organizations and researchers should put effort to construct a healthy and harmonious big-data community to improve the accessibility and validation of data, to formulate scientific usage standards and sampling frames for big data. For any authorities, networks or individuals, we are supposed to dedicate ourselves to work that can potentially benefit the whole big data community.  In this way, all the scientific researchers will have more faith and courage to face the coming era of big data with more challenges but also more valuable knowledge.

Questions:

  1. What’s the definition of knowledge in the twentieth century? How about now?
  2. How to analyze people’s relationship network without distortion? How many data platforms do we have to use? e.g., email, twitter, facebook, Instagram… what kind of combination is the best choice?
  3. To what extent do we have to consider the vulnerabilities of accessible data? For example, if we can use currently available datasets to solve a practical problem, we may ignore some of vulnerabilities and limitations a little bit.
  4. How much can systematic sampling frames help us in analyzing a specific assumption?
  5. What are the uppermost questions for researchers to think when collecting/processing data?
  6. What are the situations that would be best to avoid for researchers when collecting/preprocessing data?

Read More