Reflection #3 – [1/24] – Hamza Manzoor

Justin Cheng, Cristian Danescu-Niculescu-Mizil, Jure Leskovec. “Antisocial Behavior in Online Discussion Communities” 

 

Summary:

In this paper, Cheng et al present a study of antisocial behavior of users from the moment they join a community up to when they get banned. The paper presents a very important topic, which is cyber bulling and identification of cyber bullies is one of the most relevant topics in current digital age. The authors study users behaviors on three online discussion communities (CNN, Breitbart and IGN) and characterize antisocial behavior by analyzing users who were banned from these communities. The analysis of these banned users reveal that over time they start writing worse than other users and secondly, the tolerance of community towards them reduces.

 

Reflection:

Overall, I felt that the paper was well organized and showed all the steps of analysis from data preparation to results of findings along with visualization but the correctness of analysis performed in the paper is questionable because the basis of entire analysis is number of deleted posts but the authors did not consider all the reasons for posts to be deleted. Some posts get deleted because they are in different languages or sometimes on controversial topics like politics if the reported post does not conform to opinions of moderators. Sometimes users engage in an off-topic discussion and those posts are deleted to maintain relevance of comments to article. The biases of moderators should be considered.

The paper does not mention the population size on some analysis, which makes me question if the sample size was significant, or not. For example: When they analyze if excessive censorship cause users to write worse. In this analysis, one population had four or more posts deleted among their first five posts, which unless mentioned I believe would be negligible. Also, entire analysis in more or less dependent on first five or ten posts which is also questionable because these posts can be on same thread on one single day. This approach has two caveats, since the authors did not analyze the text, it is therefore unfair to ban user on their first few posts because it might be possible that user had a conflicting opinion rather than a troll and secondly, the paper itself shows that many of NBUs initially had negative posts and they got better in time. Therefore, banning users on first few deleted posts means that they will not have an opportunity to become better.

The strongest features in the statistical analysis are moderator features and without those features the results significantly drop. These moderator features require moderators whereas the purpose of this analysis was to automate the process of finding FBUs and their high dependency on these features makes this analysis look not so significant.

Finally, my take on the analysis in this paper is that the use of number of deleted posts is trivial and the text of posts should be analyzed before automating any such process which bans users from posting.

 

Questions:

Are the posts deleted because of inflammatory language only or difference of opinion as well?

One question that everyone might raise is that analyzing users on first few posts is unfair but what should be the threshold? Can we come up with a solid analysis without topic modeling and analyzing the text?

What kind of biases moderators display? Does it play a role in post deletions and ultimately user ban?

Read More

Reflection #2 – [1/24] – Hamza Manzoor

Mitra, Tanushree, and Eric Gilbert. “The language that gets people to give: Phrases that predict success on Kickstarter.” 

Summary:

In this paper, Mitra et al present a study to answer a research question that how does language used in pitch gets people to fund the project. The authors provide analysis of text from 45k Kickstarter project pitches. The authors clean the text from these pitches to use the phrases available in all 13 categories and finally use 20K phrases from these pitches along with 59 control variables such as project goal, duration, number of pledge levels etc. to train a penalized logistic regression model to predict if the project will be funded or not. Using phrases in model decreases the error rate from 17.03% to 2.4%, which shows that the text of the project pitches plays a vital role in getting funded. The paper compares the features of funded and non-funded projects and explains that the campaigns that show reciprocity (giving something in return), scarcity (limited availability) and social proof have higher tendency of getting funding.

Reflection:

The authors address a question about what features or language helps in getting more funds. The insights that paper provides are very realistic that people generally tend to give if they see benefit for themselves may be they get something in return. The paper provides a very useful insight to startups looking for funding that they should focus more on their pitch and show reciprocity, scarcity and social proof. But still the results of paper are somewhat astonishing to me because the first 100 predictors belong to language of pitch, which makes me question that is language sufficient to predict whether project will be funded?

There are also few phrases that do not make sense when taken out of context for example ‘trash’ has a very high beta score but does it make sense? Unless we look at entire sentence we cannot say that.

The authors show that the use of phrases in model significantly decreases the error rates but the choice of model is not evident. Why have they used penalized logistic regression? Even though penalized logistic regression (LASSO) makes sense but comparison with other models should have been provided. The ensemble methods like Random Forest Classifier should work well on this type of data and therefore the comparison of different models tested would have provided more insight to choice of model.

Furthermore, treating every campaign equally is another false assumption I see in this paper because how can a product asking for $1M and meeting its goals equivalent to a product with $1000 goal and is every category of campaign equivalent?

Finally, this paper was about the language used in pitches but it also presents new research questions, such as, is there a difference between types of people funding different projects? Do most people belong to wealthy societies? Another interesting question would be, can we process text within video pitches to perform similar analysis? Do infographics help? And, can we measure usefulness of a product and use it to predict?

 

Questions:

Is language sufficient to predict whether project will be funded?

Why the use of penalized logistic regression over other models?

Is every category of campaign equivalent?

Is there a difference between types of people funding different projects?

Can we process text within video pitches to perform similar analysis?

Can we measure usefulness of a product and use it to predict?

Read More

Reflection #1 – [1/24] – Hamza Manzoor

[1]. Danah Boyd & Kate Crawford (2012) CRITICAL QUESTIONS FOR BIG DATA

Summary:

In this paper, the authors describe big data as a cultural, technological, and scholarly phenomenon. They explain that the way we handle the emergence of an era of Big Data is critical because current decisions of how we define the use of big data will shape the future. They also describe different pitfalls and discuss six provocations about the issues of Big Data. In these six points they discuss that big data has created a radical shift in how we think about research and has changed the definition of knowledge. They also break the common myth most researchers have that data solves all problems and also point out that the access of data to privileged few is creating a new divide. Furthermore, they go on to explain that big data especially social media data can be sometimes misleading because it not necessarily represent the entire population. They further discuss the ethics of using big data in research and the lack of regulations on ethical practices of research.

[2]. D. Lazer and J. Radford, Data ex Machina: Introduction to Big Data

Summary:

In this paper, the authors define big data and institutional challenges it presents to sociology. They touch base on three types of big data sources and enumerate the promises and pitfalls common across them. The authors are of the opinion that crosscutting these three types of big data is the possibility for sociologists to study human behavior. The authors also discuss the opportunities available to sociologists with the huge amount of data available through various social systems, natural and field experiments and other digital traces. They also explain how targeted sample from a huge chunk of data can be used to study behavior of minorities. They further discuss the vulnerabilities in big data including generalization that data represents entire population, fake data generated through bots and different sources of data with different accessibility and issues that these vulnerabilities presents.

Reflections:

From both Boyd & Crawford’s and Lazer & Radford’s descriptions, I took away that big data should be carefully used keeping in mind ethical issues. Furthermore, the key take away from these papers for me is that big data is not just about size but also how we manipulate the data to generate insights about human behaviors.

I particularly liked Boyd & Crawford’s provocation #3 that bigger data is not necessarily a better data. We computer scientists have common belief that more data can solve all the problems but in actuality this is not essentially true because the data at hand no matter how big is it might not be representative at all for example: trillion rows of Twitter data will still only represent small portion of Twitter users and therefore, generalizing and making claims about behaviors and trends can be misleading. The predictions made using this data will therefore have inherent biases. Since social media data is the biggest source of big data so now the question that comes to mind after this is how do we know if data is true representative or not? If not, then from where do we get the data that is true representation of entire population?

I have concerns about Lazer & Radford’s solution to generalizability that data from different systems should be merged. Is it even possible for a normal sociologist researcher? Will companies provide access to their entire dataset? Boyd & Crawford’s paper explains that people with different privileges have different level of access to the data. Even if we consider an ideal world where we have access to data from all the sources, how will we link data from different sources? For example: A Twitter user handle to Facebook profile and Snapchat username because currently the chunk of data available of Facebook users might not have same users available in twitter data. Will Facebook provide access to their entire dataset?

Nonetheless, the papers enlightened me to think how big data can be used in context of social science and what are the ethical vulnerabilities associated with it.

 

Questions:

 

How do we know if data is true representative or not? Where do we get the data that is true representation of entire population?

Is it possible to link data from different sources?

How do we know what companies are doing at the backend is ethical or not?

Do people behave in same way on different digital platforms?

Can computational social science correctly explain human behavior with current data we have? Because papers suggested that data we have is not true representation until merged.

Read More