Reflection #3 – [1/23] – [Jiameng Pu]

Cheng, Justin, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. “Antisocial Behavior in Online Discussion Communities.” ICWSM. 2015.


Users contributions are an important part of kinds of social platforms, e.g., posts, comments, votes, etc. While most users are civil, few of the antisocial users greatly contaminate the environment of the internet. By mainly studying on users who were banned from specific communities and compare two user groups, FBUs(Future-Banned Users) and NBUs(Never-Banned Users), the authors try to characterize antisocial behavior, e.g., how FBUs write, how FBUs generate activity around themselves.  The “Evolution over time” analysis shows that FBUs write worse than other users over time and tend to exacerbate their antisocial behavior when there is more strong criticism in the community. By designing features based on the observations and then categorize them, the work can potentially help alleviate the burden of social community moderators from heavy manual labor. Besides, it proposes a typology of antisocial users based on post deletion rates. Finally, A system is introduced to identify undesired users early on in their community life.


The paper leads some extensive discussion and analysis on the topic of antisocial behavior, I highlight some points impress/inspire me most. First, the analysis about how to measure undesired behavior is a useful one in the data preparation section.  It reminds me that down-voting activities cannot be interpreted as undesirable in the context of “antisocial behavior”, which is a much narrower conception. Personally, I don’t use down-vote functionality that much when I browse Q&A websites like Quora, Zhihu, and StackOverflow. And it turns out many people also keep the same habit, which is a good instance where considering fewer features/data, i.e., report records and post delete rates, makes more sense.  Second, instead of predicting whether a particular post or comment is malicious, they put more focus on individual users and their whole community life, which is harder to analyze but bring more convenience to community moderators, since they can do their job like a real community police but not simply a cleaner.  Third, four categories of feature properly cover all the feature classes, but the author doesn’t mention some of potentially important features in Table 3, e.g., post comments, which could be categorized into post features; user’s followings and followers, which could be categorized into community features. Intuitively, these two features are strong indicators of the user’s properties — people of one mind fall into the same group and harsh criticism would show up in the comment area of malicious posts.

I notice that the author performs the above task on a balanced dataset of FBUs and NBUs (N=18758 for CNN, 1164 for IGN, 1138 for Breitbart), suggesting that these learned models generalize to multiple communities. Though the number of FBUs and NBUs is balanced, would the different number of user samples from three platforms influence the generalization of the resulting classifier? To my point of view, it’s more rigorous for the author to modify lopsided data samples or add more discussion about how data can be properly sampled.

Questions & thoughts:

  1. What’s the proper line between the definition of antisocial and non-antisocial? We should avoid confusing unpleasant users and antisocial users.
  2. Compared to the last paper, there is less description of implementation tools throughout different phases of research. I’m pretty curious about how to do specific procedures practically, e.g., data collecting, feature categorization, investigation of the evolution of user behavior and of community response.
  3. I think the classifiers we choose probably make a difference in the prediction accuracy, so it might be better to compare the performance of those classifiers to find out more feasible classifier for this task.
  4. Although we can roughly see the contribution of each feature category from Table 4, I think more extensive and quantitive analysis would complete the research.

Read More

Reflection #1 – [1/18] – [Jiameng Pu]

  • D Boyd et al. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon.
  • D Lazer et al. Data ex Machina: Introduction to Big Data.


These two papers both summarize and discuss some critical questions of big data ranging from its definition to society controversy, which demonstrates that it’s necessary for researchers in social science to consider a lot of questions prone to be ignored before conducting their research. In critical questions for big data, the author summarizes six crucial points regarding big data. 1)Big data today will produce a new understanding of knowledge. 2) All the researchers are interpreters of data, what’s the line between being subjective and being objective? 3) Big quantities of data do not necessarily mean better data because data sources and data size tend to offer researchers a skewed view. 4) The context of big data cannot be neglected. For example, today’s social network does not necessarily reflect sociograms and kinship network among people. 5) Research ethics is always supposed to be considered while using data. 6) The resource of data is not equal to different research groups, which entails a digital division in the research community. In Data ex Machina, the author explicitly illustrates definition, sources, opportunities, vulnerabilities of big data. By reviewing some of the particular projects in the literature, e.g., The Billion Prices Project, EventRegistry, GDELT, it offers us a convincing sight of the existing problem in big data. For examples, it expounds from three aspects, i.e., data generation, data sources, data validation, to illustrate vulnerabilities in big data research. The author concludes by discussing the future trends of big data — more quantities of data with expected standard forms and enabled research approaches will come.


In social science, researchers utilize a large amount of data from different platforms and analyze it to prove their assumptions or explore potential laws. Just like how Fordism produced a new understanding of labor, the human relationship at work in the twentieth century, big data today also changes the way of people’s understanding of knowledge and human network/communities. These two papers cover a lot of viewpoints I have never thought about even if I already knew big data and did some related simple tasks.  Instances like “firehose” and “bots on the social media” trigger my interest in how to improve the scientific environment of big data. Besides, they prompt readers to think in depth about research data they are using with a dialectical perspective.  Data collecting and preprocessing are more basic and critical than I’ve ever thought. Is quantity bound to represent objectivity? Can data in large number give us the whole data we need to analyze in our specific context? Are data platforms themselves unbiased?  The truth is — there are data controllers in the world, i.e., some authorities/organizations/companies have power in controlling data subjectivity and accessibility; we’ve got data interpreters, all the researchers can be considered as interpreters in some ways; we’ve got booming data platforms/sources for researchers to make choices.

In general, the papers enlighten me on the big data with a context of social science in two ways: 1) researchers should always avoid using data in a way that would obviously affect the rigor of research, e.g., use one specific platform like Twitter to analyze the kinship network.  For researchers, it’s necessary to jump out of individual subjectivity to interpret data. 2) Both organizations and researchers should put effort to construct a healthy and harmonious big-data community to improve the accessibility and validation of data, to formulate scientific usage standards and sampling frames for big data. For any authorities, networks or individuals, we are supposed to dedicate ourselves to work that can potentially benefit the whole big data community.  In this way, all the scientific researchers will have more faith and courage to face the coming era of big data with more challenges but also more valuable knowledge.


  1. What’s the definition of knowledge in the twentieth century? How about now?
  2. How to analyze people’s relationship network without distortion? How many data platforms do we have to use? e.g., email, twitter, facebook, Instagram… what kind of combination is the best choice?
  3. To what extent do we have to consider the vulnerabilities of accessible data? For example, if we can use currently available datasets to solve a practical problem, we may ignore some of vulnerabilities and limitations a little bit.
  4. How much can systematic sampling frames help us in analyzing a specific assumption?
  5. What are the uppermost questions for researchers to think when collecting/processing data?
  6. What are the situations that would be best to avoid for researchers when collecting/preprocessing data?

Read More