Reflection #14 – [04/17] – [Nuo Ma]

1.Lelkes, Y., Sood, G., & Iyengar, S. (2017). The hostile audience: The effect of access to broadband internet on partisan affect. American Journal of Political Science61(1), 5-20.

In this article, the author identifies the impact of access to broadband Internet on affective polarization. They exploited differences in broadband availability brought about by variation in state right-of-way regulations (ROW), which significantly boost access to content. The author concludes that access to broadband Internet increases partisan hostility and is stable across levels of political interest. The author also finds that access to broadband Internet boosts partisans’ consumption of partisan media.

The author identified the impact of broadband access on affective polarization by exploiting differences in broadband availability brought about by variation in state right-of-way regulations (ROW), which significantly affect the cost of building Internet infrastructure and thus the price and availability of broadband access. Assuming broadband access availability is consistent with the number of service provider, and regression model was used to prove this is related to ROW score. Other major causes like terrain and weather were also briefly discussed. I liked the methodology in this part, I’m also surprised to see the amount of dial-up connections showed in the study. Maybe also consider cellular data consumption as a part of future study, which can be measured by the average sale price of smartphones in an area. And for your group project, I can’t exactly remember how did you acquire internet speed data?

This also reminded me the similar topic of internet neutrality, which you pay to get faster access to certain websites. Since internet speed really has an impact on what kind of website you visit / app you use, it’s so painful to think that we can be so easily manipulated. Or companies like facebook pay ISP to ensure user has higher traffic priority to their website, but our personal data were sold in some way to pay for this.?


Read More

Reflection #10 – [03/22] – [Nuo Ma]

  1. Kumar, Srijan, et al. “An army of me: Sockpuppets in online discussion communities.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.


Sockpuppets created by unreal users may mislead or have an negative impact in online discussion communities. In this paper, the author present a study about sockpuppets in online discussion communities. The data they use comes from nine online discussion communities and consisted of 2.9 million users. The author first identify sockpuppets by using features like similar names and same IP addresses posted within close time proximity. Then they analyzed the posting behaviors and linguistic features of such sockpuppets. As a result, the author was able to find the behavior of sockpuppets is different from that of ordinary users. Including tendency to write more posts than ordinary users, and shorter posts with a lot of first person pronouns.


I like this paper, and there are some points to be discussed. In the data selection, I can see that on average per puppy master owns 2 puppy accounts. This is oddly consistent across all 9 communities and might there be a reason why? And one level deeper into this question, What’s the motivation of such sockpuppets.  As we can see in the figure 6, those topics (usa, world,politics,justice,opinion) have significant higher amount of sockpuppets compared to other topics. So I think it’s safe to assume that people use sockpuppets mainly for political-oriented discussions. But in the data statistics, we can see that the number of sockpuppets in political sites compared to the MLB and allkpop is still average 2 accounts per puppy master, and not even a significant difference in the ratio of #sockpuppets vs #users. Or what is a better way to verify the results of such detection methods.  There can also be multiple purposes of such sockpuppets. It can be from PR companies to give positive comments on a celebrity, on a certain event or for a product.  In fact this is really common practice in some regions, the image of a celebrity can greatly affect the revenue of related movies and product. But I’d say it’s almost impossible to get dataset from these PR companies who own a large number of such sockpuppets.

Read More

Reflection #9 – [02/22] – [Nuo Ma]

In [1] The authors studied the attitudes of people who are against vaccines by analyzing the attitudes of participants involved in the vaccination debate on Twitter. They gathered 315240 tweets related to certain phrases from 144817 users in a 3 year time period. Then users were classified into pro vaccine group, anti vaccine group and joining anti vaccine group by comparing linguistic styles, topics of interest and social characteristics. The authors found that the long-term anti-vaccination supporters have conspiratorial views, mistrust in government and are resolute, and these supporters use more direct language and have higher expressions of anger compared to their pro counterparts..  Also, the “joining-anti” share similar conspiracy thinking but they tend to be less assured and more social in nature. I am curious when they first started the analysis, did they identify “typical twitter users” first and then did a manual analysis first? Because I would assume, for a persistent anti-vaccine person, his/her tweets will be consistently very aggressive. But in here, user’s tweets are not considered consistently. By using user ID that comes with tweet raw data, we might be able to find some conflicting users and filter out some noise data, or it would be interesting to see why there is such a conflict in attitude? Also, I would consider those users who constantly tweet about anti-vaccine to be extreme because people just don’t tweet about this. It would be interesting to see how anti-vaccine tweets spread in times of flu season, and how people view this issue. The spread pattern of tweets can show us who are those opinion leaders who can make an impact. To some sense, we can see this as potentially fake news detection, because those conspiracy stories can be defined as fake news.



[1] Mitra, Tanushree, Scott Counts, and James W. Pennebaker. “Understanding Anti-Vaccination Attitudes in Social Media.”

[2] De Choudhury, Munmun, et al. “Predicting depression via social media.”

Read More

Reflection #6 – [02-08] – [Nuo Ma]

Danescu-Niculescu-Mizil, C., Sudhof, M., Jurafsky, D., Leskovec, J., & Potts, C. (2013) “A computational approach to politeness with application to social factors”.

This paper focus on linguistic aspects of politeness on social platforms like Wikipedia, Stack Exchange. The author conducted linguistic analysis for politeness using two classifiers, bag of words classifier and linguistically informed classifier. They were able to report algorithm results close to human performance. Also, the analysis of the relationship between politeness levels and social power shows a negative correlation between politeness and power on Stack Exchange or Wikipedia.

This paper is good to me. The illustration of the use of common phrases (strategies) relative to politeness is the way of how we normally make this judgement only from verbal words intuitively. However, bag of word may not entirely reflect the politeness.  Maybe using bag of phrases, and analysis the grammatical structure of the sentences. Because the use of formal and complete grammar may indicate politeness. Also what might be a drawback is that, when building two classifiers to predict politeness, the author tested the classifiers both in-domain (training and testing data from the same source) and cross-domain (training on Wikipedia and testing on Stack Exchange, and vice versa). To me these are communities with distinctive characteristics, Wikipedia is more formal, and the use of word is more precise, while stack exchange is more casual in terms of use of language. A short yet precise answer in stack exchange would be viewed by human as polite, but not by the classifier. The domain transfer here is worthy some discussion. Or what about combining these two data sources and perform a Leave-one-out cross-validation (LOOCV) ?


Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., … & Eberhardt, J. L. (2017). Language from police body camera footage shows racial disparities in officer respect. Proceedings of the National Academy of Sciences, 201702413.

This paper tries to study the respect-level between different races, gender and nationality etc. using transcribed data from Oakland Police Department body camera videos. They extracted the respectfulness of police officer language by applying computational linguistic methods on transcripts. Disparities are shown in the speaking toward black and white community members. The first thing I can think about is the prior of such events. In the case of traffic stops, whether the target comply with officer’s instructions, whether the car is clean. These are small cues that will affect officer’s actions. I’m sure they won’t be so polite after a highway chase. So, audio transcription alone is insufficient to prove this correlation.


Read More

Reflection #3 – [1/23] – [Nuo Ma]

Cheng, Justin, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. “Antisocial Behavior in Online Discussion Communities.” ICWSM. 2015.


In this paper, Cheng et al present a study of users banned for antisocial behaviors sampled from  three online communities (CNN, IGN and Breitbart). The author characterize antisocial behaviors by study specific groups from these online communities: FBUs(Future-Banned Users) and NBUs(Never-Banned Users). The author also presented analysis of “evolution over time” indicating FBUs write worse than other users over time and the community tolerance tend to decline over time. At last, the author proposed an approach to extract features for predicting antisocial behaviors, potentially automate and standardize this process.


I think there are several points noteworthy in this paper. First, these are three communities with different characteristics. Breitbart is far-right according to google, IGN don’t have a tendency and personally I consider CNN to be lean left. The nature of the community will attract certain user group and might result differently in user behaviors. Also the specific topics can lead to different results. But in the analysis I only see ‘measuring undesired behavior’. This is rather blurry description, but the term antisocial itself is hard to have a clear definition. This makes me curious because different communities have different banning rules, but carrying out these rules can vary accordingly. In this article it is simply categorized as banned and non-banned. Also the banning rule is different across different communities, but some data is treated as a single entity. To me this may not be completely solved due to the nature of the question, but can definitely be further elaborated or discussed. Also, the number of data samples are not consistent (18758 for CNN, 1164 for IGN, 1138 for Breitbart)

As for the proposed features and classifier to predict antisocial behaviors, I like the idea. While using bag of words can measure literal ‘trollings and abuse’. However a lot of antisocial behaviors online are one step further, which is not limited to literal words e.g. sarcasm. When sarcasm goes extreme, it can be antisocial. Identifying those specific antisocial behaviors can be easy within a interest group, but when there is an agreement in such groups, it is likely that this post get reported / deleted. Subjective Deleted / reported posts should not be the only metric for measuring antisocial behaviors. Objective features, such as using down votes, might reduce the influence of such subjective behavior from administrator. But needs further clarification. When you downvote in some communities, it provide options for you to choose the reason for such votes: disagree, irrelevant, or trolling. This will help the classifier get clarified response for down vote reasons.


This paper study banned posts from 3 large communities, but different communities has different guidelines and what kind of guideline can be generalizable for all communities?

Is Antisocial Behavior / Language as main user banning criteria consistent for all cases discussed here? How can it be verified / pre-processed?

For CNN, I have the impression that user tend to view this website based on their political background. We also see a higher # posts reported% compared to websites like IGN which users are less ‘categorized’. Will the nature of the website have an influence on how users behave? (I mean for sure but this might be something noteworthy)

Read More