Reflection #10 – [03/22] – [Meghendra Singh]

  1. Kumar, Srijan, et al. “An army of me: Sockpuppets in online discussion communities.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
  2. Lee, Kyumin, James Caverlee, and Steve Webb. “Uncovering social spammers: social honeypots+ machine learning.” Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010.

In the first paper, Kumar et. al. study “sockpuppets” in nine discussion communities (a majority of these are news websites).  The study shows that sockpuppets behave differently than ordinary users, and these differences can be used to identify them. Sockpuppet user accounts have specific posting behavior, they don’t usually start discussions, their posts are generally short and contain certain linguistic markers (For e.g., greater use of personal pronouns such as “I”).  The authors begin by automatically labeling 3665 users as sockpuppets in the 9 discussion communities using their IP addresses and user session data. The authors identify two types of sockpuppets: pretenders (pretending to be legitimate users) and non-pretenders (easily identifiable as fake users, i.e. sockpuppets, by the discussion community). The two main results of the paper are classification of sockpuppet user pairs from ordinary user pairs (ROC AUC=0.91) and predicting if an individual user is a sockpuppet (ROC AUC=0.68), using activity, community and linguistic (post) features. The paper does a good job of explaining the behavior of sockpuppets in the comments section of articles in typical news websites and how these behaviors can be used to detect sockpuppets and thereby lead to maintaining healthy and unbiased online discussion communities. The paper references a lot of prior work and I really appreciate the fact that most of the decisions about features, parameters and other assumptions made in the study, are grounded in past literature. While reading the paper, a fundamental question that came to my mind was, if we can already identify a sockpuppets using IP addresses and temporal features of their comments, what is the point of using predictive modeling to differentiate sockpuppets from ordinary users? In essence, if we already have a high precision, rule-based approach to detect sockpuppets why rely on predictive modeling that performs a little better than random chance (ROC AUC=0.68)?

I found the sockpuppet-ordinary user conversation example at the end of section 2 really funny, and I feel that the first comment itself is rather suspicious. This example also seems to indicate that the puppetmaster (S2) is the author of the article on which these comments are being posted. This leads to the question that given a puppetmaster has multiple sockpuppet accounts will their main account be considered an ordinary user? If not, does this mean that some of the articles themselves are being written by sockpuppets? A research question in this context can be: “detecting news articles written by sockpuppets in popular news websites”. Another question I had was why did the authors use cosine similarity between feature vectors of users? And what are the statistics for this metric (mean and standard deviation of cosine similarities between sockpuppet and ordinary user feature vectors). Additionally, is there a possibility of using a bag of words model here, instead of numeric features like LIWC and ARI computed from user’s posts? Moreover, there is a potential to experiment with other classification techniques here and see if they can perform better than Random Forest.

Lastly, as the authors suggest in discussion and conclusion, it would be interesting to repeat this experiment on big social platforms like Facebook and Twitter. This becomes really important in today’s world, where online social communities are rife with armies of sockpuppets, spambots and astroturfers, hell-bent on manipulating public opinion, scamming innocent users and enforcing censorship.

The second paper by Lee et. al. addresses a related problem of detecting Spammers on MySpace and Twitter using Social Honeypots and classifiers. The study presents an elegant infrastructure for capturing potential spammer profiles, extracting features from these profiles and training popular classifiers for detecting spammers with high accuracy and low FPR. The most interesting finding for me were the most discriminative features (i.e., About Me Text and Number of URLs per tweet) for classifying spammers from legitimate users and the fact that ensemble classifiers (Decorate, etc.) performed the best. Given that deep learning was not really popular in 2010, it would be interesting to apply state of the art deep learning technique for the classification problem discussed in this paper. As we have already seen that the discriminative features that separate spammers from regular users vary from one platform/domain to other, it would be interesting to see if there exist common cross-platform, cross-domain (universal) features that are equivalently discriminative. Although, MySpace may not be dead, it would be interesting to redo this study on Instagram which is a lot more popular now, and has a very real spammer problem. Based on personal experience, I have observed legitimate users on Instagram becoming spammers once they have enough followers. Will a social honeypot based approach work for detecting such users? Another challenge with detecting spam (or spammers) on a platform like Instagram is that most of the spam is in the form of stories (posts which automatically disappear in 24 hours), while the profiles may look completely ordinary.

Leave a Reply

Your email address will not be published. Required fields are marked *