This pair of papers describes aspects of those who ruin the Internet for the rest of us. Kumar’s “An Army of Me” paper discusses the characteristics of sockpuppets in online discussion communities (as an aside, the term “sockpuppet” never really clicked for me until seeing its connection with “puppetmaster” in the introduction of this paper). Looking at nine different discussion communities, the authors evaluate the posting behavior, linguistic features, and social network structure of sockpuppets, eventually using those characteristics to build a classifier which achieved moderate success in identifying sockpuppet accounts. Lee’s “Uncovering Social Spammers” paper uses a honeypot technique to identify social spammers (spam accounts on social networks). They deploy their honeypots on both MySpace and Twitter, capturing information about social spammer profiles in order to understand their characteristics, using some similar characteristics as Kumar’s paper (social network structure and posting behavior). These authors also build classifiers for both MySpace and Twitter using the features that they uncovered with their honeypots.
Given the discussion that we had previously when reading the Facebook papers, the first thing that jumped out at me when reading through the results of the “Army of Me” paper was the small effect sizes, especially in the linguistics traits subsection. Again, these included strong p-values of p<0.001 in many cases, but also showed minute differences in the rates of using words like “I” (0.076 vs 0.074) and “you” (0.017 vs 0.015). Though the authors don’t specifically call out their effect sizes, they do provide the means for each class and should be applauded for that. (They also reminded me to leave a note in my midterm report to discuss effect sizes.)
One limitation of “Army of Me” that was not discussed was the fact that all nine communities that they evaluated use Disqus as a commenting platform. While this made it easier for the authors to acquire their (anonymized) data for this study, there may be safety checks or other mechanisms built into Disqus that bias the characteristics of sockpuppets that appear on that platform. Some of their proposed future work, such as studying the Facebook and 4chan communities, might have made their results stronger.
“Army of Me” also reminded me of the drama from several years ago around the reddit user unidan, the “excited biologist,” who was banned from the community for vote manipulation. He used sockpuppet accounts to upvote his own posts and downvote other responses, thereby inflating his own reputation on the site.
Besides identifying MySpace as a “growing community” in 2010, I thought that the “Uncovering Social Spammers” paper was a mostly solid and concise piece of research. The use of a human-in-the-loop approach to obtain human validation of spam candidates to improve the SVM classifier appealed to the human-in-the-loop researcher in me. Some of the findings from their honeypot data acquisition were interesting, such as the fact that Midwesterners are popular spamming targets and that California is a popular profile location. I’m wondering if the fact that these patterns were seen is indicative of some bias in the data collection (is the social honeypot technique biased towards picking up spammers from California?), or if there actually is a trend in spam accounts to pick California as a profile location. This wasn’t particular clear to me; instead, it was just stated and then ignored.
I really liked their use of both MySpace and Twitter, as the two different social networks enabled the collection of different features (e.g., F-F ratio for Twitter, number of friends for MySpace) in order to show that the classifier can work on multiple datasets. It’s almost midnight and I haven’t slept enough this month, but I’m still puzzled by the confusion matrix that they presented in Table 1. Did they intend to leave variables in that table? If so, it doesn’t really add much to the paper, as they’re just describing the standard definitions of precision, recall, and false positive. They don’t present any other confusion matrices in the paper, so it seems even more out of place.