Kumar, Srijan, et al. “An army of me: Sockpuppets in online discussion communities.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
Summary:
Sockpuppets in reference to this paper, are online discussion accounts belonging to the same user, also referred here as the “puppetmaster”. Kumar et. al’s work in studying sockpuppets in online communities observe the posting behavior, interactions and linguistic characteristics of sock puppets, finally leading to a predictive model. They observe that the sockpuppets tend to comment more, write shorter posts, use more personal pronouns such as “I”, and are more likely to interact with each other. Finally, in the predictive task the authors find that activity, community and post features are most relevant to detecting the sockpuppets with 0.68 AUC. Here are some thoughts on the data collection, method and impact of this work:
Why is this research even important? Even though this paper has excellent technical and analytical aspects, I believe that there should have been some more stress on why sockpuppetry is harmful in the first place.
“In 2011, a California company called Ntrepid was awarded a $2.76 million contract from US Central Command for “online persona management” operations[42] to create “fake online personas to influence net conversations and spread US propaganda” in Arabic, Persian, Urdu and Pashto.” (Wikipedia)
I found some more reasons which I think are important to situate this research with community betterment
- Bypassing the ban on the account by creating another account (mentioned in the paper)
- Sockpuppeting during an online poll to submit multiple votes in favor of the puppeteer.
- Endorsing a product by writing multiple good reviews
- Enforcing public opinion about a policy or candidate by sheer numbers
How to build a better ground truth? One obvious point of contention with this paper is the way the data is collected and labeled as sockpuppet accounts. There is no solid validation regarding whether the selected accounts are actually sockpuppets. The authors mention that they had conservative filters while selecting the sockpuppet accounts but it also means that they might have missed significant true positives. So what can be done to build a better ground truth?
- Building a strong “anti-ground truth”. There are performance comparisons between sockpuppets and ordinary users throughout the paper. If the sampled list of ordinary accounts was vetted more strongly (if they had a stronger anti group) the comparisons would have been more telling. One way to do this is to collect accounts which posted from different IPs or location at the exact same time.
- Asking the discussion groups for sockpuppets. Even though this seems harder, it can form a very strong ground truth and validation point
Lastly, there are several comparisons between the pairs of sockpuppets and two ordinary users. I am not sure whether the ordinary user’s measure was a normalized aggregate of all pairwise ordinary measures. In any case, instead of comparing the sockpuppet pair activity with generic pairwise activity, it would be better to find out the comparison with two ordinary users with some probability of interaction (eg. same discussion, location, posting time etc.) Also, while comparing between pretenders and non-pretenders, it would be beneficial to have a comparison with ordinary users as a ground truth measure.
In the discussion, the authors claim that not all sockpuppets are malicious. Further research can be focused on finding characteristics of only malicious sockpuppets or online deception “artists”!