An Army of Me: Sockpuppets in Online Discussion Communities
People interact with each other on the Internet mainly by discussing mechanism provided by social networks such as Facebook, Reddit. However, sockpuppets created by malicious users badly influence the network environment by engaging in undesired behavior like deceiving others or manipulating discussions. Srijan et al. study sockpuppetry across nine discussion communities. By firstly identify sockpuppets using multiple signals indicating accounts might share the same user, they then characterize their behavior by inspecting different aspects. They find that the behavior of sockpuppets is different from that of ordinary users in many ways, e.g., start fewer discussions, write shorter posts, use more personal pronouns such as “I”. The study contributes towards the automatic detection of sockpuppets by presenting a data-driven view of deception in online communities.
For the process of identifying sockpuppets, the strategy is inspired by Wikipedia administrators who identify sockpuppets by finding accounts that make similar edits on the same Wikipedia article in near-similar time and from same IP address, which makes sense. But for the hyper-parameter, top percentage(5%) of most used IP address, is there any better strategy that can decide the percentage more numerically rather than intuitively? When measuring linguistic traits of sockpuppets, LIWC word categories is used to measure the fraction of each type of words written in all posts, and VADER for sentiment of posts. Up to now, I feel LIWC word categories is powerful and heavily used in social science research, I’ve never used VADER before. In the double life experiment, although they match sockpuppets with ordinary users that have similar posting activity, and that participate in similar discussion, I feel like there is too much uncertainty in the linguistic feature of ordinary users, i.e., different users have different writing style. Then the cosine similarity of the feature vectors for each account would be less convincing.
Uncovering Social Spammers: Social Honeypots + Machine Learning
Both web-based social networks (e.g., Facebook, MySpace) and online social media sites (e.g., YouTube, Flickr) rely on their users as primary contributors of content, which made them prime targets of social spammers. Social spammers engage themselves in undesirable behavior like phishing attacks, to disseminate malware and commercial spam messages, etc, which will seriously impact the user’s experience. Kyumin et al. propose a honeypot-based approach for uncovering social spammers in online social systems by harvesting deceptive spam profiles from social networking communities and creating spam classifiers to actively filter out existing and new spammers. The machine learning based classifier is able to identify previously unknown spammers with high precision and a low rate of false positives.
The section of machine learning based classifier impressed me a lot, since it shows how to investigate the discrimination power of our individual classification features apart from only evaluating the effectiveness of classifiers, in which ROC curve plays an important role. Also, AMContent, the text-based features modeling user-contributed content in the “About Me” section, shows me how to use more complicated text feature besides simple data like age, marital status, gender. I’ve never heard of Myspace before but there is still twitter experiment, otherwise I would think this is a weird choice of experiment dataset. For twitter spam classification, we can obviously see the differences in the way they collect account feature, i.e., twitter accounts are noted for their short posts, activity-related features, and limited self-reported user demographics. Thus there is a reminder that feature design varies according to the variation of study subjects.