Reflection #13 – [04/10] – [John Wenskovitch]

April 10, 2018 John Wenskovitch Leave a comment

This pair of papers evaluates the prediction of product sales based on both linguistic and quantitative aspects of online content. In the Pryzant et al. paper, the authors looked specifically at product descriptions. They obtained more than 93,000 health and chocolate project descriptions for the website Rakuten in order to evaluate these product descriptions linguistically, expanding on previous studies that examined summary stats. Using a neural network that controls for confounding features (pricing, brand loyalty, etc.), they identify a set of words and writing styles that have high impact on sales outcomes. In contrast, the Hu et al. paper examines the influence of online reviews. Rather than performing a linguistic analysis, they instead examine features such as quality of reviewers and age of an item (number of reviews).

I did really enjoy reading through the Pryzant paper. The thorough explanation of the neural network mathematics really helped to make it clear what the authors were doing, and the experiments section was clear and well-written. I think my biggest criticism of the paper is that, if you strip away all of this explanation, it doesn’t feel like the authors did all that much. They extend a neural network to meet their feature selection goals, tokenized two different datasets, ran the model, and reported a few results. This area of research is certainly not my area of expertise, but this feels like a single research question workshop paper or class project. The class project my group is building has 3 (arguably 4) distinct research goals.

Beyond that, the authors don’t spend much time discussing the lack of general cultural applicability of their findings. They note the extensibility of the project to a general lexicon near the very end of the conclusion, and that’s about it. There is no indication of how these results are applicable to any language/culture outside of Japanese/Japan. Additionally, their “seasonality” result seems to me to be too close to some of the confounding variables that the authors wanted to eliminate. Is there really that big of a difference between marketing a product with “free shipping!” in the description and marketing a seasonal item with “great Christmas gift!” in the same place?

Two stylistic criticisms for the Hu paper: (1) I think it could have been better organized by grouping the hypothesis and result of each research question together, rather than having separate hypothesis and result sections (and I feel the same way about our class project final report). I frequently found myself paging back and forward between results, hypotheses, and background that led to those hypotheses. (2) I was intrigued by the tabular related work approach. I can see it being useful in well-developed fields and for survey papers. However, in more recent and novel research, this approach makes it difficult to understand the novelty of the work performed by the authors. It’s more of a list of past results rather than an explanation of the authors’ contributions.

Reflection #12 – [04/05] – [John Wenskovitch]

April 4, 2018 John Wenskovitch Leave a comment

This pair of papers returns our class discussions to linguistic analyses, including both sentiment detection using emoji (Felbo et al) and classifying online communities (Nguyen et al). The emoji paper (not to be confused with the Emoji Movie) authors build a “DeepMoji” supervised learning model to classify the emotional sentiment conveyed in tweets with embedded emoji. Using an immense multi-billion tweet dataset (that was curated down to just over a billion), the authors build and experiment with their classifier, finding that the rich diversity of emotional labels in the dataset yield performance improvements over previous emotion supervised learning studies. The depression paper examined linguistic features of mental health support communities on Live Journal, seeking to understand some of the relationships present between distinct communities (such as Depression and Suicide groups). In addition to very detailed results, the authors clear discuss their results and the limitations of their study.

The emoji paper was a tad difficult for me to read, in part because it focused so much on the ML approaches used in order to address this emotion sentiment challenge, and in part because I’m just not a person who uses emoji. From my limited understanding, much of their motivation appeared sound. The one thing that I wasn’t certain about was their decision to take tweets with multiple instances of the same emoji and reduce them to a single instance of that emoji. I have seen tweets that use a single cry-smile which are trying to convey a slightly different but still related emotion than tweets that use twelve cry-smiles. In the text communication world, I see it as the difference between “lol” and “hahahahahaha” replies. I’m curious how the performance of their classifier would have changed if they had taken the semantics of multiple emoji into account further.

That said, their dendrogram (Fig 3) showing the clustering of the DeepMoji model prediction contained some interesting relationships between pairs and sets of emoji. For example, the various heart emoji at the right end appear in several different subgroups with a few “bridge” emoji in between to connect those subgroups. That isn’t an outcome that I was expecting. For the most part though, happy emoji were self-contained into their own group, as were clusters that I’ll call sad emoji, celebratory emoji, and silly emoji.

My biggest criticism of the depression paper is the same theme that I’ve been suggesting all semester – getting all of your data from a single source introduces implicit biases into the results that you may not be aware of. In the case of this study, all of the data came from Live Journal communities. Having never belonged to that website, I cannot speak for what features could cause problematic biases. However, I can suggest possibilities like comment moderation as being one dimension that could cause the linguistic features of these communities to differ between Live Journal and other community hubs. Though the authors provided a page of limitations, this was not one of them.

I did also like that the authors compared their Lasso classification with three other classifiers (Naïve Bayes, SVM, and Logistic Regression), and compared their results across all four classifiers. I’m also a big proponent of trying multiple classification techniques and determining which one is working the best (and then going back to the data and trying to understand why).

Reflection #11 – [03/27] – [John Wenskovitch]

March 26, 2018 John Wenskovitch Leave a comment

This pair of papers falls under the topic of censorship in Chinese social media. King et al.’s “Reverse-Engineering Censorship” article takes an interesting approach towards evaluating censorship experimentally. Their first stage was to create accounts on a variety of social media sites (100 total) and sent messages worldwide to see which messages were censored and which were untouched. Accompanying this analysis are interviews with confidential sources, as well as the creating of their own social media site by contracting with Chinese firms and then reverse-engineering their software. Using their own site gave the authors the ability to understand more about posts that are reviewed & censored and accounts that are permanently blocks, which could not be done through typical observational studies. In contrast, the “Algorithmically Bypassing Censorship” paper, the authors make use of homophones of censored keywords in order to get around detection by keyword matching censorship algorithms. Their process, a non-deterministic algorithm, still allows native speakers to recover the meaning behind almost all of the original untransformed posts, while also allowing the transformed posts to exist 3x longer than their censored counterparts.

Regarding the “Reverse-Engineering” paper, one decision in their first stage that I was puzzled by was the decision to submit all posts between 8AM and 8PM China time. While it wasn’t the specific goal of their research, submitting some after-hours posts could generate interesting information about just how active the censorship process is in the middle of the night. That includes all of the potential branches – censored after post, censored after being held for review, and accounts blocked.

From their results, I’m not sure which part surprised me more: that 63% of submissions that go into review are censored, or that 37% that go into review are not censored and eventually get posted. I guess I need more experience with Chinese censorship before settling on a final feeling. It seems reasonable that automated review will capture a fair number of innocuous posts that will later be approved, but 37% feels like a high number. Their note that a variety of technologies are used in this automated review process would imply high variability in the accuracy of the automated review system, and so a large number of ineffective solutions could explain why 37% of submissions are released for publication after review. On the other hand, the authors chose to make a number of posts about hot-button (“collective action”) issues, which is the source of my surprise regarding the 63% number. Initially I would have expected a higher number, because despite the fact that the authors submit both pro- and anti-government posts, I would suspect that additional censorship might be added in order to un-hot-button these issues. Again, I need more experience with Chinese social media to get a better feeling of the results.

Regarding the “Algorithmically Bypassing” paper, I really enjoyed the methodology of taking an idea that activists are already using to evade censorship and automating it to use at scale by more users. Without being particularly familiar with Mandarin, I suspect that creating such a solution is easier in China than it would be in a language like English with fewer homophones. However, it did remind me of the images that are shared frequently on Facebook that are something like “fi yuo cna raed tihs yuo aer ni teh tpo 5% inteligance” (generally seen with better scrambled letters in longer words, in which the first and last letters are kept in the correct position).

I felt that the authors’ stated result that posts typically live 3x longer than an untransformed equivalent censored post was impressive until I saw the distribution in Figure 4. A majority of the posts do appear to have survived with that 3x longer time statistic. However, the relationship is much more prevalent for surviving 3 hours rather than 1, while many fewer posts exist in the part of the curve where a post survives for 15 hours rather than 5. A case of giving a result that is accurate but also a bit misleading.

Reflection #10 – [03/22] – [John Wenskovitch]

March 22, 2018 John Wenskovitch Leave a comment

This pair of papers describes aspects of those who ruin the Internet for the rest of us. Kumar’s “An Army of Me” paper discusses the characteristics of sockpuppets in online discussion communities (as an aside, the term “sockpuppet” never really clicked for me until seeing its connection with “puppetmaster” in the introduction of this paper). Looking at nine different discussion communities, the authors evaluate the posting behavior, linguistic features, and social network structure of sockpuppets, eventually using those characteristics to build a classifier which achieved moderate success in identifying sockpuppet accounts. Lee’s “Uncovering Social Spammers” paper uses a honeypot technique to identify social spammers (spam accounts on social networks). They deploy their honeypots on both MySpace and Twitter, capturing information about social spammer profiles in order to understand their characteristics, using some similar characteristics as Kumar’s paper (social network structure and posting behavior). These authors also build classifiers for both MySpace and Twitter using the features that they uncovered with their honeypots.

Given the discussion that we had previously when reading the Facebook papers, the first thing that jumped out at me when reading through the results of the “Army of Me” paper was the small effect sizes, especially in the linguistics traits subsection. Again, these included strong p-values of p<0.001 in many cases, but also showed minute differences in the rates of using words like “I” (0.076 vs 0.074) and “you” (0.017 vs 0.015). Though the authors don’t specifically call out their effect sizes, they do provide the means for each class and should be applauded for that. (They also reminded me to leave a note in my midterm report to discuss effect sizes.)

One limitation of “Army of Me” that was not discussed was the fact that all nine communities that they evaluated use Disqus as a commenting platform. While this made it easier for the authors to acquire their (anonymized) data for this study, there may be safety checks or other mechanisms built into Disqus that bias the characteristics of sockpuppets that appear on that platform. Some of their proposed future work, such as studying the Facebook and 4chan communities, might have made their results stronger.

“Army of Me” also reminded me of the drama from several years ago around the reddit user unidan, the “excited biologist,” who was banned from the community for vote manipulation. He used sockpuppet accounts to upvote his own posts and downvote other responses, thereby inflating his own reputation on the site.

Besides identifying MySpace as a “growing community” in 2010, I thought that the “Uncovering Social Spammers” paper was a mostly solid and concise piece of research. The use of a human-in-the-loop approach to obtain human validation of spam candidates to improve the SVM classifier appealed to the human-in-the-loop researcher in me. Some of the findings from their honeypot data acquisition were interesting, such as the fact that Midwesterners are popular spamming targets and that California is a popular profile location. I’m wondering if the fact that these patterns were seen is indicative of some bias in the data collection (is the social honeypot technique biased towards picking up spammers from California?), or if there actually is a trend in spam accounts to pick California as a profile location. This wasn’t particular clear to me; instead, it was just stated and then ignored.

I really liked their use of both MySpace and Twitter, as the two different social networks enabled the collection of different features (e.g., F-F ratio for Twitter, number of friends for MySpace) in order to show that the classifier can work on multiple datasets. It’s almost midnight and I haven’t slept enough this month, but I’m still puzzled by the confusion matrix that they presented in Table 1. Did they intend to leave variables in that table? If so, it doesn’t really add much to the paper, as they’re just describing the standard definitions of precision, recall, and false positive. They don’t present any other confusion matrices in the paper, so it seems even more out of place.

Reflection #9 – [02/22] – [John Wenskovitch]

February 22, 2018 John Wenskovitch Leave a comment

This pair of papers examines the role of social media on aspects of healthcare, both attitudes toward vaccination and predicting depression. In Mitra et al., the authors look at Twitter data to understand linguistic commonalities in users who are consistently pro-vaccine, consistently anti-vaccine, and transition from pro-to-anti. They found that consistently anti-vaccine users are (my wording) conspiracy nutjobs who distrust government and also communicate very directly, whereas users who transition to anti-vaccine seem to be actively looking for that information, being influenced more by concerns about vaccination more than just being generally conspiracy-minded. The De Choudhury et al. paper also uses Twitter (along with mTurk) to measure social media behavioral attributes of depression sufferers. They analyze factors such as engagement, language, and posting time distributions to understand what factors social media factors can be used to separate depressed and non-depressed populations. Following this analysis, they build a ~70% accurate predictor for depression via social media signs.

My biggest surprise with the Mitra et al. paper was the authors’ decision to exclude a cohort of users who transition from anti-vaccine to pro-vaccine. I understand the goals and motivations that the authors have presented, but it feels to me research focused on understanding how best to bring these misguided fools back to reality is just as important as the other way around. Understanding how to prevent others from diving into the anti-vaccine pit is also clearly useful research, but I’d be more interested in reading a study that gives recommendations for rehabilitation rather than prevention, as well as simply displaying what topics are commonly found in discussions around the time that these users return to sanity. I guess it’s a bit late to propose a new class project now, but this really interests me.

Going beyond the linguistic and topical analysis, I’d also be curious to run a network analysis study in this dataset. Twitter affords unidirectional relationships, where an individual follows another user with no guarantee of reciprocation. This leads to interesting research questions such as (1) if a prominent member of the anti-vaccine community follows me back, am I more influenced to join the community? (2) Is the interconnectedness of follow relationships within the anti-vaccine community stronger than in the general population? (3) How long does it take for an incoming member of the anti-vaccine group to be indistinguishable from a long-time member with respect to the number/strength of these follow relationships?

As a depression sufferer myself, the De Choudhury et al. paper was also a very interesting read. I paused in reading the paper to score myself on the revised version of the CES-D test cited, and the result was pretty much what I expected. So there’s one more validation point to demonstrate that the test is accurate.

I thought it was interesting that the authors acquired their participants via mTurk instead of going through more “traditional” routes like putting up flyers in psychiatrist offices. There’s certainly an advantage to getting a large number of participants easily through computational means, and the authors did work hard to ensure that they restricted their study to quality participants, but I’m still a bit wary about using mTurkers for a study. This is especially true in this case, where the self-reporting nature of mTurk is going to stack with the self-reporting nature of depression. Using public Twitter data from these users clearly helps firm up their analysis and conclusions, but my wariness about taking this route in a study of my own hasn’t faded since reading through the paper.

Reflection #8 – [02/20] – [John Wenskovitch]

February 20, 2018 John Wenskovitch Leave a comment

This pair of papers explores how the behavior of users can propagate to other users through social media. In the Bond et al. study, the authors measured the influence of voting information and social sharing across a Facebook friend network. Users were assigned to one of three groups: a control group with no additional voting information, an informational message group with links to polling places and a global voting tally, and a social message group who received the same information as the informational message group plus profile pictures of friends who said they had voted. The researchers found that both the informational message group and the social message group outperformed the control group in voting influence, when measured only by clicks of the “I voted” button. When examining validated voting records, the difference between social message and control group persisted, while the difference between informational message and control disappeared. Further, the social message group greatly outperformed the informational message group under both of the same measures. In total, this experiment generated thousands of additional votes. In the Kramer et al. study, the authors manipulated the news feeds of Facebook users to either change the amount of positive-leaning or negative-learning posts that a user sees, and measuring whether that user is likely to be influenced by the bias mood of their news feed. The researchers found that emotionally-biased news feeds are contagious, and that text-only communication of emotional state is possible; non-verbal cues are not necessary to influence mood.

I was glad to see that the Bond study validated their findings with public voting records, as it’s certainly reasonable to assume that a Facebook user might see many of their friends voting and click the “I voted” button as well for the social credibility. It was certainly interesting to see the change in results, from a 2% boost between the social message group and control group when measuring button clicks vs. the 0.39% boost through voter record validation. I also didn’t expect that the informational message would have no influence in the voting-validated data; I would expect at least some increase in voting rate, but that’s not what the researchers found.

I took some issue with the positive/negative measurement of posts in the Kramer study. The authors noted that a post was determined to be positive or negative if they contained at least one positive or negative LIWC word. However, this doesn’t seem to take into account things like sarcasm. For example, “I’m so glad that my friends care about me” contains two words that I expect to be positive (“glad” and “care”), but the post itself could certainly be negative overall if the intent was sarcastic. I would expect this to affect some posts; obviously not enough of them to change the statistical significance of their results, but the amount of sarcasm and cynicism that I see from friends on Facebook can often be overwhelming. Could the authors have gotten even stronger results with a better model to gauge whether a post is positive or negative?

I had never heard of Poisson regression before reading the Kramer paper, so I decided to look into it a bit further. I presume that they authors chose this regression model because they hypothesized (or knew) that Facebook users’ post rates follow a Poisson distribution. My understanding of the Poisson distribution is that it assumes the events being recorded are independent and occur at a constant rate; however, I feel that my own Facebook postings violate both of those assumptions. My posts are often independent, but occasionally I’ll post several things on the same theme (like calling for gun control following a school shooting) rapidly. Further, I’ll occasionally go a week or more without posting because of how busy my schedule is, whereas other times I’ll make multiple posts in a day. My post distribution seems to be more bimodal than Poisson. Can anyone fill in the gap in my understanding why the authors chose Poisson regression?

Reflection #7 – [02/13] – [John Wenskovitch]

February 13, 2018 John Wenskovitch Leave a comment

This paper describes a study that uses the online game Diplomacy to learn whether betray can be detected in advance through the choice of wording used in interactions between the betrayer and the victim. After explaining the game and its interactions, the authors describe their methodology and begin listing their findings. Of most interest to me were the findings that the betrayer was more likely to express positive sentiments before the betrayal, that an imbalance in the number of exchanged messages also plays a role, that future betrayers don’t plan as far ahead as their victims (based on linguistic analysis of future plans in-game), and that it was also possible to computationally predict a betrayal in advance more accurately than humans could.

Much like some of the earlier papers in this course, I appreciated that the authors included descriptions of the tools that they used, including the Stanford Sentiment Analyzer and the Stanford Politeness classifier. I don’t anticipate using either of those in our course project, but it is still nice to know that they exist for potential future projects.

The authors don’t argue that their findings are fully generalizable, but they do make a claim that their framework can be extended to a broad range of social interaction. I didn’t find that claim well substantiated. In Diplomacy, a betrayal is a single obvious action in which a pair of allies is suddenly no longer allied. However, betrayals in many human relationships are often more nuanced than a single action, and often take place over longer timescales. I’m not certain how well this framework will apply to such circumstances when much more than a few lines of text precede the betrayal.

I appreciated the note in the conclusion that the problem of identifying a betrayal is not a task that the authors expect to be solvable with high accuracy, as that would necessitate the existence of a “recipe” for avoiding betrayal in relationships. I hadn’t thought about it that way when reading through their results, but it makes sense. I wonder how fully that logic could be extended to other problems in the computational social science realm – how many problems are computationally unsolvable simply because solving them would violate some common aspect of human behavior?

Reflection #6 – [02/08] – [John Wenskovitch]

February 8, 2018 John Wenskovitch Leave a comment

This pair of papers discussed computational mechanisms and studies for determining politeness, both in online communities (Danescu-Niculescu-Mizil et al.) and in interactions with law enforcement (Voigt et al.). In the DNM et al. paper, the authors use requests from both Wikipedia and Stack Exchange to build a politeness corpus, using Mechanical Turkers to annotate these requests on a spectrum of politeness. They then build two classifiers to predict politeness, testing the classifiers both in-domain (training and testing data from the same source) and cross-domain (training on Wikipedia and testing on Stack Exchange, and vice versa). The Voigt et al. paper used transcribed data from Oakland Police Department body cameras for several studies regarding racial disparity in officer behavior. These studies included measuring respectfulness and politeness perceptions, identifying linguistic features to model respect, and measuring racial disparities in respect.

I have generalizability concerns about both papers because of the choices made in data collection. In the DNM paper, both the bag of words classifier and the linguistically informed classifier performed worse than the human reference percentages of classification. This was true both in-domain and cross-domain, though cross-domain is more valid for this concern. As a result, I suspect that any new corpus that is added as a source will have classification rates similar to the cross-domain accuracy. Further, their use cases that focus on requests provide a further bias – I suspect that introducing any new corpus not focused on requests will have similar performance, if not worse performance. I wonder if their classifier accuracies might improve if they consider more than two sentences of text with each request, to acquire additional contextual information.

The Voigt paper used only transcribed body camera audio from a city police department, in a city well-known for crime rates. As a result, their results may not generalize to interactions with law enforcement in rural communities (where crime rates are different), near the national borders (where demographics are different), and in safer communities (where criminal behavior is less prevalent). Further, the behavior of the officers may differ with the knowledge that they are wearing body cameras. I’m curious to know if patterns found in transcribed audio from police cruiser dashboard cameras (in situations when the officers aren’t wearing body cameras) are any better or worse than the results shown in this study.

In general, I also felt that the discussion sections of the papers were the most interesting parts. The DNM paper looks at specific cases within the corpus, such as changes in the behavior of Wikipedia moderators when they become administrators and no longer have to be as polite (and who have to be particularly polite in the timeframe leading up to their election). The Voigt paper discussion notes that while their work demonstrates that racial disparities in levels of officer respect exist, the root causes of those disparities are less clear, perhaps an ideal target for a follow-up study on a broader range of interaction transcriptions.

Another potential follow-up study to the Voigt paper could consider the effect of seasons on officer politeness. All of the data from the Oakland Police Department was from interactions that occurred in April. Are officers more likely to be polite when the weather is nicer, or less polite in the depths of winter? And if there are seasonal or weather-related changes, does the racial disparity grow or shrink?

I found the distributions from Figure 1 of the DNM paper to be intriguing. I’m curious why the Stack Exchange politeness spectrum seems to mimic the Gaussian distribution that you would expect to see, but the Wikipedia politeness spectrum seems to plateau just above the mean. Trying to understand the difference in these distributions would be yet another interesting follow-up study – is the difference a result of inflated semi-polite interaction frequency because of the moderators trying to become administrators, or is it a result of the language in interactions on Wikipedia being more formal than the informal Stack Exchange, or some other reason entirely? I’m curious to hear the thoughts of anyone else in the class.

Reflection #5 – [02/06] – [John Wenskovitch]

February 6, 2018 John Wenskovitch Leave a comment

Both of these papers examine the trend towards “selective exposure” and away from diverse ideological exposure when viewing news electronically. At a high level, both papers are touching on the same big idea – that users seem to be creating insular “echo chambers” of polarized news sources based on their ideals, ignoring the viewpoints of the opposing ideology either by their own conscious choice or algorithmically by their past behavior. The Garrett paper looks at general web browsing for news sources and focuses on the area of opinion reinforcement. The study details a web-administrated behavioral study in which participants were shown a list of articles (and their summaries) and were given the choice of which ones they wanted to view. The study findings supported the author’s hypotheses, including that users prefer to view news articles that are more opinion-reinforcing and that users will spend more time viewing those opinion-reinforcing articles. The Bakshy et al. study was Facebook-centered, examining how users interact with shared news articles on that platform. Among their findings were that ideologically cross-cutting depended on both the spectrum of friend ideologies and how often those friends shared, but that there was some evidence of ideological isolation in both liberal and conservative groups.

Both of these studies had notable limitations that were discussed by the authors, but I felt that each was addressed insufficiently. The Garrett study made use of both a liberal and a conservative online news outlet to obtain participants, which obviously will not ensure that the sample is representative of the population. Garrett justifies this by supposing that if selective reinforcement is common in these groups, then it is likely the same among mainstream news readers; however, (1) no attempt is made to justify that statement (the brief mention in the Limitations section even contradicts this assertion), and (2) my intuition is that the opposite is true: that if selective reinforcement is common among centrists, then it almost certainly will be true at the ideological extremes. In my opinion, the results from this study do not generalize, and this is a killer limitation of the paper.

Bakshy’s study has a similar limitation that the authors point out: that they are limited to recording engagement based on clicks to interact with articles. As a result, individuals might spend some time reading the displayed summaries of some articles but never click to open the source, and such interactions are not logged. To use the authors’ phrasing, “our distinction between exposure and consumption is imperfect.” This surprised me – there was no way to record the amount of time that a summary was displayed in the browser, to measure the amount of time a viewer may have thought about that summary and decided whether or not to engage? I know in my experience, my newsfeed is so full and my time is so limited that I purposefully limit the number of articles that I open, though I often pause to read summaries in making that decision. I do occasionally read the summaries of ideologically-opposing articles, but I rarely if ever engage by clicking to read the full article. Tracking exposures based on all forms of interaction would be an interesting follow-up study.

Despite the limitations, I thought that both studies were well-performed and well-reported with the data that the authors had gathered. Garrett’s hypotheses were clearly stated, and the results were presented clearly to back up those hypotheses. I wish the Bakshy paper had been longer so that more of their results could be presented and discussed, especially with such a large set of users and exposures under study.

Reflection #3 – [01/25] – [John Wenskovitch]

January 25, 2018 John Wenskovitch Leave a comment

This paper describes a study regarding antisocial behavior in online discussion communities, though I feel that labeling the behavior as “negative” rather than “antisocial” may be more accurate. In this study, the authors looked at the comment sections of CNN, Breitbart, and IGN, identifying users who created accounts and were banned during the 18-month study window. Among other findings, the authors noted that these negative users write worse than other users, they both create new discussions and respond to existing discussions, and they come in a variety of forms. The authors also found that the response from the rest of the community has an influence on the behavior of these negative users, and also that they are able to predict whether or not a user will be banned in the future with great accuracy just by evaluating a small number (5-10) of the user’s posts.

Overall, I felt that this paper was very well organized. I saw the mapping pattern discussed during Tuesday’s class linking the data analysis process to the sections of the paper. The data collection, preprocessing, and results were all presented clearly (though I had a visualization/data presentation gripe with many of the subfigures being rendered far too small with extra horizontal whitespace between them). Their results in particular were neatly organized by research finding, so it was clear what was being discussed from the bolded introductory text.

One critique that I have which was not well addressed by the authors was the fact that all three of the discussion communities that they evaluated used the Disqus commenting platform. In a way, this works to the authors’ advantage by having a standard platform to evaluate. However, near the end of the results, the authors note that “moderator features… constitute the strongest signals of deletion.” It would be interesting to run a follow-up study with websites that use different commenting platforms, as moderators may have access to different moderation tools. I would be interested to know if the specific actions taken by moderators have a similar effect to the community response, if these negative users respond differently to more gentle moderation steps like shadowbanning or muting than to harsher moderation steps like post deletion and temporary or permanent bans. From research like this, commenting platform creators can modify their tools to support actions that mitigate negative behavior.

In a similar vein, the authors have no way of knowing precisely how moderators located comments from these negative users to begin the punishment process. I would be interested to know if there is a cause and effect relationship between the community response and the moderator response (e.g., the moderators look for heavily downvoted comments to delete and ban users), or if the moderators simply keep track of problem users and evaluate every comment made from those users. Unfortunately, this information is something that would like require moderator interviews or further knowledge of moderation tools and tactics, rather than something that could be scraped or easily provided by Disqus.

The “factors that help identify antisocial users” and “predicting antisocial behavior” sections were quite interesting in my opinion, because problem users could be identified and moderated early on instead of after they begin causing severe problems within the discussion communities. The authors’ use of inferential statistics here was well written and easy to follow. Their discussion at the end of these sections regarding the generalizability of these classifiers was also pleasing to see included in the paper, showing that negative users share enough features that a classifier trained on CNN trolls could be used elsewhere.

Finally, I wanted to make note of the discussions under Data Preparation regarding the various ways that undesired behavior could be defined. The discussion was helpful both from an explanatory perspective, describing negative tactics like baiting users, provoking arguments, and derailing discussions, as well as from a methodological perspective to understand what behaviors were being measured and included throughout the rest of the study. However, I’m curious if there are cases that the authors did not measure, or if there were false negative bans that may have been introduced into the data. For example, several reddit communities are known for banning users who simply comment with different political views. Though I don’t want to visit Breitbart myself, second-hand information that I’ve heard about the community makes me suspect that a similar approach might exist there. It was not clear to me if authors would have removed comments and banned users from consideration in this study if, for example, they simply expressed unwanted content (liberal views) in polite ways on a conservative website. It still counts as “undesired behavior,” but I wouldn’t count it in the same tier as some of the other behaviors noted.