Paper-
A Parsimonious Language Model of Social Media Credibility Across Disparate Events – Tanushree Mitra, Graham P. Wright, Eric Gilbert.
Summary-
The paper attempts at using the language of a twitter post, to detect and predict the credibility of a text thread. It focuses a dataset of around 1300 event streams, using data from Twitter. It relies solely on theory-guided linguistic category types, both lexicon based, (like conjunctions, positive and negative emotion, subjectivity, anxiety words, etc.) and non-lexicon based (like hashtags, question tags, etc.). Using these linguistic categories, it creates variables corresponding to words belonging each category. It then tries to fit a penalized ordered logistic regression model, to a dataset (CREDBANK), which contains perceived credibility information corresponding to each event thread, as determined by Amazon Mechanical Turk. It then tried to predict the credibility of a thread, and also determine which linguistic categories are strong predictors of credibility, and which ones are weak indicators, and which words among these categories are positively or negatively correlated with credibility.
Reflection-
The paper is thorough with its choice of linguistic categories, and acknowledges that there may be even more confounding variables, but some of the variables chosen do not intuitively seem like they would actually influence credibility assessments, e.g. question marks, hashtags. It does turn out, from the results, that these variables do not correlate with credibility judgements. Moreover, I fail to understand why the paper is using both average length of tweets and no. of words in the tweets as control variables. This seems strange, as both these variables are very obviously correlated, and thus will be redundant.
The appendix mentions that the Turkers were instructed to be knowledgeable about the topic. However, it seems that this strategy would make the credibility judgements susceptible to the biases of the individual labeler. The Turker will have preconceived notions about the event and its credibility, and it is not guaranteed that they will be able to separate that out from their assessment of the perceived credibility. This is a problem, since the study focuses on extracting information only from linguistic cues, without considering any external variables. For example, a labeler who believes global warming is a myth will be biased towards labeling a thread about global warming as less credible. This can perhaps be improved by assigning Turkers topics which they are neutral towards, or are not aware of.
The paper uses a logistic regression classifier, which, of course, is a fairly simplistic model, which cannot map a very complex function in the feature space very well. Using penalized logistic regression makes sense given that the number of features were almost 9 times the number of event threads, but a more complex model, like a shallow neural network could be used, if more data were to be collected.
The paper has many interesting findings about the correlation of words and linguistic categories with credibility. I found it fascinating that subjective phrases associated with newness/uniqueness, complexity/weirdness, and even certain strongly negative words were positively correlated with credibility. It was also surprising that boosters (an expression of assertiveness) were negatively correlated, if in the original tweet, and hedges (an expression of uncertainty) were positively correlated, if in the original tweet. The inversion in correlation of the same category words, based on if they appeared in the original tweet or the replies speaks to a fundamental truth of communication, where different expectations are put on the initiator of the communication, than those put on the responder to the communication.
Finally, the paper states that this system would be useful for early detection of credibility of content, while other systems would need time for the content to spread, to analyze user behavior to help them make predictions. I believe that in today’s world, where information spreads to billions of users within minutes, the time advantage gained by only using linguistic cues would not be enough to offset the drawbacks of not considering information dissemination and user behavior patterns. However, the paper has a lot of insights to offer social scientists or linguistics researchers.