CREDBANK: A Large-Scale Social Media Corpus with Associated Credibility Annotations | CS 6724: Investigative Technologies in Society

Article: CREDBANK: A Large-Scale Social Media Corpus with Associated Credibility Annotations: https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10582/10509

Summary

This is the whitepaper for Credbank, a system (which the authors refer to as a “corpus”) for systematically studying the phenomenon of social media as a news source. Credbank specifically investigates Twitter: it relies on real-time tracking and “intelligent routing” of tweets to crowdsourced human annotators to determine Tweet credibility. The trial of Credbank assessed in the paper took place over three months, and comprised “more than 60M tweets grouped into 1049 real-world events, each annotated by 30 Amazon Mechanical Turk workers for credibility (along with their rationales for choosing their annotations)” (258). Tanushee Mitra and Eric Gilbert correctly note that credibility assessment has received a great deal of attention in recent years, and the paper pays due respect to the other work done in this arena. Toward the end of their “related work” section they note that their contribution is unique in its mobilization of real-time analysis.

The bulk of the paper describes their method of collecting and analyzing Twitter data in real time, beginning with a pre-processing schema that screens tweets through tokenization, stop word and spam removal processes. This is key to their use of LDA (latent dirichlet allocation), which finds similarities between various word strings (in this case, tweets) and inductively generates topic models from them. Humans intervene in this process quickly thereafter: MTurk workers are used to confirm whether the tweets gathered actually relate to a newsworthy event; they mention that purely computational approaches often lead to false positives (261). The authors include information on their understanding of what counts as measurably “credible” (262) before disclosing that, in the process of running these trials, they also discovered the number of MTurkers necessary to approximate an expert’s judgment. That number is 30 per event (263). Through their statistical analyses of events annotated by 1,736 Turkers, they arrive at the conclusion that — basically — events discussed on Twitter have an alarmingly low rate of credibility: the highest percentage of agreement on the “certain accuracy” of tweets stood at 50% (for 95% of tweets), and the percentage of tweets / percentage of agreement-on-accuracy ratio followed the same pattern (only 55% of tweets had 80% certain accuracy agreement) (264).

The authors conclude with a macroscopic assessment on factors implicated in current and future research on this topic. This includes the temporal dynamics — it seems reoccuring events, such as sports events, had a lower overall credibility — the role of social network and mass media in impacting credibility ratings, the viability of a distribution-based (normal curve) model of credibility, other strategies used to confirm credibility, and the role that supplementary data may play.

Analysis

The authors ostensibly use the term “corpus” because they believe that the major contribution of Credbank is the dataset. Although the dataset is perhaps the more obviously practical offering, their methodology — the combination of theory and practice, well-explicated in the steps they take to arrive at their data — seems to be the most instructive for those interested in advancing knowledge on crowdsourcing and social media-as-news credibility in a more general sense. To me, Credbank is not so much a dataset is an example of theory in practice. Its shortcomings have implications for the way assumptions about human expression on social media may require more careful consideration before being operationalized in systems lke the one seen here.

Their use of topic modeling/LDA seems notable to me, and a place where we can use the outcomes (evidently, tweets aren’t very credible) to tweak the theoretical assumptions. I think they may want to revisit their use of tokenization and stop words in order to account for “the nuances associated with finding a single unique credibility label for an item,” a problem that they believe impacts the viability of credibility to be modeled along a normal curve.

Questions

Given our prior discussions about social media credibility as a news source, what is different about Credbank? Is there anything specific to its functionality that makes you think it more or less trustworthy?
What do we think about the functions utilized in the data preprocessing (their methods of spam removal, tokenization and stop words)? Can we identify any way in which this might affect the system to deleterious effect?
To return to a prior issue, since it comes up in this paper: what do we think about the use of financial incentives here? Could this taint the annotations?
They frequently discuss the use of “experts” here, but do not identify who they are. Do we see this as a weakness of the paper — and perhaps more interestingly, are there any real experts in this arena?
Is there a way to crowdsource credibility annotation of tweets that does not rely on inductive preprocessing? I would suggest that tokenization, stop words and other filters distorts the assessment of tweets to the point where this system can never be functionally practical for the purposes of real social sciences research..

emma