- Mitra, G. P. Wright, and E. Gilbert, “A Parsimonious Language Model of Social Media Credibility Across Disparate Events”
Written language for centuries, coming from word of mouth, has been the primary mode for discourse and the transportation of ideas. Due to its structure and capacity it has shaped our view of the world.But, due to changing social landscapes, written language’s efficacy is being tested. The emergence of social media, preference of short blobs of text, citizen journalism, the emergence of cable reality show, I mean, the NEWS and various other related occurrences are driving a change in the way we are informed of our surroundings. These are not only affecting our ability to quantify credibility but is also inundating us with more information one can wade through. In this paper, the author explores the idea of whether language from one of the social media website, twitter, can be a good indicator of the perceived credibility of the text written.
The author tries to predict the credibility of the credibility of news by creating a parsimonious model(low number of input parameter count) using penalized ordinal regression with scores “low”, “medium”, “high” and “perfect.” The author uses CREDBANK corpus along with other linguistic repositories and tools to build the model. The author picks modality, subjectivity, hedges, evidentiality, negation, exclusion and conjugation, anxiety, positive and negative emotions, boosters and capitalization, quotations, questions and hashtags as its linguistic features while using number of number of original tweets, retweets and replies, average length of original tweets, retweets and replies and number of words in original tweets, retweets and replies as the control variables. Measures were also taken like the use of penalized version of ordered logistic regression to handle multicollinearity and sparsity issues. The author then goes on to rank and compare the different input variables listed above by its explanatory powers.
One of the things I was unsure of after reading the paper is if the author accounted for long tweets where the author uses replies as a mean to extend one’s tweet. Eliminating this could make the use of number of replies as a feature more credible. One could also see that, the author has missed to accommodate for spelling mistakes and so forth, as this preprocessing step could improve the performance and reliability of the model.
It would be an interesting idea to test if the method the author describes can be translated to other languages especially languages which are linguistically different.
Language has been evolving ever since its inception. New slangs and dialects adds to this evolution. Certain social struggles and changes also have an impact on language use and vice versa. Given such a setting, is understanding credibility from language use a reliable method? This would be an interesting project to take on to see if these underlying lingual features have remained same across time. One could pick out texts involving discourse from the past and see how the reliability of the model build by the author changes if it does. But this method will need to account for the data imbalance.
When a certain behaviour is penalized, the repressed always find a way back. This can also be applicable to the purveyors of fake news. They could game the system in using certain language constructs and words to evade the system. Due to the way the system is build by the author, it could be susceptible to such acts. In order to avoid such methods one could automate this feature selection. The model could routinely recalculate the importance of certain features while also adding new words into its dictionary.
Can a deep learning mode be built to better the performance of credibility measurement? One could also try building a sequential model may it be LSTMs or even better a TCN [2] to which vectors of words in a tweet generated using word2vec could be given as input along with some attention mechanism or even [4] to allow us to have an interpretable model. Care has to given that models especially in this area have to be interpretable model so as to avoid not having an accountability in the system.
[2] Colin Lea et al, “Temporal Convolutional Networks for Action Segmentation and Detection”
[3] T. Mikolov et al, “Distributed Representations of Words and Phrases and their Compositionality”
[4] Tian Guo et al, “An interpretable {LSTM} neural network for autoregressive exogenous model”