Summary
This article analyzes the language style of fake news articles and compares them to those of real and satirical pieces. To do this, the researchers assembled three datasets: a Buzzfeed collection of high-impact articles, 75 hand-picked articles in each category, and a Burfoot and Baldwin collection of “non-newsy” satirical articles. From there, the researchers extract natural language features from the articles which measure the style, complexity, and psychology of the language choices used. Using these features, the researchers use ANOVA and Wilcoxon tests to detect differences between the three groups of articles. These tests deduced that fake news articles were noticeably distinguishable from real news, and were in many ways closer to satire. Finally, the researchers built a linear SVM model to classify articles, and found that their features could predict fake and satire news at an accuracy of 71% and 91%, respectively.
Reflection
First, I found the data collection methodology much more prudent than that of the last article. All three datasets are valid, but flawed. So, it was commendable that the researchers chose to measure all of them separately, to account for their shortcomings. This level of depth is impressive, since it would have been sufficient to simply pick their second, hand-picked dataset. It captures a strong list of well-known sources for each category, and the results would remain statistically significant so long as they pick the articles at random. The fact that they felt that they needed to use all three datasets shows the complexity involved in distinguishing and sampling fake news.
Second, I found the list of top 4 features they extracted to be very interesting. They were: number of nouns, lexical redundancy (TTR), word count, and number of quotes. Out of hundreds of features which measure the style, complexity, and psychological impact of the article, the most important factors are essentially how many quotes, nouns, and unique words they used. This is an interesting finding, and I’m left wondering: why didn’t the researchers include a Principle Component Analysis or Feature Importance analysis to tell us how much of the variance or accuracy could be explained by these four features? It would be an insight by itself to learn that, say, 40% of the variance could be explained by word count alone.
Finally, the concluding remakes in the article have opened up an interesting question. The researchers describe how difficult it is to obtain non-noisy ground truth for fake and real news, especially when real news becomes increasingly opinion-based. Could the researchers repeat this experiment with real vs. opinion-based/biased/editorials? And if the amount of bias present is a scale, could they instead model a regression to determine the level of bias present in modern news sources?