This Just In: Fake News Packs a Lot in the Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News
Summary:
The main question that this paper wanted to address was: Is there any systematic stylistic and other content differences between fake and real news? The authors used three different datasets during this study. The first set was a used an analysis of real and fake news from the 2016 elections. The second contained articles about US politics from real, fake, and satire sources. Finally, the third set contained real and satire news articles. The authors show that “fake news articles tend to be shorter in terms of content, but use repetitive language and fewer punctuation.” The authors show that they can use the features they found in the paper to predict if an article is real, fake or satire news with an SVM.
Reflection:
I thought this paper was better structured and focused than the last paper that we read. They had a much more clear direction with what question that they wanted to answer. It was also stated early in the paper how and why this paper was different from the previous studies that had come before it: “The inclusion of satire as a third category of news is a unique contribution of our work.” In the results section, the authors also expanded discussions on what their statistical findings could mean and why the findings turned out in a certain way.
While again I do not find the results to be too shocking it was cool to see that using the features the authors found a model can be built to predict if an article is real or fake. (I also really enjoyed how the authors structured the title)
Questions Brought Up:
Why just SVM?
One of the main questions I was wondering throughout the paper was why didn’t the authors try multiple types of machine learning algorithms to see which one would best predict fake news articles? Maybe the were leaving this for further papers, but it would have been neat to see how different types of classifiers would work on this type of prediction.
Would having evenly distributed data set cause problems with the prediction models that the authors use?
I noticed that two out of three of their data sets were evenly distributed between real and fake/satire news sources. I wonder if having the same percent of sources would mess up the prediction algorithm. Assuming that the amount of real news stories is not one to one with fake news stories, would the algorithm think that fake news stories are more/less prevalent than they actually are? I know that they could randomize the way they fed in the fake and real stories, but I wonder if it is better to have data sets that model the real world more closely.
Personally, for the training on walking prediction that I have done in the past, we have had to vary the way the data is pre-processed because the algorithm will learn the certain walking course before it will learn to predict a person’s step. We neededo have a more random walking course and just take the delta of the steps to get a model that would better suit the real world.
Future Questions/topics:
Could a browser extension house the prediction model that they have developed and possibly alert the users when an article could be potentially fake?