[Reflection #2] – [1/31] – [Matthew, Fishman]

Quick Summary

Why is this study important? With the sheer quantity of news stories shared across social media platforms, news consumers must use quick heuristics to decide if a news source is trustworthy. Horne et al. set out with the task of finding distinguishing characteristics between real news and fake news, to help news consumers be better able to distinguish the two.

How did they do it? They studied stylistic, complexity, and psychological features of real news, fake news, and satire articles to better help them classify each from one another. Horne et al. then used ANOVA tests on normally-distributed data sets and Wilcoxon rank sum tests on non-normally distributed features to find which features differ between the different categories of news. They then selected the top 4 distinguishing features for both the body text and title text of the articles to create an SVM model with a linear kernel and 5-fold cross-validation.

What were the results? Their classifier achieved 71% to 91% cross-validation accuracy over a 50% baseline when distinguishing between real news and either satire or fake news. Some of the major differences they found between fake news and real news is that real news has a much more substantial body, while fake news has much longer titles with simpler words. This suggests that fake news writers are attempting to squeeze as much substance into the titles as possible.

Reflection

Again, I was not as surprised by the outcomes of this study as I had hoped. It seems obvious that, for example, fake news articles would have a less substantial body and use simpler words in their titles than real news. However, the lack of stop words and length of fake news titles did surprise me; I had always associated fake news with click-bait, which usually has very short titles.

A few problems I had with the study included:

  • The data sets used. If the real enemy here is fake news, then why would the researchers use only a total of 110 fake news sources (in comparison to 308 satire news sources and 4111 real news sources). No wonder the classifier had an easier time distinguishing real news from the other two.
  • The features extracted. The researchers could have used credibility features or different user interaction metrics like shares or clicks to better distinguish fake news from real. If the study utilized more than just linguistic features, their classifier could have been much more accurate.

Going Forward:

Some improvements on the study (in addition to the dataset size and features extracted) could be to do some user research:

  • How much time do users spend reading a fake news article in comparison to real news?
  • What characteristics of a news consumer are correlated with sharing, liking, or believing fake news?
  • What is the ratio of fake news articles clicked or shared to that of real news?

Questions Raised:

  • Can we predict the type of user to be more susceptible to fake news?
  • How have fake news’ linguistics changed over the years? What can we learn from this is predicting how they might change in the future?
  • Should real news sources change the format of their titles to give lazy consumers as much information as possible without needing to read the article? Or, would this hurt their baseline as their articles might not get as many clicks if all the information is in the title?
  • Should news aggregates like Facebook and Reddit be using similar classifiers to mark how potentially “fake” a news article is?

feesh96

Aggressive driver and talented shower-singer

Leave a Reply

Your email address will not be published. Required fields are marked *