Reading Reflection #2 – [01/31] – [Alon Bendelac] | CS4984 Spring19: Data Science & Analytics Capstone

Summary:

The issue of fake news has received a lot of attention lately due to the 2016 US Presidential Elections. This research paper compares real, fake, and satire sources using three datasets. The study finds that fake news is more similar to satire news than to real news; and uses heuristics rather than arguments. It is also found that news article titles are significantly different between real and fake news. Three categories of features were studied: stylistic, complexity, and psychological. The study uses the one-way ANOVA test and the Wilcoxo rank sum test to determine if the news categories (real, fake, and satire) show statistically significant differences in any of the features studied. Support Vector Machine (SVM) classification was used to demonstrate that the strong differences between real, fake and satire news can be used to predict news of unknown classification.

Reflection:

Punctuation: In Table 3(c), one of the stylistic features is “number of punctuation.” The study looks at all types of punctuation as a whole, and only considers the total number of punctuations. I think it would be interesting to look at specific punctuation types separately. For example, maybe fake news articles are more likely than real news articles to have an ellipsis in the title. An ellipsis might be a common technique used by fake news organizations to attract readers. Similarly, question marks in the title might also be commonly used in fake news articles.

Neural networks: The study used Support Vector Machine (SVM) to predict if an article is real or fake. I wonder how a neural network, which is more abstract and flexible than an SVM, would perform. In Table 6, only two of the three categories (real, fake, and satire) are tested at a time, because an SVM is designed for two classes. A neural network could be designed to classify articles into one of the three categories. This would make more sense than an SVM, since we usually can’t eliminate one of the three categories and then test for just the other two.

Processing embedded links: This study only looks at the bodies of articles as plain text, without considering possible links within the text. I think looking into where embedded links direct you could help detect fake news. For example, if an article contains a link to another article known to be fake news, then the first article is most likely also fake news. The research question could be: Can embedded links be used to predict if a news article is fake or real?

Number of citations and references: I believe the real news stories are more likely to contain citations, references, and quotes than fake news stories. Number of quotes was one of the stylistic features, but number of references was not studied. A reference could be to a study or another news article related to the one in question. A reference could also be to a past event.

Alon Bendelac

Leave a Reply Cancel reply