[Reading Reflection 2] – [01/31] – [Numan, Khan] | CS4984 Spring19: Data Science & Analytics Capstone

This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News

Summary:

This paper’s overall goal is to prove whether fake news differs systematically from real news in style and language use. The authors’ motivation for this goal is to disprove the assumption that fake news is written to look similar to real news. In other words, fooling the reader who doesn’t check for credibility of source and arguments mentioned in the article. The paper’s method of proving this assumption is false is by studying three data sets and their features using one-way ANOVA test and Wilcoxon rank sum test. The first data set is from Buzzfeed’s analysis of real and fake news from 2016 US Elections. The second data set has news articles on US politics from real, fake, and satire news sources. The third data set contains real and satire articles from a previous study. The paper chose to include satire articles in order to differentiate it from other papers on fake news. The paper concluded that fake news articles had less content, more repetitive language, and fewer punctuation than real news. Furthermore, fake news articles have longer titles, use few stop words, and fewer nouns compared to real news. When comparing fake news to satire news, Horne and Adali were able conclude that fake news is more similar to satire news than real news–disproving the assumption from the beginning of the paper.

Reflection:

The assumption that this paper is trying to prove wrong is a belief that I have. When fake news became more prevalent during the 2016 Presidential Election, I viewed those articles as trying to appear as real news but have a lack of credibility in the sources and arguments used. Initially, I found it interesting that they were trying to disprove this assumption because my point of view was that fake and real news were similar. However, I became fascinated with their inclusion of satire news in their data sets. I had never thought of comparing fake news to satire news. Because I agree with the way the paper defines satire news as “…explicitly produced for entertainment”. Why would fake news–that’s purpose is to deceive–be similar to news that is read for entertainment and mockery? But now thinking about it more and looking at the bigger picture, I have a much different view of fake news now after reading this paper. These fakek news articles are not only trying to deceive people but are created for parody purposes since satirical news can easily grabs peoples’ attention too. Therefore, it would make a lot of sense that fake news is similar to satire news.

While I don’t have much experience with Natural Language Processing (NLP) or Understanding (NLU), the features defined in this paper for the datasets obtained made sense to me. In other words, there seem to be no unnecessary or overlooked features for this paper. Being able to gauge word syntax, sentence & word level complexity, and sentiment, all make sense based on the goal of this paper which is to determine if fake news differs systematically from real news in style and language use. These features provide information from a high and low level for language analysis. Personally, due to my inexperience in this field, I would be eager to learn how to analyze natural language in Python in the future.

Something else I appreciated that Horne and Adali did was acknowledging that a statistical test won’t say anything about predicting classes in the data. Therefore, they used the statistical tests as a way of feature selection for their Support Vector Machine (SVM) model that would help them classify news articles based on small feature subsets. It was amazing that from these subsets that they used in their classifier significantly improved the prediction of fake and satire news where they were able to get between 71% and 91% accuracy in separating from real news stories. One question that I was curious about is Horne and Adali selection of features for their SVM. Why did they chose specifically the top 4 features from their hypothesis testing? Is it because they are trying to avoid over-fitting? Would we see a difference if they used Principal Component Analysis as way of telling which features would be the best in terms of classification of fake versus real news?

The reflection from the statistical tests based on their defined features were clear and made sense. Sometimes this paper effectively reflects on the results found, however, some of the reflections in this paper are not surprising. For example, I already knew that fake news titles tend to be shorter in content and have longer titles than real news. My belief–from before reading this paper–is that fake news intends to grabs readers’ attention through those long titles. Therefore, it makes sense that the writers of fake news articles packs as much info into the titles. This leads me to think that readers tend to be more interested in the titles of fake news compared to real news. This leads to me another question that could be a project idea. What can we do to help the general public easily detect real news versus fake news–when readers are obviously attracted to the titles of fake news articles? Should real news articles adjust their titles? While I critique that some of the reflection was obvious, a lot of the results found from the syntax features was very interesting such as fake news uses fewer punctuation and their titles uses fewer stop words but more proper nouns. Overall, this paper was very well written, could have had some more analysis based on their results, but I really appreciate them creating a working SVM model for classifying fake and real news.

Numan

Leave a Reply Cancel reply