Reading Reflection #3 – 2/5/2019 – Bright Zheng | CS4984 Spring19: Data Science & Analytics Capstone

Early Public Responses to the Zika-Virus on YouTube: Prevalence of and Differences Between Conspiracy Theory and Information Video

Summary

This paper focuses on comparing viewer responses on informative and conspiracy videos about the early-phased epidemics (Zika-virus). Authors collected 35 top-YouTube-ranked videos and studied the user activities and the comments and replies. The results show that there is no statistically significant differences across informative and conspiracy video types, in terms of user activities, and there is no differences in the sentiment of the user responses to the two video types. The only significant result is that neither of the two types of video content promotes additional responding per unique user.

Reflection

This paper is a very well-written paper. It has very detailed definition on conspiracy theories and the problem’s background. The authors carefully explained how they obtained the data set, and suitable scenarios of different statistical tests used during the research.

Different from previous papers that we read for this class, this paper does not draw any significant insights from the all the tests. No significant differences on the selected features between informational and conspiracy theory videos was the final conclusion.

This conclusion is definitely a surprise to me. The 12 collected conspiracy videos are very much like fake news. They are mostly negative and mention a lot of proper nouns in their contents, so these conspiracy videos getting the “same” reactions as the informative ones didn’t reflect the comparison between real and fake news.

I thought the size of the data set (only 35 videos) was a limitation of this work. However, this research focuses solely on the first phase of the Zika-virus, so it might be difficult to collect a larger set of sampling videos. I also realized that people are very unlikely to go further than the top 35 videos on YouTube with any search query, not to mention interacting with the video (like, dislike, comment, etc.)

One solution to this limitation could be surveying videos from different phases of Zika-virus. This future work enables the following questions

Is there a shift on topic weights on comments of both types of videos?
Does user activity change over different phases on informative and conspiracy videos?

In the first phase of any new epidemic outbreak, only little facts are known to scientists. However, as more studies are done, maybe scientists will have a better understanding of the epidemic, and informative videos will shift topic from consequences to causes. This topic shift in informative videos might also be reflected in the comments. We can already see that the weight of informative videos’ comments is heavier on “affected stakeholders” and “consequences of Zika” instead of “causes of Zika”, while conspiracy videos’ comments are more focused on the causes.

There also might be changes in user activities on the two types of videos. Conspiracy videos’ contents might be consistent throughout different phases, since the conspiracies are all “explanations for important events that involve secret plots by powerful and malevolent groups.” Informative videos’ contents will for sure change with the increase of known facts and evidences, so people might be more willing to share and view informative videos.

We could also further this study by surveying the first phase of other epidemics; however, YouTube might not be the best social platform if we want to survey a wide range of epidemics.

Automated Hate Speech Detection and the Problem of Offensive Language

Summary

This paper addresses automating hate speech detection using a classifier to identify hate speech and offensive language tweets. Authors collected 33,458 Twitter users and selected 25k random tweets from total 85.4 million tweets. These tweets then get manually labeled into “hate speech”, “offensive language”, and “neither” for supervising learning algorithms and models. Then, the paper went on to talk about the output from the trained classifier.

Reflection

The topic of hate speech vs. offensive language is something that I have never thought about, and I’m surprised that there have been researches that systematically study the difference between the two categories of language. I like how this paper demonstrates the difficulty of this topic by listing out errors in previous studies. I don’t like how the paper is very brief on the Model section. It wasn’t detailed on why exactly they decided to use these certain algorithms and the features used in these models.

Because of the need of large amounts of crowd sourcing and outside resources (hatebase.org), there are a lot of inaccuracies in both the dataset and the results. One obvious solution for the imprecision of the Hatebase lexicon is to find a better outside resource that better identify hate speech lexicon. However, it might be difficult to find a “perfect” outside resource since there is no formal definition for “hate speech”.

Figure 1 in the paper shows that the classifier is more accurate on the “Offensive” and “Neither” categories than on the “Hate” category. I’m wondering whether this is because of the strict definition of “hate speech”.

Future work of this topic may include running considering other methods on the dataset. For example,

Can natural language processing algorithms, such as Named Entity Disambiguation/Recognition, help the classifier to determine what the event trigger is and then make a decision?
Can non-textual features, such as location and registered race, help identifying the context?

Bright

Leave a Reply Cancel reply