Reading Reflection #4 – [2/07/2019] – [Sourav Panth]

Analyzing Right-wing YouTube Channels: Hate, Violence and Discrimination

Summary:

The basis of this paper was to observe issues related to hate, violence and discriminatory bias in a data set containing more than 7000 videos and 17 million comments. They compared right wing channels to the baseline set using a three layered approach. This approach included analyzing lexicon, topics and implicit biases present in the text to find differences between users comments and video content. Their results show that right-wing channels tend to contain a higher degree of words from “negative” semantic fields, raise more topics related to war and terrorism, and demonstrate more discriminatory bias against Muslims in videos and towards LGBT people in comments.

Reflection:

Something that I found very interesting was that the paper talks about YouTube’s recommendation algorithm not being neutral during the presidential election of 2016 in the United States of America. This is actually very surprising to me because I thought YouTube would be very neutral since it is majority user submitted videos. I figured the recommendation algorithm would rely entirely on past videos that you watched, not accounting for YouTube to have a say in what shows up in the sidebar. Something interesting to look at would be the peak of videos recommended by YouTube versus similar videos that were not fed through the algorithm.

One possible source of inaccuracy is that they only used 15 categories to associate comments to negative feelings and 5 categories to associate comments to positive feelings. While I think this is probably fine to start, it seems like the results would be skewed towards the negative category with the current category set. Something that would concern me is that comments may be directed at the YouTube video and not just spewing hate in general. The author also talks about how the right wing raises more topics related to war and terrorism which isn’t necessarily a bad thing. Discussions about war and terrorism don’t automatically mean hate or discrimination and I think with a smaller list of categories some comments made be misidentified. In the future it may help to add more words for association and even out the number of categories in both the negative and positive sections.

Future Work:

There are a couple ways that you could go about improving upon the research already done. First I think it be interesting to see a similar analysis done on left-wing YouTube channels and comparing what the hateful topics of the left-wing is compared to that of the right-wing. Second, I would check the number of right wing videos recommended by YouTube versus the number of left-wing and neutral videos recommended by YouTube. This would help to show the bias of YouTube and also by looking at views you could see what kind of influence YouTube has on its audience. Finally one of the last things that I would improve to get better results is to increase the number of categories used to find comments associated with negative or positive reactions. In the paper they use 15 categories related negative feelings and 5 categories related to positive feelings. Looking through the words that they used, the association seemed very narrow. I think you’re using a larger number of categories to relate these feelings could result in more accurate sentiment analysis. Another thing to consider is if these comments that exhibit anger are directed at a certain group of people or if they’re directed at the YouTube video and creator.

Read More

Reading Reflection #3 – [2/05/2019] – [Sourav Panth]

Automated Hate Speech Detection and the Problem of Offensive Language

Summary:

In this article the authors attempted to use crowd sourcing to label a sample of tweets into three categories. These categories included hate speech, only offensive language, and those with neither. They did this by training a multi-class classifier to distinguish between these different categories. One of the key challenges is separating hate speech from other instances of offensive language.

Reflection:

Something that was very interesting to me was that both Facebook and twitter have gotten a lot of criticism for not doing enough to prevent hate speech on their sites. They responded to this by instituting policies to prohibit attacking people based on characteristics like a race, ethnicity, gender, and sexual orientation. Now I know a major way of getting these tweets or Facebook posts taken down is based off of user input, like reporting a post. I wonder if either of these platforms will take an autonomous route and remove posts automatically if they’ve reached the arbitrary quota for hate speech or offensive language. Something that would concern me about implementing this feature is if it takes it too far. For example a game that I play, Rainbow six siege, has very strict text limitations for in game chat. You can get banned for saying words that are not offensive but register as hate speech or offensive language to the machine.

Overall this is one of the articles that was very insightful, looking at it now I realize that distinguishing hate speech from offensive language is difficult but it’s not something that I would’ve thought about before reading the article.

Future Work:

There are two major things that I think would be contributing factors to distinguishing between hate speech, offensive language, and those containing neither. One would be seeing if the language was quoted from lyrics and if it was offensive or if it was just someone appreciating a song. The second would be if the text was just cultural difference that triggers some of the offensive language or hate speech words.

Early Public Responses to the Zika-Virus on YouTube: Prevalence of and Differences Between Conspiracy Theory and Informational Videos

Summary:

In this paper the authors wanted to examine the extent to which informational and conspiracy theories deferred in terms of user activity. They also wanted to check the sentiment and content of the user responses. They collected their data from the most popular videos posted on YouTube in the first phase of the Zika-virus outbreak in 2016. The results show that 12 out of the 35 videos in the data set focus on conspiracy theories however there were no statistical differences between the two types of videos.

Reflection:

Upon first reading, this paper seems fairly insignificant. Many times the authors said that there were no statistically significant findings. Something interesting the author said was YouTube videos often have misinformation on health related issues. I’d be curious to see what percentage of you youtubers that released videos on the Zika-virus were actually informed enough to make a video for the general public. I know the author wanted to see if this could be generalized towards other youtube video categories as well and I believe that there’ll be a lot of similar issues at least within the health field. Speaking from my sample set of videos I watch, there’s so much misinformation when it comes to videos related to powerlifting or bodybuilding. Often times there aren’t enough facts to back a statement, it just comes down to personal preference and what works best for you as an individual. When it comes to information that can be varied from user to user, I wonder how they would label what “misleading” exactly is.

Future Work:

Something that can be done to further research is to have a greater sample size than 35 videos. Also seeing if there are any more features that I can help to distinguish between informational videos and misleading videos.

Read More

Reading Reflection #2 – [1/31/2019] – [Sourav Panth]

This Just In: Fake News Packs a Lot in the Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News

Summary:

The premise of this article was “is there any systematic stylistic and other content differences between fake and real news?” To conduct their studies, Benjamin D. Horne and Sibel Adalı used three different data sets. The first one had been featured by Buzzfeed through their analysis of real and fake news items from 2016 US Elections. The second data set was collected by them and contains news articles on US politics from real, fake, and satire news sources. The third data set contains real and satire articles from a past study. This allowed them to draw some conclusions, primarily that fake news is shorter and uses a lot of repetitive content.

Reflection:

First, I really enjoyed this paper a lot more than the first article we read, it was a lot more structured and organized with the questions they asked. They didn’t go too broad. I also thought it was really interesting how they used feature engineering to be able to detect whether an article is fake or real.

The authors talk about how real news articles are significantly longer than fake news articles and that fake news articles use fewer technical words and quotes. This is not surprising at all to me, it’s hard to be technical and use quotes when the article is fake. I believe a lot of it also has to do with the fact that fake news primarily is used as click bait and commonly as a source of revenue just from getting consumers to view ads on their site. While real news also gets revenue from consumers using ads on their website one of their main goals is to inform the public. Because of this “real news” it’s probably backed up with more data and facts, they may even have updates on the same page.

Another thing that they talked about that was not surprising to me was that fake news articles often had longer titles than real news articles. This kind of goes back to what I was talking about the in the previous paragraph where fake news publishers are just trying to catch the eye of the consumer and get them to click on their link. An example that they gave is that fake news will often use more proper nouns in the title, this goes with the click bait theory because people will click on a link if it’s related to a celebrity or public figure that they have an interest in. I’m not sure if this is what they were going for, but it kind of seems like their title was longer than it needed to be as well.

Finally he talks about fake news being more closely related to satire then to real news. Again this doesn’t surprise me at all because satire is essentially fake news however they don’t advertise their articles as being real, all their consumers know that it’s fake and just for entertainment. One thing that really surprised me was the fact that they were able to distinguish if an article was satire or fake news over 50% of the time.

Future Work:

I think the first thing that I would work on is figuring out if the top four features, number of nouns, lexical redundancy (TTR), word count, and number of quotes, that they use are the best features and if there could be added features to increase accuracy.

Another thing that could be very interesting is seeing when the fake news articles are at the peak of publishing and if that correlates with any important events like the US election. This could help to show if fake news is a recent trend or it’s always been around but just not publicly known.

Read More

Reading Reflection #1 – [1/29/2019] – [Sourav Panth]

Summary:

This paper talks about how a classifier was trained in order to distinguish twitter accounts in different categories. These categories consisted of organizations, journalists/bloggers, and consumers.  They were able to define these groups using the Welch and Kolmogorov-Smirnov t-test. His findings were very interesting, organizations tended to be more professional while journalists use more of a personal style. Organizations also tend to share a lot more links than journalists.

Reflection:

One of the first things that popped out at me was the fact the organizations seems to be a lot more reserved than journalists. If you think about it, it makes a lot of sense, I’m going to use buzzfeed as an example because that is a website I follow closely. Buzzfeed as an organization probably does not want to be tied to certain political views or ideologies that could deter potential consumers from using their site or purchasing merchandise. Whereas, a journalist for buzzfeed can much more freely tweet their opinion online without drawing attention to the company as a whole if they do so on their private twitter account. Similarly the statistic that organizations have 3 times the number of links posted than journalists is likely tied to this as well. Buzzfeed wants to advertise the company on platforms like twitter but doesn’t care about using twitter as a form of information distribution. They post links directing consumers to their official website where information/blogs/videos are posted for the user to see. On the other hand, journalists will probably use sites like twitter to publish their personal opinion on a subject matter without linking anything.

Another big discrepancy between journalists and organizations is that they use different mediums to publish their tweets. For journalists, they seem to primarily use their phones which does not surprise me at all. If we go back to my previous example about a buzzfeed journalist, if most of the time it is just a quick statement about your opinion on a matter than there is no need to use anything other than your phone. With a phone, they also have the capability to tweet wherever and whenever they want. Organizations use special twitter applications a lot more than a journalist. This also makes sense to me, from my experience at the biomedical high-performance computing lab at Virginia Tech they would often scheduled their tweets so that there would be daily tweets at common peak times. As an organization they must make sure that they’re coming out with steady content to keep their consumers engaged. This would also explain why journalists tend to reply to their readers more than an organization. While a journalist has their phone on them almost all the time, organizations often don’t check their replies they just use Twitter as a platform to advertise their company and recent posts on their official website.

Future Work:

This article is really interesting to me because I am hoping to use data analytics to find misinformation within the news. I wasn’t exactly sure where to start however it’s helpful to know that we can distinguish different types of users based off their posting and replying habits.

I do have a few additional questions that this work could answer in the future.

  • First what features would we be able to use to distinguish different types of organizations, for example Wendys from CNN, if given a big dataset of organization information? Word association seems to be a good start but are there better solutions?
  • How could we differentiate organizations like Wendys that reply to their readers and do not post links often without mislabeling them as a journalist?
  • How would the features be discovered for finding misinformation from these sources whether they are a journalist or organization?


Read More