Reflection #12 – [04/05] – [Ashish Baghudana]

Felbo, Bjarke, et al. “Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm.” arXiv preprint arXiv:1708.00524 (2017).
Nguyen, Thin, et al. “Using linguistic and topic analysis to classify sub-groups of online depression communities.” Multimedia tools and applications 76.8 (2017): 10653-10676.

Summary 1

In the first paper, Felbo et al. build an emoji predictor on an extremely large dataset of 1.2 billion tweets and obtain state-of-the-art performance on sentiment, emotion, and sarcasm detection tasks. Since the dataset is not already labeled, the authors use distant supervision as an alternate. The authors demonstrate the success of their DeepMoji model on their dataset and transfer this knowledge to other target tasks. Transfer learning is achieved through a new approach they name “chain-thaw” that fine-tunes a layer at a time. The experiments section shows DeepMoji (with d=1024) achieving a top-5 accuracy of 43.8%, a 5% increase from fasttext’s classification module. The benchmarking experiment also shows DeepMoji (chain-thaw) outperforming state of the art techniques for the specific dataset and task.

Critique 1

The paper does numerous things really well. Firstly, their dataset is huge! This could indeed be one of the reasons for the success of their model. While the approach seems nearly perfect, I would love to know how long training a model on such large a dataset takes. Secondly, they built a website around their model – https://deepmoji.mit.edu/ and I really liked the way users can type in a sentence to obtain emojis associated with it. Interesting to note is that this dataset was obtained before Twitter shifted from 140 characters to 280 characters. The DeepMoji website refuses to process anything over 140 characters. I am not sure if this is a limitation on the front-end, or if the model’s accuracy diminishes beyond this limit. Finally, the paper is definitely more in the machine learning space than the computational social science space, at least in its current form. A good follow-up paper on this would be to use the DeepMoji model to detect bullying or trolls on Twitter (if they are associated more with specific emojis). It is also nice to see the code and the model being open-sourced and easily available for other researchers to use.

Summary 2

In the second paper, Nguyen et al. use linguistic and topic analysis-based features to classify sub-groups of online depression communities. They choose to study the online social media, Live Journal (LJ). LJ is divided into multiple communities and each community has several users posting about topics related to the community. The authors select a final cohort of 24 communities with 38,401 posts which were subsequently grouped into 5 subgroups – depression, bipolar disorder, self-harm, grief/bereavement, and suicide. Their features included LIWC and weights from corpus-topic distribution and topic-word distribution. Using these features, they built 4 different classifiers and found that Lasso performed the best.

Critique 2

I had several problems with this paper. The motivation for the paper was confusing – the authors wish to analyze characteristics of depression, however, they immediately deviate from this problem statement. Subsequently, they categorize five kinds of communities – depression, bipolar disorder, self-harm, grief, and suicide, but do not say why there are five categories, not more or less. The dataset itself collected is small and the authors do not provide any information about how the dataset was labeled. If the authors themselves labeled the communities themselves, it might have introduced bias into the training data, which could have easily been alleviated by using Amazon Mechanical Turks.

From my understanding of the features used, the authors run LIWC analysis to get 68 psycholinguistic features, and subsequently topic distribution for each post. They subsequently run a feature selection technique and show which features were important for four binary classifiers i.e. depression vs. bipolar, self-harm, grief, and suicide. Running feature selection and building four binary classifiers makes it difficult to understand coefficients of the model. The five communities could have been compared better if the authors built a multi-class classifier. Furthermore, I did not understand the semantic meaning of the topics and why they had higher weights for some classifiers without looking at the topics themselves. The authors also do not provide any justification to why they ran LDA with 50 topics. They should have run a perplexity-topics plot to determine the number of topics by the elbow method. Finally,  I also did not find any information about their train-test/cross-validation process.

Overall, it feels like this could paper could do with rework on the dataset and more discussion. I was not left with any feel for what constitutes depression vs. self-harm vs. bipolar disorder, and so on.

Read More

Reflection #12 – [04/05] – [John Wenskovitch]

This pair of papers returns our class discussions to linguistic analyses, including both sentiment detection using emoji (Felbo et al) and classifying online communities (Nguyen et al).  The emoji paper (not to be confused with the Emoji Movie) authors build a “DeepMoji” supervised learning model to classify the emotional sentiment conveyed in tweets with embedded emoji.  Using an immense multi-billion tweet dataset (that was curated down to just over a billion), the authors build and experiment with their classifier, finding that the rich diversity of emotional labels in the dataset yield performance improvements over previous emotion supervised learning studies.  The depression paper examined linguistic features of mental health support communities on Live Journal, seeking to understand some of the relationships present between distinct communities (such as Depression and Suicide groups).  In addition to very detailed results, the authors clear discuss their results and the limitations of their study.

The emoji paper was a tad difficult for me to read, in part because it focused so much on the ML approaches used in order to address this emotion sentiment challenge, and in part because I’m just not a person who uses emoji.  From my limited understanding, much of their motivation appeared sound.  The one thing that I wasn’t certain about was their decision to take tweets with multiple instances of the same emoji and reduce them to a single instance of that emoji.  I have seen tweets that use a single cry-smile which are trying to convey a slightly different but still related emotion than tweets that use twelve cry-smiles.  In the text communication world, I see it as the difference between “lol” and “hahahahahaha” replies.  I’m curious how the performance of their classifier would have changed if they had taken the semantics of multiple emoji into account further.

That said, their dendrogram (Fig 3) showing the clustering of the DeepMoji model prediction contained some interesting relationships between pairs and sets of emoji.  For example, the various heart emoji at the right end appear in several different subgroups with a few “bridge” emoji in between to connect those subgroups.  That isn’t an outcome that I was expecting.  For the most part though, happy emoji were self-contained into their own group, as were clusters that I’ll call sad emoji, celebratory emoji, and silly emoji.

My biggest criticism of the depression paper is the same theme that I’ve been suggesting all semester – getting all of your data from a single source introduces implicit biases into the results that you may not be aware of.  In the case of this study, all of the data came from Live Journal communities.  Having never belonged to that website, I cannot speak for what features could cause problematic biases.  However, I can suggest possibilities like comment moderation as being one dimension that could cause the linguistic features of these communities to differ between Live Journal and other community hubs.  Though the authors provided a page of limitations, this was not one of them.

I did also like that the authors compared their Lasso classification with three other classifiers (Naïve Bayes, SVM, and Logistic Regression), and compared their results across all four classifiers.  I’m also a big proponent of trying multiple classification techniques and determining which one is working the best (and then going back to the data and trying to understand why).

Read More