Summary 1
In the first paper, Felbo et al. build an emoji predictor on an extremely large dataset of 1.2 billion tweets and obtain state-of-the-art performance on sentiment, emotion, and sarcasm detection tasks. Since the dataset is not already labeled, the authors use distant supervision as an alternate. The authors demonstrate the success of their DeepMoji model on their dataset and transfer this knowledge to other target tasks. Transfer learning is achieved through a new approach they name “chain-thaw” that fine-tunes a layer at a time. The experiments section shows DeepMoji (with d=1024) achieving a top-5 accuracy of 43.8%, a 5% increase from fasttext’s classification module. The benchmarking experiment also shows DeepMoji (chain-thaw) outperforming state of the art techniques for the specific dataset and task.
Critique 1
The paper does numerous things really well. Firstly, their dataset is huge! This could indeed be one of the reasons for the success of their model. While the approach seems nearly perfect, I would love to know how long training a model on such large a dataset takes. Secondly, they built a website around their model – https://deepmoji.mit.edu/ and I really liked the way users can type in a sentence to obtain emojis associated with it. Interesting to note is that this dataset was obtained before Twitter shifted from 140 characters to 280 characters. The DeepMoji website refuses to process anything over 140 characters. I am not sure if this is a limitation on the front-end, or if the model’s accuracy diminishes beyond this limit. Finally, the paper is definitely more in the machine learning space than the computational social science space, at least in its current form. A good follow-up paper on this would be to use the DeepMoji model to detect bullying or trolls on Twitter (if they are associated more with specific emojis). It is also nice to see the code and the model being open-sourced and easily available for other researchers to use.
Summary 2
In the second paper, Nguyen et al. use linguistic and topic analysis-based features to classify sub-groups of online depression communities. They choose to study the online social media, Live Journal (LJ). LJ is divided into multiple communities and each community has several users posting about topics related to the community. The authors select a final cohort of 24 communities with 38,401 posts which were subsequently grouped into 5 subgroups – depression, bipolar disorder, self-harm, grief/bereavement, and suicide. Their features included LIWC and weights from corpus-topic distribution and topic-word distribution. Using these features, they built 4 different classifiers and found that Lasso performed the best.
Critique 2
I had several problems with this paper. The motivation for the paper was confusing – the authors wish to analyze characteristics of depression, however, they immediately deviate from this problem statement. Subsequently, they categorize five kinds of communities – depression, bipolar disorder, self-harm, grief, and suicide, but do not say why there are five categories, not more or less. The dataset itself collected is small and the authors do not provide any information about how the dataset was labeled. If the authors themselves labeled the communities themselves, it might have introduced bias into the training data, which could have easily been alleviated by using Amazon Mechanical Turks.
From my understanding of the features used, the authors run LIWC analysis to get 68 psycholinguistic features, and subsequently topic distribution for each post. They subsequently run a feature selection technique and show which features were important for four binary classifiers i.e. depression vs. bipolar, self-harm, grief, and suicide. Running feature selection and building four binary classifiers makes it difficult to understand coefficients of the model. The five communities could have been compared better if the authors built a multi-class classifier. Furthermore, I did not understand the semantic meaning of the topics and why they had higher weights for some classifiers without looking at the topics themselves. The authors also do not provide any justification to why they ran LDA with 50 topics. They should have run a perplexity-topics plot to determine the number of topics by the elbow method. Finally, I also did not find any information about their train-test/cross-validation process.
Overall, it feels like this could paper could do with rework on the dataset and more discussion. I was not left with any feel for what constitutes depression vs. self-harm vs. bipolar disorder, and so on.