Reflection #12 – [04/05] – [John Wenskovitch]

This pair of papers returns our class discussions to linguistic analyses, including both sentiment detection using emoji (Felbo et al) and classifying online communities (Nguyen et al).  The emoji paper (not to be confused with the Emoji Movie) authors build a “DeepMoji” supervised learning model to classify the emotional sentiment conveyed in tweets with embedded emoji.  Using an immense multi-billion tweet dataset (that was curated down to just over a billion), the authors build and experiment with their classifier, finding that the rich diversity of emotional labels in the dataset yield performance improvements over previous emotion supervised learning studies.  The depression paper examined linguistic features of mental health support communities on Live Journal, seeking to understand some of the relationships present between distinct communities (such as Depression and Suicide groups).  In addition to very detailed results, the authors clear discuss their results and the limitations of their study.

The emoji paper was a tad difficult for me to read, in part because it focused so much on the ML approaches used in order to address this emotion sentiment challenge, and in part because I’m just not a person who uses emoji.  From my limited understanding, much of their motivation appeared sound.  The one thing that I wasn’t certain about was their decision to take tweets with multiple instances of the same emoji and reduce them to a single instance of that emoji.  I have seen tweets that use a single cry-smile which are trying to convey a slightly different but still related emotion than tweets that use twelve cry-smiles.  In the text communication world, I see it as the difference between “lol” and “hahahahahaha” replies.  I’m curious how the performance of their classifier would have changed if they had taken the semantics of multiple emoji into account further.

That said, their dendrogram (Fig 3) showing the clustering of the DeepMoji model prediction contained some interesting relationships between pairs and sets of emoji.  For example, the various heart emoji at the right end appear in several different subgroups with a few “bridge” emoji in between to connect those subgroups.  That isn’t an outcome that I was expecting.  For the most part though, happy emoji were self-contained into their own group, as were clusters that I’ll call sad emoji, celebratory emoji, and silly emoji.

My biggest criticism of the depression paper is the same theme that I’ve been suggesting all semester – getting all of your data from a single source introduces implicit biases into the results that you may not be aware of.  In the case of this study, all of the data came from Live Journal communities.  Having never belonged to that website, I cannot speak for what features could cause problematic biases.  However, I can suggest possibilities like comment moderation as being one dimension that could cause the linguistic features of these communities to differ between Live Journal and other community hubs.  Though the authors provided a page of limitations, this was not one of them.

I did also like that the authors compared their Lasso classification with three other classifiers (Naïve Bayes, SVM, and Logistic Regression), and compared their results across all four classifiers.  I’m also a big proponent of trying multiple classification techniques and determining which one is working the best (and then going back to the data and trying to understand why).

John Wenskovitch

To come.

Leave a Reply

Your email address will not be published. Required fields are marked *