Reflection #11 – [03/27] – [Jamal A. Khan] | CS6724 Spring18: Computational Social Science

Hiruncharoenvate, Chaya, Zhiyuan Lin, and Eric Gilbert. “Algorithmically Bypassing Censorship on Sina Weibo with Nondeterministic Homophone Substitutions.”
King, Gary, Jennifer Pan, and Margaret E. Roberts. “Reverse-engineering censorship in China: Randomized experimentation and participant observation.”

Both of the papers assigned the next class are about Chinese censorship and in a sense have this heroic writing tone in terms of how the idea is being put forward. I didn’t quite like the way the ideas were staged but that is irrelevant and subjective.

Regardless of the tone of the writing or my likes or dislikes at that, I like the first paper’s idea of using the semantics of Chinese language itself as a deterrent against censorship. The complexity of the language has come as a blessing in disguise. Before i get into the critical details of the paper, i would say that the approach of the authors is sound and has been well demonstrated. Therefore, the reflection will focus on what can be done (or undone in my case) using the paper as a base.

Since the title itself states that the purpose of the paper is to “bypass” the censorship, a natural question is “Does this method still work?”. A naive approach to breaking this scheme, or at the very least majorly cutting down the human cost that the authors talk about, would be to build a homophone replacement method. This is very much possible with the recent advances in Word Embedding schemes (referring to works in [1], [2] and especially [3]) and their ability to detect the similarity is usage of words. These embeddings per say do not look at what a word is but rather at how it occurs to deduce the importance, similarity and in case of [3] hierarchy as well when mapping to a arbitrary dimensional vector space (meaning that it could deduce what a random smiley means as well) . Hence to these embedding the homophones are very similar words IF they are used in the same context (which they are!). Since the proposed solution in the paper relies on the reader being able to deduce meaning of the sentences from the context of the article or the situation/news trends, Embeddings will be able to do so well, if not better, and hence the system would censor the posts. So, i guess my point is that, this method might be outdated now, the only overhead the censoring system would have to bear is the training of a new embedding model every day or so.

The other question is “Do the censorship practices still function the same way”. Now that NLP tasks are being dominated by sequence models (deep learning models based on bi-directional RNNs for example), it might be possible to automatically detect even better now. I feel that this question is one that needs further exploration and there is no direct answer.

Another natural question to ask would be: does this homophonic approach extend to other languages as well? For Urdu (Punjabi as well), English and to some extent Arabic, the languages which i know myself, I’m not too sure if such a variety of homophones exists. Since it doesn’t then a straight follow up question is Can we develop language invariant censorship avoidance schemes? I feel that this could be some very exiting work. Maybe some inspiration can be drawn from schemes such as [4].

The second paper by King et al., I must is pretty impressive. The amount of detail in terms of experiment design, the consideration undertaken and the way results are presented is pretty much on point. Now, I’m not to familiar with Chinese censorship and it’s effects, so i can’t make much of the results. The thing that is surprising to me is that posts with collective action potential are banned while those critiquing the government are not, why? Another surprising finding was the absence of a centralized method of censorship and this leads me back to my original question that with newer NLP techniques powered by deep learning emerging, will the censor hammer come down harder? will these digital terminators be more efficient at their job? In the unfortunate case, that this dystopian scenario were to come true, how’re we to deal with it?.

I guess with both the paper combined an ethical question needs to be discussed: Is censorship ethical? if no, then why? if yes, then under what circumstance and to what extent? It would be nice to hear other peoples opinion on this in class.

[1] Efficient Estimation of Word Representations in Vector Space: https://arxiv.org/pdf/1301.3781.pdf

[2] GloVe: Global Vectors for Word Representation: https://nlp.stanford.edu/pubs/glove.pdf

[3] Poincaré Embeddings for Learning Hierarchical Representations: https://arxiv.org/pdf/1705.08039.pdf

[4] Unobservable communication over fully untrusted infrastructure: https://www.cs.utexas.edu/~sebs/papers/pung_osdi16_tr.pdf

Reflection #11 – [03/27] – [Jamal A. Khan]

jamal93

Leave a Reply Cancel reply