Paper 1 : Reverse-engineering censorship in China: Randomized experimentation and participant observation
Paper 2 : Algorithmically Bypassing Censorship on Sina Weibo with Nondeterministic Homophone Substitutions
Both the papers play two parts of a larger process – observation and counter-action.
The first paper deals with the reasearchers trying to understand how Chinese censoeship works. In the light of lack of any official documentation, they use experimentally controlled anecdotal evidences to build an hypthesis of functioning of “The Great Firewall of China“. They use some well-known topics with history of censorship and post discussions on various Chinese forums. They had some interesting observations like corruption cases and criticism as well as praise of goverment are more heavily censored than sensitive topics like Tibet and border disputes. This could represent the priorities of the government towards censorship and can help to bypass it for majority of the sensitive cases.
Non-familiarity with the language creates difficulty in understanding the nuances of the language used in surviving posts and banned posts. Though, mostly it seems like that the censorship is primarily dependent on keyword matching with exernal techniques as subsidiaries, will global advancement in NLP research might be harmful than useful in this case ? With the help of advanced NLP, censorship tools can go beyond just words and infer context from the statments.
This brings to the second paper on bypassing censorship. The authors make use of homophone substitutions to fool auto-censorship tools. Language is again a barrier here in fully grasping the effects of homophones substitutions. However, it can be inferred that a limited number of substitutions are possible for every word to be replaced. This creates a problem if those substitutions become popular. The censorship tools can easily ban those. A common recent example is the removal of the 2-term restriction on Presidentship in China. The people started criticising it using mathematical terminlogy of 1,2,……N to represent infinite numbers. Messages like “Congratulating Xi Jinping for getting selected as President for the Nth time”. The censor tools went ahead and not only recognised the context of this very subtle joke but also blocked the letter N for some time. Hence, it shows that no matter how robust and covert a system is, if it gains enough traction, it will come into focus and gets banned. There is a need ot find ways which cannot be countered even after they have been exposed.