On different sides of China’s internet firewall: Sina Weibo and Twitter

Center for Data Science fellow Bruno Gonçalves tackles trending topics on both sides of China’s firewall

Just how much does China’s internet firewall affect what Chinese users inside and outside of the wall discuss on social media? Center for Data Science fellow Bruno Goncalves, along with Northeastern University postdoctoral researcher Qian Zhang, has been investigating this very question by comparing trending Chinese topics on Sina Weibo (China’s most popular microblogging platform) and Twitter. After collecting 216.8 million Weibo messages and 12.3 million tweets in simplified and traditional Chinese sent during 2012, Goncalves and Zhang wrote an algorithm that identifies significant keywords in the messages so that they can be grouped by topic.

But how exactly does their algorithm work? Key to their research process is combining two methods: term frequency (TF) and inverse document frequency (IDF). TF calculates the most frequent terms in the messages, while IDF calculates the most infrequent terms. Tempering TF’s results with IDF’s calculations is crucial, Zhang explains, because using TF alone would cause words like ‘a’ or ‘this’ to be perceived as ‘topics’ even though they are simply common words that would naturally be found in all messages. Similarly, using IDF without TF would mean that every single infrequent term would become a topic of its own, even though multiple terms could refer to the same topic. “For example, nowadays “#debate” and “#debatenight” refer to the exactly same topic, and “#nastywomen2016” apparently also refers to the final presidential debate,” says Zhang.

Calculating the most and least frequent words means that the algorithm can cluster messages that refer to the same topic even if they use different hashtags, or contain no hashtags at all. The accuracy and nuance of Goncalves and Zhang’s approach is especially compelling because previous studies have only clustered messages based on hashtags alone.

After crafting this powerful algorithm, Goncalves and Zhang clustered the Chinese messages on Sino Weibo and Twitter according to their topics. Unsurprisingly, they found that top Chinese topics on Weibo and Twitter were drastically different. Sina Weibo focused on entertainment, such as singers, actors and games — the number one topic was a game titled “三国来了,” or “The Three Kingdoms Have Arrived.” But on Chinese Twitter, the number one topic was “Chen Guangcheng,” a Chinese civil rights activist, and the top ten topics were dominated by political topics like “Free Tibet” and “Wukan Protest.”

As Zhang explains, the very fact that Twitter is restricted in China might have attracted people who wished to “rebel against the system,” which might have been an influence on the political culture of Chinese Twitter. Interestingly, tweets concerning political topics were “practically nonexistent” in Weibo. Their results confirm the general belief that users who share the same linguistic background but do not live in the same place will display different interests on social media because they are exposed to different cultural influences.

Originally published at cds.nyu.edu on November 8, 2016.