The making of “How Chinese state media covered news in the early days of the epidemic”

Szu Yu
Follow the Breadcrumbs
7 min readMay 1, 2020

I finally published my new project a few days ago, which analyzed 3,222 tweets from the People’s Daily to get a sense of how the Chinese state media covered news in the early days of the epidemic.

Check it out here!

Here’s how I made the graphics.

Background

Initially, I used Python to crawl daily Chinese news from the People’s Daily website. I got thousands of news stories and realized that there was no way I could translate them all by myself.

Eventually, I decided to go with the “English posts” published by People’s Daily on Twitter.

During the COVID-19 outbreak, this analysis reveals, the People’s Daily avoids associating words like “deaths” or “died” with China. The newspaper also tends to cover a lot more information about other countries than about China, shifting the story away from its shores.

(If you want to jump directly to the website, click here!)

Data Collection and Cleaning

I used Tweepy to access the Twitter API.

In Python, I removed punctuations, usernames and URLs, tokenized each sentence into a single word, and removed stop words like the, a, is, am, are, etc.

Google “text cleaning in Python,” and you will see a variety of similar data cleaning processes. Although they may oversimplify the original context of the texts and even distort some semantics, they can help us view the data from a different perspective.

Data Analysis

Usually, I list down questions I am most curious about before I dig into the dataset.

For example:

  • What were the top words? Can we categorize tweets into different groups based on the top five or 10 words?
  • Trends: Did any of the top words get mentioned a lot in January, but then vanish in March? Or vice versa?
  • Sentiment analysis: Did tweets referencing specific words have more positive or negative sentiments?

By adding up the number of uses of each word, I got the top words across all tweets.

The trends of words tied to the outbreak, such as cases, total, number and deaths, caught my attention.

China’s COVID-19 outbreak began in January — but these words were not used often in People’s Daily tweets at that time.

It was not until after mid-February, when the epidemic hit other nations, that mentions of these words surged.

(Using raw “counts” to represent the frequency of words is not accurate enough. In the official article, I calculated the “percentage” of uses of a word per day.)

This discovery led me to my next question: What are the differences between People’s Daily tweets about China and those referencing other countries? Will the most mentioned words be different, if we divide the tweets into different groups?

Visualizing data

Finally, here comes my favorite part!

I will focus on the triangular scatterplot, which went through numerous iterations.

My first idea was to use a word tree to illustrate words that are frequently used before or after a country name.

For example, in the above chart, I put “China” as the center node. The left and right sides of the center node are the most common words before and after “China.” I created a visual prototype with the word tree tool built by Jason Davies.

However, one big problem emerged.

Although China was the most mentioned word across all tweets, there were too many different words before and after “China.” The visualization turned out to be a chart with multiple long sentences.

So I started experimenting with other visual prototypes.

I came up with a series of ideas to present the top words and their followup-words. The top line chart is showcasing the confirmed cases since Jan 1, while the bottom line chart displays the number of uses of a selected word.

I decided to go further with the third graph, the one with lots of unlinked circles, for it can demonstrate the spatial distance among words.

Each circle represents a word. The distance between the two circles reflects how often the two words were used together.

Transforming texts into vectors is not a new thing. Laniakea, a beautiful project done by Fathom, utilizes similar concepts.

Screenshot from Laniakea

So my next question was how to calculate the distance among the words based on their co-occurrence?

I used word2vec from Gensim to convert the tokenized words into vectors, giving each word an x-y position, but the result was not what I expected. (Here is a great Medium post introducing the word vectors by Jayesh Bapu Ahire.)

Word2vec tends to group similar words under one category. For example, all the numbers were grouped together in one area. And all the nations’ names were grouped together in the other area.

I tried to calculate distance among the words myself but failed (And I still don’t know how to do so LOL. If anyone has ideas on how to do that, please send me a note!)

In the end, I pivoted to calculate the distance between “each word” and “each tweet group.”

I collected tweets containing one or more country names, dividing them into three groups — tweets referencing only China (or its provinces), tweets referencing other countries but not China and tweets mentioning China and one or more other countries.

To compare the proportion of uses of each word in three categories, I decided to go with a triangle plot.

I’ve seen triangle plots being used by some well-known publications. Reuters used it in a story about the impact of Brexit. The Pudding included it here to present how news channels cover news differently.

Taking the temporal dimension into account, I colored each circle based on the month the tweet was posted.

I referred to the triangular scatterplot created by Chris Given in d3.js, tweaked the code a bit and added the gradient lines connecting circles representing the same word.

In case you wonder how I positioned the circles, below is the explanation.

I calculated the monthly percentage of uses of the top 300 words in each tweet group, normalizing the numbers based on the total number of words used by each tweet group.

For example, in January, “confirmed” showed up 18 times in tweets referencing only China, 4 times in tweets mentioning other countries, but not China and 2 times in tweets with China and one or more other countries. Its unnormalized percentage in the tweet group referencing only China is 75% (18 / (18+4+2)).

In January, the total word counts in tweet group referencing only China is 5,113, that in tweet group mentioning countries other than China is 865 and that in tweet group with China and one or more countries is 314. The way I calculated the normalized percentage of the same word in tweet group referencing only China is:

18*865*314 / (18*865*314 + 4*5113*314 + 2*865*5113) = 24.3%

The normalized percentage of “confirmed” in tweet group mentioning countries other than China is 31.8% and that in tweet group with China and one or more countries 43.9%.

To identify the position of the circle changing from one month to another, I connected the circles representing the same word together. When users hover over a circle, they can see the trajectory of a word moving from one position to the others in different months.

Last but not leaaaaaast

A huge thank you to Steven Braun for offering me guidance and inspiration along my design process.

I would also like to thank Matthew Carroll and Aleszu Bajak for their edits and valuable feedbacks and John Wihbey for his helpful advice.

And special thanks to Felippe Rodrigues for answering questions regarding my very inefficient code!

--

--