[Preview] Developing Modern Chinese NLP Models

Step 1: Character-Level ULMFiT Models with Movie Review Sentiment Analysis Task

Ceshine Lee
Aug 20, 2018 · 10 min read

20190425 Update: This project has lost its purpose since the moment BERT released its multilingual and Chinese versions of pretrained models. Readers are advised to check out those models and other similar projects (e.g. Universal Sentence Encoder).

This is an early preview of the ongoing work of developing open source modern Chinese NLP models which can be easily transfered to a wide range of tasks.

(Because most of the techniques used here also applies to other languages that usually requires word segmentation (e.g. Japanese, Thai), I choose to write in English here to reach a broader audience.)


As introduced in my previous post ([Notes] Improving Language Understanding by Generative Pre-Training), transfer learning in NLP has become an really exciting field of study. A recent blog post by the feedly team shows that transfer learning can achieve good performance with a very limited amount of labeled data:

We find that with only 1000 examples the model is able to match the accuracy score obtained by training a FastText model from scratch on the full dataset, as reported on the Kaggle project home page. With 100 labeled examples only, the model is still able to get a good performance.

This is actually a very good news for believers of liberal democracy. With much less data required to train a decent NLP algorithm (often simply referred to as “AI” these days), it’ll be harder for big corporations and governments to control the market and its data. According to Yuval Noah Harari, this kind of democratization of technology is very important in the fight against the rising authoritarianism:

So what can we do to prevent the return of fascism and the rise of new dictatorships? The number one question that we face is: Who controls the data? If you are an engineer, then find ways to prevent too much data from being concentrated in too few hands.

That’s what motivated me to open source this project which tries to provide the tools and tutorials necessary to do modern Chinese NLP with transferable models:


This project is still in a very early stage of development. This post describes the first working prototype which uses off-the-shelf models and training functions from the Fast.ai library. (That’s why this post has a [Preview] label in the title.) I plan to remove Fast.ai dependency and experiment some other model structures such as transformer decoder in the previous post.

In this prototype, two datasets were used:

  1. Chinese Wikipedia articles (from the official data dumps.)
  2. Douban movie reviews (scraped in a way similar to the one in this Zhihu post.)

Training can be decomposed into three stages:

  1. Universal language model pre-training using Wikipedia data.
  2. Language model fine-tuning using Douban data.
  3. Sentiment classification/regression fine-tuning using Douban data.

The movie reviews are associated with a 1 to 5 star rating. One way to frame this problem is to simplified it as a 3-class classification problem as in this Zhihu post (【实战NLP】豆瓣影评情感分析), another way is to treat is as a regression problem. The prototype had implemented both approaches, but we currently only provide the pre-trained weight for the regression model.

The ULMFiT Approach

Why Character-level Model

The prototype uses character-level models, that is , we treat each character as an individual token. Its advantage includes:

  1. We no longer need to do word segmentation. Word segmentation, especially when done poorly, creates additional errors/noises that could jeopardize the training of models in the downstream.
  2. It can more easily handle rare words. As even rare words often still consist of common characters, and its meaning sometimes can be inferred from the meanings of those characters.

But it also has some problems:

  1. Requires longer memory. LSTM models may have trouble keeping track of long-term dependencies.
  2. Puts more burdens on the model. The character-level approach essentially integrates the word segmentation task into the model objectives. This adds more complexity to the model.

We also plan to provide word-level and also sub-word-level models in the next few development iterations.

Universal Language Model Pre-training

Preparing the Dataset

Download the article dump (with the filename pattern of zhwiki-YYYYMMDD-pages-articles.xml.bz2) from here:

Use gensim to extract article sections from the XML dump into a (compressed) JSON file:

python -m gensim.scripts.segment_wiki -w 4 -f zhwiki-20180801-pages-articles.xml.bz2 -o zhwiki-latest.json.gz

Then we use opencc-python to convert traditional Chinese character in the corpus to simplified Chinese character. One simplified Chinese character can correspond to multiple traditional Chinese characters, so the other way around is much harder and can easily introduce noises.

The next step is to clean the corpus, which involves removing some MediaWiki markdowns, extra spaces, texts inside parenthesis, etc. This part is somewhat subjective, you can choose you own way of cleaning according to your use case. Here are some examples:

text = re.sub(r"'''?", "", text)
text = re.sub(r"\(.*\)", "", text)
text = re.sub(r"\-\{.*\}\-", "", text)
text = re.sub(r"《》", "", text)
text = re.sub(r"link=\w+\s", " ", text)
text = re.sub(r"File:.+\|", " ", text)
text = re.sub(r"\s+", " ", text)

The final step is to count and tokenize the character (using Python’s built-in collections.Counter). Each section has its own row. The full script is located here:

Train the Language Model

We’ve tried both QRNN and LSTM models, and found the LSTM models generally work better (maybe I need to better tune the hyper-parameter of the QRNN models).

The tokens from all sections are concatenated together, and I use bptt=100 and batch_size=128:

bptt = 100
batch_size = 128
n_tok = int(np.max([np.max(x) for x in tokens]) + 1)
trn_loader = LanguageModelLoader(
np.concatenate(trn_tokens), batch_size, bptt)

The initialization and the hyper-parameters of the model:

path = Path("../data/cache/lm/")
path.mkdir(parents=True, exist_ok=True)
model_data = LanguageModelData(
path, pad_idx=0, n_tok=n_tok, trn_dl=trn_loader,
val_dl=val_loader, test_dl=tst_loader
drops = np.array([0.1, 0.1, 0.05, 0, 0.1])
learner = model_data.get_model(
partial(Adam, betas=(0.8, 0.999)),
emb_sz=300, n_hid=500, n_layers=3,
dropouti=drops[0], dropout=drops[1], wdrop=drops[2],
dropoute=drops[3], dropouth=drops[4], qrnn=False

The full notebook (which currently very much needs some cleaning) is located here:

The notebook also comes with simple evaluation functions. An example:

Example: Next Character Predictions

Also a function to generate texts from the trained model. The function generates the next character stochastically, so the results will be slightly different every time. As you can see, the generated texts, although grammatically correct, don’t make much sense in the contexts. This might have something to do with the long-term dependency issue we’ve mentioned earlier. All in all, it still has much room for improvements:

Example: Conditional Text Generation

Language Model Fine-tuning Using Douban Data

Preparing the Dataset

Sharing the scraped movie review is very likely in violation of Douban’s TOS, so I’ll redirect you the code here (We used a modified version of scraper inside the spider folder.):

We’ve extracted the review and the rating into a CSV file (ratings.csv):

And we introduced a new special token BEG(=1) here to mark the beginning of a review, so we need to shift all token index by 1 to the right. The tokenization code (note that we did not redefine/refit the vocabulary):

BEG = 1
UNK = 2
results = []
tokens_train, tokens_val, tokens_test = [], [], []
for df, tokens in zip((df_train, df_val, df_test), (tokens_train, tokens_val, tokens_test)) :
for i, row in tqdm_notebook(df.iterrows(), total=df.shape[0]):
np.array([BEG] + [mapping.get(x, UNK-1) + 1
for x in row["comment"]])

Load and Fine-tune the Model

Because of the new BEG token, we need to expand the embedding matrix in the pre-trained weights:

emb_dim = weights['0.encoder.weight'].shape[1]
new_weights = np.zeros((
n_toks, weights['0.encoder.weight'].shape[1]))
new_weights[1:, :] = weights['0.encoder.weight']
assert np.array_equal(
new_weights[2, :], weights['0.encoder.weight'][1, :])
weights['0.encoder.weight'] = T(new_weights)
weights['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_weights))
weights['1.decoder.weight'] = T(np.copy(new_weights))

Then we can initialize the model as usual, load the model and fine-tune from the final layer:

learner = model_data.get_model(opt_fn, emb_dim, 500, 3, 
dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])
learner.metrics = [accuracy]
lrs = lr
learner.fit(lrs/2, 1, wds=1e-7, use_clr=(32,2), cycle_len=1)

We chose to unfreeze and fine-tune the whole model as in the Fast.ai IMDB example.

Sentiment Classification/Regression Fine-tuning using Douban Data

As a 3-Class Classification Problem

Convert to the 3-class label:

for df in (df_train, df_val, df_test):
df["label"] = (df["rating"] >= 3) * 1
df.loc[df.rating == 3, "label"] = 1
df.loc[df.rating > 3, "label"] = 2

Label distribution:

Prepare the dataset:

bs = 64
trn_ds = TextDataset(tokens_train, df_train.label.values)
val_ds = TextDataset(tokens_val, df_val.label.values)
trn_samp = SortishSampler(tokens_train, key=lambda x: len(tokens_train[x]), bs=bs//2)
val_samp = SortSampler(
tokens_val, key=lambda x: len(tokens_val[x]))
trn_dl = DataLoader(
trn_ds, bs//2, transpose=True, num_workers=1, pad_idx=0,
val_dl = DataLoader(
val_ds, bs, transpose=True, num_workers=1, pad_idx=0,
model_data = ModelData(path, trn_dl, val_dl)

Get the Classifier:

model = get_rnn_classifier(
bptt, bptt*2, 3, n_toks, emb_sz=emb_dim,
n_hid=500, n_layers=3, pad_token=0,
layers=[emb_dim*3, 50, 3], drops=[dps[4], 0.1],
dropouti=dps[0], wdrop=dps[1], dropoute=dps[2],
learn = RNN_Learner(
model_data, TextModel(to_gpu(model)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.metrics = [accuracy]

We achieved 63.6% accuracy in validation set:

And 61.70% accuracy in test set:

The precisions/recalls of the test set:

Balanced 3-class Classification with 45,000 Training Examples

Here we sample 15,000 rows from each class in the train dataset, and 5,000 rows from each class in the validation dataset, as in the Zhihu post:

df_train_small = pd.concat([
], axis=0)
df_val_small = pd.concat([
], axis=0)

We achieved 58.75% accuracy in the validation set (15,000 rows):

And 59.55% accuracy in the full test set (176,209 rows):

The Zhihu article reports 54+% with 45,000 training examples, and 57+% with 90,000 training examples. Although we did not use the exactly same dataset, and the dataset is very noisy, we can still see the clear advantage of ULMFiT models.

As a Regression Problem

All we need to do is change the loss function of the model and fine number of cells of the final layer:

class RNN_RegLearner(RNN_Learner):
def __init__(self, data, models, **kwargs):
super().__init__(data, models, **kwargs)

def _get_crit(self, data):
return lambda x, y: F.mse_loss(x[:, 0], y)
model = get_rnn_classifier(
bptt, bptt*2, 3, n_toks, emb_sz=emb_dim, n_hid=500,
n_layers=3, pad_token=0, layers=[emb_dim*3, 50, 1],
drops=[dps[4], 0.1], dropouti=dps[0], wdrop=dps[1],
dropoute=dps[2], dropouth=dps[3])
learn = RNN_RegLearner(
model_data, TextModel(to_gpu(model)), opt_fn=opt_fn)

The validation MSE is 0.7000:

The confusion matrix of the validation set if we predict the rating closest to the prediction:

And finally the precisions/recalls:

The regression problem seems to be much difficult. The model will very likely have a hard time when dealing with a review comment corresponding to a 1 or 5 star rating (i.e. extreme values).

TODO: A Manually Refined Labled Dataset

As mentioned, this dataset is very noisy and contains a lot of rows that human labeler probably won’t agree with the assigned rating. For example, some comment are off-topic/sarcastic:

Off-topic example (which also involves a personal attack on the actress). It was actually labeled 0 (negative).
Sarcastic example. It was actually labeled 0 (negative).

It can critically undermine the usefulness of the model in the real-world. One way to mitigate this problem is to review the comments one-by-one and pick the ones that are clearly aligned with the assigned ratings. We can still do the language model fine-tuning with the whole Douban dataset. It’ll be interesting to know how many reviewed examples are enough for the model to generalize well enough.


There is still much work to be done. But I hope this post is already helpful for some who want to start training their own NLP models with non-conventional datasets. And please feel free to ask any kind of questions or leave (constructive) feedback. Thanks!


Towards human-centered AI. https://veritable.pw

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade