Building a Summarization System in Minutes

An Up-to-date OpenNMT-py Tutorial for Beginners

(This is sort of a sequel or an update to “Building a Translation System in Minutes” published a year ago. This time we use a publicly available dataset, a different NLP task, and some task-specific evaluation metrics)

Summarization is the task of producing a shorter version of one or several documents that preserves most of the input’s meaning. [1]

The text summarization task is mostly solved using variants of the seq2seq structure [2] these days. The seq2seq structure is much more complicated than the usual RNN models, and that makes implementing the model from scratch a rather daunting task. Luckily, OpenNMT project [3] provides ready-to-use implementations of seq2seq models that are close to state-of-the-art. We can use it as a starting point.

In this post we are using OpenNMT-py, a Pytorch port of OpenNMT, to train a baseline model on the Gigaword summarization dataset. The official tutorial is somewhat out-dated and did not work right out-of-the-box. Here we provide an up-to-date tutorial that you can follow step-by-step.

Step 1: Install Required Packages

PyTorch 0.4.1:

OpenNMT-py:

  • Clone the repository to your project directory: git clone https://github.com/OpenNMT/OpenNMT-py.git.
  • Optionally you can check out this commit on Oct 11, 2018, which is the version I tested with: git checkout a0095fa2.
  • Install all the required packages: pip install -r OpenNMT-py/requirements.txt.

files2rouge (this is needed to calculate the evaluation metrics):

Step 2: Download the Gigaword Dataset

Follow the link here (harvardnlp/sent-summary) to download the archived dataset. Extract the content and put them in data/gigaword subfolder in your project directory. You should have these two folders in data/gigaword: train(train and validation sets) and Giga(test set).

At this point you should have two folders in your project directory: OpenNMT-py and data.

Step 3: Cleanse the Dataset

OpenNMT-py recently added a feature that prevents the dataset to have any special tokens. Unfortunately, the Gigaword dataset does contains a special token, <unk>, inside the train and validation sets, and it also expects models to output <unk> in the predicted summaries. Moreover, <unk> is inconsistent with what is used in the test set — UNK.

One simple solution is to replace <unk> with UNK. Run these commands in your bash prompt:

sed ‘s/<unk>/UNK/g’ data/gigaword/train/train.article.txt > data/gigaword/train/train.article.cleaned.txt
sed ‘s/<unk>/UNK/g’ data/gigaword/train/train.title.txt > data/gigaword/train/train.title.cleaned.txt
sed ‘s/<unk>/UNK/g’ data/gigaword/train/valid.article.filter.txt > data/gigaword/train/valid.article.filter.cleaned.txt
sed ‘s/<unk>/UNK/g’ data/gigaword/train/valid.title.filter.txt > data/gigaword/train/valid.title.filter.cleaned.txt

Test dataset contains some lines that have only one token — UNK. They can actually break the inference process of OpenNMT-py. It makes little sense to keep them.

Run the scripts below to remove those lines:

Step 4: Preprocess the Dataset

We need to let OpenNMT-py scan the raw texts, build a vocabulary, tokenize and truncate the texts if necessary, and finally save the results to the disk.

Pick a shard_size that works with your local memory. That mainly affects the training process. I’ve found that it does not affect the preprocessing process, as larger datasets can quickly make the system out of memory regardless of the value of shared_size.

python OpenNMT-py/preprocess.py \
-train_src data/gigaword/train/train.article.cleaned.txt \
-train_tgt data/gigaword/train/train.title.cleaned.txt \
-valid_src data/gigaword/train/valid.article.filter.cleaned.txt \
-valid_tgt data/gigaword/train/valid.title.filter.cleaned.txt \
-save_data data/gigaword/PREPROCESSED \
-src_seq_length 10000 \
-dynamic_dict \
-share_vocab \
-shard_size 200000

Step 5: Train a model

This command trains a model similar to the Pointer-Generator Networks [4]:

python -u OpenNMT-py/train.py \
-save_model data/gigaword/models_v2/ \
-data data/gigaword/PREPROCESSED \
-copy_attn \
-global_attention mlp \
-word_vec_size 128 \
-rnn_size 512 \
-layers 2 \
-encoder_type brnn \
-train_steps 2000000 \
-report_every 100 \
-valid_steps 10000 \
-valid_batch_size 32 \
-max_generator_batches 128 \
-save_checkpoint_steps 10000 \
-max_grad_norm 2 \
-dropout 0.1 \
-batch_size 16 \
-optim adagrad \
-learning_rate 0.15 \
-start_decay_steps 100000 \
-decay_steps 50000 \
-adagrad_accumulator_init 0.1 \
-reuse_copy_attn \
-copy_loss_by_seqlength \
-bridge \
-seed 919 \
-gpu_ranks 0 \
-log_file train.v2.log

I’m almost certain that the decay rate (decay_steps) I set was too high. I stopped the training after just 2 epochs (~ 480000 steps) because of it. You can try to significantly increase the value of decay_steps.

Please refer to the official documentation for the meaning of each arguments.

The training took about 12 hours with a single GTX 1070.

The model used here:

NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50004, 128, padding_idx=1)
)
)
)
(rnn): LSTM(128, 256, num_layers=2,
dropout=0.1, bidirectional=True)
(bridge): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50004, 128, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.1)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.1)
(layers): ModuleList(
(0): LSTMCell(640, 512)
(1): LSTMCell(512, 512)
)
)
(attn): GlobalAttention(
(linear_context): Linear(in_features=512,
out_features=512, bias=False)
(linear_query): Linear(in_features=512,
out_features=512, bias=True)
(v): Linear(in_features=512, out_features=1, bias=False)
(linear_out): Linear(in_features=1024,
out_features=512, bias=True)
)
)
(generator): CopyGenerator(
(linear): Linear(in_features=512,
out_features=50004, bias=True)
(linear_copy): Linear(in_features=512,
out_features=1, bias=True)
(softmax): Softmax()
(sigmoid): Sigmoid()
)
)

Step 6: Summarize the documents from the test set

python OpenNMT-py/translate.py -gpu 1 \
-batch_size 1 \
-beam_size 5 \
-model data/gigaword/models_v2/_step_480000.pt \
-src data/gigaword/Giga/input_cleaned.txt \
-share_vocab \
-output data/gigaword/Giga/test.pred \
-min_length 6 \
-verbose \
-stepwise_penalty \
-coverage_penalty summary \
-beta 5 \
-length_penalty wu \
-alpha 0.9 \
-block_ngram_repeat 3 \
-ignore_when_blocking "." \
-replace_unk

Most of the arguments are directly copied from the official tutorial (targeted at the CNNDM dataset).

Please refer to the official documentation for the meaning of each arguments.

An example result:

Document: the four candidates in algeria ‘s first free presidential election held final rallies monday amid tight security as some voters began casting their ballots three days ahead of the main poll .

Predicted Summary: algeria holds final rallies amid tight security

Reference Summary: algerian presidential candidates wind up campaign by richard palmer

(Note here the reference summary mentioned a name “richard palmer” that is impossible to get from the source document.)

Step 7: Calculate the Evaluation Metrics

files2rouge data/gigaword/Giga/test.pred data/gigaword/Giga/task1_ref0_cleaned.txt > eval.v2.log

Outputs:

Preparing documents...
Running ROUGE...
---------------------------------------------
1 ROUGE-1 Average_R: 0.29550 (95%-conf.int. 0.28513 - 0.30603)
1 ROUGE-1 Average_P: 0.39232 (95%-conf.int. 0.37976 - 0.40575)
1 ROUGE-1 Average_F: 0.32755 (95%-conf.int. 0.31699 - 0.33892)
---------------------------------------------
1 ROUGE-2 Average_R: 0.13540 (95%-conf.int. 0.12617 - 0.14445)
1 ROUGE-2 Average_P: 0.18223 (95%-conf.int. 0.17091 - 0.19378)
1 ROUGE-2 Average_F: 0.15004 (95%-conf.int. 0.14042 - 0.15956)
---------------------------------------------
1 ROUGE-L Average_R: 0.27718 (95%-conf.int. 0.26704 - 0.28754)
1 ROUGE-L Average_P: 0.36748 (95%-conf.int. 0.35500 - 0.38036)
1 ROUGE-L Average_F: 0.30714 (95%-conf.int. 0.29659 - 0.31820)
Elapsed time: 6.515 seconds

So we achieved Rouge-1: 0.32755, Rouge-2: 0.15004, Rouge-L: 0.30714 after just two epochs. However, they are significantly lower than those from the official pre-trained models (the best one has Rouge-1: 0.3551, Rouge-2: 0.1735, Rouge-L: 0.3317).

What’s Next

We can further tune the hyper-parameters to improve the final evaluation scores. And we can also try the new transformer models (I’ll add a working training command here later if time permitted).

What I’m really interested to do is some transfer learning with the pre-trained models. The main obstacle is adapting the preprocessing used by OpenNMT to work on a new dataset.

Customizing the model structure is also another natural next step. However, we need to understand how the original model works first. The next post will probably be some analysis of the source code, along with some brief introduction of the related research papers.

In the end I might build a system from scratch, but with a lot of components coming from OpenNMT-py. The plan is to incorporate the code into my Modern Chinese NLP project so it’ll support seq2seq applications as well.

References:

  1. Tracking Progress in Natural Language Processing (sebastianruder/NLP-progress on Github)
  2. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks.
  3. Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2018). OpenNMT : Neural Machine Translation Toolkit.
  4. See, A., Liu, P. J., & Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks.