How to Run Text Summarization with TensorFlow

Pavel Surmenok
6 min readOct 16, 2016

--

Text summarization problem has many useful applications. If you run a website, you can create titles and short summaries for user generated content. If you want to read a lot of articles and don’t have time to do that, your virtual assistant can summarize main points from these articles for you.

It is not an easy problem to solve. There are multiple approaches, including various supervised and unsupervised algorithms. Some algorithms rank the importance of sentences within the text and then construct a summary out of important sentences, others are end-to-end generative models.

End-to-end machine learning algorithms are interesting to try. After all, end-to-end algorithms demonstrate good results in other areas, like image recognition, speech recognition, language translation, and even question-answering.

Image credit: https://research.googleblog.com/2015/11/computer-respond-to-this-email.html

Text summarization with TensorFlow

In August 2016, Peter Liu and Xin Pan, software engineers on Google Brain Team, published a blog post “Text summarization with TensorFlow”. Their algorithm is extracting interesting parts of the text and create a summary by using these parts of the text and allow for rephrasings to make summary more grammatically correct. This approach is called abstractive summarization.

Peter and Xin trained a text summarization model to produce headlines for news articles, using Annotated English Gigaword, a dataset often used in summarization research. The dataset contains about 10 million documents. The model was trained end-to-end with a deep learning technique called sequence-to-sequence learning.

Code for training and testing the model is included into TensorFlow Models GitHub repository. The core model is a sequence-to-sequence model with attention. When training, the model is using the first two sentences from the article as an input and generates a headline.

When decoding, the algorithm is using beam search to find the best headline from candidate headlines generated by the model.

GitHub repository doesn’t include a trained model. The dataset is not publicly available, a license costs $6000 for organizations which are not members of Linguistic Data Consortium. But they include a toy dataset which is enough to run the code.

How to run

You will need TensorFlow and Bazel as prerequisites for training the model.

The toy dataset included into the repository, contains two files in “data” directory: “data” and “vocab”. The first one contains a sequence of serialized tensorflow.core.example.example_pb2.Example objects. An example of code to create a file with this format:

“vocab” file is a text file with the frequency of words in a vocabulary. Each line contains a word, space character and number of occurrences of that word in the dataset. The list is being used to vectorize texts.

Running the code on toy dataset is really simple. Readme on GitHub repo lists a sequence of commands to run training and testing code.

You can run TensorBoard to monitor training process:

When running “decode” code, note that it will loop over the entire dataset indefinitely, so you will have to stop execution manually at some point. You can find results of decoding in log_root/decode folder. It will contain a few files, some of them have prefix “ref”, they contain original headlines from the test set. Other files have prefix “decode”, they contain headlines generated by the model.

Troubleshooting

You can encounter an error when running “eval” or “decode” code using TensorFlow 0.10 or later:

“ValueError: Could not flatten dictionary. Key had 2 elements, but value had 1 elements.”

There is an open issue on GitHub for this error. One workaround is to downgrade TensorFlow to 0.9, it worked for me. Another workaround requires changing the code of the model: adding “state_is_tuple=False” to instantiations of LSTMCell in seq2seq_attention_model.py.

If you run training and decoding on toy dataset, you will notice that decoding generates nonsense. Here are few examples of headlines generated:

<UNK> to <UNK> <UNK> <UNK> <UNK> <UNK> .

<UNK> <UNK> <UNK> <UNK> of <UNK> <UNK> from <UNK> <UNK> .

in in <UNK> <UNK> <UNK> .

One of the reasons for poor performance on the toy set could be incompleteness of the vocabulary file. Vocabulary file is truncated and doesn’t contain many of the words which are used in the “data” file. It leads to too many “<UNK>” tokens which represent unknown words.

How to run on another dataset

A toy dataset is, well, a toy. To create a useful model you should train it on a large dataset. Ideally, the dataset should be specific for your task. Summarizing news article may be different from summarizing legal documents or job descriptions.

As I don’t have access to GigaWord dataset, I tried to train the model on smaller news article datasets, which are free: CNN and DailyMail. I found the code to download these datasets in DeepMind/rcdata GitHub repo, and slightly modified it to add the title of the article in the first line of each output file. See modified code here.

92570 articles in CNN dataset, and 219503 articles in Daily Mail dataset. It could be a few more articles, but the code from DeepMind repo could not download all URLs. 322k articles are way fewer than 10 million articles in GigaWord, so I would expect a lower performance of the model if training on these datasets.

After you run the code to download the dataset you will have a folder with lots of files, one HTML file for every article. To use it in TextSum model you will need to convert it to the binary format described above. You can find my code to convert CNN/DailyMail articles into binary format in textsum_data_convert.py file in my “TextSum” repo on GitHub. An example of running the code for CNN dataset:

Then you can copy train/validation/test sets and vocabulary files into “data” directory and start training the model:

Training with default parameters doesn’t go very well. Here is a graph of running_avg_loss:

Decoding results are also disappointing:

“your your <UNK>”

“We’ll the <UNK>”

“snow hit hit hit <UNK>”

Either dataset is too small, or hyperparameters need to be changed for this dataset.

When running the code I found that training code doesn’t use GPU, though I have all the correct configuration: GeForce 980Ti, CUDA, CuDNN, TensorFlow compiled with using GPU. While training, python.exe consumes 100–300+% CPU, and it appears in the list of processes when running nvidia-smi, but GPU utilization stays 0%.

I guess it can be related to the fact that authors of the model were running the code using multiple GPUs, and one GPU had some special purpose. A fragment of seq2seq_attention_model.py file:

The decoding code uses GPU quite well. It consumes almost all 6Gb of GPU memory and keeps utilization over 50%.

Conclusion

Using the code from this article you can easily run text summarization model on your own dataset. Let me know if you find something interesting!

If you happen to have a license for the GigaWord dataset, I will be happy if you share trained TensorFlow model with me. I would like to try it on some proprietary data, not from news articles.

Do you use any other text summarization algorithms? What works the best?

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMI family. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!

--

--

Pavel Surmenok

Machine learning engineering and self-driving cars. Opinions expressed are solely my own and do not express the views or opinions of my employer.