This tutorial is the 10th installment of the Abstractive Text Summarization made easy tutorial series. Today we would build a Hindi Text Summarizer ,using the new novel approach introduced by google research of using curriculum learning scheduled sampling approach with seq2seq, we also combine attention and pointer generator models in our text summarization model, all the code for this series can be found here, code for this tutorial can be found here, which is written in tensorflow and runs seamlessly on google colab.
Today we would learn how to build a text summarization model for Hindi language from :
- collecting/scrapping data
- building word embedding model
- training seq2seq scheduled sampling model
This blog is based upon the work of these amazing projects, we have used news-please for scrapping data, our model is based on Bengio et al’s from google, and is implemented using yaserkl’s amazing library.
I would truly like to thank all of them for there amazing work.
This tutorial series helps you get to know the latest methods to build an abstractive text summarizer, here are the list for the tutorial series, we have also provided the text summarization as a free to use api through eazymind.
Now lets begin :D
1- Collect/Scrap dataset and process to csv
To build our text summarization model, we need to have a dataset of the required language, this data would be in the format of text with title, so as to train the model to summarize the text to the title.One of the most effective datasets for our goal is news, as each news article would have a big article with a title that summarizes it.
So we would need a method to collect online news in Hindi, one of the most useful methods that i have found was using the amazing scrapper news-please , this scrapper would allow you to state the needed websites for scrapping, and it would recursively scrap for data and put it in a json format.
I suggest scrapping on google colab, as to not waste your bandwidth in scrapping, and to only download the resultant json files would be much smaller than scrapping the whole html file.
1-A run this notebook on your (google colab)
1-B set the configurations (google colab)
in google colab, in the file tab, go up one level, then under root directory create directory called news-please-repo, then under it, create config directory.
here you would create 2 files (their contents can be found in the notebook), 1 file (config.cfg) would set the directoory to save the json files, i like to save them to google drive, so feel free to set your own path (this option can be found in the variable called working_path
The second file would set the names of the websites to scrap from, I used about 9 websites (their names are found in the notebook), feel free to add and modify to your own news websites,
i suggest that you modify the sites.hjson to contain couple of sites each google colab session, so that each google colab session would scrap from couple of sites not the all of them at the same time
1-C Download from google drive to process locally (google colab)
after running the news-please command for a couple of hours (or couple of google colab sessions) and saving the resultant json files to your google drive, i suggest downloading the data from google drive to your local computer to process the files locally, as accessing the files from google colab would be quite slow (i think it has something to do with slow file i/o between google colab and google drive)
download the zip by simply selecting the folder in google drive and downloading it, it would zip it automatically
1-D Process the downloaded zip locally (locally)
after the zip has been downloaded, unzip it, install
pip install langdetect
and run this script (locally on your computer for fast file accessing, don’t forget to modify the location of your extracted zip in the script), this script would loop through all the scrapped json files, check if they are in Hindi, and then would save them to a csv
1-E Upload the resultant csv
now after you run the script, upload it to your google drive
2- Generate word embedding model and vocab
For our text summarization model, we will need a word embedding model specifically built for Hindi, this would help our model understand the language, learn more about word embedding for text summarization
We can either use an already built word embedding, or we can build our very own word embedding model.
So lets choose the option of building our own model,
2-A Build Word Embedding
This notebook uses gensim python package to build the word embedding model that we would use in our text summarization model, we would build our model to have length of 150. (run this notebook on google colab by connecting it to google drive to read your recently uploaded csv)
2-B Build Vocab Dict
In this same notebook , we are able to build a vocabulary file, which contains all the different words (contains 200k word), each word with its count, this generated file (vocab file is essential for our text summarization model).
3- Build Text Summarization model
Now after we have
- scrapped data,
- processed in a csv
- built word embedding model
- built vocab
we are now able to start building our text summarization model, there are actually multiple approaches to build our model, you can know more here , we would choose building seq2seq with scheduled sampling, learn more about scheduled sampling here, this concept has been introduced by Bengio et al’s from google, and is implemented using yaserkl’s amazing library.
This notebook is a modification to yaserkl’s amazing library, we have made the code able to run on python3, we have converted the code to ipynb to run seamlessly on google colab, we have also enabled the dataset to be read in simple csv format without the need for complicated binary preprocessing.
is built on top of attention model, and pointer generator model, to help solve the problem of Exposure bias, which is simply training the model using reference sentences, while in testing we test without having a reference sentence, this causes the Exposure bias problem.
The scheduled sampling is an simple yet smart way to solve this problem, which is exposing the model to its own output while training, to make the model learn from its mistakes in training, you can know more about this research here (thanks Bengio et al’s for this amazing research).
So in every training step, we would throw a coin, one time we would choose the reference sentence, the other time we would choose to train from the model output itself (learn more here)
some results for text summarization are (summary are the generated summaries)
More about the series
This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow in multiple approaches , we call it abstractive as we teach the neural network to generate words not to merely copy words
We have covered so far (code for this series can be found here)
0. Overview on the free ecosystem for deep learning (how to use google colab with google drive)
- Overview of the text summarization task and the different techniques for the task
- Data used and how it could be represented for our task (prerequisites for this tutorial)
- What is seq2seq for text summarization and why
- Multilayer Bidirectional LSTM/GRU
- Beam Search & Attention for text summarization
- Building a seq2seq model with attention & beam search
- Combination of Abstractive & Extractive methods for Text Summarization
- Teach seq2seq models to learn from their mistakes using deep curriculum learning
- Deep Reinforcement Learning (DeepRL) for Abstractive Text Summarization made easy
I truly hope you have enjoyed reading this tutorial , and I hope I have made these concepts clear. All the code for this series of tutorials are found here. You can simply use google colab to run it, please review the tutorial and the code and tell me what do you think about it, don’t forget to try out eazymind for free text summarization generation, hope to see you again.