Hindi Abstractive Text Summarization (Tutorial 10)

amr zaki
amr zaki
Dec 7, 2019 · 7 min read
Text Summarization in Hindi

This tutorial is the 10th installment of the Abstractive Text Summarization made easy tutorial series. Today we would build a Hindi Text Summarizer ,using the new novel approach introduced by google research of using curriculum learning scheduled sampling approach with seq2seq, we also combine attention and pointer generator models in our text summarization model, all the code for this series can be found here, code for this tutorial can be found here, which is written in tensorflow and runs seamlessly on google colab.

Today we would learn how to build a text summarization model for Hindi language from :

  1. collecting/scrapping data
  2. building word embedding model
  3. training seq2seq scheduled sampling model

This blog is based upon the work of these amazing projects, we have used news-please for scrapping data, our model is based on Bengio et al’s from google, and is implemented using yaserkl’s amazing library.

I would truly like to thank all of them for there amazing work.

This tutorial series helps you get to know the latest methods to build an abstractive text summarizer, here are the list for the tutorial series, we have also provided the text summarization as a free to use api through eazymind.

EazyMind free text summarization and obj detection

Now lets begin :D

1- Collect/Scrap dataset and process to csv

scrap data on google colab and save to google drive

To build our text summarization model, we need to have a dataset of the required language, this data would be in the format of text with title, so as to train the model to summarize the text to the title.One of the most effective datasets for our goal is news, as each news article would have a big article with a title that summarizes it.

So we would need a method to collect online news in Hindi, one of the most useful methods that i have found was using the amazing scrapper news-please , this scrapper would allow you to state the needed websites for scrapping, and it would recursively scrap for data and put it in a json format.

I suggest scrapping on google colab, as to not waste your bandwidth in scrapping, and to only download the resultant json files would be much smaller than scrapping the whole html file.

learn more about how to copy from github to google colab here, this notebook would install the news-please python package, it would scrap data to your google drive

in google colab, in the file tab, go up one level, then under root directory create directory called news-please-repo, then under it, create config directory.

here you would create 2 files (their contents can be found in the notebook), 1 file (config.cfg) would set the directoory to save the json files, i like to save them to google drive, so feel free to set your own path (this option can be found in the variable called working_path

The second file would set the names of the websites to scrap from, I used about 9 websites (their names are found in the notebook), feel free to add and modify to your own news websites,

i suggest that you modify the sites.hjson to contain couple of sites each google colab session, so that each google colab session would scrap from couple of sites not the all of them at the same time

after running the news-please command for a couple of hours (or couple of google colab sessions) and saving the resultant json files to your google drive, i suggest downloading the data from google drive to your local computer to process the files locally, as accessing the files from google colab would be quite slow (i think it has something to do with slow file i/o between google colab and google drive)

download the zip by simply selecting the folder in google drive and downloading it, it would zip it automatically

after the zip has been downloaded, unzip it, install

pip install langdetect

and run this script (locally on your computer for fast file accessing, don’t forget to modify the location of your extracted zip in the script), this script would loop through all the scrapped json files, check if they are in Hindi, and then would save them to a csv

now after you run the script, upload it to your google drive

2- Generate word embedding model and vocab

Hindi Word Embedding

For our text summarization model, we will need a word embedding model specifically built for Hindi, this would help our model understand the language, learn more about word embedding for text summarization

We can either use an already built word embedding, or we can build our very own word embedding model.

So lets choose the option of building our own model,

This notebook uses gensim python package to build the word embedding model that we would use in our text summarization model, we would build our model to have length of 150. (run this notebook on google colab by connecting it to google drive to read your recently uploaded csv)

In this same notebook , we are able to build a vocabulary file, which contains all the different words (contains 200k word), each word with its count, this generated file (vocab file is essential for our text summarization model).

3- Build Text Summarization model

Now after we have

  1. scrapped data,
  2. processed in a csv
  3. built word embedding model
  4. built vocab

we are now able to start building our text summarization model, there are actually multiple approaches to build our model, you can know more here , we would choose building seq2seq with scheduled sampling, learn more about scheduled sampling here, this concept has been introduced by Bengio et al’s from google, and is implemented using yaserkl’s amazing library.

This notebook is a modification to yaserkl’s amazing library, we have made the code able to run on python3, we have converted the code to ipynb to run seamlessly on google colab, we have also enabled the dataset to be read in simple csv format without the need for complicated binary preprocessing.

scheduled sampling model :

is built on top of attention model, and pointer generator model, to help solve the problem of Exposure bias, which is simply training the model using reference sentences, while in testing we test without having a reference sentence, this causes the Exposure bias problem.

The scheduled sampling is an simple yet smart way to solve this problem, which is exposing the model to its own output while training, to make the model learn from its mistakes in training, you can know more about this research here (thanks Bengio et al’s for this amazing research).

coin animation borrowed from google search results

So in every training step, we would throw a coin, one time we would choose the reference sentence, the other time we would choose to train from the model output itself (learn more here)

some results for text summarization are (summary are the generated summaries)

Example 1

Example 2

I truly hope you have enjoyed reading this tutorial , and I hope I have made these concepts clear. All the code for this series of tutorials are found here. You can simply use google colab to run it, please review the tutorial and the code and tell me what do you think about it, don’t forget to try out eazymind for free text summarization generation, hope to see you again.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

amr zaki

Written by

amr zaki

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade