Bengali Abstractive Text Summarization Using Sequence to Sequence RNNs

Published in

Analytics Vidhya

8 min readJan 2, 2020

A human can describe his mood with the help of text. Therefore understanding the meaning of the text is very important. Sometimes, it is hard to understand the meaning of those texts, alongside this is also time-consuming. The machine is the best way to solve this problem. As a part of machine learning, text summarization is a large field of research in natural language processing. Build automatic text summarizer is the main focusing point of all research. Text summarizer produces the gist part of a large document in a short time.

Automatic text summarizer for other languages has been made previously but not for the Bengali language. Increasing the tools and technology of Bengali language is the main goal of this research. In this research work, we’ve tried to build an automatic text summarizer for the Bengali language. Although, working with the Bengali language was a very challenging part of this research. But until the end, we have made a base for automatic text summarizer of the Bengali language.

The dataset used is collected from online social media. The deep learning model is used to make the summarizer. In the model, train time reduce the loss is directly affect the experiment result. We have reduced the training loss for our summarization model. Which is capable to generate a short text summary for the Bengali language.

Introduction

In the field of text summarization there are two categories. Abstractive and Extractive text summarization. Abstractive text summarization contains an abstract of the text document. Basically providing abstract is the representation of the main idea of the text but here summarizer does not repeat the original sentences. Here is the main challenge to finding the gist of the text in natural language processing. The maximum number of research work is held on the extractive text summarization. Extract keyword and find the most frequent words from the text is the main idea of extractive text summarization. But generate a new word or sentences based on text is the most challenging stuff. This is not mandatory to have the word in the providing abstractive summary is also present in the original context. There are much abstractive text summarization research work has done previously in different languages. In this time, we have tried to build an abstractive text summarizer for Bengali language applying deep learning algorithms.
Bengali is one of the most used languages in the world. Increasing the tools and technology for this language is very important. Therefore, the research area of Bengali language needs expansion. An automatic system text needs to be processed. NLP tools and library very much helps to process any kinds of text. Working in the Bengali language to build an automatic system is difficult compared to other languages. Because some NLP libraries are not built for the Bengali language therefore, all techniques and libraries are applying by raw coding. Our research work can provide an abstractive text summary for Bengali text. No machine gives 100% accurate results every time but maximum time a satisfactory result can be obtained. Our automatic abstractive text summarizer is also looked like that. All the generated summaries are not 100% accurate but the maximum response of machine summary is satisfactory for Bengali text summarization.

Research Summary

In our research, we have introduced a methodology for Bengali abstractive text summarization. We build a model using deep learning. To build this model we have used our own dataset. Dataset has collected form social media. At first collect Bengali status, comment, page and group posts from Facebook. Then create a summary of each Bengali text. Therefore, the dataset contains two columns, one is Bengali text and another is their corresponding summary. The total number of two hundred data with their summary in the dataset. Before creating a deep learning model we have preprocessed the Bengali text. In the preprocessing stage, at first, split the text and then add Bengali contractions and remove stop words from the text. After preprocessing we have to count the vocabulary of whole data. Word embedding is important for deep learning model. Word vector helps to save the related vocabulary in a file with a numeric value. We used a pre-trained word vector file for Bengali text which is available in online. We build a sequence to sequence model based on attention model. In this model encoder and decoder is used with Bi-directional LSTM cell. Word vector is the input of the encoder and relevant word vector in the decoder is the output of the model. An encoder and decoder to pass the sequence need a token which is known as a special token such as PAD, UNK, EOS etc. After declaring and define all function and library we train the model for more than 3 hours. Then we found good response from the machine.

Text preprocessing

View of Model

After the invention of machine translation, a deep learning algorithm creates a great milestone in the Artificial Intelligence field. All text related problem are given accurate output in the deep learning model. RNN is the most usable algorithm in deep learning. It works more efficiently in any text related problem. Each RNN are made by LSTM cell. LSTM cell is like a short term memory. Encoder and decoder are used in LSTM cell. The input text is a pass in the encoder where each input is word vector sequence. The decoder takes the input sequence and generates the output of the text from the relevant text sequence.

Sequence to Sequence Learning

Seq2Seq model is created by LSTM cell. Firstly, the input of the word is formed from the vector file. In the vector file, each related word has an embedded value. Those embedded values are worked like the input of the encoder. The encoder saves the sequence value in short memory which is LSTM. Here each sequence used a token to identify the end and start point of the sequence. In the program, we defined some special sequence such as <PAD>, <EOS>, <GO>, <UNK> etc. All of those special tokens are used for working in handling the sequence in the encoder and decoder. <EOS> is used to identify the end of the input sequence. In the encoder when the sequence of the input ends the <EOS> token automatic discard the sequence. Then the sequence will go to the decoder to decode the sequence by providing related output. End of the decoder that means when the output sequence ends the <EOS> token stop the decoder. After the end of the encoding, the sequence needs an instruction to enter the decoder. Here we use <GO> token to give the instruction of encoding sequence to enter the decoder. In the text sequence, some the text or word are not replaced. All of that sequence need to identify. Therefore, we used a special token <UNK> which means an unknown token. When an unknown token is found in the sequence it will be added <UNK> token in the text. In the train, the time sequence is divided into the batch. In a batch size similar length of the sequence needed to be together. Thus we used a token which is known as <PAD> token.

Sample Output

Conclusion and Future work

Some limitation is presented in this is model such as work for limited sequence, the dataset is not enough. But the model is built for future development. Since any research work is a continuous process. Therefore, this model will be developed day by day for the Bengali language. To find a proper solution any works need more research. Then all research find a proper solution for a specific problem. So, research work needs future implement or development. The future implement is dependent on the limitations of the previous work. Solving the limitations of the previous work helps to make an efficient system. In this work, the future work will be increasing the dataset of the Bengali text. Updating the model and prepare the model of any kinds of text length. That means the model won’t dependent on the text length. The model is complex and working in TensorFlow 1.15 version. But need to convert the code in updated versions. After completing research the model needs to deploy. Thus, making an application like web and mobile application is important based on the future of artificial intelligence. Therefore, we have developed an application for automatic Bengali abstractive text summarization.

[1] A. K. Mohammad Masum, S. Abujar, M. A. Islam Talukder, A. S. Azad Rabby and S. A. Hossain, “Abstractive method of text summarization with sequence to sequence RNNs,” 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1–5.

[2] M. A. I. Talukder, S. Abujar, A. K. M. Masum, F. Faisal and S. A. Hossain, “Bengali abstractive text summarization using sequence to sequence RNNs,” 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1–5.

[3] S. Abujar, A. K. M. Masum, M. Mohibullah, Ohidujjaman and S. A. Hossain, “An Approach for Bengali Text Summarization using Word2Vector,” 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1–5.

[4] S. Abujar, A. K. M. Masum, S. M. M. H. Chowdhury, M. Hasan and S. A. Hossain, “Bengali Text generation Using Bi-directional RNN,” 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1–5.

[5] A. K. M. Masum, S. Abujar, R. T. H. Tusher, F. Faisal and S. A. Hossain, “Sentence Similarity Measurement for Bengali Abstractive Text Summarization,” 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1–5.