Text Summarization, Extractive, T5, Bahasa Indonesia, Huggingface’s Transformers

Published in

Analytics Vidhya

5 min readJun 16, 2020

image from: https://www.kdnuggets.com/2019/11/getting-started-automated-text-summarization.html

Yes, the title is just a bunch of keywords that i used when i did search google to do this experiment. yet you are here, wanted to know how to leverage huggingface transfomers to make your own Bahasa Indonesia (indonesian language) text summarizer, don’t you?

The Data

In this experiment, I’m using Kata.ai Indosum dataset. You can go to the link there to get the dataset. In this experiment, i’m using only the first fold of their dataset (17k train data), and the text is set to be a paragraph instead of tokenized for both of the full text article and the summary.

The Model

Naturally in text summarization task, we want to use a model that has encoder-decoder model (sequence in, sequence out // full text in, summarization out). So, in the repo, we can choose the model that has this architecture. Based in its page there are only 3 models at this time that support it: Bart, T5, MarianMT. Since i wanted to do this task for bahasa, There’s only 1 user, huseinzol, that upload the model in this language (tho i’m pretty sure he trained in bahasa melayu, not Indonesian), but still, the t5 model for summarization works well. (and his albert, for classification). So, i choose his model to be fine tuned.

initiate the tokenizer and the model

The Dataset class.

Here is the important part of using t5 model. Because it converts the nlp task into a text-to-text format, instead of a special token like BERT, in the input we must start “summarize: ”. You can see the code in the encoding_paragraphs variable in the above screenhoot.Then, the model need to know where’s the end of the full article right? so we could add eos token (end-of-sentence / </s> in the tokenizer’s vocab). After this token, the tokenizer will pad the input by args pad_to_mask_length until it reaches the max_length (512, because…t5 in default also use it in the encoder state). The pad itself is useful so we can have batch size > 1, if the device could handle it. return_attention_mask is also set to True, because of that padding. It gets the model to know if the token that the model attends is a pad, then it wouldn’t attend/record it. and we’re set of the sequence input.

Now, the part of the sequence target. t5 is using teacher forcing, means that we have to shift the seq. target by one token, and usually it’s a start-of-sentence <sos> token. HHowever in the documentation, it is written that we can use the pad token as the <sos> token, hence in my encoding_summary, i add it into the summary text alongside with <eos>, like in the input. max_length should be different less than the input, because it’s the goal…pad_to_mask_length is still set to True just the same reason with the sequence input.

In the return of __getitem__ i put the full text (sentence_text), summary text (summary_text), full text’ input_ids (input_ids), full text’ attention_mask (attention_mask), and the summary’s input_ids (lm_labels) as the output of it. we can use different configuration about it, but don’t forget those 2 primary output. fulltext & summary’s input_ids.

The Training

Here, The training itself is just like regular training routing of a pytorch development.

model.train() -> Defining dataset -> define dataloader -> iterate thru it -> put the data in the device (cpu/cuda) -> train the model -> get the output -> get loss value -> add the loss value per batch_size to the full train_loss -> backward the loss -> step the optimizer.

However, the model itself needs 3 specific input. input_ids (the sequence input tokens),attention_mask (the sequence input pad token or not), and lm_labels (right shifted sequence target).

The validation is similar with the difference is we define the model to not update its gradient, so we just need it to calculate the loss, if the loss is better than the previous epochs, we save the current model.

The Inferencing

To do the inferencing (or generate the summary itself), transformer already have a function generate in the model class. we only need the input of input_ids and attention_mask. Pretty much, that’s it. the other args is optional. The return of it is the token id of the summarization, so we need to decode that using the tokenizer. and it’s also already there, very convenient.

The Result

Full text is the original text, summary is from the data, and Generated summary is fresh from the model.

The result from the summarization itself i think it’s pretty similar with the first few chunk of the full text. I guess it’s happened because the input model is a news article, and the summary that being input is also served as the extractive summarization.

The metrics that usually used in the summarization task is ROUGE. but in this time, i’m not calculate the metrics and relies on the loss gathered. After 3 epochs, I got train loss and val loss of around 0.35 for that first fold of the data.

You can see the full notebook here, but don’t forget to change the “cuda” into the device you have…also the path to read/write the docs.

Cheers.