Sitemap

Sequence to Sequence learning for Hebrew abstractive summarization

12 min readJul 9, 2023

--

“A robot draws a picture “a summary” in the pre-raphaelite style” by Dalle AI

Introduction

Recent work pre-training Transformers with large text corpora
has shown great success when fine-tuned on
downstream NLP tasks including text summarization. However, abstractive text summarization for Hebrew has not been explored. First of all, due to the absence of available generative models for Hebrew. And secondly, there is no publicly available dataset for summarization in Hebrew.
So, the main focus of this article will be on simple experiments of the sequence-to-sequence models (mT5 models) training for Hebrew summarization.

Task

Our main task is abstractive summarization, which involves going beyond the extraction of existing sentences and generating new, concise summaries that capture the essence of the original text. While abstractive approaches have been extensively explored for English and other widely spoken languages, applying them to Hebrew presents additional complexities due to the language’s unique morphological and syntactic structure. Researchers are actively developing novel approaches that address these challenges, combining neural networks, natural language generation models, and semantic analysis to generate meaningful summaries. However, for the purposes of this simple article, we will explore a straightforward and readily available approach widely used in summarization tasks for other languages, which is training a seq2seq model. It’s worth mentioning that there are also long-document summarization and multi-document summarization techniques, but they are beyond the scope of this article.

Extractive summarization is beyond the scope of this article, as our main focus is on selecting a subset of sentences that effectively represent a summary of the document. This task closely aligns with textual matching tasks, making it possible to utilize text similarity models. In the case of Hebrew, it is feasible to train models based on Hebrew pre-trained language models such as AlephBERT, HeBERT, or any other suitable model. Here is an example of a bi-encoder model (and I plan to publish a cross-encoder model soon).

Dataset

Indeed, the absence of publicly available datasets for abstractive summarization in Hebrew poses a significant challenge. And couple solutions came in to my mind:

  1. Scraping news sites
    Web scraping news articles may seem like a straightforward solution for gathering data, but it often comes with challenges and limitations. While there are various tools and Python packages available for web scraping and data cleansing, it’s important to consider the specifics. I wrote several scraping scripts with metadata extractions and information extractions from linked json (json-ld), and put in the GitHub repository. But this approach had to be abandoned due to impracticality.
    There are several reasons why web scraping news articles may not be ideal. Firstly, many news websites implement protection mechanisms that can make scraping difficult or time-consuming, requiring the use of tools like Selenium. However, investing time in overcoming these obstacles may not be desirable for certain projects.
    Furthermore, the quality of the data obtained through scraping methods can be compromised. Advertisements, Twitter links, and iframe content often clutter the article text. Additionally, the article descriptions intended for use as labels may be trivial, contain repetitive parts, or be truncated versions of the article’s beginning.
    These issues result in a high level of overlapping text, which is not ideal as the goal is to generate concise and non-repetitive summaries rather than simply rewriting existing segments.
    Given these challenges, alternative approaches or data collection methods may need to be considered for obtaining high-quality data for abstractive summarization in Hebrew.
  2. Usage of the machine translated corpus
    The chosen approach for experimentation involved translating a summarization corpus from English to Hebrew. For this purpose, the CNN/DailyMail Dataset, a well-known and classic dataset for summarization, was utilized. This dataset comprises approximately 300,000 news articles in English. You can access the dataset itself and review its statistics at the CNN/DailyMail Dataset link. And here I published a translated version of the dataset.
    To perform the translation, Google Cloud Translation should be utilized. While the Google Cloud Translation service is not free, it is relatively straightforward to use their Python API SDK for conducting the translation. Alternatively, a free option is to utilize the Facebook model (nllb), which is known to provide decent translation quality. Here is a simple example demonstrating how to use the Facebook model.
dataset statistics

3. Trying to synthesise a corpus using LLM (e.g. ChatGPT)
Indeed, ChatGPT has shown the capability to generate summaries effectively in both English and Hebrew, with the added advantage of being able to use prompts and provide specific instructions. But sometimes, as other LLM, ChatGPT does hallucinations.
However, it’s important to acknowledge that using ChatGPT for generating a large-scale corpus is not a free option.
Creating a corpus of comparable size to the CNN/DailyMail dataset is a challenging task.
A small dataset(~1673 samples) that I made, is here:

from datasets import load_dataset

ds = load_dataset("imvladikon/he_sum_chatgpt")
print(ds["train"][0])
Example of the news summary generated by chatgpt

I believe it’s promising direction, usage of the LLM to generate datasets with the curation of such datasets, - success of the Alpaca shows it. And apparently task-oriented Hebrew LLM will appear soon, as far as I know, Yam Peleg is working on it.

Metrics

Indeed, metrics for generative tasks due to infinite plausible outputs are often more challenging compared to discrete types of tasks.

  • What we want to measure in general?
  • How we want to measure it?

What we want to measure, seems to be obvious in our case — similarity between generated summary and reference summary (It’s possible also to measure similarity between reference text and generated summary, especially it’s interesting in case of the absence summaries/highlights, but since we are using machine-translated dataset with labels, we skip this case).
How we want to measure such similarities, that’s a good question. And when it comes to evaluating the quality of generated summaries, there are certain criteria that can be measured:

1. Lexical similarity

First of all we are interested in lexical similarity, that’s why we are going to use classical ROUGE metrics:
- ROUGE-N (ngrams overlapping), since we are interested in unigrams and bigrams, - ROUGE-1, ROUGE-2.
- ROUGE-L (longest common subsequence)

More details about this metric possible to check in the original article (implementation by google, and huggingface wrapper).

Couple notes regarding usage of the huggingface wrapper:

  • stemmer need to disable (at least for Hebrew)
  • need to pass here specific Hebrew tokenizer instead of the default one. otherwise we could get strange metrics’ results.

Let’s use udpipe model as tokenizer:

import os

import spacy_udpipe
from spacy_udpipe import UDPipeModel


class UDPipeTokenizer:

def __init__(self, language):
if not self._model_exists(language):
# default Hebrew model is "hebrew-htb-ud-2.5-191206.udpipe"
# but if you have your own udpipe model, you could provide path to it
spacy_udpipe.download(language)
self.language = language
self.nlp = spacy_udpipe.load(language)
self.udpipe_model = None

def _model_exists(self, language):
return spacy_udpipe.utils.LANGUAGES[language] in os.listdir(spacy_udpipe.utils.MODELS_DIR)

def tokenize(self, text):
doc = self.nlp(text)
for token in doc:
yield token.text

def sentencize(self, text):
if self.udpipe_model is None:
self.udpipe_model = UDPipeModel(self.language)

for sentence in self.udpipe_model(text):
yield sentence.getText()

def __call__(self, text):
return list(self.tokenize(text))
import evaluate

hebrew_tokenizer = UDPipeTokenizer(language="he")
metric = evaluate.load("rouge")
result = metric.compute(predictions=decoded_preds,
references=decoded_labels,
use_stemmer=False,
tokenizer=hebrew_tokenizer)

And in addition I recommend to check this paper SummEval: Re-evaluating Summarization Evaluation, A. R. Fabbri et al., 2021, where there is a great explanation about evaluation metrics in the text summarization tasks.

2.Semantic similarity

The main problem with lexical-based methods is that they often fail to robustly match paraphrases when the similar semantically phrases have low lexical overlap.

Regarding semantic similarity metrics, that’s a good question: which metric is better to use? In different text generation tasks, different metrics such as MOVERScore, BERTScore, BARTScore, InfoLM, T5Score, Mutual Implication Score, etc., are more relevant. However, the concept remains more or less the same, which is measuring semantic closeness or relatedness to the initial text. So feel free to use your own metrics/models, such as cross-encoders that were trained on the NLI task with AlephBert as the backbone model.

Training

So, we have collected data, and let’s do training of the mT5 models using colab free (GPU: T4).

Regarding training pipeline explanations and tutoring code I recommend first of all to check this part “NLP Course” by huggingface where there is a great tutorial about mT5 training for the text summarization.

mT5 models are not the only multilingual models that include the Hebrew language during the pre-training stage. There are also mBart-50 by Facebook and GPT-like models trained by Doron Adler. Benchmark results for different models and architectures can be checked on the paperwithcode site. T5 is not the best model in the CNN/Daily benchmark, but it is also a good model for downstream tasks. It is important to note that we are planning to use the mT5 model and according to the paper (repo), it is a similar model to T5 with a similar training process, except for several differences. It is based on T5.1.1, trained on mC4 with probabilistic sampling of examples during training for each language to boost lower-resource languages. Additionally, the vocabulary has been increased to approximately ~250k wordpieces, following the XLM-R model.

benchmarks results on CNN / Daily Mail according to paperswithcode.com

And increased vocabulary with appropriate embeddings layer impacts on number parameters and model size a lot:

Because of the limitation in the colab (free), — only mT5-small possible to train without any issues. I assume that with some tweaks possible to train mT5-base, too (in my experience, there were cases when accelerate could help, but I didn’t check it in these experiments).
But simpler to prune embeddings, because we need only Hebrew and English languages for our experiments.It was done according to brilliant article “How to adapt a multilingual T5 model for a single language” by David Dale (he was my YData project mentor). The similar idea was suggested also in the “What You Need: Smaller Versions of Multilingual BERT”, A. Abdaoui et al., 2020.
Also, during writing of this article I found these papers, where similarly to David’s article vocab trimming method is used:

Similarly, to “How to adapt a multilingual T5 model for a single language” article, I used “Leipzig Corpora Collection” for frequency statistics, and I put this dataset into huggingface here:

from datasets import load_dataset

heb_wikipedia = load_dataset("imvladikon/leipzig_corpora_collection", "heb_wikipedia_2021_1M", split="train")

Another option, is using wikipedia dataset itself, I put dataset here (version: 2023/06/01) where clean_text is column without wiki markups.

So, I have placed the checkpoints with reduced sizes here:

And training pipeline code is in the github. Example of the running:

python3 -m hebrew_summarizer.cli \
--model_name_or_path "google/mt5-small" \
--tokenizer_name "google/mt5-small" \
--do_train \
--do_eval \
--num_train_epochs 3 \
--dataset_name "imvladikon/he_cnn_dailymail" \
--output_dir "./models" \
--text_column "article" \
--summary_column "highlights" \
--per_device_train_batch_size=8 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate \
--save_total_limit 3 \
--save_strategy "steps" \
--evaluation_strategy "epoch"

If you encounter GPU memory issues, you can try reducing the batch size. I trained models with relatively low batch sizes (per_device_train_batch_size=2) due to the limitations of the free version of colab. However, please note that training with smaller batch sizes may result in longer training times. In my case, the training process took approximately 24 hours, and I had to rerun the training multiple times due to the time limitations and termination of the colab environment.

Based on the validation loss, it was determined that it was feasible to continue the training.

Results and analysis

Final summarization checkpoints:

which can be used through the huggingface pipeline, as shown below:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, SummarizationPipeline

model = AutoModelForSeq2SeqLM.from_pretrained("imvladikon/het5_small_summarization")
tokenizer = AutoTokenizer.from_pretrained("imvladikon/het5_small_summarization")
summarizer = SummarizationPipeline(model=model, tokenizer=tokenizer) # device=0 if you're using CUDA
print(summarizer(text)) # additionaly, need to adjust params like min/max length, etc.

Results on the validation set:

mT5-small:

eval_gen_len = 101.9809
eval_loss = 1.9887
eval_rouge1 = 34.2573
eval_rouge2 = 13.4089
eval_rougeL = 24.9806
eval_rougeLsum = 31.8807
eval_runtime = 1:07:01.85
eval_samples = 13368
eval_samples_per_second = 3.324
eval_steps_per_second = 0.831

mT5-base(pruned):

eval_gen_len = 92.287
eval_loss = 1.8404
eval_rouge1 = 35.61
eval_rouge2 = 14.5098
eval_rougeL = 25.9692
eval_rougeLsum = 33.3598
eval_runtime = 0:53:11.74
eval_samples = 13368
eval_samples_per_second = 4.188
eval_steps_per_second = 0.524

And here predicted samples on unseen data in the spreadsheets:

which I generated using these params:

print(summarizer(record["article"],
min_length=20,
max_length=120,
num_beams=10,
repetition_penalty=2.5,
length_penalty=1.0,
early_stopping=True,
no_repeat_ngram_size=2,
use_cache=True,
do_sample = True,
temperature = 0.8,
top_k = 50,
top_p = 0.95)[0]["summary_text"])

example 1:

original news article
mT5-small predictions
mT5-base(pruned) predictions

example 2:

another example
mT5-small prediction
mT5-base(pruned) predictions

sometimes models are failing and generate non-sense in the end of the predictions:

Cross-lingual summarization

It’s interesting to check cross-lingual summarization ability of the mT5 model in case of the Hebrew language since our original dataset in English.
In general, it reminds us NMT (Neural Machine Translation) task, it is — our articles will be in Hebrew and summaries in English.
Considering the limitations which I faced with colab and the GPU quota, it’s unfortunate that the training crashed just before the validation, so I have trained a model only in 1 epoch.

Even though training for only one epoch may not be sufficient to achieve optimal results, it’s interesting to note that the summaries generated by the model still have some meaning:

initial article
predicted summary in English

Limitations and further directions

The provided article demonstrates a simple case of abstractive summarization using the basic training pipeline with the huggingface library, without hyperparameter tuning. While this provides a starting point, it’s important to note the limitations and consider further directions to enhance the summarization task. Some of these limitations and potential directions include:

  • long document summarization, — probably it’s possible to do using more concise and sophisticated algorithms, — finding most important places in the document (e.g. similarly to extractive summarization and/or clustering content where centres are such most important points in the document) and usage of such places as generation points.
  • improve data quality instead of using google-translated data (important note: for simplicity sake I didn’t check how named entities were projected during translation, but it’s crucial thing)
  • making model to be more robust to noises in the data, — extend data, adding data augmentation in terms of the different words/phrases order in Hebrew.
  • checking factual consistency, — it’s not trivial task and a lot of works are dedicated to this problem. a good question also how to measure it, — some work like “Evaluating Factual Consistency of Texts with Semantic Role Labeling” proposes specific metric to it.
  • think about preventing of the accidental translation, especially when initial text contains different scripts
  • checking multi-task learning on the different tasks(translation, paraphrasing, sentence-gap filling like in the PEGASUS paper) and datasets (parallel corpuses for example) to mitigate a problem of the absense of the enough amount summarization datasets.

Conslusion

As we checked, mT5 models are still useful for different tasks and for text summarization in particularly in different languages and in Hebrew, too.

Also, using huggingface libraries helps a lot, it was easy to train a seq2seq models, and the similar pipeline and code could be used for different text generation tasks: paraphrasing, text translation, text simplification/text compression, or even training the one model on the multiple tasks, like original T5 model with different prefixes.

The post was written by Vladimir Gurevich, a data scientist, NLP engineer and software developer.
Feel free to contact me with any questions, comments, or suggestions on twitter, linkedin, github or huggingface

--

--

Responses (1)