Summarization with Hugging Face and Blurr

8 min readMar 12, 2022

In a digital landscape increasingly centered around text data, two of the most popular and important tasks we can use machine learning for are summarization and translation. Hugging Face plays a vital role in enabling virtually anyone with an internet connection and some ML/DL/SWE experience build models centered around summarization and translation tasks. Hugging Face gives us access to a repository of text corpora and even a variety of pre-trained models to use as a jumping off point for NLP development.

Blurr is an awesome library built by Wayde Gilliam that leverages my favorite framework, Fast.ai, and Hugging Face transformers. Blurr’s high-level API is extremely developer friendly and streamlines the process of working with three crucial pieces of building any ML/Dl model: DataBlock, DataLoaderss, and Learner.

This blog is a result of the ongoing Fastai + Hugging Face study group being graciously put together by Weights and Biases’ Sanyam Bhutani and Wayde Gilliam.

I write blogs like this mostly for my own personal benefit of retaining information and giving me an easy place to go back and check how to do something when I inevitably forget. So let’s jump right into it.

Main NLP Tasks

The below list lays out the main tasks (in order) that need to be considered and worked through when approaching any NLP task.

Retrieve a dataset.
Get Hugging Face Object (in this context).
Pre-process data.
Construct a datablock.
Define a Learner and Metrics.
Train the model.
Inference on unseen data.

Retrieve a dataset.

Since we are working with Notebooks let’s start off with our imports. It’s important to note here that since BLURR is under heavy development at the time of this writing, Wayde advises we use the editable install: pip install git+https://github.com/ohmeow/blurr.git@dev-2.0.0

We also need to have nltk installed: pip install nltk.

Ok, imports.

To retrieve our dataset from Hugging Face’s amazing repositories of data, we are going to use the super helpful load_dataset() function they provide, and then stick it in a Pandas DataFrame for ease of use.

Outputs:

Hugging Face Objects

Now that we have a dataset we want to utilize the pre-trained models that Hugging Face gives us access to and use BLURR’s get_hf_objects() method to pass in a pretrained model and the specific instance of a model we want to utilize in the model_cls parameter. In this case we are going to use the BartFor ConditionalGeneration model from Hugging Face Transformers. If you are feeling curious I highly suggest you checkout the documentation linked there for how this class is setup.

When using pretrained models and all the other great capabilities HuggingFace gives us access to it’s easy to just plug and play and if it works, it works — but it’s a lot more fun and intellectually stimulating to actually understanding what’s going on under the hood of these hf_objects and pretrained models. That’s why we print out the details of these objects we just instantiated. You can see we are using the Bart architecture, the tokenizer that was trained to work with that architecture, and the Pytorch subclass BartForConditionalGeneration that we passed into the model_cls() parameter.

I would argue the most important thing to explore here is that config piece. This BartConfig configuration establishes all the hyperparameters that will be used in the model. For example: activation dropout, activation_function, vocab_size, and many more. Check the link to the documentation if you are curious. We can even just print out the hf_config piece and see what we are working with, it’s too long to include the entire output in a screen shot but here is a teaser:

Preprocess Data

Depending on the domain you are working in and the data you are working with, preprocessing data can be a nitty gritty part of the ML flow. Luckily with NLP tasks like summarization it is not too heavy of a lift, and utilizing BLURR’s SummarizationPreprocessor makes that lift even lighter.

I will spare the reader from walking through each of these parameters but some noteworthy aspects of this preprocessor are the parameters we can pass integers to. This is where we can control the truncation of input and target and target lengths as well as the length of the minimum summary length — all by character count.

Now that we’ve (easily) preprocessed our data we utilize BLURR’s BatchTokenizeTransform class to handle everything required to assemble a mini-batch of inputs and targets, this includes the decoding the product of tokenization that occurs in the encodes method. We are going to use BLURR’s Seq2SeqBatchTokenizeTransform and pass in some default text generation keyword arguments that will be utilized in the hf_model.generate method. These text_gen_kwargs establish some parameters for how we want our summaries to be generated, for example a min/max length, beginning of sentence id, “bad” words IDs, and penalties for things like repetition and length.

DataBlock

The final line from that above code snippet brings us to the steps of constructing our DataBlocks and DataLoaders which is where Fast.ai comes in to play. I did a post exploring Fast.ai’s DataBlock API here.

Since we are dealing with text data, a lot of text data, it is advisable to use small batch sizes to speed up training. It is also advisable to check the shapes of our data in the dataloaders. We can see that there are two elements in “one” (so probably not the best variable name — my bad.) and that the input_ids are of shape 2x257 (two items of 257 characters each.)

Defining a Learner and Metrics

Now that we have our data all setup to actually put into a model, we have to define some metrics to abide by for measuring how good/bad the model is performing. In this case we are going to use two metrics that are popular for summarization: Rouge and Bertscore. Understanding these and other metrics is critical in constructing effective summarization models, for now I will spare the reader of the details here but have linked the sources to both metrics and I will probably do an entire post on that topic later.

For now we will just setup a dictionary of dictionaries that contains the necessary strings for us to plug into a callback system. Within this callback we can also decide how often we want to compute these metrics. We have the option to compute after every epoch, every other epoch, or just after the last epoch.

We use BLURR’s BaseModelWrapper to conveniently plug in our pre-trained model, this wrapper is the same as nn.Module but without the need for subclasses to call super().__init__. We use the Seq2SeqMetricsCallback to pass in Rouge and Bertscore with the decision to only calculate those metrics after the last_epoch. Then we create a classic Fast.ai Learner passing in our dataloaders, model, an optimization function, loss function, callbacks, and splitter.

So if we get a batch and run it through the model we can output an object that contains our predictions data. This shows us the numerical representation of the inputs.

We look at shapes a lot in ML, so let us look at the shapes of different aspects of this preds output.

We can see there’s no value for loss because we have not actually conducted a backward pass, our “logits” indicate we have two examples looking at 36 tokens and a vocabulary of over 50k that we are trying to predict.

Training

Training is another area where Fast.ai comes in here. We can use Fast.ai’s awesome learning rate finder:

We can actually do some training using Fast.ai’s fit_one_cycle() method. We will just do one epoch for the sake of time, this is also where we will pass in those metrics callback that we covered above.

Now let’s check some results:

Inference

For better precision we will start by setting our precision back to floating point 32,we can also export this model and deploy it somewhere using something like AWS for later. We can pull a test article from our DataFrame as well so that we can read it for ourselves before we let the machine summarize for us.

If we exported the model (in a .pkl file) we can bring it back in and call BLURR’s blurr_summarize() method on our model passing in that above test article.

There you have it, summarizing text based news from a dataset of CNN text based articles using BLURR, Hugging Face, and Fast.ai. There is a lot we can tweak and play around with through this entire process like the dataset, model and model config decisions, and metrics we want to use.

Our next task is to use a similar framework (same tools) to do text translation.