Master NLP with Hugging Face

Pipelines for performant inferences with Hugging Face 🤗

Take good care of memory usage

Alex Litvinov

Published in

Geek Culture

5 min readJun 3, 2023

2 transformer toys — Photo by Jeffery Ho on Unsplash

We all love transformers! In recent years they completely revolutionized natural language processing by enabling more accurate and efficient language understanding and generation compared to the previously used methods and starting to do the same with image and audio processing. And Hugging Face, in turn, has made transformer-based models more accessible and easier to use for researchers and developers. It is as easy as going to the model’s card page and copying the boilerplate code provided there. Or is it…?

Lots of models have some sample code for inference on their page. Copy-pasting it is fine when you just want to try it on a couple of samples. But if you want to make inferences on a decent amount of data, this approach most likely won’t work and for sure won’t be the most efficient one. Actually, depending on the model, a decent amount could be equal to as little as 100 samples to make the inference impossible.

If you want to use a Hugging Face transformer model for inference on more than a few examples, you should use pipeline. And below we’re gonna see how to apply this suggestion in practice.

The plan for today is the following:

Take a Hugging Face model and use the inference code provided on the model card.
Run it in a Colab notebook, profile memory usage and the time the inference takes
Repeat using a bigger sample (Spoiler Alert!) and Fail.
Update the original code to use pipeline and witness a significant improvement.

The Model

We’ll use the Climate Fact-Checking Model for our experiments. I’m using this example as it pertains to the subject matter of an NLP project I recently participated in, but the most important is that the principle is applicable to any model from Hugging Face.
The model takes as input two climate-related sentences one of which is a claim and the other is potential evidence. The model is a classifier that predicts the presence of entailment/contradiction. In other words, it labels an input pair as either SUPPORTS, REFUTES, or NOT ENOUGH INFORMATION.
This model makes a good example because it provides a bit more advanced and potentially more helpful case as the model’s input is a pair of sentences rather than a single one.

Links to the Notebook

On Colab or Github

Note: Below we’ll look at execution on CPU. But the notebook is ready to be executed on GPU as well if one is available

Environment setup

To be able to run a Hugging Face model in Colab we’ll need to install transformers library first. Also, we’ll install datasets to get the data to make inferences, and memory_profiler to monitor the improvements.

Inference using the original code

First, let’s take a look at the code from the model card. I’ll modify it a bit right away, extracting the inference part as a function to make it easy to change the inputs

To be able to profile execution time and memory usage one needs to use time and memit magic like this

With profiling set up, we run the inference on the provided sample

%%time %memit predict_using_sample_code(sample_claim, sample_evidence) peak memory: 1400.44 MiB, increment: 11.71 MiB CPU times: user 1.24 s, sys: 140 ms, total: 1.38 s Wall time: 1.7 s — Memory and run time profile on one sample using the original code

And the result looks kind of acceptable.

Inference using the original code on several samples

Now let’s load more samples and try running the inference again.
The dataset test split contains 1535 pairs which is not a particularly huge number.

Loads the dataset that the model was tested on

The time I ran it I happened to get an environment with an extortionate amount of RAM, it must’ve been some glitch from Colab, and these are the shocking numbers that I got. 52GB RAM is no joke!

%%time %memit labels = predict_using_sample_code(input_claims, input_evidences) peak memory: 70648.12 MiB, increment: 52950.69 MiB CPU times: user 12min 46s, sys: 6min 37s, total: 19min 23s Wall time: 3min 48s — Memory and run time profile on a bunch of samples using the original code. Source: Image by the author.

But normally it would just kill the notebook with an Out Of Memory.

Your session crashed. Automatically restarting — Source: Image by the author.

That’s certainly not acceptable and we need to fix it.

Inference using `transformers.pipeline`

To create a pipeline we need to specify the task at hand which in our case is “text-classification”. There are a couple of dozens of other tasks available, you can check out the list here. Also, we’re passing the model and the tokenizer along with its additional parameters. And lastly, we’re specifying the device to run the pipeline on.

To execute the pipeline we need to pass the data that can come in two flavors — either a Dataset or a generator. In this instance, we could’ve used a dataset since we have it available but I decided to use a generator to show how to deal with the data that does not represent a Dataset . For example, if the data comes from HTTP requests or a database.

batch_size is set to 1 for the sake of comparison and most importantly because in this example we’re using CPU. When running your inferences on GPU, make sure to play with the batch_size parameter to see what setting performs best on the hardware you’re using, it can make a huge difference.

Now we’re ready to run the inference with a pipeline

pred_labels, pred_probs = predict_using _pipelines(input_claims, input_evidences) peak memory: 1434.24 MiB, increment: 9.80 MiB CPU times: user 4min, sys: 557 ms, total: 4min Wall time: 4min 7s — Memory and run time profile using a pipeline. Source: Image by the author.

We can notice a significant improvement right away:

First of all, the inference finishes successfully
Turns out it’s possible to run an inference using approximately the same amount of RAM as for a one-sample one with the original code.

I bet you can improve it even more by switching to GPU and using a proper batch_size for your setting. Please refer to the following documentation to get more information.

Conclusion

We saw how to utilize pipeline for inference using transformer models from Hugging Face. This approach not only makes such inference possible but also significantly enhances memory efficiency.

I definitely recommend you read through “Pipelines for inference tutorial on Hugging Face” for a deeper dive and further adoption for your needs.

That’s all for today. Happy inferring!