Data Science

Plug-and-Play ML Models with ‘Accelerated Inference API’ from Hugging Face

With a 4-Step guide to python implementation

Dr. Dharini R


Hugging Face is an Artificial Intelligence developer community that helps by contributing resources like tools, models, datasets, and solutions. This enables fellow developers to code and deploy projects related to the broad spectrum of Data Science.

The Hugging Face Accelerated Inference API helps to utilize a model by providing input to it and getting the model’s output. All these can be done with just a simple HTTP request. Yes, as simple as that!!

The accelerated inference enables the ‘plug-and-play’ kind of usage to the machine learning models by means of API calls.

What are the benefits of Inference API?

  • Utilize ML models without the hassle of building one.
  • Access models, that are built for a wide range of NLP tasks (like Summarization, Translation, Classification, Question-Answering, Zero-Shot classification, Named Entity Recognition), Image processing tasks (Computer Vision), Audio processing tasks, Statistical data processing, etc.
  • Infer from the generously available Transformer Models (like GPT-2, T5, BERT, XLM-RoBERTa, and much more)
  • The ability to run large models at ease which might otherwise pose difficulty during deployment
  • Having a way to upload a model securely and manage it privately.
  • Choose from a range of plans for both contributors and organizations that provide CPU-Accelerated Inference and GPU-Accelerated Inference.
  • A dashboard that shows the used and remaining number of characters, the number of requests made from an account, and much more as an API Usage Dashboard.
  • Extensive documentation helps understand the models in Accelerated Inference and how to utilize them.
  • Get “Accelerated” inference — up to 10 times faster inference.
  • Options are available for streaming your data to the API through parallelism and batch jobs.

Who can use Inference API?

  • An entrepreneur who wants to understand how ML can help the company’s purpose.
  • A student who wants to build an ML project with less coding knowledge.
  • Researchers who want to combine their own feature engineering or visualization with State-Of-The-Art (SOTA) ML models
  • A developer who wants to get a proof of concept done for a task in a short time and showcase the result to clients.
  • A company that is satisfied with the proof of concept can go ahead to build its app on top of the SOTA Transformer models.
  • And the list goes on. In short, anyone interested in tasting the fruit of advanced developments in the field of Artificial Intelligence.

How to use Inference API?

The 4 Step process to create a python implementation.

1. Select a model for a specific task

2. Identify the model’s Inference API

3. Create Access Token

4. Build the python project using Inference API

1. Select a model for text summarization from Hugging Face

Out of the many models prevailing in the ocean of the Hugging Face community, let us pick one model for the task of Summarization. Kindly go through the following link, which shows the countless models available for various tasks like Classification, Sentiment analysis, Question and Answer generation, etc.

  • As a first step, let us decide to utilize the model facebook/bart-large-cnn given by Lewis etal from Facebook. [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension]
  • The information related to the model, like the definition, description, parameters, and usage instructions, can be found in this link.

2. Identify the model’s Inference API

  • To use the Inference API, select the Accelerated Inference under the Deploy button as shown below.
  • Clicking that will lead us to a tab containing a python script that can be utilized for the inference, as shown below.
  • The API_URL and headers are the important parts to consider, where the former gives us the URL to the model, and the latter lets us access the model using Access Token.

3. Create Access Token

  • To get the Access Token,create a profile in Hugging Face and create a new access token by following the steps Profile -> Settings -> Access Tokens Tab
  • Once we get an access token, we can use it as the value for Bearer in headers.
  • Now let’s move to the next part of our project to create a web application.

4. Build a summarization project by utilizing the Inference API

The documentation about the summarization task, parameters, and it’s usage is given in this link.

  • As a prerequisite, kindly install the requests library.
  • The initial step in the code is to import the requests library
  • The next step is to use the API URL and the access token to call the model.
  • The task summarization model gets an input text and generates an abstract summary.
  • As shown in the snippet below, we can get the text input from the user. The number of words present in the generated output can be defined using the parameters min_len and max_len.
  • Now comes the best part, sending the values in our input parameters to the model and storing the result in the output variable. The code snippet for the same can be seen below.
  • That's it!! The python implementation for utilizing the Inference API is completed. The inputs given and the output generated can be seen below.
  • The full code is displayed below and also available on the GitHub.


Hugging Face provides valuable resources to the AI community, and one such is this Accelerated Inference API. In this article, we explored the concept of “Hugging Face Accelerated Inference API,” its advantages, and a small demo on how to implement using the API. This demo python script can be extended in many ways, like working on a different task or model or adding more parameters.



Dr. Dharini R

A doctorate in the field of Natural Language Processing. Passionate in writing & learning about NLP, AI, ML.