How We Used AWS Inferentia to Boost PyTorch NLP Model Performance by 4.9x for the Autodesk Ava Chatbot

Binghui Ouyang
Published in
7 min readApr 7, 2021


© Autodesk Inc.

Autodesk is a multinational software company with world-renowned products in areas such as Architecture, Engineering, & Construction, Manufacturing, and Media & Entertainment. Amongst Autodesk’s best-known products are AutoCAD, Revit, Maya, and Fusion 360. The company has millions of customers around the world, and many of them have need for support to make best use of their products.

As part of the process of improving the customer support experience, the company developed AVA, the Autodesk Virtual Agent. Ava is Autodesk’s customer support chatbot. The front end consists of a dynamic web component, which can be embedded in different sites and applications.

There are six NLP models that comprise part of the backend of AVA that decides the best response or next action presented to the customer, based on their input. For example, one of the NLP models is the Intent Model, which classifies a customer’s natural language input into tasks such as ‘introducing product information’, ‘initiating product downloads’, and ‘helping manage subscriptions’. AVA answers over 100,000 customer questions per month by applying natural language understanding (NLU). Therefore, both the speed and cost of model inference is important to ensure good customer experience with AVA.

AWS Inferentia is the first Machine Learning chip by AWS, which promises to achieve the highest throughput at almost half the cost per inference when compared with GPUs. Given the need for a high-quality, efficient service, we decided to test and benchmark the performance of Inferentia on the Intent Model.

Designing the Benchmark

The AVA Intent Model is a BERT Sequence Classification model using PyTorch1 and the Huggingface library (version 3.4.0) 2. AWS Neuron is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It is integrated into PyTorch to run inference. With Neuron, ML developers could compile a pretrained BERT model, and use its run-time, and profiling tools to benchmark the performance of the inference.

BERT Models represent a popular example of a Transformer model. These models are large, with hundreds of millions or more parameters, and generally are built in two stages. The first is the training of the base language model, and the second is the creation of a task-specific fine-tuned model.

Inference in transformer models is computationally expensive, and generally inefficient and memory-intensive on CPU-based architectures. While GPU-based systems deliver performance, they can be costly. It is possible to use quantization, distillation, or other approaches to create a smaller model that is less costly, but in the end the use of another architecture is appealing in the AVA Case.

One of the key factors that led to interest in Inferentia is that chatbots have a need for predictable, lower-latency responses. Since chatbot requests often come independently instead of as a group, the inference needs to perform well on a low batch size setting. As models are chained or called in parallel, the ability to have stable, scalable throughput is essential. It is more difficult to create batches or queues of inputs for inference. This need for stability, throughput, cost-efficiency, and small batch sizes made the choice of inferential potentially attractive.

Compiling a Model using Neuron

Historically, there was a large separation between training and inference infrastructure in many deep learning models. For example, before the projects merged, many models trained in PyTorch were ported to Caffe2 for inference. One lesson from this era is that any new approach to inference must keep the process of cross-compilation as short and simple as possible — it is no longer feasible to devote significant time purely to re-engineering models for production.

For Inferentia, the process is described below, and is largely automatic. The procedure is to take the conventionally-trained model, and perform a ‘trace’, cross-compiling it for the new hardware. There are a few lines of code to create that traced model, and then the remaining inference code is as normal. The same code can be used for inference on custom hardware, with different model files. This is a great advantage for testing and engineering new models quickly.

The model compilation was done within SageMaker. We first loaded the fine-tuned Intent Model in a SageMaker Notebook using AutoModelForSequenceClassification from Hugging Face transformers library.

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

We then use torch.neuron.trace from AWS Neuron to generate a Torchscript that is optimized by AWS Neuron, and save the script for later use.

model_neuron = torch.neuron.trace(model, example_inputs, compiler_args=['-O2'], verbose=1, compiler_workdir='./compile')# Save the TorchScript for later use'')

We could now test the model inference within SageMaker notebook. The example input we used for the test was “change company address”, which is part of an intent of changing customer information.

# Setup some example inputs
docs = "change company address"
tensors = tokenizer.encode_plus(docs, max_length=512, pad_to_max_length=True, return_tensors="pt")example_inputs = tensors['input_ids'], tensors['attention_mask'], tensors['token_type_ids']# getting the prediction result for input example
classification_logits_neuron = model_neuron(*example_inputs)
logits = torch.nn.functional.softmax(classification_logits_neuron[0]).detach().cpu().numpy()

After testing, we find that the model optimized by AWS Neuron returns the correct predicted intents. Therefore, we now move onto deploy the model. In order to deploy the compiled model, we need to upload the compiled ‘’ Torchscript file onto an S3 bucket. From there, we can create an Amazon EC2 Inf1 instance, and copy paste and deploy the Torchscript file into the Inf1 instance.


Before we test model inference on an Inferentia chip, we need to create an ML Deep Learning Inf1 instance. This webpage shows the process of how to create the instance.

Next we use the following command to SSH into the Inf1 instance in our command line. Note that you need to save your AWS pem key file in your working directory when doing the SSH. You also need to make sure that the inbound rule of your Inf1 instance allows your IP address to SSH into it. You could set this up in the security settings of your Inf1 instance.

ssh -i [pem key file] ec2-user@[IP address of your Inf1 instance]

The following code runs the benchmark process for the model inference, and you could save the benchmark result into a CSV file.

pids = []
current_num_infers = []
throughputs = []
p50s = []
p90s = []
last_num_infer = num_infer
for _ in range(args.throughput_time // args.throughput_interval):
current_num_infer = num_infer
throughput = (current_num_infer - last_num_infer) / args.throughput_interval
p50 = 0.0
p90 = 0.0
if latency_list:
p50 = np.percentile(latency_list[-args.latency_window_size:], 50)
p90 = np.percentile(latency_list[-args.latency_window_size:], 90)
print('pid {}: current infers {} throughput {:.3f}, latency p50={:.3f} p90={:.3f}'.format(os.getpid(), current_num_infer,throughput, p50, p90))
last_num_infer = current_num_infer
global live
live = False
df_dump = pd.DataFrame({'pid':pids, 'current_num_infer':current_num_infers,
df_dump.to_csv('benchmark_dump_neuron_v3.csv', index=False)

Run similar Benchmark Steps for the Same Model deployed in a G4 instance

We load the model in a SageMaker notebook, upload it onto S3. After we created an EC2 G4 instance, we copy the model files from S3 to the instance. In this POC, we used a g4dn.xlarge instance. SSH into the G4 instance, and run the similar inference and benchmark scripts as we did for the Inf1 instance.

The g4dn is a comparable instance chosen as one of the most popular for GPU inference.

Benchmark Result

Using Inferentia, we were able to obtain a 4.9x higher throughput over g4dn for the Intent Model for AVA.

The following table shows the throughput and latency of model inferences with batch size equal to one in Inf1 instance. Here throughput is defined as number of inferences per second. Latency is defined as the number of seconds it takes for the model inference. Latency_p50 is the 50 percentile of model latency, while latency_p90 is the 90 percentile of model latency.

The following table shows the throughputs and latencies of model inference in G4 instance.


The benchmark results show an almost five-fold increase in the throughput of the intent model inference in an Inf1 instance compared to model inference in a G4 instance, while having approximately half of the latency. This successful proof of concept encourages us to deploy more models in production on Inferentia in the future.

With examples of benchmark experiences on various NLP applications with 30% to 45% reduce of cost, we are looking forward to testing Inferentia on AVA NLP models in production. When we get the benchmark results there, we will have more information regarding the cost savings.

From a more general perspective, the simplicity of the process makes this an attractive option for models that have predicted traffic suitable for inf1 use. The high, stable throughput and lower cost make it especially helpful in scenarios where small or fixed batches are required, as well as always-on availability. In addition, the nature of the neuron sdk cross-compilation means that the deployment can be easily automated — adding custom inference to the model can be done as part of a standard approach to deployment with only a few extra steps.