PyTorch
Published in

PyTorch

How We Used AWS Inferentia to Boost PyTorch NLP Model Performance by 4.9x for the Autodesk Ava Chatbot

© Autodesk Inc.

Designing the Benchmark

Compiling a Model using Neuron

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model_neuron = torch.neuron.trace(model, example_inputs, compiler_args=['-O2'], verbose=1, compiler_workdir='./compile')# Save the TorchScript for later use
model_neuron.save('intent_neuron.pt')
# Setup some example inputs
docs = "change company address"
tensors = tokenizer.encode_plus(docs, max_length=512, pad_to_max_length=True, return_tensors="pt")example_inputs = tensors['input_ids'], tensors['attention_mask'], tensors['token_type_ids']# getting the prediction result for input example
classification_logits_neuron = model_neuron(*example_inputs)
logits = torch.nn.functional.softmax(classification_logits_neuron[0]).detach().cpu().numpy()

Benchmarking

ssh -i [pem key file] ec2-user@[IP address of your Inf1 instance]
pids = []
current_num_infers = []
throughputs = []
p50s = []
p90s = []
last_num_infer = num_infer
for _ in range(args.throughput_time // args.throughput_interval):
current_num_infer = num_infer
throughput = (current_num_infer - last_num_infer) / args.throughput_interval
p50 = 0.0
p90 = 0.0
if latency_list:
p50 = np.percentile(latency_list[-args.latency_window_size:], 50)
p90 = np.percentile(latency_list[-args.latency_window_size:], 90)
pids.append(os.getpid())
current_num_infers.append(current_num_infer)
throughputs.append(throughput)
p50s.append(p50)
p90s.append(p90)
print('pid {}: current infers {} throughput {:.3f}, latency p50={:.3f} p90={:.3f}'.format(os.getpid(), current_num_infer,throughput, p50, p90))
sys.stdout.flush()
last_num_infer = current_num_infer
time.sleep(args.throughput_interval)
global live
live = False
df_dump = pd.DataFrame({'pid':pids, 'current_num_infer':current_num_infers,
'throughput':throughputs,'p50':p50s,'p90':p90s})
df_dump.to_csv('benchmark_dump_neuron_v3.csv', index=False)

Run similar Benchmark Steps for the Same Model deployed in a G4 instance

Benchmark Result

Conclusion

--

--

An open source machine learning framework that accelerates the path from research prototyping to production deployment

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store