Improving Human-AI Interactions with More Accessible Deep Learning
Democratizing Chatbot Technology with Dedicated Deep Learning Features in CPUs
Artificial intelligence (AI) chatbot technology is becoming increasingly popular among businesses and organizations as a way to interact with customers and improve customer service. However, despite the advances in chatbot technology, several challenges still need to be addressed. Additionally, building, optimizing, and maintaining chatbots for specific use-cases is expensive and can be financially prohibitive for many organizations.
Luckily, 4th Generation Intel Xeon Scalable processors (4th Gen Xeon) offer improved data management and efficient computations through the Intel Advanced Matrix Extensions (AMX). Furthermore, when combined with the Auto Mixed Precision (AMP) functionality available through the Intel Extension for PyTorch, this technology stack becomes quite competitive for workloads like transfer learning and training small/medium-sized models from scratch (Figure 1).
This optimized stack can help AI chatbot technology in the customer service sector in several ways:
- Scaling Training Data: Chatbots can learn from larger, more complex data sets. This will help chatbots better understand user needs and intent over time.
- Multitasking: Handling multiple tasks and intents simultaneously, allows them to provide more efficient and personalized customer service.
- Faster Responses: Faster inference times enable chatbots to respond to queries more quickly, providing a better user experience and reducing customer wait times.
- Personalization: Chatbots can analyze large amounts of user data and generate personalized recommendations, offers, and services.
- Handling Complex Queries: Improved computing power can also help chatbots handle more complex queries, which require more computational resources to process the data and understand the intent behind the question.
What is AMX?
AMX is a built-in accelerator that optimizes deep learning training and inference workloads. The architecture consists of two components (Figure 2):
- Tiles — Eight, two-dimensional registers, each one kilobyte in size that store large chunks of data.
- Tile Matrix Multiplication (TMUL) — An accelerator engine attached to the tiles that perform matrix-multiply computations for AI.
What is AMP?
AMP uses single-precision (32-bit) and half-precision (16-bit) representations. By using both 32-bit and 16-bit floating-point types, the training phase of a model will be quicker and consume less memory.
Transfer Learning a BERT Model for an Airline Chatbot
We will explore an AI system that understands the intent and the entities involved within the query, lookup or launch the relevant information, and return the appropriate response to the customer in a reasonable amount of time. We leverage the Intel Extension for PyTorch to fine-tune a foundational BERT model from the Hugging Face Transformers library to train and deploy a fast and accurate AI system to predict the Intent and Entities of a user requesting information about airline travel.
The implementation in this article is based on the Customer Chatbot Intel AI Reference Kit. Visit the AI Reference Kit page to discover other open-source implementations of popular industry AI workloads.
Setting Up Your Environment
Feel free to use the conda environment .yml configuration below to set up your environment. You’ll need to install conda on your machine — if you’ve never set up conda, here is a helpful tutorial.
name: chatbot
channels:
- pytorch
- intel
dependencies:
- intel::intelpython3_core
- python=3.9
- pip
- pytorch::pytorch==1.11.0
- cpuonly
- conda-forge::scikit-learn
- pip:
- intel_extension_for_pytorch==1.11.200
- psutil
- transformers
- torchserve
Airline Travel Information Systems (ATIS) Dataset
This demo will use the Airline Travel Information Systems (ATIS) dataset. The dataset consists of ~5000 queries of customer requests for flight-related details. Each of these queries is annotated with the intent and the entities involved within the query. For example, the phrase
I want to fly from Orlando to Houston round trip.
Would be classified with the intent of atis_flight
, corresponding to a flight reservation, and the entities would be Orlando (fromloc.city_name)
, Houston(toloc.city_name)
, and round_trip (round_trip)
.
Using the Hugging Face Uncased BERT Base Model
BERT is a Transformers model pretrained in a self-supervised fashion on a large English corpus. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data), with an automatic process to generate inputs and labels from those texts.
BERT was released initially in base and large variations for cased and uncased input text. The uncased models also strip out accent markers.
The “bert-base-uncased” (Figure 3) model is smaller with ~110M learned parameters. It’s important to pick the right model for your use-case because bigger is not always better. Models with more trained parameters will be inherently more data-hungry. We start by loading a pretrained version of BERT from Hugging Face.
self.bert = BertModel.from_pretrained("bert-base-uncased")
We add two custom linear classification heads to our base model architecture to adapt it to our intent classification task. They return the probabilities of token and sequence labels.
self.token_classifier = torch.nn.Linear(768, self.num_token_labels)
self.sequence_classifier = torch.nn.Linear(768,self.num_sequence_labels)
Transfer Learning with Intel Extension for PyTorch, AMX, and AMP
First, check if your machine has AMX enabled by running lscpu | grep amx
in your terminal.
The response should include “amx” in a bold color (Figure 4). If it’s missing, it may not be supported on your processor or it may not be enabled. If the latter, ask your sysadmin to enable it.
At the time of writing this article, you can access 4th Gen Xeon nodes on the Intel Developer Cloud Beta or request preview access on AWS.
Once you have access to the right compute resources and features, you only need two lines of code to enable AMP with Intel Extension for PyTorch:
- The first line of code is
model, optimizer = ipex.optimize(model, optimizer, dtype=torch.bfloat16)
. It optimizes against the model object and an optimizer object and enables bfloat16 half-precision training. - The second line of code is
with torch.cpu.amp.autocast():
which wraps the forward and backward pass portion of our model training function. It enables AMP and tells our model that it’s okay to train with both FP32 and bfloat16 weights.
Next, we will instantiate our model object IntentAndTokenClassifier
and feed it into our train
function.
# Create a model and prepare for training
model_bf16_wAMX = IntentAndTokenClassifier(
num_token_labels=len(dataset['train'].tag2id),
num_sequence_labels=len(dataset['train'].class2id))
# Train the model
print("Training the model...")
model_bf16_wAMX_trained = train(train_loader, model_bf16_wAMX,
epochs=EPOCHS, max_grad_norm=MAX_GRAD_NORM, amx=True, dataType='bf16')
Once our model is finished training, we can pass it to our evaluation function evaluate_accuracy
.
# Evaluate accuracy on the testing
accuracy_ner_bf16_wAMX, accuracy_class_bf16_wAMX = evaluate_accuracy(
test_loader, model_bf16_wAMX_trained)
I encourage you to clone this tutorial’s GitHub repository and try the solution out for yourself. If you want to try it out on similar hardware, visit the Intel Developer Cloud page and sign-up for an account.
Summary and Discussion
AMX and AMP half-precision training with bfloat16 are powerful techniques that make CPUs a competitive choice for fine-tuning large language models. By leveraging optimized software like Intel Extension for PyTorch in tandem with 4th Gen Xeon processors, developers can achieve faster training times with minimal loss of accuracy, all while reducing the cost of hardware infrastructure.
This has significant implications for the chatbot industry, as CPUs can now be considered a viable alternative to GPUs for chatbot training and deployment. Furthermore, with the availability of AMX and AMP, developers can now take advantage of the capabilities of CPUs, such as larger memory, higher core memory capacity, and TMULs for matrix computations.
In short, these techniques have the potential to revolutionize the chatbot industry, making it easier and more cost-effective for teams of all sizes to develop and deploy chatbots that deliver exceptional customer experiences.
Don’t forget to follow my profile for more articles like this!