Simplify Your Custom Chatbot Deployment

Deploy Your Chatbot within Minutes on Intel Platforms

Intel(R) Neural Compressor

Published in

Intel Analytics Software

5 min readJun 29, 2023

Liang Lv, Wenjiao Yue, Haihao Shen, and Huma Abidi, Intel Corporation

In our previous blog, Create Your Own Custom Chatbot, we described how to do parameter-efficient fine-tuning on Intel processors. In this blog, we are happy to introduce Neural Chat, an end-to-end chatbot solution to simplify creation to deployment. With Neural Chat, you will be able to deploy your custom chatbot on Intel platforms within minutes and then make it open to the public!

Neural Chat

Before diving into the Neural Chat components, let’s have a quick overview on how to deploy a chatbot. Here are the three steps:

Deploy the backend on AWS + docker
Create the frontend with Hugging Face Spaces
Deploy the chatbot by incorporating into app.py

Now, let’s look at the key components of the Neural Chat architecture as shown below:

In our previous blog, we leveraged the parameter-efficient fine-tuning (PEFT) library from Hugging Face and used LoRA in model fine-tuning. We recently added an instruction template and provided the scripts to generate the instruction samples. We use Intel Neural Compressor to quantize the model with 8-bit integers or prune the model with sparsity. We also use Intel Extension for PyTorch or the default lightweight runtime in Intel Extension for Transformers to further optimize the model. We provide a docker-based software package to allow you easily setup the environment for both backend and frontend (e.g., hosting a live demo on Hugging Face Spaces or your personal website).

The source code is available in Intel Extension for Transformers, empowering developers to easily create and deploy their own chatbots on Intel platforms.

Neural Chat Deployment

Once the chatbot is ready, it’s time to deploy it for public use. Let’s go through the steps below.

Optimize the Model Performance

To improve the chat experience, we first accelerate inference using the Intel oneAPI AI Analytics Toolkit (AI Kit). The example below shows how to use Intel Extension for PyTorch to enable BF16 automatic mixed-precision inference, which leverages the Intel Advanced Matrix Extensions (AMX) on 4th Gen Intel Xeon Scalable Processors.

We may also enable INT8 inference using Intel Neural Compressor or leverage the default runtime of Intel Extension for Transformers to further improve performance. To test the model, follow the instructions in the README.

Deploy the Model

To interact with the chatbot, we need to launch the model on a backend server with the necessary configurations. We select the Amazon EC2 instance (r7iz.metal-16xl) powered by 4th Gen Intel Xeon Scalable Processors and enable the network configurations as follows:

Enable a public IP address for the EC2 instance
Open port 22 to allow SSH access for setup and debugging
Open TCP ports 80 for HTTP or 443 for HTTPS service

Once the EC2 instance is up, you can connect using SSH to set up the environment. You can use the provided Dockerfile to build a Docker image and launch the model. (See the README for more details.) Once environment setup is complete, just execute the commands below to launch the chatbot server:

When the backend is deployed and running, a corresponding URL like http://xx.xx.xx.xx:80 is created to monitor requests from the frontend. Note that if you want to deploy on port 443 with HTTPS service, additional steps are required as described here.

Create the frontend on Hugging Face Space

Hugging Face Space helps to make some amazing ML applications more accessible to the community. Inspired by this, we created a new space to host the custom chatbot:

The new space is like a new project that supports GitHub-style code repository management. We recommend using Gradio as the Space SDK, keeping the default values for the other settings. We recommend directly using the frontend code released in Intel Extension for Transformers, which shows the reference implementation that we already deployed on Hugging Face Space.

Deploy the Chatbot

We can now connect the backend and frontend by supplying the right URL to app.py. This allows the frontend to directly access the deployed backend through the requests.

Note that for each change made to the frontend, Hugging Face Space rebuilds the code and updates the frontend. The status “running” indicates that the frontend is ready for use. However, if an error occurs, please refer to the content in “Logs” for troubleshooting.

Neural Chat Demo

If everything goes well, the chatbot demo has been deployed, as shown below. Let’s start conversing with the chatbot. The chatbot allows users to submit a prompt and shows the response and elapsed time:

The chatbot provides additional settings such as “Max output tokens” for longer responses and “TOP K” for more creative ones, which allows users to customize the chatbot’s behavior to meet their preferences. Meanwhile, the chatbot also supports “Regenerate” to generate new responses to the same prompt from previous input, and “Clear history” to reset or clear the conversation history and start a fresh conversation without any prior context.

That’s it! Now you can start creating your own chatbot and enjoy an interactive and personalized chat experience.

Summary and Future Work

We are committed to enabling advanced features such as document retrieval and query cache to improve the chat experience with better responses and reduced latency. In addition, we plan to integrate text-to-image (e.g., Stable Diffusion) to enrich the chat modalities. For more details, please visit the Intel Extension for Transformers. Learn more about and download the Intel AI software components used in the Neural Chat architecture. We encourage you to create and deploy your own chatbot. Feel free to contact us if you have any questions.