NeuralChat: A Customizable Chatbot Framework

Creating Your Own Chatbot in Just a Few Minutes

Intel(R) Neural Compressor
Intel Analytics Software
3 min readSep 7, 2023


Liang Lv, Xuhui Ren, Xinyu Ye, Kaokao Lv, Qun Gao, Feng Tian, and Haihao Shen, Intel Corporation

NeuralChat, a customizable chatbot framework under Intel Extension for Transformers, provides an easy-of-use API to quickly build a chatbot on multiple architectures (e.g., Intel Xeon Scalable Processors and Habana Gaudi Accelerator). NeuralChat is built on top of large language models (LLMs) and supports fine-tuning, optimization, and inference. It also offers a rich set of plugins to allow users to make their chatbots smarter with knowledge retrieval, more interactive through speech, faster through query caching, and more secure with guardrails.

NeuralChat components

Getting Started

NeuralChat is installed as one of a component of Intel Extension for Transformers. Just run the following command to install it:

pip install intel-extension-for-transformers

NeuralChat provides an easy-of-use Python API to quickly create a chatbot, e.g.:

## create a chatbot on local
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig
config = PipelineConfig()
chatbot = build_chatbot(config)
## use chatbot to do prediction
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

You can deploy NeuralChat as a service, e.g.:

neuralchat_server start --config_file ./server/config/neuralchat.yaml

NeuralChat provides a default chatbot configuration in neuralchat.yaml. You can customize the behavior of this chatbot by modifying the following fields in the configuration file to specify which LLM model and plugins to use:

If NeuralChat starts as a service, users can send requests to NeuralChat and get responses via curl:

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Tell me about Intel Xeon Scalable Processors."}'

Plugins to Augment the Chatbot

NeuralChat provides plugins that offer a rich set of LLM utilities and features to augment the chatbot’s capability. These plugins are applied in the chatbot pipeline for inference:

  • Knowledge retrieval consists of document indexing for efficient retrieval of relevant information, including Dense Indexing based on LangChain and Sparse Indexing based on fastRAG document rankers to prioritize the most relevant responses.
  • Query caching enables the fast path to get the response without LLM inference, which improves the chat response time.
  • Prompt optimization supports automatic prompt engineering to improve user prompts.
  • Memory controller enables efficient memory utilization.
  • Safety checker enables the sensitive content check on inputs and outputs of the chatbot.

You can enable, disable, or provide a customizable plugin as follows:

from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig, plugins
plugins.retrieval.enable = True
conf = PipelineConf(plugins=plugins)
chatbot = build_chatbot(conf)

Fine-Tuning a Chatbot

NeuralChat supports fine-tuning a pretrained LLM for text generation, summarization, code generation tasks, and even TTS models:

from intel_extension_for_transformers.neural_chat import finetune_model, TextGenerationFinetuningConfig
finetune_cfg = TextGenerationFinetuningConfig() # support other finetuning configs
finetuned_model = finetune_model(finetune_cfg)

This way, users can fine-tune the models with proprietary datasets for customization.

Optimizing a Chatbot

NeuralChat provides several model optimization technologies, like advanced mixed precision (AMP) and WeightOnly Quantization, to allow users to optimize chatbot inference:

from intel_extension_for_transformers.neural_chat import build_chatbot, AMPConfig
pipeline_cfg = PipelineConfig(optimization_config=AMPConfig())
chatbot = build_chatbot(pipeline_cfg)

This way, the pretrained LLM will be optimized on the fly to boost inference speed.

Concluding Remarks

NeuralChat is now available for you to quickly create your own chatbot on multiple architectures. With plugins, you can also enhance your chatbot’s intelligence through knowledge retrieval, make it more interactive through speech, and make it faster through query caching.

We encourage you to try to create your own chatbot. You can submit pull requests, issues, or questions to

