From Training to Model Deployment: Harnessing Intel oneAPI’s Potential

10 min readOct 29, 2023

Hi reader, I am Ramashish Gupta, a 4th year undergrad from IIT Kharagpur presenting my intel oneAPI LLM Challenge solution. Generative Question-Answer (Q&A) models, powered by advanced deep learning techniques, have revolutionized the landscape of natural language understanding and information retrieval. These models generate human-like text responses to questions and find application across diverse domains.

The whole workflow can be understood by the above diagram, where Query i.e. the question, Documents i.e. the context and LLM is the model that we are going to train.

Below, we outline various use cases and requirements for these generative Q&A models, highlighting their role in semantic search applications and beyond:

> Customer Support Chatbots: Businesses leverage generative Q&A models to create sophisticated chatbots capable of engaging with customers in a human-like manner. These chatbots understand and respond to a wide array of customer queries.

> Educational Tools: In educational settings, generative Q&A models serve as virtual tutors. Students pose questions related to their coursework, and the models provide detailed explanations and solutions, enhancing the learning experience.

> Semantic Search: Generative Q&A models are instrumental in semantic search applications, going beyond traditional keyword-based search engines. They comprehend the meaning behind user queries and retrieve information based on semantic relevance.

> Domain-Specific Expertise: By fine-tuning generative Q&A models with domain-specific data, they can serve as experts in various fields. For example, a healthcare Q&A model can offer medical advice, and a legal Q&A model can assist with legal inquiries.

Generative Q&A models continue to evolve, offering versatile and valuable solutions across a spectrum of applications where understanding and generating human-like text responses to questions is paramount.

Training generative Q&A models is an immensely complex task, primarily due to the demanding infrastructure requirements, necessitating powerful compute resources and substantial datasets.

Introducing Intel oneAPI

oneAPI is an open, standards-based programming model that frees developers to use a single code base across multiple architectures — CPU, GPU, FPGA, and other accelerators. The result is quicker computation without vendor lock-in.

The Intel oneAPI toolkit is a comprehensive suite of high-performance tools designed for creating Data Parallel C++ applications and oneAPI library-based applications. These tool-kits cater to specific domains, with the Intel oneAPI Base Toolkit serving as the foundation for all others.

You have a choice of seven meticulously curated toolkits each of these toolkits caters to a unique user base and offers top-tier features. In this blog, we’ll delve into the Intel AI Analytics Toolkit, exploring how to train a generative Question Answering model and subsequently use it to create a web application.

Intel® AI Analytics Toolkit
The AI Kit equips data scientists, AI developers, and researchers with familiar Python tools and frameworks to expedite end-to-end data science and analytics on Intel® architecture. Leveraging oneAPI libraries for low-level compute enhancements, it optimizes performance across preprocessing, machine learning, and enables efficient model development through enhanced interoperability.
With this toolkit, you can
> Achieve high-performance deep learning training on Intel® XPUs, seamlessly integrating fast inference into your AI workflow. Utilize Intel®-optimized deep learning frameworks for TensorFlow* and PyTorch*, including pretrained models and low-precision tools.
> Experience effortless acceleration for data preprocessing and machine learning workflows using compute-intensive Python packages like Modin*, scikit-learn*, and XGBoost.

The Dataset

The dataset we’re working with is designed for training a question-answering model. Each sample consists of a substantial context and a question related to that context, for which the model must generate an answer.

Here’s an example from the dataset:

CONTEXT: Malawi (, or ; or [maláwi]), officially the Republic of Malawi, is a landlocked country in southeast Africa that was formerly known as Nyasaland. It is bordered by Zambia to the northwest, Tanzania to the northeast, and Mozambique on the east, south, and west. Malawi is over with an estimated population of 16,777,547 (July 2013 est.). Its capital is Lilongwe, which is also Malawi’s largest city; the second largest is Blantyre, the third is Mzuzu, and the fourth largest is its old capital Zomba. The name Malawi comes from the Maravi, an old name …
QUESTION: Is it a large country?
ANSWER: No

This dataset is not your typical extractive question-answering dataset, where the answer can be found verbatim within the context. In fact, upon analysis, it was discovered that 35% of the answers were not present in the context. Hence, traditional encoder models that locate start and end indices of answer text won’t suffice. Instead, a generative question-answering model, employing an encoder-decoder architecture, is required for this unique dataset. Also training a separate model for yes-no questions and true-false questions will overcomplicate.

Fine Tuning the model

In our journey to fine-tune the model using transfer learning, we zeroed in on the T5 architecture, a formidable encoder-decoder model. This model was meticulously fine-tuned using the SquadV2 dataset. We delved into several other encoder-decoder models, including BART and GPT-2. Through rigorous experimentation, we discovered that the T5 model consistently outperformed the others on our validation dataset.

Fine tuning a pre-trained model using transfer learning

To elevate the training process to its zenith, I harnessed the capabilities of the Intel Developer Cloud. Within this ecosystem, I seamlessly integrated two powerful oneAPI services PyTorch* Optimizations from Intel and Intel® Extension for PyTorch* which synergized to unlock the full potential of Intel hardware, ensuring the most efficient training environment possible.

About PyTorch* Optimizations from Intel
Intel stands as a significant contributor to PyTorch*, consistently delivering essential optimizations that enhance the performance of PyTorch within Intel architectures. The AI Kit offers the most current binary version of PyTorch, rigorously tested for compatibility with the entire kit. Additionally, it incorporates the Intel® Extension for PyTorch*, introducing the latest Intel optimizations and user-friendly features to further elevate your PyTorch experience. With a few lines of code, you can use Intel Extension for PyTorch to take advantage of the most up-to-date Intel software and hardware optimizations for PyTorch, automatically mix different precision data types to reduce the model size and computational workload for inference and add your own performance customizations using APIs.
About Intel® Extension for PyTorch*
The Intel® Extension for PyTorch* elevates PyTorch* by infusing it with the latest feature enhancements and optimizations, unlocking superior performance on Intel hardware. These optimizations harness the capabilities of AVX-512 Vector Neural Network Instructions and Intel® Advanced Matrix Extensions on Intel CPUs, as well as Intel Xe Matrix Extensions AI engines on Intel discrete GPUs. Additionally, with the PyTorch* xpu device, Intel® Extension for PyTorch* facilitates effortless GPU acceleration for Intel discrete GPUs, seamlessly integrating them with PyTorch* for enhanced performance.

For fine tuning this model we picked up a T5 model which is an encoder-decoder based model fine tuned on SquadV2 dataset. To start the training first ssh login into your intel developer cloud instance.

sycl-ls
srun --pty bash
source /opt/intel/oneapi/setvars.sh
git clone https://github.com/ramashisx/oneAPI_hackathon_submission`
cd ./scripts
cat train.py
python train.py

Now you are good to go, as you can see in the train.py file using `cat train.py` this training pipeline utilizes both Pytorch Optimization for Intel and Intel Extension for Pytorch to optimize the model and use the most out of your Intel hardware while training.

Model being fine-tuned in Intel Developer Cloud

After the training is complete the model get’s saved in your selected folder.

Drum rolls . . .

Results Time

The model I trained managed to attain a leaderboard score of 0.376, ultimately securing my first-place position in Phase 1. However, it’s crucial to highlight that the model’s actual capabilities surpassed this score. The evaluation script, which considered exact matches in a case-sensitive manner, adversely affected the accuracy metric.Upon carefully exploring the data I also found that the casing in answer didn’t follow a definite pattern so the model never learnt how to follow the exact casing.

Quantization and Model Pruning

In my journey to optimize machine learning models, I harnessed two vital techniques quantization and pruning. Quantization, the process of reducing numerical precision, allows the model to run faster and more efficiently, while pruning involves eliminating less critical parameters, boosting efficiency without compromising performance. These methods became my toolkit for achieving a delicate balance between model accuracy and operational speed. My journey began with pruning, where I trimmed away extraneous model elements. Subsequently, I delved into quantization to reduce computational overhead. This combined approach refined my model, making it well-suited for applications that prioritize efficiency without substantial accuracy loss. It’s this dual strategy that empowered my model to shine in efficiency-focused scenarios.

Once your training process is successfully completed, it’s time to take the next steps in model optimization. Begin by navigating to your trained model directory and execute the following commands.

bash run_pruning.sh

When prompted, provide the path to your trained model. This step utilizes the Intel Neural Compressor to prune your model, and you have the flexibility to fine-tune parameters by adjusting values in the run_pruning.sh file.

After pruning, navigate to your pruned model, and now, let’s venture into quantization.

bash run_quantization.sh

After running the command provide the path to your pruned model when prompted. Once again, this process leverages the Intel Neural Compressor to quantize your model, optimizing it for efficiency.

About Intel® Neural Compressor
The Intel® Neural Compressor is a versatile tool designed for model compression, effectively reducing the size of models and boosting the speed of deep learning inference for deployment on CPUs or GPUs. This open-source Python* library streamlines well-known model compression techniques like quantization, pruning, and knowledge distillation, making them accessible across a range of deep learning frameworks.
This multifaceted library equips you with a range of powerful tools. It enables the acceleration of model convergence through automated accuracy-driven strategies for quantized models, simplifies the optimization of large models by efficiently pruning less vital parameters, empowers the refinement of smaller models for deployment via knowledge distillation from larger ones, and provides a user-friendly, one-click approach to initiate model compression, streamlining the entire process for your convenience.

Finally, navigate to your final quantized model. While it may not be the most accurate model, it offers the highest throughput, making it ideal for various applications.

Deploying this model as a Web App

Web App Structure

This web application utilizes FastAPI for the backend and Streamlit for the frontend. Here’s how it all comes together:

Backend (FastAPI): FastAPI serves as the backbone of the web application. It handles HTTP requests, processes data, and communicates with the machine learning model. It also automatically generates an interactive API documentation, which is invaluable for developers.

Frontend (Streamlit): The user interface of the web app is built using Streamlit. It simplifies the creation of interactive, data-driven applications. Streamlit allows you to incorporate widgets and components, such as sliders, buttons, and charts, to make the user experience intuitive and engaging.

Model Deployment and Optimization: The machine learning model deployed in this web app has been optimized for Intel hardware using the Intel Optimization for PyTorch through the Intel AI Analytics Toolkit. However, the code has been structured in a way that it can be used on various hardware platforms, ensuring that it remains versatile and accessible to a wide range of users.

Modularity: To reuse this web app for a different model, you can modify the model’s configuration and ensure that you update the MD5 checksum in the provided bash script. This modularity makes the web app adaptable for various machine learning models and use cases.

Now, let’s proceed with turning this well-trained and optimized model into an actual product. Navigate to the “webapp” folder within the repository, and initiate the setup by installing the necessary libraries using the commands:

pip install -r requirements.txt
bash run.sh

This script will launch both the backend and frontend components of the model. Additionally, it will download and verify the model for use. If you intend to adapt this web application for a different model, remember to update the MD5 checksum within the bash script accordingly.

If you are running on a remote instance make sure you port forward two ports by running on a new terminal using the commands:

ssh myidc -L 4444:10.10.10.X:4444
ssh myidc -L 4445:10.10.10.X:4445

Replace `10.10.10.X` with the Public IP of your remote machine, connect to backend `localhost:4444/docs` for api documentation and connect to `localhost:4445` for UI.

On the Web-App input context in the Context dialog box, your question in the Question dialog box and click on submit.

Conclusion

In conclusion, Intel’s oneAPI toolkit is revolutionizing software development by breaking free from hardware constraints. Its comprehensive toolkits cater to diverse domains, and the Intel AI Analytics Toolkit, as showcased here, accelerates deep learning and streamlines Python-based data science.

Intel’s collaboration with PyTorch via the Intel Extension enhances performance, while the Intel Neural Compressor enables model compression. The deployment of a web app showcases its adaptability.

Intel’s oneAPI is a game-changer, offering flexibility, efficiency, and innovation for developers, transcending hardware limitations to propel us into the future of computing.