Efficiency in AI (Part 1)

7 min readApr 29, 2024

Practical tips for optimizing AI systems

This blog post is inspired by my webinar on YouTube (in Ukrainian), so I decided to make notes on it. Most of the images were taken from my presentation. This material is divided into two parts, this is the first one, and the second one is available 🔥

Neural networks are expensive — they’re big and take a long time to run, especially LLMs. We will discuss how to speed them up, save on costs and time, and even run some extensive neural networks on your own devices, like a Macbook or any laptop. So, let’s talk about efficiency in AI.

Why do we need to optimize AI?
The neural network pipeline
How to choose an ML framework?
Practical example: How to run GPT on a consumer laptop?

A retro 8-bit style image depicting a small robot training in a gym, preparing for a run. The robot is designed with simplistic 8-bit pixel graphics. Credits: ChatGPT

Why do we need to optimize AI?

Let’s start with a bit of clickbait on ChatGPT, a topic that has already been brought up today. It’s like casting a line out there. What are the advantages of ChatGPT? It serves as a personal assistant, generates ideas, responds quickly, and is multilingual — truly a smart tool.

However, ChatGPT can be used in different ways. Some use it as a personal assistant to tackle tasks. Large enterprises employ ChatGPT to handle customer inquiries. For instance, someone created a service akin to a psychologist where a person in need reaches out for help, and under the hood, it’s merely sending a request to ChatGPT.

But there are drawbacks:

Firstly, it collects your data. When using ChatGPT for personal use, understand that everything you write stays in the system. It’s training data for ChatGPT. So, if you input your password address, or private company data, all this trains ChatGPT. And then someone on the other side of the planet may end up seeing a piece of text you once wrote.
The second drawback is that it only works online. If you want to run ChatGPT at home without an internet connection, or if you have disconnected the internet, or simply don’t need internet access, or if a company wants to build its internal ChatBot, they face limitations. They may have a database and want to ask questions on that database without going online.
The third drawback is the closed-source nature. We don’t know the specific architecture or how exactly it generates text. It’s a problem because we can’t control it.
Fourth, there’s a limited usage policy. While it’s great that companies like ChatGPT care about safety — preventing questions about how to break into a car or steal something — on the flip side, if you work with the military or any other sensitive subject, you won’t get a response. Either because the model will refuse to answer or because it wasn’t trained on such data.

There are alternatives to ChatGPT. Claude from Anthropic, which Amazon has invested a lot in. Then there’s Bard from Google, which is free to use. Coral from Cohere is more enterprise-focused, just so you know, it’s not for personal use. Grok is currently under development by Twitter. And Ernie Bot by Baidu is the Chinese counterpart to ChatGPT.

However, all of these alternatives share the same disadvantages. They have closed source code, limited usage policies, collect your data, and operate only online.

Open-source large language models (LLMs)

These models offer full transparency with open-source code and weights, allowing you full control over how you run the model, how it processes your data, and what it remembers.

Here’s a short list of open-source Large Language Models that are available for use. You might be familiar with Llama2, which is quite well-known. Then there’s Mistral, Yi-Chat developed by Berkeley, and even Intel is developing its own. Many different companies are contributing their Large Language Models to the open-source community.

The neural network pipeline

Figure 5. Typical Neural Network Inference Pipeline.

Now, let’s delve into the core topic of this article: neural network optimization. To run these neural networks efficiently and without spending a fortune, we need to optimize and accelerate their operation. Let’s first examine the standard process of launching a neural network.

We start with any neural network or AI, such as YOLO for Object Detection, Llama2 as an alternative to ChatGPT, Stable Diffusion, or Whisper. We use a machine learning framework like PyTorch, TensorFlow, or ONNX — there are many, and they’re built for different tasks. These neural networks process input data through a pipeline, which typically involves pre-processing to prepare the data for the network, inference, and then post-processing of the results. For example, during pre-processing, you might add additional data from a book or corporate database.

What can be optimized?

Naturally, the neural network itself can be reduced in size and made faster.
You can also switch to a framework that better suits your needs.

*Of course, the pipeline itself can be optimized — it often looks the most cumbersome and likely contains the most code.

However, this very much depends on your specific task and business. We will focus specifically on optimizing neural networks and ML frameworks.

How to choose an ML framework?

Starting with the choice of framework, you need to pick one that suits your project. PyTorch and TensorFlow are the general, widely-known machine learning frameworks. You’re likely familiar with these frameworks since they allow you to do almost everything from training and running neural networks to optimization and deployment.

However, these may not be the most efficient choices. There are also mobile frameworks like TensorFlow Lite and TensorFlow Edge TPU from Google, NCNN from Tencent, and Core ML from Apple, which are designed to run neural networks on mobile devices or edge devices like Raspberry Pi.

Then we have optimized frameworks used particularly in the MLOps stage, when you need to run neural networks in production. ONNX from the Linux Foundation is an open-source framework; TensorRT from NVIDIA significantly speeds up neural network performance on graphics cards; OpenVINO from Intel is tailored for Intel processors, allowing you to run models like Stable Diffusion on a standard PC with an Intel processor; and JAX is Google’s successor to TensorFlow, potentially becoming more popular than PyTorch in the future.

PaddlePaddle is a Chinese framework that runs on both GPUs and CPUs. DeepSparse from Neural Magic is currently efficient on powerful servers with GPUs but plans to expand to mobile devices. There is also the Mojo framework. Another one, VLLM is a framework for working with Large Language Models.

Figure 7. Machine Learning Frameworks: Name, Purpose, Platform.

Practical example: How to run GPT on a consumer laptop?

Let’s run GPT models on a consumer PC using a Python library named GPT4ALL. This library supports various language models such as Mistral, LLAMA2, LLAMA, OpenLLaMa, and Orca Mini, among others.

# Code in Python
from gpt4all import GPT4All
model = GPT4All("orca-mini-3b-gguf2-q4_0_gguf")
output = model.generate("The capital of France is ", max_tokens=3)
print(output)

The code snippet above shows how straightforward it is to use the library. With just a few lines of code, a user can load a model and generate text based on a given prompt. The performance benchmarks below the code snippet provide an insight into the number of tokens (equivalent to words in English) that can be generated per second on different hardware setups, including a MacBook Pro with an M2 chip and 16 GB of memory.

Figure 8. Nomic Vulkan Benchmarks: Single batch item inference token throughput benchmarks.

Thank you for reading this article on the various optimization techniques for neural networks and machine learning models. It’s fascinating to see how advancements in this field can bring powerful AI capabilities to everyday devices.

If you found it informative and engaging, please connect with me through my social media channels.

If you have any questions or feedback, please feel free to leave a comment below or contact me directly via any of my communication channels. I look forward to sharing more insights and knowledge with you!