Empowering Your Business With Local LLMs Becomes Possible

Published in

Akvelon

9 min readAug 9, 2023

Large Language Models (LLM) show incredible results in different tasks. They offer text summaries, classify data and provide output, help recognize patterns and generate code.

Cloud-based LLMs have already transformed many industries, as they become a new norm for business tasks. According to Arize AI survey, OpenAI leads with 83.0% of ML teams considering one of their models. LLaMa ranks second, with 24.5% of teams planning to use their open-source model, so your own highly performing AI bot based on an LLM is something that is quite possible and manageable.

However, when it comes to deciding, how to run an LLM, it becomes obvious that sometimes it’s not optimal to use cloud services in case you work with some kinds of data in industries like healthcare or finance. Moreover, there are cases when there’s no way to connect to the cloud.

At Akvelon, the need of our clients for security, privacy, and low latency is prioritized. That’s why we have explored ways to use LLMs locally to ensure there is no way for protected data to fall into the wrong hands, and that the requirements of GDPR and HIPAA compliance for running LLMs are met.

In this article, we’ll explore which industries sometimes use local LLM as an alternative to cloud LLMs, overview the most common approaches that help ensure local LLM security, and provide insights into ways to launch a secure local LLM on your machine to help you empower your business with a secure AI-powered solution.

Which Industries Hesitate to Use LLM Solutions

While cloud-based LLMs are scalable and cost-efficient, not everyone is ready to implement them into business. The reasons behind CTOs and CEOs being hesitant to do it are quite understandable:

Data privacy and security concerns
Cloud providers aren’t always reliable
There’s a threat of control loss
Cloud-based LLMs are difficult to migrate
Network latency

These reasons make businesses look for opportunities to run an LLM locally. Among the companies that might find it vital to look for an alternative to cloud-based LLMs are industries that have specific security and privacy requirements, depending on the speed of LLM and cannot afford high latency, or just cannot have a connection to a server in some circumstances.

Healthcare

In healthcare, certain devices and applications use the power of AI to process sensitive patient information and provide timely medical insights. Thus, there are security and privacy restrictions that have to be met, as well as the matter of speed. For example, a portable medical device should be able to analyze and interpret patient symptoms without connecting to the internet, while ensuring data privacy.

Finance

Local-machine LLMs can be used in banking applications to process customer queries, assist with financial analysis, and identify potential fraud patterns without transmitting sensitive data to external servers. On the other hand, no part of client data will be shared.

Autonomous transportation

Autonomous vehicles need onboard language models to interpret voice commands from passengers, understand traffic signs and interact with pedestrians. A local-machine LLM enables the vehicle to process language-related tasks quickly and efficiently, enhancing the overall safety and responsiveness of the vehicle.

Research and exploration

In remote or off-grid locations, researchers, explorers, and scientists may need language models to process and analyze data without internet access. A local-machine LLM empowers them to work with language-related tasks in the field without the need for a constant online connection.

How to Ensure That Local LLMs Are Secure

It’s a huge misconception to think that LLMs that are run locally are already completely protected against data breaches, bias, misuse, or different types of attacks since they are local. It’s still crucial to use the best practices of secure local-machine LLM deployment to protect the data as well as hold regular security and standard compliance assessments.

Confidentiality

Failing to maintain the confidentiality of sensitive data, like Personally Identifiable Information (PII), in LLM systems can lead to data breaches and legal repercussions. For example, the 2017 Equifax data breach exposed the sensitive information of millions of consumers, resulting in reputational damage and financial loss.

To avoid penalties connected with confidentiality, companies should train the LLM to recognize and handle sensitive data by redacting or refusing to process it and monitor LLM usage and log requests to detect potential breaches or unauthorized activity.

Data privacy

Non-compliance with data privacy regulations such as GDPR and HIPAA in LLM implementations can result in severe penalties and loss of consumer trust.

To ensure that data privacy regulations are preserved, you will need to configure the LLM to obtain explicit user consent before collecting or processing personal data, train the LLM to anonymize or pseudonymize user data to protect privacy and ensure LLM compliance with data privacy regulations like GDPR and HIPAA.

Industry-specific regulations and compliance

Weak authentication and authorization mechanisms in LLM-based systems may lead to unauthorized access to sensitive information or systems even on a local machine.

To avoid unauthorized access to your business data, you will have to integrate the LLM with secure authentication protocols like OAuth 2.0 or SAML, employ role-based access control (RBAC) within the LLM to manage user permissions and access and apply multi-factor authentication (MFA) for LLM user access to enhance security.

Regular security audits

It doesn’t matter how well-protected your local machine is because new threats arise with time. It’s important to conduct regular security audits of the local machine and the LLM implementation to identify vulnerabilities and ensure compliance with the latest security recommendations.

How to Launch LLM on a Local Machine

Now that we have quickly overviewed the cases when you might want to use an LLM on a local machine, and explored the security measures, let’s dive deeper into the LLM differences and local machine requirements and setup.

Existing models

Overall, there are two groups of LLMs you can use for your business tasks:

You can check the benchmarks here:

Since these models can be launched locally without Internet access, all your data will be secure. All models above can be fine-tuned to solve a specific task.

Note. You also need to check the model license. Some of the models don’t have commercial licenses.

Local machine requirements

To run an LLM on a local machine, you have to ensure it meets certain requirements to ensure optimal performance and functionality. To run a model you need a lot of RAM. If you have a good enough GPU, your model inference will also work faster. You also don’t need Internet access for running the model. If you want to run LLM on your local machine, you need to define how much memory your machine has. The specific requirements may vary depending on the size and complexity of the LLM and the tasks it needs to perform.

Here are some general recommendations that you can check out.

Quantization

Neural network quantization is a technique used in deep learning to reduce the computational and memory requirements of neural networks. It involves converting the weights and activation values of a network from floating-point precision (32-bit) to lower accuracy (8-bit or even lower). This significantly reduces the memory footprint required to store the model and accelerates the computations by allowing more efficient data movement and processing.

Quantization can be done in various ways, but a common approach is called post-training quantization. In this method, a pre-trained neural network is taken and the weights and activations are recalibrated to the desired lower precision. This often involves selecting a representative dataset to collect the range of values that occur during inference, and then scaling and rounding the values to fit within the target precision range.

By reducing the precision, quantization aims to strike a balance between model size, memory usage, and computational efficiency. It allows for the deployment of neural networks on hardware with limited resources such as mobile devices, embedded systems, or specialized chips like graphics processing units (GPUs) or application-specific integrated circuits (ASICs).

Running an LLM on a local machine

Now that you are familiar with the basic requirements to set up an LLM locally, let’s check how we can run it based on an example of WizardLM’s WizardCoder 15B 1.0 GGML.

Clone the Git repository to your local machine and locate it on your hard drive. Then, install all the required Python dependencies. After that, build ggml examples.

Now let’s download the model we have chosen.

Let’s run the model.

Now you’re all set up! However, even though your chosen model is now up and running, you still have to fine-tune it to make it sufficient to work with.

Parameter-Efficient Fine-Tuning (PEFT)

Fine-tuning a pre-trained LLM is a great way to get the appropriate model to fit your specific task. Parameter-efficient fine-tuning (PEFT) is a technique used to improve the adaptation of pre-trained language models to domain-specific tasks using a smaller number of task-specific examples. It aims to reduce the computing and data requirements for fine-tuning by leveraging existing pre-training.

The goal is to improve the performance of the model on the specific target task while minimizing the amount of new training.

To achieve parameter efficiency, PEFT employs methods like adapter layers or Transformer knowledge distillation. Adapter layers allow the model to selectively learn task-specific adaptations without significantly affecting the pre-trained parameters. Knowledge distillation allows the model to transfer knowledge from a larger model to a smaller one by mimicking its behavior.

By utilizing PEFT, fine-tuning can be performed efficiently, requiring less computational resources and data compared to traditional fine-tuning approaches. This makes it especially useful when domain-specific training data is scarce or when fine-tuning needs to be done on smaller devices with limited capacity.

Reinforcement Learning With Human Feedback

After fine-tuning a pre-trained LLM or instead of fine-tuning, you can use Reinforcement Learning (RL) algorithms to align the model to act in a desired manner. For example, you can decrease the model’s toxicity or improve the helpfulness of the model’s responses. To make this scenario, you need to train an additional model called Reward Model, which stands for estimating generated by the LLM text and guide adjust its weight via RL algorithms.

To train Reward Model, you need to rank different LLM responses per prompt, choosing the best choice from multiple options. Thus, you will be able to train the Reward Model to rank an LLM’s output.

Machine learning techniques, such as Proximal Policy Optimization or other similar algorithms, are used to fine-tune the model based on this reward model. The aim is to improve the model’s responses according to the trainers’ preferences, blending it with the pre-trained knowledge.

This iterative process of supervised fine-tuning, comparison data collection, and reward model fine-tuning helps to align the large language model with human preferences, making it more useful and safe for various applications.

Final Thoughts

It is possible to run LLM even on your local machine with limited RAM while preserving the security of data.

However, to fine-tune an LLM you have chosen to fit your business needs and to ensure it runs smoothly and there are no delays in response, you can use PEFT or reinforcement learning algorithms.

The Akvelon team has extensive experience in tailoring LLMs to different business needs while preserving security requirements.