Custom LLMs in Action: How to successfully integrate LLMs in your company

Published in

The Aptitude Data Blog

12 min readFeb 1, 2024

Large Language Models (LLMs) have made a spectacular impression in recent years on company execs looking to improve the fortunes of their businesses. But where do they start?

Welcome to the first post of our series “Custom LLMs in Action”, which aims to guide businesses through the complex process of integration of LLMs in their operations by leveraging Aptitude’s first hand experience. Throughout this series we will delve into various perspectives; strategic, business development, and technical, exploring the application of custom Large Language Models (LLMs) to create new solutions, optimise existing processes, and enhance decision-making within companies. To illustrate these concepts, we’ll spotlight a compelling use case within the pharmaceutical sector, showcasing how Aptitude has successfully harnessed the power of custom LLMs.

Two options for success: Use already existing LLM apps or Custom LLMs?

The emerging field of GenAI evolves fiercely every week and companies are fighting to avoid falling behind in its adoption. However, many questions arise:

Where to start?
How do I integrate LLMs in my infrastructure and applications?
Should I train my own LLM from scratch?
Should I customise a pre-trained open-source model?
Or maybe use existing models through APIs?

The answer to all of these questions will depend on several factors, such as: how digitally mature is our company, whether we are at the right time to start adopting this new technology and others that we will discover along this and subsequent posts.

A smart way to start with this type of project is with the philosophy of: keep it simple. Research and create a simple problem statement and use this to trigger a series of PoCs using already Existing LLMs or GenAI Apps, where we can validate the use-case quickly and efficiently with our Customers, and pivot if necessary to fit the new needs and changes.

This first step consists of interacting with existing applications that can perform similar tasks to the ones we are looking for, such as: ChatGPT, Bing, Alexa, MidJourney, etc. This prior supervision will also allow us to better define and narrow down our problem. If we want to exploit these applications, many of them offer an API to experiment with and create more robust PoCs/MVPs.

At this point, it is crucial to recognise that the ethical and responsible utilisation of these applications lies upon our shoulders. This is: be aware of the data we are using and how they are treated, apart from understanding the different pricing and subscription plans before putting all our eggs (our business, apps and data) in one basket (an external service we don’t own/control). Notice how little control we have over these applications and APIs, and pay special attention whether they conform to the privacy and security policies of our company.
(SPOILER) For this reason and others that we will discover soon, it is important to continue reading this post and thus be able to approach the development of Custom LLMs.

From here, we can iterate our product to more complex solutions that better fit our business, data, requirements and internal policies, i.e. Custom LLMs, which are language models trained on specific data to address unique requirements and achieve higher accuracy and performance in specific domains. In short, we are generating a new model, adapted to our data and use case, which has other implicit advantages such as: being the owners of the model and its lifecycle, as well as knowing the associated costs at all times, which ensures total control over it.

The following diagram shows the transition between existing and custom LLMs, and how we actively participate in the successful development of GenAI products, and hence custom LLMs, through more sophisticated methods such as Fine-Tuning, RAG (Retrieval Augmented Generation), RLHF (Reinforcement Learning from Human Feedback), Knowledge Graphs, etc.

Transitioning between Existing and Custom LLMs

The reality: Why GenAI projects may fail?

It is still too early to talk about failed GenAI projects, but if we analyse the potential challenges and pitfalls, several reasons may emerge. In this section, our attention will be directed towards four crucial aspects: costs, governance and ethics, data issues and current trend and hype.

Costs

First of all, let us outline 3 categories of costs associated with GenAI projects according to their type and LLMs usage:

NOTE: This is a very high level table, which aims to simplify the types of GenAI projects and possible associated costs. Each use case should consider the variants and combinations that best suit its needs.

The ultimate goal is to find the best of both worlds, bearing in mind that this kind of projects, if are not managed properly, can incur very high costs.

Governance and ethics

Governance and ethics are likely the most crucial and distinguishing factors, which were absent in previous data projects. These two concepts are relatively recent, and their success or failure lies on a proper understanding and application.

When relying on existing LLMs we have to bear in mind the following: we do not own the models, technologies and data used. Moreover, it is very likely that they are not public or easily accessible to ensure the intellectual property of their creators.

This poses a problem of control and governance in many respects, such as:

Costs: will the access to the App or API subscription costs remain constant? → Establish contracts and plans with the providers.
Versioning: newer model versions will guaranty input/output reproducibility? → Consistent checks to maintain our App behaviour.
Ethics: do these models comply with internal and external regulations? → Understand the data used to train the models, existing filters and the responses it may produce.
Data leakage: is our data safe? → Don’t share internal sensitive information.

Most of these problems can be addressed by choosing to create our own custom LLM. This option provides us with control, allowing us to utilise our own data (or from third parties) and apply the filters and treatments we deem necessary. We will oversee the entire model life cycle, including versioning, deployment in various environments, and deprecation. Ultimately, we will retain intellectual property within the perimeter of our company, and the decision regarding the usage and exposure of our models will remain internal.

Data issues

GenAI projects are, after all, data projects. We need to be aware of the data we have and its quality and quantity to ensure good results. Are we able to answer these questions: Do we know our data well enough? Is it democratised and standardised throughout our company? Do we have permission to use it? Is it accessible?

It is not necessary to fulfil all these questions, as some GenAI tasks are geared towards solving these problems, but simply to be aware of where we are and our maturity as a company around data.

Bret Greenstein says: “Companies should prioritise potential AI use cases first by impact, second by risk, and third by data.”

Current trend and hype

At this point, it may be more important for a company to strategically position itself in the market than to make good use of these new technologies to carry out relevant use cases that add value to the company. This is totally valid, and may be associated with the effect known as FOMO (Fear Of Missing Out).

It is important to stress that some companies do not really have a problem that can be solved with GenAI, but the easy access to this new technology and the fast results can lead to unrealistic expectations.

For a first PoC or MVP it is probably easier to use existing generalist models (or those known as foundational models), by subscribing to a specific model that can, momentarily, meet our needs. With this we guarantee a first contact with the final product, feedback from users and customers, and why not an early market launch, which is GREAT. However, over time we will realise that these models are limited, our users will want more features, they do not completely fit our data, and most importantly we do not have 100% control over them.

For these reasons, it is important to know how to transition (in a timely manner) from existing LLMs Apps/APIs to customised models, and thus to be able to considerably reduce the chances of a GenAI project failing in the medium-long term.

Integration is the key: How Aptitude tries to reduce the gap by using custom LLMs?

Aptitude recommends the integration of Custom LLMs within companies rather than using already existing ones, for many reasons:

Make the most of companies own data.
Quality of results
AI Ethics
Governance: Ownership and control

The following diagram describes, at a very high level, a common workflow of a GenAI project and how Aptitude deal with this kind of developments.

Main components:

Sources: Internal company data + 3rd party data (when possible) to enhance the models input, adapt to a specific domain or task and improve the quality of results by applying filters, transformations, cleaning, quality checks, etc.
The data format of most GenAI projects is unstructured: texts, images, videos, audio, etc.
As part of this, for some models and Use Cases, it is important to have an initial dataset called Ground Truth with labeled data.
The AI-Enabler is a set of best practices around the data and data workflows (ETLs) to ensure the models receives the latest available data with the best quality.
Data Engineering and ML Engineering tasks are carried out in this component, with the Data Scientist support.
In the Gen-AI component is where the custom LLM is trained and deployed as an application or service. This is supported by following a MLOPs approach that guarantees the model ent-to-end life cycle.
To achieve this, Aptitude has huge knowledge around some well known tools, such as: AWS Sagemaker, MLFlow or Kubeflow.

This workflow can be enhanced with more techniques/approaches based on the Use Case and needs, such us:

RAG (Retrieval Augmented Generation): In combination with Vector Databases (such as VectorDB), or Knowledge Graphs, it leverages pre-existing knowledge or context to enhance the generation of relevant and coherent responses. This technique is particularly useful in tasks like question answering and text generation where context plays a crucial role.
RLHF (Reinforcement Learning from Human Feedback): An approach that involves training machine learning models through a combination of initial data and human feedback. It helps improve model performance by incorporating insights from human evaluators.
Knowledge Graphs: A structured representation of knowledge that captures relationships between entities. By organising information in a graph format, it facilitates efficient data retrieval and enables the extraction of meaningful insights from interconnected data points.

A Case Study: The Pharmaceutical Frontier

The pharmaceutical industry is one of those sectors where the amount of available data is tremendous, and a knowledge system (human, artificial or both), has become critical to understand the documents and all the related data.

Domain expertise: complex terms and definitions.
Many different document types, mainly composed of unstructured data, such as:
- Quality manuals
- Policies
- Standard operating procedures
- Clinical trials
- Specifications
- Logbooks etc
Huge variety of languages

Business problem

Our journey into the realm of custom LLMs begins with a compelling use case in the pharmaceutical sector. Aptitude collaborated with a leading pharmaceutical company facing challenges in streamlining its research and development processes. The industry demands precision, speed, and constant innovation, making it a fertile ground for the application of advanced language models.

Traditionally, researchers spent extensive hours sifting through vast volumes of scientific documents to identify potential drug candidates and understand the latest advancements in their respective fields. This manual approach not only consumed valuable time but also posed the risk of overlooking critical information. Aptitude recognised this inefficiency and embarked on a mission to enhance the company’s research capabilities.

The business problem consists of improving the system for assessing and classifying documents, which is currently done manually and is very time consuming. The results are stored in a structured way in a database so that they can be analysed and consumed by other applications. In summary, automate to some extent the assessment step that is currently performed by manual operators. The problem is defined within the area of information extraction as requires an approach to extract content from unstructured data in documents (e.g. PDFs) to structured fields with very specific information.

With the proposed solution, experts can spend more time on other technical tasks and review the evaluation performed by the model, correcting errors if necessary, instead of running the evaluation from scratch by themselves. Furthermore, these corrections will serve to improve the training of the model in the future, applying techniques such as RLHF.

Technical solution

The most critical point was finding the LLM task that best suits the problem, starting from a token classification approach and pivoting to an abstractive question answering task after a few weeks of research and problem understanding, was definitely a win for the subsequent steps of the project.

Remember one key point of success when building custom LLMs is to identify, with precision, the task(s) you want to address.

The abstractive question answering task is a type of natural language processing (NLP) technique that involves generating an answer to a given question in natural language by summarising and synthesising information from various sources, rather than just selecting an answer from explicitly from the text.

The pre-trained LLM model chosen to carry out this project was Google’s T5 model. This family of models has five different levels: Small, Base, Large, 3B, 11B; that correspond to the number of parameters into the architecture.

Given that the text associated with the project had a clear biomedical bias, implementations of T5 that were previously fine-tuned for this specific domain area were sought in order to reduce the training time and improve the results using a smaller amount of data. To the best of our knowledge the T5-base-for-BioQA is the only publicly available T5 model fine-tuned on biological information (BioASQ) for question answering downstream task.

The training data was extracted from the ground truth database and the raw data, by matching already tagged pairs of questions and answers with the documents used for this purpose. We start with an initial PoC on 7 types of questions, the most representatives ones based on different criteria, such us: type of question (Yes/No answers, free-text, categories, etc), frequency and variety.

Finally, the preprocessing and training pipeline’s were built within AWS Sagemaker and deployed as an API endpoint to be consumed by other applications or processes. The POST request was defined by a JSON including: the text of the document and a question or list of questions, and the response includes a list of answers. Stay tuned to further posts for more info around the technical implementation.

Model Evaluation & Results

The results of the PoC/MVP were excellent, and the subsequent human validation were very positive. The evaluation metrics used for this purpose were: Exact Match (EM) and F1 score.

At the very beginning of the process, a human validation was carried out from a baseline model (a T5-base-for-BioQA model without fine tuning) to extrapolate our experiments from there. The conclusions were that the model needed to be corrected 35% of the time and was right 65% of the time. The baseline model whose results were evaluated had a F1 score of 0.67.

For our new custom LLM, the best results obtained reached an aggregated F1-score of 0.89. The aggregated score was obtained considering all the question types and their individual F1 scores, having higher performance in the binary questions (above 0.9 F1-Score) and lower in the free text questions (around 0.75 F1-Score the worst case).

This means that by fine-tuning an LLM using custom training data we improved the model performance by around 20% — a great success.

Looking Ahead: The Future of Custom LLMs

As we dissect this pharmaceutical success story, it becomes evident that the integration of custom LLMs is not just a technological upgrade; it’s a strategic imperative. Aptitude’s efforts in this domain open a gateway for businesses across sectors to re-imagine their operations, fuel innovation, and stay ahead in an ever-evolving landscape.

Authors: F. Santos & M.Serrano.

Stay tuned for Part 2: “Custom LLMs in action: Architecting Custom LLMs for Scale and Impact” and Part 3: “Custom LLMs in action: Building AWS pipelines and deployment models for LLMs”.