Our journey to fine tune a language model and build our moat.

Published in

Explain AI

11 min readAug 5, 2023

At Explain, we help companies do business with governments. Public entities don’t talk to companies, they write — a lot ! One major pain point of companies interacting with them is that they have to collect, search and read piles of documents to identify new public projects or tenders, learn about relevant decisions, or find information about their clients or their competitors. We think that artificial intelligence can do the boring part of those tasks by automatically identifying the relevant documents and extracting the exact information our clients were looking for.

Imagine you work in public water treatment, waste management or renewable energies. Think about a software that would automatically identify a commercial opportunity in a city, based on city council reports, gather available data about the project and the client, and help you write an answer to the call for tender, carefully crafted to take all the relevant information into account. Wouldn’t that be amazing ? That is what we are building at Explain.

Reading administrative documents is long and boring

Let’s take a specific example: you work in a company building windmill farms. Part of your job consists in monitoring all the windmill farms projects, to ensure that if a project of a competitor fails, you understand why and do not make the same mistake in your next project.

Therefore, you have to identify and read documents like the one below, in which a département refuses to authorize a windmill farm (in French “arrêté préfectoral”).

This kind of administrative documents are a pain to read: most of the text is spent referring to the codes, reports and law articles the decision is based on. The important information, which is the motivation behind the decision, is only provided on the second page and hidden in four paragraphs (circle in red). In this case, the minister of armies refused to give its approval because the planned windmill farm was too close to a military radar and could disturb its functioning.

Many administrative documents share this common challenge: important information is masked by a maze of documents, expressed in dense jargon and intricate legal terms. While our existing NLP system effectively addresses some of these issues for our clients — such as document retrieval, classification, and named entity recognition — we were keen to explore if the latest advancements in generative AI could push the boundaries. Could they help us more accurately pinpoint and distill key sections within these documents? And if they can, is it necessary to leverage major providers like OpenAI ? Or is it possible to develop our own model ?

Can artificial intelligence do the job ?

Over a year ago, we tried to tackle the “rephrase complex document into easy to read paragraphs” problem using what was at that time the state of the art in abstractive summarization: a pre-trained RoBERTa language model fine tuned on a summarization task.

Our investigations at the time are documented in https://medium.com/le-blog-explain/using-state-of-the-art-summarization-to-cut-through-tedious-documents-1980c21ed88a. We were quite happy with the quality of the results, but we had to construct a complex system to handle hallucinations, deal with some systematic mistakes, and exclude documents that were too hard to summarize. We had the feeling that human-quality summarization was still years away. How wrong were we!

GPT4 writes a better summary than you do

Let’s ask GPT4 to summarize the document above. This is the output:

Wow. GPT4 clearly understood the document and the context very well, and it extracted the right information from it. It did not make any factual mistakes, did not hallucinate, and overall, its summary is better than what most of us would have written.

This result holds when tested on a larger corpus: GPT4 summaries are consistently very good. They never make up facts and important elements are correctly identified and rephrased.

This could be the end of the story: just apply GPT4 on every document and let it do its magic. But is it ?

Why we cannot build our product on GPT

If you are an AI company, you have to make a choice between using proprietary models behind API such as GPT or Claude in your product or using models that you can customize and deploy on your own infrastructure.

For us, it is not viable to use an external API, for many reasons:

cost at scale: as the business models of those API is based on consumption, cost does not scale well. For use cases where millions of documents need to go through the API, this is significant
competitive advantage: if you are using the same model as anyone else, it is hard to build a moat.
privacy: sharing your or your client’s data to external parties such as OpenAI can be a deal breaker
control over model behavior : as you do not have control over the model, its behavior and its terms of use can change, and affect your use cases. The recent claims about the decline in performances of GPT4 illustrated this risk.

The question is therefore: can we obtain performances that come close to proprietary models, without paying the cost ? That is what we tried to find out.

Can you have your cake and eat it too ? Yes, if you fine tune

Fine tuning is not a new idea: old large language models such as BERT, Roberta or XLNet required to be fine tuned on a specific task to acquire useful capabilities (classification, NEF, summarization…).

The idea is simple: take a model that has already been trained, and feed it a dataset of a task you want it to accomplish, to adjust it. For instance, in our case we would give the model a set of examples of (“Administrative documents” ⇒ “summary”).

Fine tuning felt out a fashion for a period due to the rise of super capable hyper smart large language models, but it is now gaining back some tail wind for several good reasons:

Methods for fine tuning have improved a lot in the last few months.
Several papers (the most known being the Alpaca paper) demonstrated that the best open source language models can reach performance similar to the best proprietary language models if they are fine tuned.
Fine tuning allows for a strategy called model distillation: use a super intelligent large language model to generate a training set, and transfer part of its capacity to a smaller cheaper model thanks to fine tuning.

The promise is appealing: get close to GPT4 performances by training our own, much smaller, fine tuned models, using a process that is relatively fast, cheap and completely automatic. On top of that, our model would be hard to replicate because it would require our competitors to have access to our data. Does this plan work in practice ?

Practical fine tuning in 4 steps

Step 1 — Generate a training set

In order to fine tune a model for the summarization of administrative documents, you need to build a dataset of… well, summaries of administrative documents.

Happily for us, as we have already seen earlier, very large Language Models are insanely good at summarization tasks. It was therefore very easy to generate our training set: we gave GPT4 excerpts of administrative documents representative of the ones our clients read, asked the model to provide a summary, and reviewed a sample to ensure the quality was higher. Overall, we created a dataset of 2,000 pairs Excerpt / Summary.

Step 2 — Choose a base LLM

When you go to a page compiling the performances of open LLMs such as https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, there are a huge number of models to choose from and it can be a bit overwhelming.

We did not spend too much time trying to select the model with the highest performance. Instead, we chose one that was well documented, widely used for fine tuning, and that could be safely put in production: Redpajama, a replication of Llama with a more permissive license. More specifically, we chose the 7B model, because we thought it would be faster and easier to use. At the time I’m writing this, we would probably have tried with Meta’s Llama 2.

We took the base model rather than the instruction tuned one, because previous experiments (with Flan-T5 and T5) had revealed that instruction tuned models behave worse after instruction tuning. Furthermore, our task is narrow and does not require our model to be effective at following instruction.

Step 3 — Implement a fine tuning script

At the time of our experiment, the best method available to implement a fine tuning script was LoRA. Since then, it has been surpassed by QLoRA, its quantized version of LoRA, which is also available in the Peft library. We used https://www.philschmid.de/ and https://github.com/philschmid to find examples of implementation, which became the base of our own script.

While it was useful to rely on such examples, the road from there was still very bumpy: we had to find the right parameters, the most adapted versions of the different libraries, the best way to formulate our prompt… We spent a lot of time iterating until we were able to run a complete process, from fine tuning to endpoint deployment.

In many cases, you have to run the whole process to realize that something is wrong. Therefore, it is a good practice to begin with a very small fine tuning set and dataset, in order to reduce the duration of an iteration.

Step 4 — Costs and inference

Thanks to LoRA and the PEFT library, fine tuning is cheap. In our case, using a 7B parameter model and ~2000 data points in the fine tuning set, fine tuning lasted 4 hours on a ~2$/h instance.

The main technical challenge was to deploy the model on a machine with good latency. This is easier said than done. On AWS, to host a 7B fine tuned model, we used ml.m5.2xlarge instances which cost 0.55$ each. With those instances, latency was ok but not great: around 1 second / 10 tokens. So, depending on the number of inferences you want to run, and how you manage your instance, you can incur large costs on a yearly basis.

Going forward, inference costs are likely to drop, because models will continue to get more efficient and hardware will improve. In the meantime, though, hosting costs are not negligible and should be computed.

Results

4 iterations with increasing training set sizes

We wanted to compare our fine tuned models to GPT4, and we also wanted to get some intuition about the size of the fine tuning set that is required to obtain good results. Therefore, we fine tuned four times our RedPajama7B mode, with four training set sizes (200, 800, 1400 and 2000 samples).

Evaluation

To evaluate the performances of our models, we used human evaluation, which remains the gold standard for judging natural language generation models. We compared our models to GPT4. We generated summaries with the five models (GPT4, 200, 800, 1400, 2000) for 60 documents from a test set. After that, an annotator ranked them blindly (two models could be ranked ex aequo).

Model comparison

Once we had ranked the models for each document, we computed the number of times each model won, tied, or lost against the GPT4 summary. Results are presented in the graph below.

As expected we can see that the more examples there are in the dataset, the better the model gets.
All things considered, even the 200 model is quite strong (more than what I would have expected) ! More than 50% of the summaries are of similar quality as GPT4.
When you look qualitatively at the summaries that lost against GPT4, they are still of honorable quality, even with the 200-samples model: hallucinations and factual errors are very rare (and almost non-existent for the 800, 1400, and 2000-samples models).
There is a wide gap in performance between the 800-samples model (which generates around 50% of summaries judged worse than GPT4) and the 1400-samples model (where that fraction drops to 30%), suggesting the presence of a threshold effect.
We did not push further but it is possible that increasing the number of samples would still improve performance, as there remains a substantial gain between the 1400 and 2000-samples models. When the performance plateaus remains an open question.

Conclusion

At Explain, we believe that Large Language Models can do tasks that will radically change the daily life of many people in the business world. However, leveraging LLMs capacities via proprietary API is not a viable strategy for many companies, so we wanted to evaluate whether fine tuning is a viable alternative.

We studied one particular task: summarization. The results shown above are very encouraging: fine tuned models closely approach GPT4 performances, for a small training cost and with a model size that is 100x smaller. This is very impressive: At scale, you get 90% of the performance for 1% of the cost.

Have we found the new paradigm for 99% of LLM companies ? Will companies work with a lot of small models fine tuned on all their specific tasks ? We don’t know yet: more complex tasks may require very big models. Or maybe providers will propose very performant small models. But one thing is already sure: fine tuning is a very robust solution. Indeed,

performances on specific tasks are comparable to proprietary models.
fine tuning is cheap and easy to repeat on several tasks when you have done it once,
it allows to build a moat around proprietary data, as you can use it to generate a proprietary fine tuning dataset,
you keep control of your model and the data you feed it.

Moreover, as we have seen recently with Llama-v2, this approach has strong supporters in academia and in the private sectors. As new off-the-shelf models continue to appear, fine tuning will get even more efficient and inference costs will drop further.