How to bootstrap your way to language model personalization

Aaron Berdanier
3 min readDec 30, 2023

Data are often considered the lifeblood of a machine learning model’s success, especially with language models. More data typically leads to better results.

But what if you’re just getting started on your AI journey? You can still harness the power of language models to support tasks like summarizing articles or revising grant submissions in a personalized way.

It is all about creating your own data flywheel.

The Data Flywheel Concept

Before we dive in, let’s first understand the concept of a data flywheel. A data flywheel is a strategy for using model output in a way that supports future output.

It is a loop of data collection, model training, and refinement that leads to a self-sustaining process that improves over time. The technique creates a virtuous cycle: more data in means better data out in the future.

Leveraging GPTs

Pre-trained transformer models give your data flywheel a boost. I like to think about this as an inverted pyramid. We start at the bottom and build an upward spiral that gets us to a model that is really personalized:

  • Get Started with Zero-Shot Inference: It starts by utilizing the pretrained model for zero-shot inference. This means you input a prompt or query without any extra data and get results fresh out of the box. They won’t be perfect, but you will refine those results to set up better ones in the future.
  • Feed in a Few Examples: To improve the model’s performance, you provide a few examples related to your task as context. This helps the model understand the context and generate more relevant responses. The examples you feed in can be revised ones from the previous step.
  • Build Up to Fine-Tuning: Then, as you collect more data over time (like, hundreds+ of examples), you can start fine-tuning the pre-trained model. Fine-tuning allows you to adapt the model to your specific task, making it more accurate and tailored to your needs.

Setting Up Your Data Collection System

The key to building your data flywheel is to establish a reliable and easy data collection system. From the beginning of zero-shot inference, you need to have a way to collect the inputs and the outputs that you want.

Let’s use the example of revising a paragraph. You have a paragraph of text, which is your input. Then you submit it to a language model with a prompt that says something like “Revise this paragraph to make it easier to understand.

You’ll get an output that you need to clean up (revise the revision from the model to match what you want to see) and then save it all⁠ — the original input and the revised output. You build a library of examples that you can use for training down the line.

Implementation Options

Now that you have a strategy in place, it’s time to implement it. There are a few different paths:

  • OpenAI APIs: The private OpenAI APIs can be a good way to start for a quick and convenient solution. They provide easy access to GPTs through the web.
  • Open Source Models: If you want more control or ownership of your models and data, a great choice could be models like Mistral. You can fine-tune these models using your data and customize them to suit your specific requirements.

If you want to explore this, I can get you started (email me)!

Conclusion

Building your data flywheel for a language model doesn’t require a massive initial dataset. By setting up a data collection system and intelligently leveraging pre-trained models, you can start building your own data set while you are getting useful help along the way.

Whether you choose private APIs or open source models, the key is to start small and iterate, allowing the data flywheel to gain momentum and continuously improve over time.

--

--