Instruction-tune models using your own data with txtinstruct

Generate datasets and train instruction-following models

Published in

NeuML

4 min readMay 1, 2023

Generative AI has stormed on to the scene in 2023. ChatGPT and other models have opened the eyes of many on how far Large Language Models (LLMs) have come. The number of use-cases has been unbounded with many projects springing up around this new ecosystem.

A couple problems have emerged. The first one being that in order to use a hosted service like ChatGPT, you have to share your data. In many cases, this isn’t a huge issue but if you’re working with sensitive or proprietary data it is.

The second issue is that most hosted models are designed to do well with the broadest scope of use cases, which makes sense. If you’re working in a domain with very specific language, like in medical, finance and engineering, being able to customize how a model responds is crucial.

The third issue is generating factual responses that tie back to a trusted data source. Model hallucinations are a term for when LLMs essentially make up confident sounding but wrong responses.

And the last issue is that many open instruction-tuning datasets such as Alpaca and open models such as LLaMA are licensed for non-commercial use.

Introducing txtinstruct

txtinstruct was created to address these concerns. txtinstruct is a framework for training instruction-tuned models.

The objective of txtinstruct is to support open data, open models and integration with your own data. txtinstruct makes it easy to build your own instruction-following datasets and use those datasets to train instructed-tuned models.

A full end-to-end example can be found in this notebook. How this process works is explained below.

Generating instructions

{
  "context": "In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers.  A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.  It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.",
  "statements": [
    {
      "source": "What is a type of linear classifier?",
      "target": "binary classifier"
    },
    {
      "source": "Tell me about Perceptron",
      "target": "an algorithm for supervised learning of binary classifiers"
    },
    {
      "source": "Tell me about Machine learning",
      "target": "I don't have data on that"
    }
  ]
}

The first step in generating an instruction-tuned dataset is a statement generation model. For example, a model that takes a paragraph and generates a question or descriptive statement. This includes both questions (Who/What/When/Where/How) and descriptive statements like “Tell me about”, “Describe the”, “Explain how”.

Next there needs to be a context data source. This is the proprietary, custom or protected dataset you want to create specific instructions for. An example of this is the txtai-wikipedia dataset. This is all of English Wikipedia and can be used a fact data source for generating statements.

The last piece of this step is a LLM that can generate target statements. Selection of this model is important. It doesn’t have to be 100% accurate but care should be taken to ensure it can be used in the setting you intend (i.e. commercial). The FLAN-T5 series of models are a good place to start. txtinstruct provides the following default prompt which can be customized.

Answer the following question using only the context below. Give a detailed answer.
Say 'I don't have data on that' when the question can't be answered
Question: {statement}
Context: {context}

One last item to note here, not answering questions is just as important as answering them. When generating data, txtinstruct will randomize contexts with questions in some cases to generate “unanswerable” questions for a given context. This ensures the model is willing to say “I don’t know”.

Model Training

Generating the dataset is the hard part in this process. Once the dataset is generated, selecting the right model is the remaining task. Selecting a smaller model with fewer parameters than the one used to generate the dataset is usually the case. This trains a model with quicker response times that is also able to generate accurate responses.

txtinstruct has a built-in trainer that makes it easy to take a generated dataset and train an instruction-following model.

import json

from txtinstruct.models import Instructor

# Read in generated dataset
with open("data.json", encoding="utf-8") as f:
    data = json.load(f)

# Instruction-tune model
instructor = Instructor()
model, tokenizer = instructor(
    "google/flan-t5-small", 
    data,
    "sequence-sequence",
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=128 // 8,
    num_train_epochs=3,
    logging_steps=100,
)

This trained model can be used with standard txtai pipelines as shown below.

extractor([{
    "query": "Tell me about Linux",
}])

[{'answer': 'Linux (or ) is a family of open-source Unix-like operating
systems based on the Linux kernel, an operating system kernel first released 
on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a 
Linux distribution, which includes the kernel and supporting system software 
and libraries, many of which are provided by the GNU Project.'}]

extractor([{
    "query": "What is the weather in Phoenix today?",
}])

[{'answer': "I don't have data on that"}]

As mentioned above, a full working example of this can be found in this notebook.

Wrapping up

This article briefly introduced txtinstruct, a framework for training instruction-tuned models. We’re excited to see how the community uses txtinstruct to train domain-specific instruction-following models!