Finetuning LLM efficiently: Part 2— Sorting sequences matters, potential time and money saver!

Timothy Lim
4 min readSep 26, 2023

--

Part 1 of this series: Simple Fixes to the Dataloader

As I mentioned in Part 1, one highly effective technique for fine-tuning your custom Large Language Model (LLM) involves collating properly and pre-sorting your data.

In Part 2, we will explore the concept of sorting further, focusing on sorting your custom dataset based on sequence length. This approach has the potential to significantly reduce computation time, ultimately translating into cost savings, especially when utilizing cloud GPU instances.

Photo by Andre Taissin on Unsplash

Experiments

When fine-tuning a language model, the sequence in which you present your training data can significantly influence the training process. To optimize this, arranging the data based on sequence length prior to fine-tuning offers distinct advantages, primarily in terms of faster training runs and reduced computational overhead for padded or longer sequences.

I conducted a series of experiments using Llama-2-7B, focusing on the MMLU dataset. Interestingly, the results consistently favored sorting the data in descending order (reverse sort) to achieve significantly lower loss values compared to sorting it in ascending order. These experiments were motivated by my personal interest in fine-tuning the LLM for knowledge enhancement. However, it’s important to note that the MMLU dataset is relatively small and may not be representative enough to persuade the wider community to adopt reverse sorting as a best practice.

To provide more robust and meaningful insights that are applicable to a broader audience (given that most individuals may not be interested in the MMLU dataset), I expanded the experimentation to include Conversational and Instruction Data, which encompassed data from LIMA, Top-1 OpenAssistant Data, and Dolly, totaling approximately 25,000 examples. These experiments involved longer fine-tuning sessions of 10 epochs (2–3 days of GPU time per experiment) on a larger model (Llama-2–13B) with three distinct settings by tweaking the Dataloader:

  1. Random Shuffle
  2. Ascending Order Sort
  3. Descending Order Sort

This comprehensive approach ensures that the findings could be more universally applicable and can better inform practitioners in the field of language model fine-tuning.

We can observe the trend is pretty much the same with reverse sorting beating out normal sorting by a good margin while it matches pretty well to random shuffling.

However, random shuffling the sequences actually takes 50% more time to complete every epoch compared to sorting your data sequences! This is where the benefit of prior sorting of data comes in!

  • Note 1: There is a spike in sorting time in one of the epoch, I am not too sure what’s the reason for this.
  • Note 2: Experiment ran using LoRA (Rank 8)

Implementation

Implementation is very simple. You just have to implement the sort in your Dataloader. Pseudocode:

class CustomDataset(Dataset):
.
.
.
.

# Calculate the token counts of your data, save it into a dictionary or
# however you want to do it. In this case, it is "token_count"

self.ann = sorted(
self.ann, key=lambda k: k["token_count"], reverse=True
)
.
.
.
.
def __getitem__(self, index):

ann = self.ann[index]
.
.

Thoughts

My initial idea was to do reverse sorting of data because:

  1. It will definitely save compute time.
  2. I thought it will actually perform better than random shuffling (currently, disproved on experiment that I have ran as performance in terms of loss is nearly equivalent).

The reason I thought reverse sorting will be better is:

A) The model is exposed to longer sequences first which are richer in potential features

B) Learning from richer features (longer sequences) will allow it to easily learn simpler patterns from shorter sequences later on.

My intuition is that most datasets have short and long sequences with similar concepts. Thus, learning the concept from the longer sequences first will be better as it can generalise to the shorter sequences. For example, a long sequence instruction data could have wikipedia context of President Obama with instruction to summarise this information while a short sequence instruction data could be directly asking the question: Who is Barack Obama?

C) Gradient updates are more stable, similar length of tokens contributing to the loss function every batch.

I was thinking that random shuffling will have the problem of some batches having a poor mix of short and long sequences.

Conclusion

Photo by Kolleen Gladden on Unsplash

Fine-tuning a language model is an iterative journey, where you may need to experiment with various strategies to discover the most effective approach for your specific task. The arrangement of sequences, be it in ascending, descending, or random order, is just one of the many variables that can impact the success of your fine-tuning process.

I encourage you to try out sorting your data in descending order before starting the fine-tuning process for your custom dataset. Observe whether this adjustment affects your model’s performance. If you find that it doesn’t, you can confidently conclude that commencing your custom dataset fine-tuning with data pre-sorted in descending order will yield significant gains in computational efficiency. This not only saves you valuable processing time but also translates into substantial cost savings!

Look forward to hearing from others on their experiments regarding this technique!

--

--