$ Cost of LLM continued pre-training

Gili Nachum
2 min readFeb 13, 2024

--

How much will it cost you to do continued pre-training for a small (7B) LLM?

What is continued pre-training?

Continued pre-training refers to further pre-training of an existing base LLM on additional unstructured data that is relevant for your domain. For example, you may have an off-the-shelf 7B parameter language model that was pre-trained on general web data. you could bring specific domain data like legal or medical documents, as an input for additional pre-training before fine-tuning. This helps the model better understand the new domain and improves performance.
The output of continued pre-training is a base model which you can use directly with few-shot prompting or further align the model for a specific tasks like instructions following, chat, or task specific, using fine tuning methods like RLHF, DPO, etc.

Key Factors Impacting Cost

  1. Dataset size — More data (measured in tokens) means longer training times and more compute resources needed.
  2. Epochs — More epochs means more full passes through the dataset, increasing training time and compute usage.
  3. LLM size — Bigger models have more parameters to update each iteration, slowing down each step of training.
  4. Hardware compute power (flops) — More flops means faster iteration times and potentially reduced overall training duration.
  5. Hardware GPU memory size — Bigger GPU memory is needed to fit larger model batches, influencing total hardware cost.
  6. Software frameworks you’ll use (e.g., PyTorch and FSDP)
  7. Software optimizations like 8-bit quantization — These speed up training by reducing precision, but can impact accuracy. (According to this reddit post, PEFT methods (like LoRA) can also be used for pre-training).
  8. Training speed — Faster is more expensive, requiring more parallel hardware to increase throughput. Diminishing returns and complexity at some point.

Cost Estimate Example

Reading this paper Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models (TABLE III and TABLE IV), we can see that training on an 8xA100 80GB (p4de.24xlarge on AWS) will have a throughput of ~1K-10K tokens/s for 7B-13B LLMs.
(an A800 GPU is fairly equivalent to an A100 GPU).

Let’s assume a dataset is 5,000 100-page PDFs (500K pages). If we assume 500 words/page, then this is 250M words or 333M tokens (.75 word = token). Assuming we’ll do 3 epochs, that’s 1B tokens to process in total.
Assuming an AWS instance type p4de.24xlarge can do 5K tokens/sec, then you’ll need 57Hrs (1000*1000*1000/5000=200K seconds). a p4de.24xlarge on-demand public price on AWS is ~$40.9/hour. Therefore total cost will be $2,331 (200000*40.9/3600).

Additional hidden costs

You’ll also have to into account this additional costs:
- Cost to pre-process the input data. scrap, transform, clean.
- Cost to align your trained based model using supervised fine tuning (RLHF, DTO, etc)
- Cost to serve the model.

Prompt “an industrial machine digesting a stream of $ coins and outputing a stream of english words and letters” — Titan Image Generator G1

--

--