Estimating GPT3 API Cost

790 requests/$

Pratik Bhavsar | @nlpguy_
Modern NLP
3 min readAug 26, 2020

--

Come join Maxpool — A Data Science community to discuss real ML problems!

Do you know how much GPT3 API will it cost?

A rough calculation tells me it can go a maximum of 790 requests/$.

GPT3 is pretty huge(175B parameters = 700GB) and you know how costly GPU inferences can be. Even if we find a use case for it, we still need to justify the ROI. There are many blogs on the potential applications but I haven’t found anything on its pricing.

Let’s try to guess it with the fundamentals of cloud pricing.

Note: You can use this methodology for calculating the API cost for any model. People also like to use AWS TCO(Total cost of ownership) calculator but I enjoy doing it manually.

STEP 0 — Usecase

Transformers are quadratic in compute. So it’s extremely crucial to decide on the use case for it because the use case will decide the sequence length.

The best use case for GPT3 is text generation given the prompt.

The prompt can be of any length but 128 makes a sensible guess. People also do it recursively by appending the previously generated text to generate more.

GPT3 can take the seq_length up to 1024(max supported) but due to the quadratic nature of the transformer, it is going to make the inference even costlier.

Let’s fix the seq length to 128 and then use scaling to calculate for 1024.

Note: You can use this methodology for calculating the API cost for any model. People also like to use AWS TCO(Total cost of ownership) calculator but I enjoy doing it manually.

STEP 1 — Getting GPT2 inferences per hour

Assumptions

  • Seq length — 128
  • GPU + XLA inference on Tensorflow
  • V100 GPU instance
  • 12 vCPUs, 40GB of RAM
  • Batch size — 8

From HuggingFace experiment sheet, GPT2 gets inference time of 0.02s for a batch size of 8 on Tensorflow GPU + XLA.

Hence it can serve 8*3600/0.02 = 1440000 inferences/hour.

STEP 2 — Getting GPT3 inferences per hour

GPT2–1.5B parameters

GPT3–175B parameters

Since GPT3 cannot fit on 1 GPU, its split across many. For simplicity reasons, let’s assume we can extrapolate the inference time with linear calculation. Although multi-GPU can be slower due to the passing of gradients from 1 GPU to another.

Equivalent GPT3 inferences/hour/GPU

= 1440000*1.5/175

= ~12400

STEP 3 — Inference optimisation

HuggingFace mentions AMP(fp16) can increase throughput by 1.5x.

New inferences/hour/GPU

= 12400*1.5

= 18600

STEP 4 — Cost per hour at full load

AWS p3.2x costs $3.06/hour. If we take a reserved instance for a year, it can give upto 36% discount with all upfront cost.

Discounted cost = $3.06(1–0.360) = $1.96/hour

(Azure V100 1 year reserved instance costs $1.72/hour)

STEP 5 — Cost per inference

Cost per inference

= instance cost/inferences

= 1.96/18600

= $0.00010537634

It will cost you a minimum of $0.00010537634 per API call of GPT3.

In $1 you will be able to serve 9490 API requests.

Longer sequence API

GPT2 with seq length 1024 and batch size 8 takes 0.195s which is 10x the time of 128 seq length.

Hence you will be able to serve 949/$

Conclusion

I hope this gives you a good idea of how to justify the use case for your business.

We haven’t added any profit margin of OpenAI to the API cost. But taking a profit margin of 20% means it will be able to serve 949/1.2 = 790 requests/$

Do you think 790/$ is good enough for your business?

--

--