Planning to build a business around GPT-4, think again!

8 min readOct 10, 2023

--

Last week, a startup failed to build a business around LLMs and RAGs, even though they had the first B2B large contract. Here’s why and how it could have been avoided:

The founder wrote a blog explaining why he had to shut down his business, and I’m summarizing his main points here.

The product was quite good. No problem with that part.

The product was just a chat application that answered user queries with GPT-4. But before answering, it searched through a database (docs, FAQs, products, etc.) and answered based on that data (yeah, it’s nothing but RAG)!

The startup’s prospective client aimed for hundreds of thousands of monthly user queries. So, the startup evaluated which models to pick. GPT-4 was producing the best result for them. So, the founder chose GPT-4.

The startup didn’t choose other open-source alternatives because they weren’t good enough in their testing. I can’t entirely agree with this conclusion, but there is a considerable probability that it’s true for their particular use case. He must have dealt with highly complex data. Vanilla open-source LLM might not be suitable for him.

In this phase, the startup faced its first reality check. The financial reality check. GPT-4 is expensive as hell!

For hundreds of thousands of monthly user queries, the difference is huge monthly ChatGPT bills. Cherry on the cake is, only GPT-4 seemed viable for the startup’s complex use case.

So, the company most likely backed down after evaluating the cost proposal. Simply, spending a gazillion amount of money for a chatbot seemed not a feasible idea to the client.

The startup failed to reach a contract and stopped building the product.

End of story. Keep reading to get the link to the story!

But, if I were in his shoes and building that startup, I would remember these 3 rules:

First of all, stop stressing over the cost. Sam Altman, in his last talk at Y Combinator, also hoped that LLM inference would be cheap someday.

I am hopeful as well. See the comparison between A100 and H100. You will see that massive innovation is happening every day in this field. Price-performance comparison will be feasible soon.

But financial reality is not going to change too soon. Now it’s time for some rules to remember, keeping the financial reality in mind!

First rule: Keep faith in open-source models

“Let’s fork together and create something even more ‘forktastic’!”

Forget the GPT-4 API, especially for a large number of user queries. It is only good for labelling, testing, and generating data sets for fine-tuning, not as the main model. Technically, it is the best, but not financially!

Besides, have faith in open-source models like Llama 2. Anyscale proves that finetuned Llama 2 models can outperform GPT-4 in some tasks. Test your problem and try to find if your finetuned model can outperform GPT 4 as well! Read about the finetuning guide from Anyscale here.

The performance gain of Llama-2 models was obtained via fine-tuning on each task. The darker shade for each of the colours indicates the performance of the Llama-2-chat models with a baseline prompt. The purple shows the performance of GPT-4 with the same prompt. The stacked bar plots show the performance gain from fine-tuning the Llama-2 base models. In Functional representation and SQL gen tasks with fine-tuning, we can achieve better performance than GPT-4, while on some other task like math reasoning, fine-tuned models, while improving over the base models, are still not able to reach GPT-4’s performance levels. (By Kourosh Hakhamaneshi and Rehaan Ahmad | August 11, 2023, *Anyscale )*

On the other hand, Anyscale also proved that the cost for summarization with GPT-4 is still 30 times more than the cost of Llama-2–70b, even though both models are about the same level of factuality.

Why are you losing faith in open-source models?

Besides, Gradient AI proposes a novel solution. It’s called The Mixture of Experts (MoE) Approach to AI! So what it is. It’s basically finetuning open-source models like Llama 2 7b for particular tasks.

Say, for example, you divide your company data into 4 categories. Finance, Engineering, Product, and Operations. Now, you finetune 4 models for 4 tasks. Each model excels at their own task.

Now, when a user writes a query, we will try to find its type. For instance, the user writes a query like this — Give me the estimated number of H100 GPU for our recommendation engine product.

The best answer will likely come from engineering, product, or finance. If we normalize the top two results, we can weigh our answer to 75% engineering and 25% product. We can generate a final response by merging the responses from these two finely tuned models.

Source: https://gradient.ai/blog/the-next-million-ai-models

See how you can outperform GPT-4 models with multiple fine-tuned models. The startup can save money as well. How?

GPT-4 cost 18x times as much as Llama-2–70b! If my memory serves me right, Llama-2–7b is 30X cheaper than GPT-4. Serving 4 finetuned Llama-7b models will not make you broke!

Second Rule: The Age of Giant AI Models Is Already Over

**GPT-4 could be the last major advance!**

Sam Altman said this, not me. Altman’s statement suggests that GPT-4 could be the last major advance to emerge from OpenAI’s strategy of making the models bigger and feeding them more data.

Let me tell you a story first. When preparing for my business school admission exam, I worried about the competition. I asked my teacher if memorizing a list of GMAT high-frequency vocabularies would be enough. He said yes, but I was still concerned. What if a difficult vocabulary appeared on the exam that wasn’t on the list? Even a single mark could have made a difference in this competitive exam.

My teacher reassured me, saying that no one would be able to answer that question either. He said I should focus on memorizing the given list of high-frequency vocabularies well, which would be enough.

I realized I couldn’t worry about everything, so I focused on what I could control: memorizing a certain list of high-frequency GMAT vocabularies. Not trying to memorize everything.

I fear not the man who has practised 10,000 kicks once, but I fear the man who has practised one kick 10,000 times.
Bruce Lee

By the way, I think Bruce Lee practised 10,000 kicks 10,000 times! Don’t want to read the post anymore, then watch some Bruce Lee kicks here: https://www.youtube.com/watch?v=t8CFioo_Elo (Source)

So, when that founder shut down his startup because GPT-4 is costly and other models are less performant, this ROI problem is true for everyone and unsolvable. So, don’t worry. Nobody would be able to achieve that much accuracy. You should focus on what you have.

Like most founders, you don’t have the money or resources to finetune and serve a fine-tuned model to hundreds of thousands of users by yourself.

But I believe one thing religiously (again because of the same financial reality).

I believe the future is not in 1 trillion parameters but in open-source 7B models!

Look at Mistral 7B or Llama 2 7B! But don’t always go for the latest model. Because there is a high chance that new models can be specifically trained to work best on the benchmarking data but perform poorly in real life. Mistral 7B works better in benchmarks than Llama 13B, but if you ask me which model is better, I will simply tell you to test on your use case before giving any conclusive remarks!

I don’t blame founders in LLM model creation. Some VCs think that hiring 9 mothers can make a baby in 1 month! To keep the flow of funding and make a baby in 1 month instead of 9 months, the founders will produce some benchmark-crashing bad models. There is even a funny paper proving that selective training data can outperform every model! Beware.

Anyway, I have also given enough examples that finetuned Llama 2 models can outperform GPT 4. So, test like hell and pick the best possible open-source model and ROI!

Finally, a word from the Hugging Face CEO to prove that the age of giant AI models is over!

Third rule: Bet highly on RAG and prompt tuning, less on fine-tuning.

I believe the future is mostly in RAG, less in fine-tuning. Again, because of the same financial reality!

GPT 3.5 finetuning has been available for the last month. GPT 4 will also offer the same thing soon. But I don’t believe that they are actually finetuning based on your data. It’s not simply possible at this scale. They are probably using a technique called prompt tuning.

We can achieve this task with prompt tuning. A paper called P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks showed us that. And I think OpenAI is not sharing their research!

Finetune is more or less related to new formatting and style of response. Finetuning means telling LLM models to behave in a certain way or do a different task than just being a generalist chatbot.

RAG is basically keeping the LLM informed with your new data. (i.e, financial stats)

So, again, I am pretty sure that massive finetuning is not the future. RAG, few shot prompts, chain of thoughts prompt, and prompt tuning are the future as it is less costly and less complex. History has taught us repeatedly that simple ideas always win! Look at the world again. All great ideas and business are simple.

And we are going to see radical innovations in the RAG field! Mark my words! The way that startup did RAG and got poor results in open-source models simply states that they could do better.

Final words about RAG:

Experiment with RAGs. Test all novel techniques like RAG Fusion for your use case.
Build a complex RAG system, not a simple one. The Llama Index approach is great, but build your own that fits you. Reinvent it every time. It’s an art problem, not a science problem.
Pour all your semantic search and domain knowledge into it.
Leaving RAG aside, I showed how critical it is to understand the problem in semantic search here.

I have even seen a veteran founder ranting, who told you that you only need vector databases in RAG. Graph databases and keyword searches can sometimes produce better results if they are structured enough. Question the status quo.

Conclusion

Yes, you can build a scalable and feasible business around LLM. Most of the time, at scale.

GPT 4 is the best, but it is not scalable or feasible. So, you are trying to guess the startup name who wrote a blog and pivot! It’s Llamar.ai. The blog is here: Llamar.ai: A deep dive into the (in)feasibility of RAG with LLMs

And my take is — it’s more feasible than infeasible. And I hope that other startups will prove me right by building profitable and scalable AI products.

And I will be here to share the success stories of some garage startup kids!

Follow me for more here at Medium. Check my LinkedIn profile for more short content.