Large Language Models & Data Warehousing

Ravikiran durbha
Data And Beyond
Published in
3 min readMay 8, 2023

Can large language models replace data models in decision systems?

A thought occurred to me recently, with all the conversation about large language models. How will it impact BI Industry in the near term? Can it replace the entire concept of data warehousing? After all, the purpose of data warehousing is to ask questions about the business and extract intelligence. Today, these questions are constructed using Structured Query Language (SQL). Can we instead have a large language model (LLM) to which we can ask questions in English and have it generate charts and reports? Presumably, we don’t have to then build and manage large data pipelines for data analytics and BI.

Maybe, not so fast — Trust is one of the main ingredients in a successful BI program. I suspect, the consumers of the reports will want to know if they are looking at “Truth”. A large part of the effort in building a data warehouse is expended on making it verifiably true. This is why data lineage is an important category of software within data governance. If the inner working of a model is inscrutable (as is the case for LLMs), how can it be the single source of truth? I think having the ability to explain these models is critical before it can disrupt data warehousing, especially when they drive critical decision making.

LLMs are very good at prediction. They are auto-complete on steroids. Instead of predicting the next three characters, they can predict the next three paragraphs. To be sure, this has tremendous utility — It can generate programs, prose, poetry or even a picture merely from a prompt. While it can do all this, reason is still beyond its grasp. It doesn’t know the difference between correlation and causation. Of course, we may get there at some point in future, but not by merely marinating these models with more and more data. Causal knowledge requires understanding the data generating process and cannot be inferred from the generated data alone.

Having said this, I think LLMs will certainly be ubiquitous within BI Industry. It will certainly change the way we interface with data. Anyone who knows English will be able to query the data (as they can generate SQL from a prompt). Some of these tools already exist in some primitive form and are continuing to evolve. This will democratize and bring data closer to everyone in the organization.

Going back to trust being one of the main ingredients, data quality is very important. We already have AI that can learn business rules from data, but they still need augmentation from data stewards to be complete (since they cannot infer causation). Today, this augmentation has to be translated to SQL, but with LLMs these rules can be expressed in English and translation to SQL will be automatic. Also, LLMs can express some complex rules generated by AI in English which can then be easily validated by stewards. One will also be able to generate charts and reports by merely describing what they are looking for in plain English.

LLMs may not replace the data warehouse yet, but can certainly sit on top of it to provide an access layer. The current wave of AI will bring more data democratization and leave data engineers with more time to dig for deeper insights.

--

--