How NL-to-SQL is changing the way marketing teams use data

James Patten
Dataherald
Published in
5 min readDec 18, 2023
AirPipe AI

Why are you working on NL-to-SQL?

NL-to-SQL aims to solve a problem that is very close to my heart.

For several years, I worked as a Data Engineer at various companies, large and small. Each data team had their unique set of challenges, but there was one common issue across them all: we were always drowning in ad-hoc data requests.

No matter how many dashboards we built, it was just impossible to keep up with the pace of ad hoc questions that would come in every single day. Our backlog of requests grew bigger and bigger, and our stakeholders became more frustrated each day they were kept waiting for their answers.

Even when I transitioned from being a Data Engineer into a Software/Growth Engineer inside the marketing function of a start-up, I noticed a very similar problem. The best marketing teams are testing and learning new ideas all the time. Iterating on things such a copy, creatives, bid strategies etc. But with more tests comes more questions.

Getting into the nuance of why something is working or isn’t can be the difference between capitalising on a winning idea vs burning precious hours and ad spend on losing ones. And it’s not just people inside marketing teams who want answers. Their clients also want to be kept in the loop on what is working and what isn’t.

All agency owners I’ve spoken to have stressed the importance of keeping open lines of communication with their clients. It’s a huge value add to clients if their marketing partner can answer their questions and address their concerns with timely and accurate information. This helps marketing specialists and agencies build trust, improve their reputation and ultimately win more work.

This is why I’m building AirPipe. It’s the Marketing Analytics tool I needed when I was both drowning in ad-hoc requests and scratching my head on what growth ideas we needed to prioritise. Marketing teams can pull data from their channels, blend them together into one unified source of truth, then ask questions and perform analysis in natural language.

How did you go about creating your NL-to-SQL engine?

I first started exploring NL-to-SQL when OpenAI released the Codex model which was a derivative of GPT-3. It was a little bit premature at the time, but the potential of using LLMs for NL-to-SQL really became apparent in the advent of GPT-4. At the same time, a framework called LangChain was gaining popularity.

They had a SQL agent, which allowed me to connect to my database, start feeding in some basic schema info into a prompt and then using it to perform NL-to-SQL questions.

It was pretty novel at the time, and I was able to use it to successfully answer some basic questions such as “How many clicks to X campaign get in July?” or “What was the average cost per conversion for campaign Y?”

However, when trying more advanced questions that involved multiple tables in my database such as “What were all the dates where the cost per impression on Facebook was greater than on LinkedIn?” It struggled. The model would often choose the wrong columns, fail to join tables, or even make up tables or columns that didn’t exist.

Even using LangChain’s agent to self-correct errors didn’t help much. Once a model starts on the wrong path, I found that the out-the-box SQL agent was more likely to exacerbate the issue rather than correct it.

It quickly became apparent that NL-to-SQL is fundamentally more challenging than text to text generation (i.e. tuning an LLM on PDFs/text documents). Like all LLM based approaches, it’s prone to hallucinations. However, in NL-to-SQL, a hallucination doesn’t just result in a number being slightly off, it often causes the entire query to fail and no result to be returned at all.

This is important, as I don’t want to waste API spend to model providers on queries that don’t add any value to my users. I also don’t want my users to waste time ‘prompt engineering’ when the purpose of AirPipe is to help them self-serve their data requests quickly and efficiently.

It means AI and Data Engineers need to set-up their agent and databases a lot more carefully, pass in more context about their schema, and give models clear direction on how it should be interpreting the schema based on business reasoning.

And that’s where I found Dataherald to be very helpful.

Why does the Dataherald engine works for you?

Dataherald’s Open Source NL-2-SQL engine

Having struggled to get the base LangChain SQL agent working, I decided to go about building my own engine so that I could pass in additional context and improve the accuracy of SQL for my use case.

I spent a decent amount of time researching the NL-to-SQL space, and eventually came across Dataherald’s open source engine whilst reading an article. It comes equipped with several useful endpoints for adding context, including: database scanning, adding verified SQL examples, and setting column descriptions.

I was able to get their engine running on my server, make some customisations for my use case, and then integrate it into AirPipe in a single afternoon. It probably saved me a good few weeks of development building my own engine.

Even just from using their engine out of the box, I found the accuracy of SQL to be a significant improvement on other implementations I had previously tried. I then got even more accurate queries on even more advanced questions when I started adding context using the methods described above.

I like that Dataherald has a strong focus on just doing NL-to-SQL very well, so that developers like me can then focus on making it work for our use cases. Their engine is modular, so I can quite easily make customisations where I need to without worrying about interfering with other parts of the base engine.

As the NL-to-SQL space evolves, I’m interested to see how the Dataherald team continue to enhance their engine to make it more performant and accurate. Having spoken to the founders Anuj and Amir, they’ve got some exciting ideas in the pipeline such as fine-tuning capabilities to reduce latency and API costs.

Using Dataherald, I’m excited to build AirPipe’s NL-to-SQL features for marketing teams and help them solve their biggest data analysis and reporting challenges.

--

--