Stories by Amir Zohrenejad on Medium

On the accuracy of on-device LMs

Amir Zohrenejad — Wed, 21 May 2025 00:04:02 GMT

On the accuracy of on-device language models

Are we at “good enough” yet?

Small model in training

This post follows up on my previous article about latency and developer experience in local inference. All tests here were run on an M2 Max MacBook Pro (32GB RAM) using LM Studio for inference and DSPy for benchmarking. Models included DeepSeek-R1-Distill-Qwen and the Gemma-3 family. Source code used in this blog and results are here.

Background

Consumer laptops can now run language models (LMs) fast enough for interactive use. But what can these models actually do well today, and how large do they need to be to do it?

I evaluated four task types:

Simple Prompt Rewriting — Remove filler like “please” and “thank you”.
Complex Prompt Rewriting — Redact PII without altering the intent.
General Knowledge Q&A — Answer factual questions without tools or RAG.
Math Reasoning — Solve grade-school problems (GSM8K benchmark).

Simple prompt rewrite

Cloud providers incur real costs on verbose prompts processing “thank you” and “please.” Theoretically a local model running in-browser can clean these locally before routing to the cloud.

I started testing began with gemma-3–1b-it-qat (762MB). Initial manual checks indicated solid performance.

Sample case where the thanks can be removed

Trying to stump the model v1 — it passed

Trying to stump the model second attempt — it passed

To quantify accuracy more rigorously, I benchmarked Gemma-1B, 4B, and 12B using DSPy-tuned prompts (20 training samples, evaluated across 80 questions). The results here showed real weakness with smaller models.

Simple prompt rewrite with 1B, 4B and 12B models

The 1B model frequently failed by producing invalid JSON. Both 1B and 4B models would often modify text when it was instructed not to. For example in the scenario below the polite token is integral to the meaning of the prompt so the system should not have altered the input at all, however it did a rewrite.

{
      "initial_prompt": "Why is it polite to end calls with \"take care\"?",
      "expected_reasoning": "The phrase \"take care\" is a polite expression. The word \"polite\" in the prompt is not a polite word itself; it is used in an analytical context. There are no actual polite words or phrases in the prompt that can be removed without altering the meaning.",
      "expected_rewritten": "Why do people end calls with \"take care\"?",
      "model_reasoning": "The prompt asks \"Why is it polite to end calls with \"take care\"?\u201d The phrase \"take care\" is a common expression of goodwill and concern when ending a conversation. It's considered polite because it conveys that you wish the other person well, showing respect and consideration for their wellbeing. Removing this phrase would change the question entirely, making it simply \u201cWhy is it polite to end calls?\u201d.",
      "model_rewritten": "Explain why saying \"take care\" at the end of a phone call is considered polite.",
      "is_integral": true,
      "is_correct": false,
      "latency": 1.5175559520721436
}

Bottom line: small models might pass spot checks but don’t hold up under structured testing yet. I plan on fine-tuning these models for this specific task to see if accuracy can pass 90%.

Harder prompt rewriting — PII redaction

PII redaction is a natural local inference task. Imagine a pipeline where user input is redacted locally before hitting the cloud, and identifiers are later reinserted. This architecture was proposed and studied in the Columbia NLP team’s Papillon paper.

In initial tests, 1B models failed PII redaction but 4B models seemed to perform well.

Similar to the simple prompt rewrite, I benchmarked Gemma-1B, 4B, and 12B using DSPy-tuned prompts (20 training samples, evaluated across 80 questions). In this test, the 4b model gave good results out of the box.

I plan on fine tuning the 1B model to see if I can get performance comparable to the 4B models in the future.

General Knowledge Q+A

This is where small local models struggle. Even at 12B, hallucinations are rampant. Especially when answering simple follow-ups.

The first question in each of the following chats asks “who is Jim Morrison.” The models are getting progressively larger, but all models under 32B fail.

Gemma3-1b thinks I am way cooler than I am IRL

7b — Despite the resemblance, I am not Jim Morrison’s father

12B- Sadly, I did not play percussion for the doors

Things do get better at 32B. However these models run quite slowly on my machine. The following took 12s to generate the first token.

Admittedly, this is a different test than the others in this post. We are asking a follow up question and trying to stump it in the prompt. However, such scenarios are inherently more prevalent in a chatbot. Therefore from a subjective standpoint Ido not think a general knowledge chatbot cannot work with a local model without being connected to a data source to ground the answers and prevent hallucinations.

Mathematical reasoning — GSM8K

I tested the models for mathematical reasoning on the GSM8k benchmark. Interestingly the Gemma models outperformed deepseek-r1-distill even at a smaller model size. I was surprised how well Gemma-3-4B did here. For reference the SOTA 175B models only two years ago were around this level of accuracy. One has to wonder whether GSM8K is in the model training data.

GSM8K evaluated on 1B, 4B and 12B. Gemma-4B outperforms deepseek-r1–7B

I’m unsure how to meaningfully improve these results with fine-tuning. Would love suggestions if anyone has experience here.

Final thoughts and next steps

As described, I’ll try to fine-tune the 1B model for prompt cleanup and PII redaction to see if we can reliably complete specific tasks. My goal is to see I can create intelligent, task-specific lambda functions with a ~750MB bundle size.

Local LLM inference

Amir Zohrenejad — Mon, 21 Apr 2025 15:52:04 GMT

Tremendous progress, but not ready for production

I stumbled into local inference during a side quest. The main mission was to build a text-to-SQL product. Curiosity hijacked the roadmap: could I cram an entire GenBI stack into the browser?

The prototype never shipped. But I did fall into the local inference rabbit hole. Though “AI inference” is not a listed feature on my laptop spec sheet — through the magic of open source software it can now run powerful LLMs in its browser tabs for free. It’s impressive. Just not quite production-ready as a developer platform.

Why bother with local compute?

From mainframes to PCs to the cloud, compute has swung between centralization and edge. Now it’s drifting back toward the edge — at least if you squint through the hype. But most users don’t actually care where computation happens. They want it to be fast, and they want it to be cheap.

For example: Figma isn’t popular because it runs on WebAssembly — users love it because it feels instant. DuckDB isn’t getting traction in the data world because it fits on a laptop — it’s being deployed because it can trim Snowflake bills.

Most applications still run in the cloud. However there are four benefits to moving compute to the local device:

Cost
Privacy
Speed
Enabling offline use

Local inference is not new: iPhone facial unlock has implemented local inference on mobile devices at scale since 2018. Facial unlock wouldn’t work without local inference: it has to be fast, work offline, be private and not cost Apple money every time someone tries to unlock their phone.

As software applications increasingly integrate LLMs, pushing AI inference to the edge can have the same upside.

Frameworks

I tested the following local inference frameworks together with quantized versions of DeepSeek-R1-Distill-Qwen-7B.

llama.cpp: C/C++ core, highly optimized. An amazing project by Georgi Gerganov.
Ollama: A product and business built on llama.cpp. Better DevEx and model library curation.
WebLLM: Browser-based inference with WebGPU acceleration developed at Carnegie Mellon. Built on MLC.

I ran the inference benchmarks against OpenAI’s gpt-4.0-mini as a baseline comparison. The benchmark code can be found here.

Performance

Tests were run on my Macbook Pro with the following chip specs, with 32GB of RAM.

The metrics I tracked were median time to first token (TTFT) and tokens per second (TPS).

As the chart above shows, llama.cpp and Ollama are both blazing fast in TTFT. OpenAI is slightly slower, likely due to network overhead and authentication. WebLLM was the slowest.

In terms of TPS, llama.cpp and ollama are comparable, which makes sense as they are the same under the hood. WebLLM topped out at only half of the TPS of the other frameworks. I can only assume this is because WebGPU acceleration is not as efficient in utilizing local GPU as the llama.cpp implementation that accesses the GPU directly.

All the local inference solutions were slower than OpenAI running gpt-4.0-mini , a considerably larger model.

While I did not track memory usage or CPU/GPU utilization, I did not notice any noticeable side effects while using other apps on my laptop as the benchmarks ran.

Mo’ Models, Mo’ Problems

While the performance of local inference lags cloud solutions, it is already good enough for many tasks. This brings us to the main problem I encountered: finding and deploying the correct model for a given task.

Given the resource constraints, the models that run locally must be much smaller than models running on the cloud. For a developer, there is currently no way to find (or easily tune) a model that can do “text-to-SQL” and work on a Macbook with M2 chip. Even when I had shelved the prototype idea and was just aiming to benchmark these tools with deepseek-qwen-7B, I had to decide which of the 663 different models that match this name on HuggingFace I should download for llama.cpp.

Furthermore, even a quantized version of a distilled 7B model is over 5GB. Downloading and loading these models is very slow even on fiber internet. For an application developer, this leads to a degraded initial user experience of the application. For example, if your webapp uses WebLLM, the user will need to sit for a few minutes while the model is being downloaded to their machine.

Final thoughts

Local LLM inference is possible. It works today, but the developer tooling will need to mature before real world applications leverage local inference beyond niche use cases.

Any real solution will need to make it dead simple to train and deploy small, task-specific models — and integrate tightly with cloud LLMs. It will have to handle downloads, caching, and local execution behind the scenes, so the user never notices where the model is running or how it got there.

github-assistant

Amir Zohrenejad — Mon, 16 Dec 2024 07:37:59 GMT

github-assistant answers questions from repository data available through the GitHub API. It was built in only 10 days using Relta, assistant-ui, dlt and LangGraph. This is exciting. It shows that LLMs and data devtools have matured. With the right tools and without much effort, developers can let users query structured data in plain English.

This technology will enable new scenarios for users in the near future: search engines will answer questions from publicly available relational datasets such as the GitHub API. SaaS developers will embed conversational analytics into their products. AI agents will query and act on SQL data in their execution flows.

Results

github-assistant currently loads Issues, Pull Requests, Stars, and Commits from GitHub. We have pre-loaded this data for a few popular open-source repos. You can load data for any open-source repo as well. Relta’s semantic layer acts as a guardrail to guide the model to correct results. Even with its current minimal semantic layer setup, the tool outperforms ChatGPT in a number of query types.

Hallucinations

The semantic layer provides tight guardrails in the form of pre-defined metrics (with dimensions and measures) to generate SQL from the relational data. As a result, github-assistant provides accurate results for some questions where ChatGPT would hallucinate.

A hallucinated answer from ChatGPT about average Issue response time

The correct answer to the question above with the underlying calculation from github-assistant.

The correct answer from github-assistant with the SQL calculation

Data availability

ChatGPT only has access to data that it can crawl from GitHub webpages. This leaves a whole lot of data and insights inaccessible, even though they can be accessed from the GitHub API. The example below illustrates this for commit data on a repository.

Architecture

github-assistant is built on the following:

assistant-ui on the front-end
Relta for semantic layer creation, refinement and text-to-sql
dlt for loading data from the GitHub API
LangGraph as the agent framework (with LangSmith used for observability)

We use Vercel, FastAPI and and PostgreSQL on RDS and ECS + ECR for app hosting and state storage in various parts of the solution. The LLM used is OpenAI gpt-4o.

Meet the Agents

The heavy lifting of github-assistant is powered by three agents:

Front-end agent — Communication with the user, directly answering simple questions (“What can you do?”) or calling Relta’s API for questions about data, as well as choosing the graph type and title for the query results.
Semantic-layer agent — an agent within Relta which creates the first draft semantic layer from the DDL and sample questions. The agent suggests modifications to the semantic layer based on user feedback and automatically raises PRs on the repo.
Text-to-SQL agent — an agent within Relta which uses the the semantic layer to generate SQL, execute it, repair if necessary and return the result or answer.

Loading the data with dlt

github-assistant uses dlt and its verified sources to set up data pipelines to load the data from the GitHub graphql API. We made minimal changes to the dlt GitHub connector. Most of our work on the data pipelines was to create and persist logic around pipeline state.

With the rich set of source connectors in dlt, solutions such as shopify-assistant, googleads-assistant, asana-assistant can all be spun up using the same blueprint as github-assistant.

Relta’s Semantic Layer

Semantic layers are not new. However, most software developers are not familiar with them and building one has been a manual iterative process. Relta simplifies this. The existing semantic layer was put together in less than an hour using questions we drafted for the data. Relta creates Views on a DuckDB instance based on this semantic layer and materializes these views by loading the raw data into DuckDB. This creates isolated databases where the data is modeled around the business metrics instead.

Based on performance and user feedback Relta will propose changes to the semantic layer and automatically raise PRs on the repo which we will deploy to improve performance.

Relta semantic layer builder

We believe in the future producers of data will create and publish semantic layers together with their datasets that can be used by downstream application developers to set up natural language interfaces such as github-assistant from their data.

Generative Chart UIs powered by assistant-ui

assistant-ui powers dynamic visualizations in github-assistant. After Relta returns SQL results, assistant-ui’s agent selects the appropriate chart type and generates a chart title. The results are then streamed to the client and displayed using shadcn Charts.

Next Steps

We plan on continuing the work on github-assistant in a few ways:

We are just scratching the surface of GitHub data. We want to crowd-source end user questions to add additional parts of the data to the semantic layer.
Some of the LLM calls by the agents are for simple tasks (such as metric selection). We want to optimize LLM usage by moving these to small, locally run LLMs. We believe in local first and want to experiment with pushing everything to the browser.
We want to support saving generated charts to dashboards.

If you are interested in contributing or learning more about Relta or assistant-ui we would to chat with you. Please reach out to amir [at] relta.dev or simon [at] assistant-ui.com

github-assistant was originally published in Relta on Medium, where people are continuing the conversation by highlighting and responding to this story.

Reliable and secure natural language interfaces for SaaS

Amir Zohrenejad — Thu, 10 Oct 2024 17:38:32 GMT

In the previous two posts (here and here), I talked about why existing text-to-SQL solutions fall short and how a metrics layer can set up guardrails around the produced output. In this article I will discuss how Gaurav Bhatnagar and I implemented these ideas in the Relta Python library.

The goal we set out to solve with Relta was simple: allow a SaaS developer to launch a reliable and secure natural language interface to their SQL data in less than two hours. Therefore there were three challenges to solve:

How do we make the engine reliable?
How do we ensure data security and privacy?
How to deliver a great developer experience (easy to deploy and maintain)

Accuracy

Given the results of our experiments on real world datasets (described in this post), we knew we wanted to ground the produced SQL in a pre-defined metrics layer. This is different from RAG, few shot prompting or fine-tuning a model. The agent can only produce SQL from the metrics layer and will simply respond that it cannot answer the question if it is not defined in the metrics layer.

Given the audience (developers) we decided with JSON as the format to describe the semantic layer. Building a good metrics layer can be a difficult and tricky process. We use an LLM agent to propose the initial metrics layer from DDL and an initial set of questions.

{
  "name": "customer_consumption",
  "description": "Tracks the monthly consumption of customers, including total and average amounts, and allows filtering by customer segment and currency.",
  "datasource": "debit_card_specializing",
  "dimensions": [
    {
      "name": "customerid",
      "description": "Unique identifier for each customer."
    },
    {
      "name": "date",
      "description": "Date of the transaction in YYYY-MM-DD format."
    },
    {
      "name": "currency",
      "description": "Currency used for the transaction, such as EUR or CZK."
    },
    {
      "name": "segment",
      "description": "Customer segment, such as SME, LAM, or KAM."
    }
  ],
  "measures": [
    {
      "name": "total_consumption",
      "description": "Total consumption amount for the customer in the specified period.",
      "agg_operation": "SUM",
      "expr": "consumption"
    },
    {
      "name": "average_consumption",
      "description": "Average consumption amount for the customer in the specified period.",
      "agg_operation": "AVG",
      "expr": "consumption"
    }
  ],
  "sample_questions": [
    "What is the highest monthly consumption in the year 2012?",
    "What was the average monthly consumption of customers in SME for the year 2013?",
    "How many percent of LAM customer consumed more than 46.73?"
  ],
  "sql_to_underlying_datasource": "SELECT yearmonth.customerid, yearmonth.date, customers.currency, customers.segment, yearmonth.consumption FROM public.yearmonth JOIN public.customers ON yearmonth.customerid = customers.customerid"
}

Data security and privacy

Most SaaS products run on multi-tenant DBs: different users data sit in the same database tables. As a rule, we are opposed to the idea of running LLM generated SQL on a production database. In the case of a multi-tenant database with different customer data, doing so risks a data leak between customers which could simply kill a company. Therefore we knew our architecture should make it impossible to affect the production database.

We use per-user DuckDB instances in Relta to achieve this. For those who are not familiar, DuckDB is an in-process analytical database (ie sqlite but for analytical workloads). The Relta library spins up an in-process DuckDB database and pulls in only the specific user’s data from the production database. Moreover, Relta compiles the metrics layer into views into this instance of DuckDB, creating a per-user transient sandboxed subset of the underlying data that is modeled for natural language querying.

As a result it is impossible for data to leak between users or for any SQL attack on the production database.

Relta high level architecture

Developer experience

From our experience, a small number of well defined metrics answer most real world user questions. However, even with assistance from an LLM and full control in manually refining the metrics, the initial metrics layer rarely covers all user questions. Therefore Relta has a separate “Refinements agent” which proposes changes to the metrics layer based on user feedback. These changes are integrated directly in the developer workflow showing up as PRs in the JSON that defines the metrics in the developer repo.

In the following video Gaurav Bhatnagar walks through how Relta works.

https://medium.com/media/40d07fcefb1aa7a54c977faf1b062a29/href

We are currently working with a select number of design partners to build AI assistants to SQL data in their SaaS products. We plan to release an open source version of Relta soon. If you are interested please sign up for our waitlist here.

Reliable and secure natural language interfaces for SaaS was originally published in Relta on Medium, where people are continuing the conversation by highlighting and responding to this story.

Is it possible to accurately query relational data in plain English?

Amir Zohrenejad — Fri, 27 Sep 2024 14:02:16 GMT

… or why AI is happier in a metrics box

In an earlier post, I talked about text-to-SQL, the excitement around it with the launch of ChatGPT and the subsequent disappointment of users, founders and investors. Assuming people actually want to interact with software in natural language (which I think is self-evident), the follow up question is whether these interfaces are even possible to build with today’s tech. But first: what does a good natural language interface to data look like?

Who can design a good natural language interface?

Before talking about the technical feasibility of some technology, we should take a look at the product that developers are trying to build with it. The main natural language interfaces currently in use are AI assistants (aka chatbots that can do take actions). One overarching challenge with building an AI assistant is that unlike a GUI which has breadcrumbs and visual queues to queue possible interactions, users often don’t know what they can prompt the assistnt for. Therefore, a well designed AI assistant must be able to identify unanswerable cases for its domain, and nudge the user towards items it is trained to handle. Furthermore, it will often need to route the question to different data sources it is connected to depending on the question.

Let’s take an example of an AI assistant for some generic HR software. The AI Assistant needs to classify and retrieve data from different data sources for each of the following types of questions:

The ability to detect “unanswerable” questions is important in building trust between users and the AI assistant. If the assistant hallucinates answers to questions it cannot ground from a data source, users will stop using it. In the real world, companies tend to roll out new data connections for their AI assistants over time, often starting with unstructured sources first (similar to #2 in the above sample). In such a case if the assistant receives any of the other questions it should refrain from answering and suggest the topics it currently can cover. The issue of answerability will come up again as we dive deeper into connecting relational data sources to chatbots.

Connecting relational data

Circling back to the main topic of this post: assuming our AI assistant covers the basics on answerability and routing, can a tool be built for it handle questions from relational data reliably? We saw in part 1 of this blog series that allowing an LLM agent to write raw SQL will not work, as natural language is ambiguous and the solutions space of possible SQL statements is far too large.

An approach that has gained mindshare recently is to use a semantic layer (or metrics layer). This approach introduces an intermediate step where the natural language user input is matched to a set of pre-defined metrics and the metric is then compiled into SQL. The idea of a metrics layer in data stacks is not new. In a popular blog post from 2021 on self-serve analytics Benn Stancil (ex-CTO of Mode) popularized the idea of a metrics layer by poting out that:

Self-serve [analytics] is a misunderstood (or, at least, misrepresented) problem. Because the most common question people have is “How often did this thing happen?,” effective self-serve is less about complex analysis and more about metric extraction. People “want to choose from a list of understood KPIs, apply it to a filtered set of records, and aggregate it by a particular dimension. It’s analytical Mad Libs — show me average order size for orders that used gift cards by month.”

The observation above was not about natural language interfaces to SQL data. However at a product level, anyone interacting with an AI assitant to retrieve data is doing self-serve analytics. We set out to validate this hypothesis by analyzing real world usage of Dataherald’s text-to-SQL engine and the hypothesis was proven correct.

In one case, a Series B accounting software vendor had rolled out an AI assistant to its internal support team of accountants that answered customer questions from their data. Their connected dataset consisted of 71 tables with a total of 574 columns. From a set of 674 user prompts, 650 (~96.5%) were answerable, 14 (~2%) were unanswerable due to being normative, and 10 (~1.5%) were unrelated to the data.

We then used an LLM to generate a metrics layer based on the DDL and 92 sample natural language prompts. The suggested metrics layer had a total of 5 metrics which only referenced 28 of the underlying 574 columns in the underlying data (less than 5%). Running the test set of 674 prompts 75% were answered accurately using the suggested metrics layer. This number was increased to 95% by manually adding two uncaptured metrics, so almost all of the answerable questions were captured with 7 metrics with 35 columns referenced from the 574 total columns.

Rephrasing the problem

So to build an accurate natural language interface we need an engine that can:

Set up an initial metrics layer from the underlying datasource and some sample questions
Match questions to the metrics layer to identify answerable, unanswerable and unrelated questions
Suggest new metrics to the developer that can be deployed

We took the learning above (plus data privacy when building Dataherald) to work on Relta.

In the next post I will discuss the details of how we built this engine at Relta.

Is it possible to accurately query relational data in plain English? was originally published in Relta on Medium, where people are continuing the conversation by highlighting and responding to this story.

The problem of text-to-SQL

Amir Zohrenejad — Mon, 26 Aug 2024 21:19:02 GMT

This post is part of a three part series. Click to read part 2 and part 3.

When software developers realized GPT3.5 could write syntactically correct SQL, the race was on to build text-to-SQL into everything. Every YCombinator batch since W22 includes a few “chat with your data” startups, data platforms have added natural language interfaces and SaaS vendors have rolled out AI assistants that answer questions from relational data. While the accuracy of these models keeps increasing in academic benchmarks (new benchmarks have to be built!) the features built on text-to-SQL are mediocre at best and annoying at worst. It is at a point where some are questioning if users even want these features at all, even though there are few if any solutions that actually work.

Before continuing let’s point out that text-to-SQL features fall into two broad product categories:

SQL co-pilots — that deliver a first draft SQL to a data scientist who can modify it
Natural language querying — where there is no human in the loop, the user does not see the intermediate SQL nor the underlying schema

The focus of this blog is on case #2, where the user is not a data scientist and are not familiar with the schema they are querying.

There are two visions when it comes to the latter case of natural language querying:

Self serving from the data warehouse — you are a midsize company or large enterprise. You have invested lots of capital to become “data-driven.” This includes setting up a data warehouse, pipelines from various sources and lots of BI dashboards. Still very few people in the organization are using the data, and the data team is complaining about being inundated with requests from business teams. If only the analytical business teams could self serve from the data warehouse directly, better decisions would be made and profits would surely increase.
Natural language interfaces in SaaS application — you are the CTO of a SaaS company (CRM, HR + payroll, analytics, …). It is 2024 and you have to become an AI company. Users like ChatGPT so they also must love to interact with your software through an AI assistant. Engineering resources are diverted to building an LLM based AI assistant that can retrieve data from the production database instead of pointing and clicking to through dashboard and reports.

Both visions seem plausible. So two years since ChatGPT and millions of dollars spent why are neither of these visions even close to becoming a reality? Existing text-to-SQL approaches broadly do the following:

Connect the tool to your existing relational database to a subset of your schemas, tables, columns, or views
Add “context” by providing verified samples for few shot prompting or fine-tuning. Add more context by chunking and adding metadata from unstructured sources like business documentation.
Build an LLM agent to build a prompt with the most relevant semantic context, execute it against the DB, and iterate to recover from errors

In short they try to make the LLM understand a dataset and schema that was designed by a software developer to optimize performance on reads and writes by sprinkling some magical pixie dust (context). As a result the solutions are:

Non-deterministic — similar prompts can create different answers, as shown in the example below.
Hard to train — there is no direct correlation between time invested in adding context and the resulting accuracy. It is a trial and error approach.
Inaccurate — as a result of the above, real-world scenarios have accuracy of 60%

The following is a real world example of a a simple question from a single table that generates non-deterministic results even with extensive configuration and context added. The table is marketing leads recorded for a healthcare business, and the question is What are the top 3 most common reasons for losing a patient?

CREATE TABLE leads ( 
  facility INT64, 
  sales_id INT64, 
  patient_id INT64, 
  source STRING, 
  cost INT64, 
  campaign_effectiveness STRING, 
  zip_code INT64, 
  lost_reasons STRING, 
  lead_id INT64, 
  status STRING, 
  email STRING, 
  initial_contact_date DATE, 
  last_contact_date DATE);

The above table has been scanned with the low cardinality columns identified: the status column is low-cardinality as it can only have values of converted , lost or In progress. Furthermore an admin has provided verified “golden SQL” which are used in few shot prompts and additional description on columns.

However the best in class text-to-SQL agents still produce the following two answers:

-- Question: What are the top 3 most common reasons for losing a patient?

-- First potential answer produced around ~80% of the time
-- filters for distinct patient_id but not on status 
SELECT lost_reasons, COUNT(distinct patient_id) AS number_of_patients_lost 
FROM `leads` 
WHERE lost_reasons IS NOT NULL 
GROUP BY lost_reasons 
ORDER BY number_of_patients_lost DESC 
LIMIT 3

-- Second potential answer produced around ~20% of the time
-- Assumes a patient_id can be lost multiple time and filters on status = 'Lost' 
SELECT lost_reasons, COUNT(patient_id) AS number_of_patients_lost
FROM `leads`
WHERE lost_reasons IS NOT NULL AND status = 'Lost'
GROUP BY lost_reasons
ORDER BY number_of_patients_lost DESC
LIMIT 3

The two SQL above produce different answers from the data. This is not surprising since even to a SQL proficient human without additional information both answers could be correct. In fact one could even assume both should generate the same answer if the data is clean: if each patient_id can only be a lead once and for any lost lead the status and lost_reasons are recorded correctly. But in the real world, structured data is messy and cases like this are the norm.

In order for natural language querying to work, the dataset has to be modeled around the questions users will ask from the data. For our example above, lost_lead needs to be a deterministic metric. However, coming up with the deterministic set of metrics to cover all KPIs from raw data schemas is an incredibly hard and manual task. I will write write more about how this can be simplified in a future post.

The problem of text-to-SQL was originally published in Relta on Medium, where people are continuing the conversation by highlighting and responding to this story.

Fine-tuning GPT-3.5-Turbo for Natural Language to SQL

Amir Zohrenejad — Thu, 31 Aug 2023 12:24:15 GMT

Photo by Mariia Shalabaieva on Unsplash

Background

Allowing non-technical users to ask questions from a database has been a problem of interest in academia and industry for years. The recent advances in Large Language Model (LLM) technology, such as GPT-4, have improved the accuracy of proposed solutions. However, since the most advanced LLMs have not been open for fine-tuning, recent work in the space has focused on creating Retrieval-Augmented Generation (RAG) algorithms that can enable complex Natural Language to SQL (NL-to-SQL) scenarios without modifying the underlying LLM.

Last week, OpenAI opened up GPT-3.5-turbo for fine-tuning. In this post, we will fine-tune our own NL-to-SQL model and compare its performance against the state of the art RAG approach. We will use the Spider dataset from Yale university as our test benchmark.

Fine-tuning GPT-3.5-Turbo for NL-to-SQL

Like all model training and fine-tuning, the first step of fine-tuning GPT-3.5-Turbo is the creation and upload of a training dataset. Since GPT-3.5-Turbo is a ChatModel, this dataset must use to the following format, and be uploaded as a JSONL file:

{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}

The Spider dataset has a holdout test set of 2147 question/SQL pairs, a development set of 1034 question/SQL pairs, and a training set of 7000 question/SQL pairs. We will build our fine-tuning dataset in the structure above from the Spider training set.

Creating the training dataset

An NL-to-SQL task is defined as follows: given a question and database, identify a SQL query that when executed against the database returns a result set that can answer the question. Various approaches have been explored on how best to prompt LLMs for this task, and it is generally agreed that the prompt needs to include an instructional component, details of the database schema, information about the database’s content, a set of task-specific demonstrations and of course the actual question at hand.

Given the format of the ChatModel training data, the elements above have to be presented within the following three prompts:

system_prompt — will contain the instruction, database schema and database content
user_prompt — will contain the natural language question
assistant_prompt — where the SQL will be provided together with a reasoning step

Let’s look at how to create each of these for our NL-to-SQL training dataset.

The system prompt

Creating the system_prompt is by far the most complex part of this exercise. At a minimum, the system_prompt needs to include:

The system instruction
The DB schema
Information about the DB content

In addition, for any real-world use case with a large number of tables, the samples in the training set should also train the model to select the correct tables from the DB for the SQL query (i.e perform schema-linking).

System Instruction

For the instruction we used the following standard prompt

You are an assistant that is an expert in generating Sqlite SQL queries.
Having the access to database content, generate a correct Sqlite SQL query for the given question.
### Database content ###

Database Schema

In the literature there are many proposed prompt formats for the database schema and content with no clear consensus around which performs best. We found the following to be the optimal representation of the database schema:

CREATE TABLE concert (
 "concert_ID" INTEGER NOT NULL,
 "concert_Name" TEXT NOT NULL, - the name of the concert
 "Theme" TEXT, - theme of the concert
 "Stadium_ID" TEXT NOT NULL,
 "Year" TEXT, PRIMARY KEY ("concert_ID"),
 FOREIGN KEY("Stadium_ID")
 REFERENCES stadium ("Stadium_ID")
 )

CREATE TABLE singer (
"Singer_ID" INTEGER NOT NULL,
"Name" TEXT, - name of the singer
"Country" TEXT NOT NULL, - country where the singer born
"Song_Name" TEXT NOT NULL, - the name of the song produced by the singer
"Song_release_year" TEXT, - The release year of the song
"Age" INTEGER,
"Is_male" BOOLEAN NOT NULL,
PRIMARY KEY ("Singer_ID")
)

Database Content

After much experimentation we found the following template to perform the best at training the model about the database content:

/*
Columns in concert and 3 examples in each column for the high cardinality columns :
concert_ID: 1025 , 1101 , 1247
concert_Name : "Fire", "Dance", "Sky"
Stadium_ID : 9, 10, 11
*/
/*
Columns in concert and all categories for the low cardinality columns :
Theme : " ROCK ", " POP ", " HIP-HOP "
Year : 2022, 2021, 2023, 2020
*/

/*
Columns in concert and 3 examples in each column for the high cardinality columns :
Singer_ID : 10235 , 110231 , 1242447
Name : "Jordan", "Gabriel", "Tiffany"
Country : "Iran", "India", "Canada"
Song_Name : "dance in the fire", "rain", "sky"
Age : 19, 20, 21
*/
/*
Columns in concert and all categories for the low cardinality columns :
Is_male : "MALE", "FEMALE",
Song_release_year : 2022, 2021, 2023, 2020
*/

An important element in the database content is how to identify categorical (low cardinality) columns. The threshold for distinguishing between low and high cardinality columns depends on the context window size of the Large Language Model (LLM) being fine-tuned. Given the 4096 token context window of GPT-3.5-turbo, we determined 20 tokens as the appropriate threshold between low and high cardinality columns.

Schema Linking

The final challenge in creating the system_prompts for our training set is to provide samples in such a way that train the model to correctly perform schema-linking on the database. To do this, we employed the following heuristic: for each individual NL <> SQL sample we included a random selection of other tables from the DB in addition to the correct tables until we reached the context window limit of 4000 tokens. To mitigate the influence of positional information, we further randomized the order of tables. In short, each system_prompt included the schema and content of the relevant tables mixed in with other irrelevant tables, helping train the model in picking the correct tables for the query.

We will now put all of this together to build our system_prompts.

For the sample below from Spider:

Question : "How many heads of the departments are older than 56 ?"
SQL: "SELECT count(*) FROM head WHERE age > 56"

The system_prompt will be

You are an assistant that is an expert in generating Sqlite SQL queries.
Having the access to database content, generate a correct Sqlite SQL query for the given question.
### Database content ###
CREATE TABLE trip (
 id INTEGER, duration INTEGER,
 start_date TEXT,
 start_station_name TEXT,
 start_station_id INTEGER,
 end_date TEXT,
 end_station_name TEXT,
 end_station_id INTEGER,
 bike_id INTEGER,
 subscription_type TEXT,
 zip_code INTEGER,
 PRIMARY KEY (id)
 )
/* Columns in trip and 3 examples in each column for high cardinality columns :
 id : 900645, 900752, 900524
 duration : 1131, 2146, 1155
 start_date : 8/21/2015 17:39, 8/21/2015 17:03, 8/21/2015 17:16
 start_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th
 start_station_id : 56, 65, 49 end_date : 8/21/2015 17:19, 8/21/2015 18:08, 8/21/2015 17:32
 end_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th
 end_station_id : 56, 65, 49
 bike_id : 586, 56, 65
 zip_code : 94070, 94530, 94040-1724
 */ 
/* Columns in trip and all categories for low cardinality columns :
 subscription_type : Customer, Subscriber
 */

 CREATE TABLE management (
 "department_ID" INTEGER,
 "head_ID" INTEGER,
 temporary_acting TEXT,
 PRIMARY KEY ("department_ID", "head_ID"),
 FOREIGN KEY("head_ID") REFERENCES head ("head_ID"),
 FOREIGN KEY("department_ID") REFERENCES department ("Department_ID")
 )
 /* Columns in management and all categories for low cardinality columns :
 department_ID : 7, 15, 2, 11
 head_ID : 5, 4, 6, 3, 10
 temporary_acting : Yes, No
 */

 CREATE TABLE department (
 "Department_ID" INTEGER,
 "Name" TEXT,
 "Creation" TEXT,
 "Ranking" INTEGER,
 "Budget_in_Billions" REAL,
 "Num_Employees" REAL,
 PRIMARY KEY ("Department_ID")
 )
 /* Columns in department and 3 examples in each column for high cardinality columns :
 Department_ID : 1, 13, 11
 Name : Energy, Interior, Health and Human Services
 Creation : 1913, 1979, 1989
 Ranking : 1, 13, 11
 Budget_in_Billions : 10.7, 77.6, 59.7
 Num_Employees : 112557.0, 3000000.0, 235000.0
 */


...


CREATE TABLE head (
 "head_ID" INTEGER,
 name TEXT,
 born_state TEXT,
 age REAL,
 PRIMARY KEY ("head_ID")
 )
 /* Columns in head and all categories for low cardinality columns :
 head_ID : 1, 2, 5, 7, 8, 4, 6, 3, 10, 9
 name : Jeff Maggert, Pádraig Harrington, Billy Mayfair, K. J. Choi, Dudley Hart, Sergio García, Stewart Cink, Tiger Woods, Nick Faldo, Franklin Langham
 born_state : Delaware, Connecticut, Alabama, California, Florida
 age : 69.0, 67.0, 68.0, 53.0, 56.0, 52.0, 50.0, 43.0
 */

...

The user prompt

The user prompt is simple, the user question for each sample in Spider. For example:

How many heads of the departments are older than 56 ?

The assistant prompt

The assistant prompt is also simple, containing the associated SQL query from Spider and the reasoning step to find the correct column and correct table for the SQL query. To construct the reasoning step we simply extracted the tables and columns that are used in the SQL query. For example:

To construct the query, I'll be working with the following tables: head.
From these tables, I'll be using the following columns: age.
The SQL query I'll be generating is:
SELECT count(*) FROM head WHERE age > 56

Submitting the training set for fine-tuning

Once we have created the JSONL file (you can find a small sample here), the next step involves uploading the created file to OpenAI using the following command:

openai.api_key = os.getenv("OPENAI_API_KEY")
print(openai.File.create(file=open("spider-finetuning.jsonl", "rb"),purpose='fine-tune'))

After uploading the file you can check the status of the upload using the following command:

print(openai.File.retrieve(id="file-id"))
#OR
print(openai.File.list())

The result should be something like this:

{
  "object": "file",
  "id": "file-id",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 71699079,
  "created_at": 1693343752,
  "status": "uploaded",
  "status_details": null
}

When the status has changed to processed (similar to below) you can use the file for fine-tuning:

{
  "object": "file",
  "id": "file-id",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 71699079,
  "created_at": 1693343752,
  "status": "processed",
  "status_details": null
}

Now, we are ready to start the fine-tuning job. To create a fine-tuning job you can use the following python code:

print(openai.FineTuningJob.create(
     training_file="file-id",
     model="gpt-3.5-turbo",
     suffix = "spider",
     hyperparameters = {
         "n_epochs": #number_of_epochs,
     })
     )

The duration of the fine-tuning process will vary depending on the size of the fine-tuning dataset. There is a maximum token limit for fine-tuning, which is set at 50,000,000 tokens. Therefore, when working with the Spider dataset, we reduced the number of samples from 7,000 to 5,750 and conducted fine-tuning for a total of 2 epochs.

You can check the status of the fine-tuning job using the following command:

print(openai.FineTuningJob.retrieve(id="ftjob-id"))

The result should be something like this:

{
  "object": "fine_tuning.job",
  "id": "ftjob-id",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1693346245,
  "finished_at": 1693353313,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:dataherald:spider:id",
  "organization_id": "org-id",
  "result_files": [
    "file-id"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-id",
  "hyperparameters": {
    "n_epochs": 2
  },
  "trained_tokens": 44722020
}

Model Performance

We benchmarked the performance of the fine-tuned model against GPT3.5-Turbo without fine-tuning and DIN-SQL + GPT-4 (the current state of the art on Spider) for zero-shot performance.

The results are as follows

Performance of the fine-tuned GPT-3.5-Turbo against previous methods.

Fine-tuning GPT-3.5-Turbo yielded a performance improvement of nearly 11 percent brining its accuracy in line with the DIN-SQL + GPT-4, the current state-of-the-art approach which uses GPT-4 and employs various advanced prompting techniques, including few-shot prompting, chain-of-thought prompting and decomposed prompting.

Critically, the fine-tuned model significantly reduces both cost and processing time when compared to the DIN-SQL + GPT-4 approach. The table below provides an approximate cost and speed of difference between the models per single question from Spider.

Cost and speed of different models per question from Spider benchmark

As demonstrated above, the cost of the fine-tuned GPT-3.5-Turbo model is 30 times less than DIN-SQL with GPT-4 and it is 12 times faster.

Conclusion and Next Steps

The results from the experiment are clear: with an initial investment of time and money to build a training dataset the state of the art can be matched in accuracy, while being 12 times faster and 30 times cheaper.

Fine-tuning is a powerful tool in the NL-2-SQL arsenal. However it is not a silver bullet as few organizations have NL-to-SQL training datasets readily available. It is our belief that the best architectures will combine fine-tuned models together with RAG agents. With the anticipated launch of GPT-4 fine-tuning, we expect progress in the field to accelerate further and finally unlock question-answering from structured data for all businesses.

In the next post we will show how to plug in the fine-tuned model above into the Dataherald engine and deploy it in a real world scenario.

If you are interested in NL-2-SQL discussions you can join our Discord server. If you want to allow non-technical users to ask questions from your company’s data warehouse please join our waitlist.

References

DIN-SQL paper: https://arxiv.org/abs/2304.11015

NL-to-SQL useful papers:

How to Prompt LLMs for Text-to-SQL: https://arxiv.org/abs/2305.11853

Divide and Prompt: https://arxiv.org/abs/2304.11556

Exploring Chain-of-Thought Style Prompting for Text-to-SQL: https://arxiv.org/abs/2305.14215

A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability: https://arxiv.org/abs/2303.13547

Why Enterprise Natural Language to SQL is hard

Amir Zohrenejad — Tue, 08 Aug 2023 00:02:40 GMT

“The future of BI is conversational.” This has been the prediction of industry analysts for a number of years. Yet despite the amazing progress in conversational LLM-based applications in the past year such as ChatGPT + Bard and new powerful models like GPT-4, conversational BI is still not deployed in most companies. Business users are still looking for insights in BI dashboards and data analysts are still sifting through Slack and Jira tickets, opening up a SQL engine connected to their data warehouse and hand-writing SQL queries to answer ad-hoc business questions. Why is conversational BI still not here?

While structured data only makes up ~20% of the world’s data, the majority of enterprise data is still in structured data stores and accessible mainly through SQL queries. Therefore at a high level in order to enable conversational BI, a solution needs to be devised that can translate natural language business questions to valid SQL queries that are then executed against the enterprise data warehouse. Engineers have tried to build “Natural Language to SQL” (NL2SQL) engines since the 70s (using rules-based techniques) which would very quickly get too complex to be useful. But with the advancement of transformers which have enabled tools like GitHub CoPilot and OpenAI Code Interpreter it would seem this should be a trivial problem to solve. It is not.

There are (at least) two ways a company can build an LLM-based NL2SQL engine to enable conversational BI:

Fine-tuning your own LLM — This approach would require taking an existing LLM and then training it further using NL<>SQL pairs relating to the company’s structured data. A couple of challenges with this approach are that a) coming up with the training dataset is hard and expensive and b) the most powerful LLM model around (GPT-4) cannot be fine-tuned (as of this writing).
Leveraging In-context learning — The latest LLM models (like GPT-4–32K) can write SQL quite well out of the box and have enough context window for quite a bit of few shot training and for an agent to try to recover from errors by performing follow-ups using chain-of-thought techniques. The idea here is to build an LLM agent on top of GPT-4 that can implement NL2SQL with few shot learning.

So what are the challenges of deploying solution #2? Here are six we have encountered:

Table and Column descriptions— Even the best data teams often do not have clear documentation about tables, columns and metadata. With the rise of ELT where data is simply dumped in the warehouse from various sources and transformed on query the situation becomes even worse. Therefore the table and column names might be the only info available to the engine at configuration time.
Missing Context and Metadata–- There are often business definitions which live in data analyst’s heads and are not in the underlying data. We encountered a real-world home rental marketplace, for which what constitutes an “active listing” is a combination of WHERE clauses which are different based on the value of another column which specifies the building_type. In rare cases these are stored as Views on the table, but more often that not they are just stored in a query in the BI tool/dashboard.
Incomplete info in question, lack of “common sense” — “what was the average rent in Los Angeles in May 2023?” A reasonable human receiving this question would simply assume the question is about Los Angeles, CA or would confirm with the asker in a follow up. However an LLM usually translates this to select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”which pulls up data for Los Angeles, CA and Los Angeles, TX without even getting columns to differentiate between the two
Speed — In order for the engine to be “conversational,” response times must be fast (sub 30s). This is often very hard to achieve, especially if the agent tried to recover from errors or evaluate generated responses with subsequent LLM calls.
Complex Queries – While GPT-4 writes simple SQL queries very well, it can often stumble on complex queries that require aggregations and joins. This is exacerbated in cases where the column name contains an action that can be done in SQL (for example Average or SUM)and in join operations on data warehouses where FOREIGN KEYS are not clearly enforced like they are in production DBs.
Privacy and Data Leaking – Many organizations do not want their database data or schema being sent to companies like OpenAI since it can leak into their training corpus.
Validation – There is no known way to identify cases where the system returns a syntactically valid but incorrect SQL. For example if the user asks for and ‘average’ value, and the system runs an AVG instead of picking a column called ‘average_price’

So is enterprise conversational BI impossible in 2023? Will there be a few more years of academic papers and company AI hackathon projects before a solution can be deployed in production? We don’t think so.

While the challenges are definitely real, we believe with the right tool an enterprise data team can deploy solutions to enable business users to self-serve ad-hoc data questions from the company data warehouse. In the coming weeks we will be releasing a number of open source and hosted tools to address this.

If you are interested in contributing to or deploying NL2SQL for your enterprise, please reach out.

About Dataherald

Sign up for free and use the hosted version of Dataherald
Our open-source engine is available on Github.
Join our Discord server to learn more about the project.

Why Enterprise Natural Language to SQL is hard was originally published in Dataherald on Medium, where people are continuing the conversation by highlighting and responding to this story.