Getting started with the Dataherald NL-to-SQL API

Published in

Dataherald

5 min readJan 18, 2024

Dataherald is a hosted NL-to-SQL API that enables any developer to embed natural language querying to their relational database. You can access Dataherald directly through the API or through the Python SDK.

In this guide, we’ll walk through the steps for getting started with the Dataherald Python SDK.

Install and Instantiate Dataherald

First, you have to contact Dataherald and set up your account. Once that is done, you can log into the Dataherald admin console and generate an API key. Save this key, as it won’t be recorded anywhere unencrypted.

Then you can start building the app. First, install the Dataherald SDK.

!pip install dataherald

You can then instantiate the Dataherald client. Note that you should never save your API key with your code. The code below assumes the API key is saved as an environment variable.

import os
from dataherald import Dataherald

API_KEY = os.environ['API_KEY']
dh_client = Dataherald(api_key=API_KEY)

Create a Database Connection

The next step is to give the tool access to your relational data warehouse. Dataherald uses a standard ODBC protocol to connect to the relational data warehouse of your choice. It currently supports:

PostGres
Databricks
Snowflake
BigQuery
AWS Athena
MotherDuck

In order to create a database connection, all you need is a connection URI for your database. Once you have that, you can create the database connection from the admin console’s Databases tab or through the SDK.

For additional instructions on generating the connection URI (such as using an SSH connection or if your warehouse requires additional credentials like BigQuery) please refer to this page.

Connecting a database through the Admin Console

To create a database connection using the SDK use the following command:

alias = "<Insert a human-readable name for the connection>"
use_ssh = False # Dataherald provides support for SSH connections
connection_uri = "<Insert connection URI here for your data warehouse>"

db_connection = dh_client.database_connections.create(
                  alias=alias,
                  use_ssh=use_ssh,
                  connection_uri=connection_uri
                )

Scan Tables

The next step is to scan the tables in the database which allows Dataherald to build context about the schema and dataset. During the scan Dataherald will:

Read the schema to collect the table and column schema including data types
Identify categorical columns and their values
Determine relationships between tables
and more

You can specify tables to be scanned or just have the tool scan all available tables. Please note this is an asynchronous process and can take some time to complete.

table_names = [ <List of tables to scan> ]

dh_client.table_descriptions.sync_schemas(
  db_connection_id=db_connection.id,
  table_names=table_names
)

Ask a Question

The engine is finally set up to query the database in natural language. There are three main resources involved in querying the database in natural language:

Prompts: the natural language text or question
SQL-Generations: the AI generated SQL query to answer the natural language prompt
NL-Generations: a natural language answer to the prompt

While these resources are linked and build on each other (ie you need to create a Prompt before a SQL-Generation before an NL-Generation), the API and SDK provide methods where you can create these in a single call.

The following example creates a SQL query for a given question.

from dataherald.types.sql_generation_create_params import Prompt

prompt = Prompt(text="<Insert Question here>", 
                db_connection_id="<Insert Database Connection ID here>")

response = dh_client.create(prompt=prompt)

To access the generated SQL:

print(response.sql)

Improving Performance

While the engine now can write SQL, the performance in terms of accuracy or latency is lackluster. This is because the engine still has limited business context and has not been trained on the specific underlying dataset. Dataherald allows you to address these by:

Adding Golden SQLs — providing sample Natural Language to SQL pairs. These can be used in few-shot prompting or to fine-tune your own LLM.
Fine-tuning a model — Train and deploy an LLM fine-tuned to your own dataset. This unlocks the best performance in terms of accuracy and speed.
Adding Instructions — Provide explicit rules for Dataherald to follow.

Golden Records

Dataherald allows you to add context in the form of a Golden SQL: natural language <> SQL query pairs that have been verified by a data admin.

When given a question, our tool tries to identify natural language questions available in the golden records that are most similar to the question being asked. Then, the tool passes the Golden SQL queries from those similar golden records to guide the LLM on how best to write the query.

The Golden SQL need to be in the following format:

[
  { 'prompt_text': "", 'sql': "", 'db_connection_id': "" },
  ...
]

If you have this dataset available you can upload it to Dataherald using the following code snippet (assuming it is in a file called `training_data.json`):

with open('training_data.json', 'r') as jsonfile:
  samples = json.loads(jsonfile.read())
  created_golden_sqls = dh_client.golden_sqls.upload(body=samples)

If you do not have such a training set you can use the Admin Console UI and the SQL generated by the agent as a starting point to verify and edit the SQL and add it to the Golden SQLs through the UI by marking it as Verified.

Finetuning

In order to get state of the art NL-to-SQL performance you need to fine-tune your own NL-to-SQL model. Although Dataherald works without a fine-tuned model, the accuracy and latency will be considerably worse than when you deploy a fine-tuned model.

Dataherald works with various LLMs, but for the best performance you should fine-tune GPT-4. Dataherald is one of a limited number of companies with access to GPT-4 fine-tuning.

Dataherald uses Golden SQLs you have uploaded for the fine-tuning. While you can fine-tune a model with as little as 10 Golden SQL, in order for the fine-tuned model to provide value, you need a diverse training set with at least 20–30 samples per table and also samples that show how to join across tables.

You can create a fine-tuned model with the following command. Please note fine-tuning is an asynchronous task that will take a few hours to complete.

base_llm = {
  "model_provider": "openai",
  "model_name": "gpt-4",
}
alias = "<Insert a human-readable name for the fine-tuned model>"

dh_client.finetunings.create(
  db_connection_id=db_connection.id,
  base_llm=base_llm,
  alias=alias
)

Further model parameters for fine-tuning can be found in the docs. Once the model is fine-tuned its STATUS will be SUCCESS and it can be used in SQL generation.

from dataherald.types.sql_generation_create_params import Prompt

prompt = Prompt(
           text="<Insert Question here>", 
           db_connection_id="<Insert Database Connection ID here>",
         )
result = dh_client.sql_generations.create(
           prompt=prompt,
           finetuning_id='65a184f5c73d4232ab262e30'
         )

Additional controls

Dataherald provides other levers to control SQL generation which we will not cover in this guide. These include

Table descriptions
Column descriptions
Database instructions

You can read more about these in our docs.

About Dataherald

Sign up for free and use the hosted version of Dataherald
Our open-source engine is available on Github.
Join our Discord server to learn more about the project.