Natural Language to SQL: How to use Dataherald AI to perform NL-2-SQL operations

Dishen Wang
Dataherald
Published in
6 min readSep 18, 2023
Photo by Google DeepMind on Unsplash

Introduction

Recently, we open sourced Dataherald AI: a natural language to SQL engine that allows you to set up an API from your structured database and answer questions in plain English. As a lead developer at Dataherald, it has been super cool to witness developers worldwide using it to create awesome projects and contributing to the code base as well. In this post, I will be providing a step by step tutorial on how to set up the engine on a local machine so that you can query your structured database with natural language.

Getting Started

First, let us clone the Dataherald AI repository so we can work on the project locally:

git clone https://github.com/Dataherald/dataherald.git

Since we will be running the project on Docker, make sure to install Docker Desktop as well.

Then, set up the environment variables by copying the example file:

cp .env.example .env

Specifically the following 4 fields must be manually set before the engine is started:

#OpenAI credentials and model 
OPENAI_API_KEY =
LLM_MODEL =
ORG_ID =

#Encryption key for storing DB connection data in Mongo
ENCRYPT_KEY =

For the Encryption key, you can generate it using the cryptography package from Python3:

# Install the package cryptography in the terminal
pip3 install cryptography

# Run python in terminal
python3

# Import Fernet
from cryptography.fernet import Fernet

# Generate the key
Fernet.generate_key()

While not strictly required, we also suggest you change the Mongo username and password fields as well.

Run Dataherald AI on Docker

Now that all the setup is complete, let’s start running Docker on our local machine.

We need to create a Docker network so that Dataherald AI can communicate with the local Mongo database running in Docker.

docker network create backendnetwork

Let’s build the image, create the container, and raise the app using docker-compose:

docker-compose up --build

Once complete, let us make sure both containers, Dataherald AI and the Mongo database, are running successfully:

docker ps

It should look like this:

CONTAINER ID   IMAGE            COMMAND                  CREATED         STATUS         PORTS                      NAMES
72aa8df0d589 dataherald-app "uvicorn dataherald.…" 7 seconds ago Up 6 seconds 0.0.0.0:80->80/tcp dataherald-app-1
6595d145b0d7 mongo:latest "docker-entrypoint.s…" 19 hours ago Up 6 seconds 0.0.0.0:27017->27017/tcp dataherald-mongodb-1

We can now visit the app’s API docs by going to http://localhost/docs !

Connect SQL Databases

Now that the engine is running, we need to connect our data warehouse to it. Currently Dataherald AI supports Postgres, BigQuery, Databricks, and Snowflake connections. You can create connections to these warehouses through the API or at application start-up using the environment variables.

In our tutorial we will be adding the connection via the API. Call the API endpoint /api/v1/database from Dataherald AI and supply the json body:

curl -X 'POST' \
'<host>/api/v1/database' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"alias": "my_db_alias_identifier",
"use_ssh": false,
"connection_uri": "sqlite:///mydb.db",
"path_to_credentials_file": "my-folder/my-secret.json" # Required for bigquery
}'

Here are the structure of the connection_uri parameter:

Postgres: postgresql+psycopg2://<user>:<password>@<host>:<port>/<db-name>
Databricks: databricks://token:<token>@<host>?http_path=<http-path>&catalog=<catalog>&schema=<schema-name>
Snowflake: snowflake://<user>:<password>@<organization>-<account-name>/<database>/<schema>
BigQuery: bigquery://<project>/<database>

To connect to BigQuery you should also create a credential file, you can follow this tutorial to generate it. Once you generated and stored the credential file, set the path using path_to_credentials_file field.

If you require SSH for your database connection swap the connection_uri with ssh_settings:

curl -X 'POST' \
'http://localhost/api/v1/database' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"use_ssh": true,
"ssh_settings": {
"db_name": "db_name",
"host": "string",
"username": "string",
"password": "string",
"remote_host": "string",
"remote_db_name": "string",
"remote_db_password": "string",
"private_key_path": "string",
"private_key_password": "string",
"db_driver": "string"
}
}'

By default, DB credentials are stored in database_connection collection in MongoDB. Connection URI information is encrypted using the ENCRYPT_KEY you provided as an environment variable.

You should get a response like this:

{
"_id": "the db_connection_id", // remember this!!
"use_ssh": false,
"uri": "the connection uri",
"path_to_credentials_file": "path to credentials file",
"ssh_settings": {
"db_name": "database name",
"host": "the host uri",
"username": "the user name",
"password": "the ssh password",
"remote_host": "the remote host uri",
"remote_db_name": "the remoet database name",
"remote_db_password": "the remote database password",
"private_key_path": "the private key path",
"private_key_password": "your private key password",
"db_driver": "the database driver"
}
}

Be sure to remember the _id value as it will be used to reference the database connection later.

Adding Context

Once you have connected to the data warehouse, you should add context to the engine to help improve the accuracy of the generated SQL. The context can currently be added in one of three ways:

  1. Scanning the Database tables and columns
  2. Adding verified SQL (golden SQL)
  3. Adding string descriptions of the tables and columns

While only the Database scan part is required to start generating SQL, adding verified SQL and string descriptions are also important for the tool to generate accurate SQL.

Scanning the Database

The database scan is used to gather information about the database including table and column names and identifying low cardinality columns and their values to be stored in the context store and used in the prompts to the LLM. You can trigger a scan of a database from the POST /api/v1/table-descriptions/scan endpoint. Example below

endpoint. Example below

curl -X 'POST' \
'<host>/api/v1/table-descriptions/scan' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"db_connection_id": "the _id value from database_connection",
"table_names": ["string"] # optional, it is not set all the tables are scanned
}'

Since the endpoint identifies low cardinality columns (and their values) it can take time to complete. Therefore while it is possible to trigger a scan on the entire DB by not specifying the table_names, we recommend against it for large databases.

Once a database was scanned you can use this endpoint to retrieve the tables names and columns

curl -X 'GET' \
'<host>/api/v1/table-descriptions?db_connection_id=your_db_connection_id&table_name=optional_table_name' \
-H 'accept: application/json'

Adding Verified SQL

Sample NL<>SQL pairs (golden SQL) can be stored in the context store and used for few-shot in context learning. In the default context store and NL 2 SQL engine, these samples are stored in a vector store and the closest samples are retrieved for few shot learning. You can add golden SQLs to the context store from the POST /api/v1/golden-records endpoint:

curl -X 'POST' \
'<host>/api/v1/golden-records' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[
{
"nl_question":"what was the most expensive zip code to rent in Los Angeles county in May 2022?",
"sql": "SELECT location_name, metric_value FROM table_name WHERE dh_county_name = '\''Los Angeles'\'' AND dh_state_name = '\''California'\'' AND period_start='\''2022-05-01'\'' AND geo_type='\''zip'\'' ORDER BY metric_value DESC LIMIT 1;",
"db_connection_id":"The db_connection_id"
}
]'

Adding Table Descriptions Manually

In addition to scanning the database and adding golden SQLs, you can add table/column descriptions by editing the table description using the PATCH /api/v1/table-descriptions/{table_description_id} endpoint:

curl -X 'PATCH' \
'<host>/api/v1/table-descriptions/table_description_id' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"description": "Tabla description",
"columns": [
{
"name": "column1",
"description": "Column1 description"
},
{
"name": "column2",
"description": "Column2 description"
}
]
}'

Querying the Database in Natural Language

Finally, now that we have connected our database warehouse and added some context as well, we can start querying our data warehouse using the POST /api/v1/question endpoint:

Once you have connected the engine to your data warehouse (and preferably added some context to the store), you can query your data warehouse using the POST /api/v1/question endpoint.

curl -X 'POST' \
'<host>/api/v1/question' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"question": "Your question in natural language",
"db_connection_id": "The database connection id"
}'

Conclusion

In conclusion, this tutorial has covered the basics of setting up Dataherald AI on your local machine. If you would like to working on your own code for the core modules in project, feel free to replace them using the environment variables. If you are interested in contributing, check out this link for more details.

About Dataherald

  • Sign up for free and use the hosted version of Dataherald
  • Our open-source engine is available on Github.
  • Join our Discord server to learn more about the project.

--

--