How Best to Use your $50 Dataherald AI Credit (and fine-tune your own NL-to-SQL LLM)

Ainesh Pandey
Dataherald

--

Dataherald AI provides an API that allows you to embed NL-to-SQL into your product. With a free $50 credit, you can get started right away by following:

However, how do you get the most bang for your buck with the $50 credit? Read on to find out.

Overview of Costs

Through the self-serve option, Dataherald covers any LLM costs, giving you a straightforward pricing structure:

  • Cost per question (base model): $0.90
  • Cost for fine-tuning: $3 / training record, or free (?!?)

Ideal Use of Your $50 Credit

Phase 0: Setting a Baseline

By following this guide, you can determine how much performance improves by using Dataherald the right way. However, to run any experiment, we’ll need to curate a test set to compare performance across iterations. Let’s start by generating a test set of 8 questions.

To get a performance baseline, you can go to the Playground in the Dataherald Admin Console and ask the questions there, or you can call the Generate SQL API endpoint. Keep track of the latency and accuracy of each SQL generation.

With this baseline, we’re ready to start providing the context needed to improve Dataherald’s performance.

Time Spent: ~1–2 mins per query
Credits Expended: $7.20

Phase 1: Building Business Context

As explained in our Improving Accuracy of NL-to-SQL Enterprise Use Cases through Context blog post, Dataherald uses a variety of contextual tools to improve the performance and accuracy of its SQL generations. Some of the most impactful and, quite importantly, free elements are:

  • Database Instructions
  • Table Descriptions
  • Column Descriptions

Spend some time adding the above by following the general guidelines in the blogpost or the guides posted in our docs. Right off the bat, you’ll have provided a solid foundation for the tool to generate more accurate queries.

Time Spent: ~30 mins
Credits Expended: $0.00

Phase 2: Ramping Up your Golden Records

One of the most important context tools for Dataherald is golden records: natural language <> SQL query pairings that help the tool better understand how to write SQL queries similar to past questions asked. These golden records are also used as the training data for fine-tuning your own LLM, which offers better performance in terms of latency and accuracy. Let’s get started by adding 30 golden records to Dataherald. These should be similar in complexity and content to the test set. There are two ways to go about doing so:

The Free, Time-Consuming Way
The Dataherald API offers an endpoint to manually upload golden records. You can curate a collection of pairings in any format you choose (as a CSV, JSON file, etc) and then call the Add golden SQL API endpoint. Although a manual endeavor, this process does save you some credits.

Time Spent: ~2–15 mins per query, depending on complexity
Credits Expended: $0.00

The Easier, Dataherald Way
However, you have an NL-to-SQL tool at your disposal; why not take it for a spin to create this training set for yourself? Just head over to the Playground in the UI or call the Generate SQL API endpoint to supply a question, and our tool will generate the SQL query for you. Check that it’s correct, and then use the Query tab or the Add golden SQL API endpoint to upload it to Dataherald.

Verified queries get added to the Golden SQL collection

Time Spent: ~1–2 mins per query
Credits Expended: $27.00

Phase 3: Fine-tuning your own LLM

You can now begin the process of fine-tuning your own LLM, and this is where you really get your money’s worth: as a reward for using your credits wisely, Dataherald will cover the cost of fine-tuning GPT-4 for you! If you get to this point and want to fine-tune a model, reach out to Dataherald any time before March 30th and we’ll credit your account with the funds needed to fine-tune a GPT-4 model. With 30 golden records, that’s an additional $90 in value! Just join the Dataherald Discord community, and send a message in the #support channel under “Hosted API” requesting fine-tuning credits, and our team will add the credits accordingly.

Fine-tuning a model can take a while (anywhere from 20 minutes to 2 hours), but it’s an async job, so you can start the process and go about your business. Just use the Create a finetuning job API endpoint to start the job, and you can track it’s progress in the Fine-tuning tab of the Admin Console. Keep in mind that, with a fine-tuned model, the latency will improve but the accuracy will only be as good as the quality and comprehensiveness of the golden records you provide.

Time Spent: ~2 mins
Credits Expended: $90.00 (additional credits)

Phase 4: Testing Performance

You can now test the performance of the base model compared to the fine-tuned model. To do so, do the following for each test question:

  • Call the Generate SQL API endpoint with the base model (by not specifying a finetuning_id) or ask the question in the Playground with “None” selected for the model
  • Call the Generate SQL API endpoint with the finetuning_id set or ask the question in the Playground with the fine-tuned model selected

Keep track of the latency and accuracy of the queries generated to compare with the baseline.

Time Spent: ~1–2 mins per query
Credits Expended: $14.40

Conclusion

You can now compare the performance across:

  • The raw Dataherald setup, with no context tools or golden records
  • Dataherald with some business context and 25 golden records supporting its SQL generation
  • Dataherald with a fine-tuned LLM supporting its SQL generation

At most, this will use up $48.60 of your $50 credit, and you’ll also get the value of $90 in additional credits.

Keep in mind that the improvement in performance scales with the number of golden records; even modest improvements in your experimentation offer only a small glimpse into the true power of Dataherald AI. So what are you waiting for? Sign up and get started today for free!

About Dataherald

  • Sign up for free and use the hosted version of Dataherald
  • Our open-source engine is available on Github.
  • Join our Discord server to learn more about the project.

--

--

Ainesh Pandey
Dataherald

Ainesh is the Founding Data Product Manager for Dataherald.