<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Amir Zohrenejad on Medium]]></title>
        <description><![CDATA[Stories by Amir Zohrenejad on Medium]]></description>
        <link>https://medium.com/@aazo11?source=rss-5ed30a0aa18------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*g4-ZrF2WQmSCiCnF.jpg</url>
            <title>Stories by Amir Zohrenejad on Medium</title>
            <link>https://medium.com/@aazo11?source=rss-5ed30a0aa18------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 31 May 2026 18:48:16 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@aazo11/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[On the accuracy of on-device LMs]]></title>
            <link>https://medium.com/@aazo11/on-the-accuracy-of-on-device-llms-34fd6cc420b5?source=rss-5ed30a0aa18------2</link>
            <guid isPermaLink="false">https://medium.com/p/34fd6cc420b5</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[lm-studio]]></category>
            <category><![CDATA[ollama]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[local-llm]]></category>
            <dc:creator><![CDATA[Amir Zohrenejad]]></dc:creator>
            <pubDate>Wed, 21 May 2025 00:04:02 GMT</pubDate>
            <atom:updated>2025-05-21T04:32:08.182Z</atom:updated>
            <content:encoded><![CDATA[<h3>On the accuracy of on-device language models</h3><p>Are we at “good enough” yet?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xp0hNZYqI5XoLJerHzvLqg.png" /><figcaption>Small model in training</figcaption></figure><p>This post follows up on my <a href="https://medium.com/@aazo11/local-llm-inference-897a06cc17a2">previous article</a> about latency and developer experience in local inference. All tests here were run on an M2 Max MacBook Pro (32GB RAM) using LM Studio for inference and DSPy for benchmarking. Models included DeepSeek-R1-Distill-Qwen and the Gemma-3 family. Source code used in this blog and results are <a href="https://github.com/aazo11/blog_code">here</a>.</p><h3>Background</h3><p>Consumer laptops can now run language models (LMs) fast enough for interactive use. But what can these models actually do well today, and how large do they need to be to do it?</p><p>I evaluated four task types:</p><ol><li><strong>Simple Prompt Rewriting</strong> — Remove filler like “please” and “thank you”.</li><li><strong>Complex Prompt Rewriting</strong> — Redact PII without altering the intent.</li><li><strong>General Knowledge Q&amp;A</strong> — Answer factual questions without tools or RAG.</li><li><strong>Math Reasoning</strong> — Solve grade-school problems (GSM8K benchmark).</li></ol><h3>Simple prompt rewrite</h3><p>Cloud providers <a href="https://www.nytimes.com/2025/04/24/technology/chatgpt-alexa-please-thank-you.html">incur real costs</a> on verbose prompts processing “thank you” and “please.” Theoretically a local model running in-browser can clean these locally before routing to the cloud.</p><p>I started testing began with gemma-3–1b-it-qat (762MB). Initial manual checks indicated solid performance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*deqJVRcb1TH2ZOaaqPFoOA.png" /><figcaption>Sample case where the thanks can be removed</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tI8UFWRm778rkKcNVgvYbA.png" /><figcaption>Trying to stump the model v1 — it passed</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*c7S9AzFF8WuYgo7x4QehnA.png" /><figcaption>Trying to stump the model second attempt — it passed</figcaption></figure><p>To quantify accuracy more rigorously, I benchmarked Gemma-1B, 4B, and 12B using DSPy-tuned prompts (20 training samples, evaluated across 80 questions). The results here showed real weakness with smaller models.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qqCnEyyV34x8w7ikYAb7qg.png" /><figcaption>Simple prompt rewrite with 1B, 4B and 12B models</figcaption></figure><p>The 1B model frequently failed by producing invalid JSON. Both 1B and 4B models would often modify text when it was instructed not to. For example in the scenario below the polite token is integral to the meaning of the prompt so the system should not have altered the input at all, however it did a rewrite.</p><pre>{<br>      &quot;initial_prompt&quot;: &quot;Why is it polite to end calls with \&quot;take care\&quot;?&quot;,<br>      &quot;expected_reasoning&quot;: &quot;The phrase \&quot;take care\&quot; is a polite expression. The word \&quot;polite\&quot; in the prompt is not a polite word itself; it is used in an analytical context. There are no actual polite words or phrases in the prompt that can be removed without altering the meaning.&quot;,<br>      &quot;expected_rewritten&quot;: &quot;Why do people end calls with \&quot;take care\&quot;?&quot;,<br>      &quot;model_reasoning&quot;: &quot;The prompt asks \&quot;Why is it polite to end calls with \&quot;take care\&quot;?\u201d The phrase \&quot;take care\&quot; is a common expression of goodwill and concern when ending a conversation. It&#39;s considered polite because it conveys that you wish the other person well, showing respect and consideration for their wellbeing. Removing this phrase would change the question entirely, making it simply \u201cWhy is it polite to end calls?\u201d.&quot;,<br>      &quot;model_rewritten&quot;: &quot;Explain why saying \&quot;take care\&quot; at the end of a phone call is considered polite.&quot;,<br>      &quot;is_integral&quot;: true,<br>      &quot;is_correct&quot;: false,<br>      &quot;latency&quot;: 1.5175559520721436<br>}</pre><p>Bottom line: small models might pass spot checks but don’t hold up under structured testing yet. I plan on fine-tuning these models for this specific task to see if accuracy can pass 90%.</p><h3>Harder prompt rewriting — PII redaction</h3><p>PII redaction is a natural local inference task. Imagine a pipeline where user input is redacted locally before hitting the cloud, and identifiers are later reinserted. This architecture was proposed and studied in the Columbia NLP team’s <a href="https://github.com/Columbia-NLP-Lab/PAPILLON">Papillon paper</a>.</p><p>In initial tests, 1B models failed PII redaction but 4B models seemed to perform well.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*goSjdHYnFYbGiEE5AqJWxA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*g-1K9kO0seMd5gBUr3FvGg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HL4P9DWCTYptMRAUyjKYGQ.png" /></figure><p>Similar to the simple prompt rewrite, I benchmarked Gemma-1B, 4B, and 12B using DSPy-tuned prompts (20 training samples, evaluated across 80 questions). In this test, the 4b model gave good results out of the box.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EUrpd_J_QQ8dH97hg1i01A.png" /></figure><p>I plan on fine tuning the 1B model to see if I can get performance comparable to the 4B models in the future.</p><h3>General Knowledge Q+A</h3><p>This is where small local models struggle. Even at 12B, hallucinations are rampant. Especially when answering simple follow-ups.</p><p>The first question in each of the following chats asks “who is Jim Morrison.” The models are getting progressively larger, but all models under 32B fail.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Aqv5MC3ZlS0J3jvNtBkV2g.png" /><figcaption>Gemma3-1b thinks I am way cooler than I am IRL</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-2aITgHkJPZcFsCHLyIeEg.png" /><figcaption>7b — Despite the resemblance, I am not Jim Morrison’s father</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kilBk8EDfWV3TC_Dv1SjXw.png" /><figcaption>12B- Sadly, I did not play percussion for the doors</figcaption></figure><p>Things do get better at 32B. However these models run quite slowly on my machine. The following took 12s to generate the first token.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*whDlAFXJuJOquCyqTJLjKQ.png" /></figure><p>Admittedly, this is a different test than the others in this post. We are asking a follow up question and trying to stump it in the prompt. However, such scenarios are inherently more prevalent in a chatbot. Therefore from a subjective standpoint Ido not think a general knowledge chatbot cannot work with a local model without being connected to a data source to ground the answers and prevent hallucinations.</p><h3>Mathematical reasoning — GSM8K</h3><p>I tested the models for mathematical reasoning on the GSM8k benchmark. Interestingly the Gemma models outperformed deepseek-r1-distill even at a smaller model size. I was surprised how well Gemma-3-4B did here. For reference the SOTA 175B models only two years ago were around this level of accuracy. One has to wonder whether GSM8K is in the model training data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WUc6JPr-vuG8XRrOAIQyiA.png" /><figcaption>GSM8K evaluated on 1B, 4B and 12B. Gemma-4B outperforms deepseek-r1–7B</figcaption></figure><p>I’m unsure how to meaningfully improve these results with fine-tuning. Would love suggestions if anyone has experience here.</p><h3>Final thoughts and next steps</h3><p>As described, I’ll try to fine-tune the 1B model for prompt cleanup and PII redaction to see if we can reliably complete specific tasks. My goal is to see I can create intelligent, task-specific lambda functions with a ~750MB bundle size.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=34fd6cc420b5" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Local LLM inference]]></title>
            <link>https://medium.com/@aazo11/local-llm-inference-897a06cc17a2?source=rss-5ed30a0aa18------2</link>
            <guid isPermaLink="false">https://medium.com/p/897a06cc17a2</guid>
            <category><![CDATA[local-ai-processing]]></category>
            <category><![CDATA[ollama]]></category>
            <category><![CDATA[llama-cpp]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[llm-inference]]></category>
            <dc:creator><![CDATA[Amir Zohrenejad]]></dc:creator>
            <pubDate>Mon, 21 Apr 2025 15:52:04 GMT</pubDate>
            <atom:updated>2025-04-21T20:46:39.346Z</atom:updated>
            <content:encoded><![CDATA[<p>Tremendous progress, but not ready for production</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KMRcmzbMbza7qLD1kK22rA.png" /></figure><p>I stumbled into local inference during a side quest. The main mission was to build a text-to-SQL product. Curiosity hijacked the roadmap: could I cram an entire GenBI stack into the browser?</p><p>The prototype never shipped. But I did fall into the local inference rabbit hole. Though “AI inference” is not a listed feature on my laptop spec sheet — through the magic of open source software it can now run powerful LLMs in its browser tabs for free. It’s impressive. Just not <em>quite</em> production-ready as a developer platform.</p><h3>Why bother with local compute?</h3><p>From mainframes to PCs to the cloud, compute has swung between centralization and edge. Now it’s drifting back <a href="https://www.inkandswitch.com/essay/local-first/">toward the edge</a> — at least if you squint through the hype. But most users don’t actually care <em>where</em> computation happens. They want it to be fast, and they want it to be cheap.</p><p>For example: Figma isn’t popular because it runs on WebAssembly — users love it because it feels instant. DuckDB isn’t getting traction in the data world because it fits on a laptop — it’s being deployed because it can trim Snowflake bills.</p><p>Most applications still run in the cloud. However there are four benefits to moving compute to the local device:</p><ol><li>Cost</li><li>Privacy</li><li>Speed</li><li>Enabling offline use</li></ol><p>Local inference is not new: <a href="https://machinelearning.apple.com/research/face-detection">iPhone facial unlock</a> has implemented local inference on mobile devices at scale since 2018. Facial unlock wouldn’t work without local inference: it has to be fast, work offline, be private and not cost Apple money every time someone tries to unlock their phone.</p><p>As software applications increasingly integrate LLMs, pushing AI inference to the edge can have the same upside.</p><h3>Frameworks</h3><p>I tested the following local inference frameworks together with quantized versions of DeepSeek-R1-Distill-Qwen-7B.</p><ul><li><a href="https://github.com/ggml-org/llama.cpp"><strong>llama.cpp</strong></a>: C/C++ core, highly optimized. An amazing project by <a href="https://github.com/ggerganov">Georgi Gerganov</a>.</li><li><a href="https://ollama.com/"><strong>Ollama</strong></a>: A product and business built on llama.cpp. Better DevEx and model library curation.</li><li><a href="https://webllm.mlc.ai/"><strong>WebLLM</strong></a>: Browser-based inference with WebGPU acceleration developed at Carnegie Mellon. Built on <a href="https://mlc.ai/">MLC</a>.</li></ul><p>I ran the inference benchmarks against OpenAI’s gpt-4.0-mini as a baseline comparison. The benchmark code can be found <a href="https://github.com/aazo11/edgeinference">here</a>.</p><h3>Performance</h3><p>Tests were run on my Macbook Pro with the following chip specs, with 32GB of RAM.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*N4Hpyp_J4TXDovWgWEKDhg.png" /></figure><p>The metrics I tracked were median <strong>time to first token (TTFT)</strong> and <strong>tokens per second (TPS)</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Dan2BcOw09G6_MvxK-cARg.png" /></figure><p>As the chart above shows, llama.cpp and Ollama are both blazing fast in TTFT. OpenAI is slightly slower, likely due to network overhead and authentication. WebLLM was the slowest.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1g1VnfbGpDWJegxWEXZL2g.png" /></figure><p>In terms of TPS, llama.cpp and ollama are comparable, which makes sense as they are the same under the hood. WebLLM topped out at only half of the TPS of the other frameworks. I can only assume this is because WebGPU acceleration is not as efficient in utilizing local GPU as the llama.cpp implementation that accesses the GPU directly.</p><p>All the local inference solutions were slower than OpenAI running gpt-4.0-mini , a considerably larger model.</p><p>While I did not track memory usage or CPU/GPU utilization, I did not notice any noticeable side effects while using other apps on my laptop as the benchmarks ran.</p><h3>Mo’ Models, Mo’ Problems</h3><p>While the performance of local inference lags cloud solutions, it is already good enough for many tasks. This brings us to the main problem I encountered: finding and deploying the correct model for a given task.</p><p>Given the resource constraints, the models that run locally must be much smaller than models running on the cloud. For a developer, there is currently no way to find (or easily tune) a model that can do “text-to-SQL” and work on a Macbook with M2 chip. Even when I had shelved the prototype idea and was just aiming to benchmark these tools with deepseek-qwen-7B, I had to decide which of the 663 different models that match this name on HuggingFace I should download for llama.cpp.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*We4av7FdUCjPPTR5dE6BVg.png" /></figure><p>Furthermore, even a quantized version of a distilled 7B model is over 5GB. Downloading and loading these models is very slow even on fiber internet. For an application developer, this leads to a degraded initial user experience of the application. For example, if your webapp uses WebLLM, the user will need to sit for a few <strong><em>minutes</em></strong> while the model is being downloaded to their machine.</p><h3>Final thoughts</h3><p>Local LLM inference is possible. It works today, but the developer tooling will need to mature before real world applications leverage local inference beyond niche use cases.</p><p>Any real solution will need to make it dead simple to train and deploy small, task-specific models — and integrate tightly with cloud LLMs. It will have to handle downloads, caching, and local execution behind the scenes, so the user never notices where the model is running or how it got there.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=897a06cc17a2" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[github-assistant]]></title>
            <link>https://medium.com/relta/github-assistant-49ae388ad758?source=rss-5ed30a0aa18------2</link>
            <guid isPermaLink="false">https://medium.com/p/49ae388ad758</guid>
            <category><![CDATA[ai-assitant]]></category>
            <category><![CDATA[agents]]></category>
            <category><![CDATA[text-to-sql]]></category>
            <category><![CDATA[chatbots]]></category>
            <dc:creator><![CDATA[Amir Zohrenejad]]></dc:creator>
            <pubDate>Mon, 16 Dec 2024 07:37:59 GMT</pubDate>
            <atom:updated>2024-12-17T21:54:23.765Z</atom:updated>
            <content:encoded><![CDATA[<p><a href="https://github-assistant.com">github-assistant</a> answers questions from repository data available through the GitHub API. It was built in only 10 days using <a href="https://www.relta.dev/">Relta</a>, <a href="https://www.assistant-ui.com/">assistant-ui</a>, <a href="https://dlthub.com/">dlt</a> and <a href="https://www.langchain.com/langgraph">LangGraph</a>. This is exciting. It shows that LLMs and data devtools have matured. With the right tools and without much effort, developers can let users query structured data in plain English.</p><p>This technology will enable new scenarios for users in the near future: search engines will answer questions from publicly available relational datasets such as the GitHub API. SaaS developers will embed conversational analytics into their products. AI agents will query and act on SQL data in their execution flows.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/267/1*wUZiI2Mg2cncuMWWXIiBgQ.png" /></figure><h3>Results</h3><p>github-assistant currently loads Issues, Pull Requests, Stars, and Commits from GitHub. We have pre-loaded this data for a few popular open-source repos. You can load data for any open-source repo as well. Relta’s <a href="https://github.com/reltadev/github-assistant/tree/main/server-poc/semantic_layer">semantic layer</a> acts as a guardrail to guide the model to correct results. Even with its current minimal <a href="https://github.com/reltadev/github-assistant/tree/main/server-poc/semantic_layer">semantic layer</a> setup, the tool outperforms ChatGPT in a number of query types.</p><p><strong>Hallucinations</strong></p><p>The semantic layer provides tight guardrails in the form of pre-defined metrics (with dimensions and measures) to generate SQL from the relational data. As a result, github-assistant provides accurate results for some questions where ChatGPT would hallucinate.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/853/1*LAr17fiwUscQpxk7O2j2yg.png" /><figcaption>A hallucinated answer from ChatGPT about average Issue response time</figcaption></figure><p>The correct answer to the question above with the underlying calculation from github-assistant.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/877/1*90iINO981k1_kV7FRfs1DQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/806/1*vXqq7E4y5BgInTqtKOZvRg.png" /><figcaption>The correct answer from github-assistant with the SQL calculation</figcaption></figure><p><strong>Data availability</strong></p><p>ChatGPT only has access to data that it can crawl from GitHub webpages. This leaves a whole lot of data and insights inaccessible, even though they can be accessed from the GitHub API. The example below illustrates this for commit data on a repository.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Eh7ckeNK7seKGIKHp_hFMg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/734/1*hpm_G0_Roug6dn4xIiBPTg.png" /></figure><h3>Architecture</h3><p><a href="https://github.com/reltadev/github-assistant">github-assistant</a> is built on the following:</p><ol><li><a href="https://www.assistant-ui.com/">assistant-ui</a> on the front-end</li><li><a href="https://www.relta.dev/">Relta</a> for semantic layer creation, refinement and text-to-sql</li><li><a href="https://dlthub.com/">dlt</a> for loading data from the GitHub API</li><li><a href="https://www.langchain.com/langgraph">LangGraph</a> as the agent framework (with LangSmith used for observability)</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QrMsc8nka5jfHR_I" /></figure><p>We use Vercel, FastAPI and and PostgreSQL on RDS and ECS + ECR for app hosting and state storage in various parts of the solution. The LLM used is OpenAI gpt-4o.</p><h3>Meet the Agents</h3><p>The heavy lifting of github-assistant is powered by three agents:</p><ol><li><strong>Front-end agent — </strong>Communication with the user, directly answering simple questions (“What can you do?”) or calling Relta’s API for questions about data, as well as choosing the graph type and title for the query results.</li><li><strong>Semantic-layer agent — </strong>an agent within Relta which creates the first draft semantic layer from the DDL and sample questions. The agent suggests modifications to the semantic layer based on user feedback and automatically raises PRs on the repo.</li><li><strong>Text-to-SQL agent —</strong> an agent within Relta which uses the the semantic layer to generate SQL, execute it, repair if necessary and return the result or answer.</li></ol><h3>Loading the data with dlt</h3><p>github-assistant uses dlt and its <a href="https://dlthub.com/docs/dlt-ecosystem/verified-sources/">verified sources</a> to set up data pipelines to load the data from the GitHub graphql API. We made minimal changes to the dlt GitHub connector. Most of our work on the data pipelines was to create and persist logic around pipeline state.</p><p>With the rich set of source connectors in dlt, solutions such as shopify-assistant, googleads-assistant, asana-assistant can all be spun up using the same blueprint as github-assistant.</p><h3>Relta’s Semantic Layer</h3><p>Semantic layers are not new. However, most software developers are not familiar with them and building one has been a manual iterative process. Relta simplifies this. The existing semantic layer was put together in less than an hour using questions we drafted for the data. Relta creates Views on a DuckDB instance based on this semantic layer and materializes these views by loading the raw data into DuckDB. This creates isolated databases where the data is modeled around the business metrics instead.</p><p>Based on performance and user feedback <a href="https://relta.dev">Relta</a> will propose changes to the semantic layer and automatically raise PRs on the repo which we will deploy to improve performance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CIgUiLnqvTxUve_kCGMtEg.png" /><figcaption>Relta semantic layer builder</figcaption></figure><p>We believe in the future producers of data will create and publish semantic layers together with their datasets that can be used by downstream application developers to set up natural language interfaces such as github-assistant from their data.</p><h3>Generative Chart UIs powered by assistant-ui</h3><p><a href="https://www.assistant-ui.com">assistant-ui</a> powers dynamic visualizations in github-assistant. After Relta returns SQL results, assistant-ui’s agent selects the appropriate chart type and generates a chart title. The results are then streamed to the client and displayed using <a href="https://ui.shadcn.com/charts">shadcn Charts</a>.</p><h3>Next Steps</h3><p>We plan on continuing the work on github-assistant in a few ways:</p><ol><li>We are just scratching the surface of GitHub data. We want to crowd-source end user questions to add additional parts of the data to the semantic layer.</li><li>Some of the LLM calls by the agents are for simple tasks (such as metric selection). We want to optimize LLM usage by moving these to small, locally run LLMs. We believe in local first and want to experiment with pushing everything to the browser.</li><li>We want to support saving generated charts to dashboards.</li></ol><p>If you are interested in contributing or learning more about Relta or assistant-ui we would to chat with you. Please reach out to amir [at] relta.dev or simon [at] assistant-ui.com</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=49ae388ad758" width="1" height="1" alt=""><hr><p><a href="https://medium.com/relta/github-assistant-49ae388ad758">github-assistant</a> was originally published in <a href="https://medium.com/relta">Relta</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Reliable and secure natural language interfaces for SaaS]]></title>
            <link>https://medium.com/relta/reliable-and-secure-natural-language-interfaces-for-saas-863604e98b91?source=rss-5ed30a0aa18------2</link>
            <guid isPermaLink="false">https://medium.com/p/863604e98b91</guid>
            <category><![CDATA[ai-assistant]]></category>
            <category><![CDATA[text-to-sql]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[saas]]></category>
            <category><![CDATA[sql]]></category>
            <dc:creator><![CDATA[Amir Zohrenejad]]></dc:creator>
            <pubDate>Thu, 10 Oct 2024 17:38:32 GMT</pubDate>
            <atom:updated>2024-10-10T17:38:32.729Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/366/1*CDRD76EIoOOa6nW7dWjQ3A.png" /></figure><p>In the previous two posts (<a href="https://medium.com/relta/the-problem-of-text-to-sql-9fa9df8d15ab">here</a> and <a href="https://medium.com/relta/is-it-possible-to-accurately-query-relational-data-in-plain-english-5659892ffda5">here</a>), I talked about why existing text-to-SQL solutions fall short and how a metrics layer can set up guardrails around the produced output. In this article I will discuss how <a href="https://www.linkedin.com/in/gaurav-bhatnagar/">Gaurav Bhatnagar</a> and I implemented these ideas in the Relta Python library.</p><p>The goal we set out to solve with Relta was simple: allow a SaaS developer to launch a reliable and secure natural language interface to their SQL data in less than two hours. Therefore there were three challenges to solve:</p><ol><li>How do we make the engine reliable?</li><li>How do we ensure data security and privacy?</li><li>How to deliver a great developer experience (easy to deploy and maintain)</li></ol><h3>Accuracy</h3><p>Given the results of our experiments on real world datasets (described in this <a href="https://medium.com/relta/is-it-possible-to-accurately-query-relational-data-in-plain-english-5659892ffda5">post</a>), we knew we wanted to ground the produced SQL in a pre-defined metrics layer. This is different from RAG, few shot prompting or fine-tuning a model. The agent can only produce SQL from the metrics layer and will simply respond that it cannot answer the question if it is not defined in the metrics layer.</p><p>Given the audience (developers) we decided with JSON as the format to describe the semantic layer. Building a good metrics layer can be a difficult and tricky process. We use an LLM agent to propose the initial metrics layer from DDL and an initial set of questions.</p><pre>{<br>  &quot;name&quot;: &quot;customer_consumption&quot;,<br>  &quot;description&quot;: &quot;Tracks the monthly consumption of customers, including total and average amounts, and allows filtering by customer segment and currency.&quot;,<br>  &quot;datasource&quot;: &quot;debit_card_specializing&quot;,<br>  &quot;dimensions&quot;: [<br>    {<br>      &quot;name&quot;: &quot;customerid&quot;,<br>      &quot;description&quot;: &quot;Unique identifier for each customer.&quot;<br>    },<br>    {<br>      &quot;name&quot;: &quot;date&quot;,<br>      &quot;description&quot;: &quot;Date of the transaction in YYYY-MM-DD format.&quot;<br>    },<br>    {<br>      &quot;name&quot;: &quot;currency&quot;,<br>      &quot;description&quot;: &quot;Currency used for the transaction, such as EUR or CZK.&quot;<br>    },<br>    {<br>      &quot;name&quot;: &quot;segment&quot;,<br>      &quot;description&quot;: &quot;Customer segment, such as SME, LAM, or KAM.&quot;<br>    }<br>  ],<br>  &quot;measures&quot;: [<br>    {<br>      &quot;name&quot;: &quot;total_consumption&quot;,<br>      &quot;description&quot;: &quot;Total consumption amount for the customer in the specified period.&quot;,<br>      &quot;agg_operation&quot;: &quot;SUM&quot;,<br>      &quot;expr&quot;: &quot;consumption&quot;<br>    },<br>    {<br>      &quot;name&quot;: &quot;average_consumption&quot;,<br>      &quot;description&quot;: &quot;Average consumption amount for the customer in the specified period.&quot;,<br>      &quot;agg_operation&quot;: &quot;AVG&quot;,<br>      &quot;expr&quot;: &quot;consumption&quot;<br>    }<br>  ],<br>  &quot;sample_questions&quot;: [<br>    &quot;What is the highest monthly consumption in the year 2012?&quot;,<br>    &quot;What was the average monthly consumption of customers in SME for the year 2013?&quot;,<br>    &quot;How many percent of LAM customer consumed more than 46.73?&quot;<br>  ],<br>  &quot;sql_to_underlying_datasource&quot;: &quot;SELECT yearmonth.customerid, yearmonth.date, customers.currency, customers.segment, yearmonth.consumption FROM public.yearmonth JOIN public.customers ON yearmonth.customerid = customers.customerid&quot;<br>}</pre><h3>Data security and privacy</h3><p>Most SaaS products run on multi-tenant DBs: different users data sit in the same database tables. As a rule, we are opposed to the idea of running LLM generated SQL on a production database. In the case of a multi-tenant database with different customer data, doing so risks a data leak between customers which could simply kill a company. Therefore we knew our architecture should make it impossible to affect the production database.</p><p>We use per-user DuckDB instances in Relta to achieve this. For those who are not familiar, DuckDB is an in-process analytical database (ie sqlite but for analytical workloads). The Relta library spins up an in-process DuckDB database and pulls in only the specific user’s data from the production database. Moreover, Relta compiles the metrics layer into views into this instance of DuckDB, creating a per-user transient sandboxed subset of the underlying data that is modeled for natural language querying.</p><p>As a result it is impossible for data to leak between users or for any SQL attack on the production database.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*s3N_snjL_3uTlny45pvwxQ.png" /><figcaption>Relta high level architecture</figcaption></figure><h3>Developer experience</h3><p>From our experience, a small number of well defined metrics answer most real world user questions. However, even with assistance from an LLM and full control in manually refining the metrics, the initial metrics layer rarely covers all user questions. Therefore Relta has a separate “Refinements agent” which proposes changes to the metrics layer based on user feedback. These changes are integrated directly in the developer workflow showing up as PRs in the JSON that defines the metrics in the developer repo.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-QbGPvVbpHPx8B3xDzgB8A.png" /></figure><p>In the following video <a href="https://medium.com/u/e531cd8e15ec">Gaurav Bhatnagar</a> walks through how Relta works.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FfFCkO2YTwJk%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DfFCkO2YTwJk&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FfFCkO2YTwJk%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="640" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/40d07fcefb1aa7a54c977faf1b062a29/href">https://medium.com/media/40d07fcefb1aa7a54c977faf1b062a29/href</a></iframe><p>We are currently working with a select number of design partners to build AI assistants to SQL data in their SaaS products. We plan to release an open source version of Relta soon. If you are interested please sign up for our waitlist <a href="https://relta.dev">here</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=863604e98b91" width="1" height="1" alt=""><hr><p><a href="https://medium.com/relta/reliable-and-secure-natural-language-interfaces-for-saas-863604e98b91">Reliable and secure natural language interfaces for SaaS</a> was originally published in <a href="https://medium.com/relta">Relta</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Is it possible to accurately query relational data in plain English?]]></title>
            <link>https://medium.com/relta/is-it-possible-to-accurately-query-relational-data-in-plain-english-5659892ffda5?source=rss-5ed30a0aa18------2</link>
            <guid isPermaLink="false">https://medium.com/p/5659892ffda5</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[text-to-sql]]></category>
            <dc:creator><![CDATA[Amir Zohrenejad]]></dc:creator>
            <pubDate>Fri, 27 Sep 2024 14:02:16 GMT</pubDate>
            <atom:updated>2024-10-29T20:22:55.140Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kKTkikeyWhwuypST87FbRQ.png" /></figure><p>… or why AI is happier in a metrics box</p><p>In an <a href="https://medium.com/relta/the-problem-of-text-to-sql-9fa9df8d15ab">earlier post</a>, I talked about text-to-SQL, the excitement around it with the launch of ChatGPT and the subsequent disappointment of users, founders and investors. Assuming people actually want to interact with software in natural language (which I think is self-evident), the follow up question is whether these interfaces are even possible to build with today’s tech. But first: what does a good natural language interface to data look like?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/690/1*Qwv5-zCFtSBRpA44W8TJCw.png" /><figcaption>Who can design a good natural language interface?</figcaption></figure><p>Before talking about the technical feasibility of some <strong><em>technology</em></strong>, we should take a look at the <strong><em>product</em></strong> that developers are trying to build with it. The main natural language interfaces currently in use are AI assistants (aka chatbots that can do take actions). One overarching challenge with building an AI assistant is that unlike a GUI which has breadcrumbs and visual queues to queue possible interactions, users often don’t know what they can prompt the assistnt for. Therefore, a well designed AI assistant must be able to identify unanswerable cases for its domain, and nudge the user towards items it is trained to handle. Furthermore, it will often need to route the question to different data sources it is connected to depending on the question.</p><p>Let’s take an example of an AI assistant for some generic HR software. The AI Assistant needs to classify and retrieve data from different data sources for each of the following types of questions:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dI05PPOegkpxoz6lTa2PeA.png" /></figure><p>The ability to detect “unanswerable” questions is important in building trust between users and the AI assistant. If the assistant hallucinates answers to questions it cannot ground from a data source, users will stop using it. In the real world, companies tend to roll out new data connections for their AI assistants over time, often starting with unstructured sources first (similar to #2 in the above sample). In such a case if the assistant receives any of the other questions it should refrain from answering and suggest the topics it currently can cover. The issue of answerability will come up again as we dive deeper into connecting relational data sources to chatbots.</p><h3>Connecting relational data</h3><p>Circling back to the main topic of this post: assuming our AI assistant covers the basics on answerability and routing, can a tool be built for it handle questions from relational data reliably? We saw in <a href="https://medium.com/relta/the-problem-of-text-to-sql-9fa9df8d15ab">part 1 of this blog series</a> that allowing an LLM agent to write raw SQL will not work, as natural language is ambiguous and the solutions space of possible SQL statements is far too large.</p><p>An approach that has gained mindshare recently is to use a semantic layer (or metrics layer). This approach introduces an intermediate step where the natural language user input is matched to a set of pre-defined metrics and the metric is then compiled into SQL. The idea of a metrics layer in data stacks is not new. In a <a href="https://benn.substack.com/p/metrics-layer">popular blog post</a> from 2021 on self-serve analytics Benn Stancil (ex-CTO of Mode) popularized the idea of a metrics layer by poting out that:</p><blockquote><em>Self-serve [analytics] is a misunderstood (or, at least, misrepresented) problem. Because the most common question people have is “How often did this thing happen?,” effective self-serve is less about complex analysis and more about metric extraction. People “want to choose from a list of understood KPIs, apply it to a filtered set of records, and aggregate it by a particular dimension. It’s analytical Mad Libs — show me </em>average order size<em> for </em>orders that used gift cards<em> by </em>month<em>.”</em></blockquote><p>The observation above was not about natural language interfaces to SQL data. However at a <strong>product level</strong>, anyone interacting with an AI assitant to retrieve data is doing self-serve analytics. We set out to validate this hypothesis by analyzing real world usage of <a href="https://github.com/Dataherald/dataherald">Dataherald</a>’s text-to-SQL engine and the hypothesis was proven correct.</p><p>In one case, a Series B accounting software vendor had rolled out an AI assistant to its internal support team of accountants that answered customer questions from their data. Their connected dataset consisted of 71 tables with a total of 574 columns. From a set of 674 user prompts, 650 (~96.5%) were answerable, 14 (~2%) were unanswerable due to being normative, and 10 (~1.5%) were unrelated to the data.</p><p>We then used an LLM to generate a metrics layer based on the DDL and 92 sample natural language prompts. The suggested metrics layer <strong>had a total of 5 metrics which only referenced 28 of the underlying 574 columns in the underlying data (less than 5%)</strong>. Running the test set of 674 prompts 75% were answered accurately using the suggested metrics layer. This number was increased to 95% by manually adding two uncaptured metrics, so almost all of the answerable questions were captured with 7 metrics with 35 columns referenced from the 574 total columns.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ncXoCBupyIT-yxrW5rd4LQ.png" /></figure><h3>Rephrasing the problem</h3><p>So to build an accurate natural language interface we need an engine that can:</p><ol><li>Set up an initial metrics layer from the underlying datasource and some sample questions</li><li>Match questions to the metrics layer to identify answerable, unanswerable and unrelated questions</li><li>Suggest new metrics to the developer that can be deployed</li></ol><p>We took the learning above (plus data privacy when building Dataherald) to work on Relta.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/330/1*tB22FEh1M0EbCZtI9pQpvg.png" /></figure><p>In the <a href="https://medium.com/relta/reliable-and-secure-natural-language-interfaces-for-saas-863604e98b91">next post</a> I will discuss the details of how we built this engine at <a href="https://relta.dev">Relta</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5659892ffda5" width="1" height="1" alt=""><hr><p><a href="https://medium.com/relta/is-it-possible-to-accurately-query-relational-data-in-plain-english-5659892ffda5">Is it possible to accurately query relational data in plain English?</a> was originally published in <a href="https://medium.com/relta">Relta</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The problem of text-to-SQL]]></title>
            <link>https://medium.com/relta/the-problem-of-text-to-sql-9fa9df8d15ab?source=rss-5ed30a0aa18------2</link>
            <guid isPermaLink="false">https://medium.com/p/9fa9df8d15ab</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[text-to-sql]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[langchain-agents]]></category>
            <dc:creator><![CDATA[Amir Zohrenejad]]></dc:creator>
            <pubDate>Mon, 26 Aug 2024 21:19:02 GMT</pubDate>
            <atom:updated>2024-10-29T20:22:06.342Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1022/1*20Uz3Qh2FCIRIP75qIJq7g.jpeg" /></figure><p><em>This post is part of a three part series. Click to read </em><a href="https://medium.com/relta/is-it-possible-to-accurately-query-relational-data-in-plain-english-5659892ffda5"><em>part 2</em></a><em> and </em><a href="https://medium.com/relta/reliable-and-secure-natural-language-interfaces-for-saas-863604e98b91"><em>part 3</em></a><em>.</em></p><p>When software developers realized GPT3.5 could write syntactically correct SQL, the race was on to build text-to-SQL into everything. Every YCombinator batch since W22 includes a few “<a href="https://www.ycombinator.com/launches/K7T-buster-enable-your-customers-to-chat-with-their-data">chat</a> <a href="https://www.ycombinator.com/companies/vizly">with</a> <a href="https://www.ycombinator.com/companies/sled">your</a> <a href="https://www.ycombinator.com/launches/LgG-zeit-ai-insights-from-your-tabular-data-with-just-a-few-words">data</a>” startups, <a href="https://www.databricks.com/product/ai-bi/genie">data platforms</a> have added <a href="https://hex.tech/product/magic-ai/">natural language</a> interfaces and SaaS vendors have rolled out <a href="https://www.deel.com/deel-ai/">AI assistants</a> that answer questions from relational data. While the accuracy of these models keeps increasing in academic <a href="https://bird-bench.github.io/">benchmarks</a> (<a href="https://yale-lily.github.io/spider">new benchmarks</a> have to be built!) the features built on text-to-SQL are mediocre at best and annoying at worst. It is at a point where some are questioning if users even want <a href="https://techcrunch.com/2024/08/20/is-your-company-ai-washing-rippling-founder-parker-conrad-thinks-it-might-be/">these features at all</a>, even though there are few if any solutions that actually work.</p><p>Before continuing let’s point out that text-to-SQL features fall into two broad product categories:</p><ol><li><strong>SQL co-pilots — </strong>that deliver a first draft SQL to a data scientist who can modify it</li><li><strong>Natural language querying</strong> — where there is no human in the loop, the user does not see the intermediate SQL nor the underlying schema</li></ol><p>The focus of this blog is on case #2, where the user is not a data scientist and are not familiar with the schema they are querying.</p><p>There are two visions when it comes to the latter case of natural language querying:</p><ol><li><strong>Self serving from the data warehouse — </strong>you are a m<strong>i</strong>dsize company or large enterprise. You have invested lots of capital to become “data-driven.” This includes setting up a data warehouse, pipelines from various sources and lots of BI dashboards. Still very few people in the organization are using the data, and the data team is complaining about being inundated with requests from business teams. If only the analytical business teams could self serve from the data warehouse directly, better decisions would be made and profits would surely increase.</li><li><strong>Natural language interfaces in SaaS application — </strong>you are the CTO of a SaaS company (CRM, HR + payroll, analytics, …). It is 2024 and you have to become an AI company. Users like ChatGPT so they also must love to interact with your software through an AI assistant. Engineering resources are diverted to building an LLM based AI assistant that can retrieve data from the production database instead of pointing and clicking to through dashboard and reports.</li></ol><p>Both visions seem plausible. So two years since ChatGPT and millions of dollars spent why are neither of these visions even close to becoming a reality? Existing text-to-SQL approaches broadly do the following:</p><ol><li>Connect the tool to your existing relational database to a subset of your schemas, tables, columns, or views</li><li>Add “context” by providing verified samples for few shot prompting or fine-tuning. Add more context by chunking and adding metadata from unstructured sources like business documentation.</li><li>Build an LLM agent to build a prompt with the most relevant semantic context, execute it against the DB, and iterate to recover from errors</li></ol><p>In short they try to make the LLM understand a dataset and schema that was designed by a software developer to optimize performance on reads and writes by sprinkling some magical pixie dust (context). As a result the solutions are:</p><ul><li>Non-deterministic — similar prompts can create different answers, as shown in the example below.</li><li>Hard to train — there is no direct correlation between time invested in adding context and the resulting accuracy. It is a trial and error approach.</li><li>Inaccurate — as a result of the above, real-world scenarios have accuracy of 60%</li></ul><p>The following is a real world example of a a simple question from a single table that generates non-deterministic results even with extensive configuration and context added. The table is marketing leads recorded for a healthcare business, and the question is What are the top 3 most common reasons for losing a patient?</p><pre>CREATE TABLE leads ( <br>  facility INT64, <br>  sales_id INT64, <br>  patient_id INT64, <br>  source STRING, <br>  cost INT64, <br>  campaign_effectiveness STRING, <br>  zip_code INT64, <br>  lost_reasons STRING, <br>  lead_id INT64, <br>  status STRING, <br>  email STRING, <br>  initial_contact_date DATE, <br>  last_contact_date DATE);</pre><p>The above table has been scanned with the low cardinality columns identified: the status column is low-cardinality as it can only have values of converted , lost or In progress. Furthermore an admin has provided verified “golden SQL” which are used in few shot prompts and additional description on columns.</p><p>However the best in class text-to-SQL agents still produce the following two answers:</p><pre>-- Question: What are the top 3 most common reasons for losing a patient?<br><br>-- First potential answer produced around ~80% of the time<br>-- filters for distinct patient_id but not on status <br>SELECT lost_reasons, COUNT(distinct patient_id) AS number_of_patients_lost <br>FROM `leads` <br>WHERE lost_reasons IS NOT NULL <br>GROUP BY lost_reasons <br>ORDER BY number_of_patients_lost DESC <br>LIMIT 3<br><br>-- Second potential answer produced around ~20% of the time<br>-- Assumes a patient_id can be lost multiple time and filters on status = &#39;Lost&#39; <br>SELECT lost_reasons, COUNT(patient_id) AS number_of_patients_lost<br>FROM `leads`<br>WHERE lost_reasons IS NOT NULL AND status = &#39;Lost&#39;<br>GROUP BY lost_reasons<br>ORDER BY number_of_patients_lost DESC<br>LIMIT 3</pre><p>The two SQL above produce different answers from the data. This is not surprising since even to a SQL proficient human without additional information both answers could be correct. In fact one could even assume both should generate the same answer if the data is clean: if each patient_id can only be a lead once and for any lost lead the status and lost_reasons are recorded correctly. But in the real world, structured data is messy and cases like this are the norm.</p><p>In order for natural language querying to work, the dataset has to be modeled around the questions users will ask from the data. For our example above, lost_lead needs to be a deterministic metric. However, coming up with the deterministic set of metrics to cover all KPIs from raw data schemas is an incredibly hard and manual task. I will write write more about how this can be simplified in a future post.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9fa9df8d15ab" width="1" height="1" alt=""><hr><p><a href="https://medium.com/relta/the-problem-of-text-to-sql-9fa9df8d15ab">The problem of text-to-SQL</a> was originally published in <a href="https://medium.com/relta">Relta</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Fine-tuning GPT-3.5-Turbo for Natural Language to SQL]]></title>
            <link>https://medium.com/@aazo11/fine-tuning-gpt-3-5-turbo-for-natural-language-to-sql-f0bb1db00e05?source=rss-5ed30a0aa18------2</link>
            <guid isPermaLink="false">https://medium.com/p/f0bb1db00e05</guid>
            <category><![CDATA[fine-tuning]]></category>
            <category><![CDATA[text-to-sql]]></category>
            <category><![CDATA[openai]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[llm]]></category>
            <dc:creator><![CDATA[Amir Zohrenejad]]></dc:creator>
            <pubDate>Thu, 31 Aug 2023 12:24:15 GMT</pubDate>
            <atom:updated>2023-08-31T12:24:15.728Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*BZIExhB4en3YQ5Xo" /><figcaption>Photo by <a href="https://unsplash.com/@maria_shalabaieva?utm_source=medium&amp;utm_medium=referral">Mariia Shalabaieva</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h3>Background</h3><p>Allowing non-technical users to ask questions from a database has been a problem of interest in academia and industry for years. The recent advances in Large Language Model (LLM) technology, such as GPT-4, have improved the accuracy of proposed solutions. However, since the most advanced LLMs have not been open for fine-tuning, recent work in the space has focused on creating Retrieval-Augmented Generation (RAG) algorithms that can enable complex Natural Language to SQL (NL-to-SQL) scenarios without modifying the underlying LLM.</p><p>Last week, OpenAI opened up GPT-3.5-turbo for fine-tuning. In this post, we will fine-tune our own NL-to-SQL model and compare its performance against the state of the art RAG approach. We will use the <a href="https://yale-lily.github.io/spider">Spider dataset</a> from Yale university as our test benchmark.</p><h3>Fine-tuning GPT-3.5-Turbo for NL-to-SQL</h3><p>Like all model training and fine-tuning, the first step of fine-tuning GPT-3.5-Turbo is the creation and upload of a training dataset. Since GPT-3.5-Turbo is a ChatModel, this dataset must use to the following format, and be uploaded as a JSONL file:</p><pre>{&quot;messages&quot;: [{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;system_prompt&quot;}, {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;user_prompt&quot;}, {&quot;role&quot;: &quot;assistant&quot;, &quot;content&quot;: &quot;assistant_prompt&quot;}]}<br>{&quot;messages&quot;: [{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;system_prompt&quot;}, {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;user_prompt&quot;}, {&quot;role&quot;: &quot;assistant&quot;, &quot;content&quot;: &quot;assistant_prompt&quot;}]}<br>{&quot;messages&quot;: [{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;system_prompt&quot;}, {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;user_prompt&quot;}, {&quot;role&quot;: &quot;assistant&quot;, &quot;content&quot;: &quot;assistant_prompt&quot;}]}</pre><p>The Spider dataset has a holdout test set of 2147 question/SQL pairs, a development set of 1034 question/SQL pairs, and a training set of 7000 question/SQL pairs. We will build our fine-tuning dataset in the structure above from the Spider training set.</p><h3>Creating the training dataset</h3><p>An NL-to-SQL task is defined as follows: <strong>given a question and database, identify a SQL query that when executed against the database returns a result set that can answer the question.</strong> Various approaches have been explored on how best to prompt LLMs for this task, and it is generally agreed that the prompt needs to include an instructional component, details of the database schema, information about the database’s content, a set of task-specific demonstrations and of course the actual question at hand.</p><p>Given the format of the ChatModel training data, the elements above have to be presented within the following three prompts:</p><ul><li><strong>system_prompt —</strong> will contain the instruction, database schema and database content</li><li><strong>user_prompt —</strong> will contain the natural language question</li><li><strong>assistant_prompt —</strong> where the SQL will be provided together with a reasoning step</li></ul><p>Let’s look at how to create each of these for our NL-to-SQL training dataset.</p><p><strong>The system prompt</strong></p><p>Creating the system_prompt is by far the most complex part of this exercise. At a minimum, the system_prompt needs to include:</p><ol><li>The system instruction</li><li>The DB schema</li><li>Information about the DB content</li></ol><p>In addition, for any real-world use case with a large number of tables, the samples in the training set should also train the model to select the correct tables from the DB for the SQL query (i.e perform schema-linking).</p><p><strong>System Instruction</strong></p><p>For the instruction we used the following standard prompt</p><pre>You are an assistant that is an expert in generating Sqlite SQL queries.<br>Having the access to database content, generate a correct Sqlite SQL query for the given question.<br>### Database content ###</pre><p><strong>Database Schema</strong></p><p>In the literature there are many proposed prompt formats for the database schema and content with no clear consensus around which performs best. We found the following to be the optimal representation of the database schema:</p><pre>CREATE TABLE concert (<br> &quot;concert_ID&quot; INTEGER NOT NULL,<br> &quot;concert_Name&quot; TEXT NOT NULL, - the name of the concert<br> &quot;Theme&quot; TEXT, - theme of the concert<br> &quot;Stadium_ID&quot; TEXT NOT NULL,<br> &quot;Year&quot; TEXT, PRIMARY KEY (&quot;concert_ID&quot;),<br> FOREIGN KEY(&quot;Stadium_ID&quot;)<br> REFERENCES stadium (&quot;Stadium_ID&quot;)<br> )<br><br>CREATE TABLE singer (<br>&quot;Singer_ID&quot; INTEGER NOT NULL,<br>&quot;Name&quot; TEXT, - name of the singer<br>&quot;Country&quot; TEXT NOT NULL, - country where the singer born<br>&quot;Song_Name&quot; TEXT NOT NULL, - the name of the song produced by the singer<br>&quot;Song_release_year&quot; TEXT, - The release year of the song<br>&quot;Age&quot; INTEGER,<br>&quot;Is_male&quot; BOOLEAN NOT NULL,<br>PRIMARY KEY (&quot;Singer_ID&quot;)<br>)</pre><p><strong>Database Content</strong></p><p>After much experimentation we found the following template to perform the best at training the model about the database content:</p><pre>/*<br>Columns in concert and 3 examples in each column for the high cardinality columns :<br>concert_ID: 1025 , 1101 , 1247<br>concert_Name : &quot;Fire&quot;, &quot;Dance&quot;, &quot;Sky&quot;<br>Stadium_ID : 9, 10, 11<br>*/<br>/*<br>Columns in concert and all categories for the low cardinality columns :<br>Theme : &quot; ROCK &quot;, &quot; POP &quot;, &quot; HIP-HOP &quot;<br>Year : 2022, 2021, 2023, 2020<br>*/<br><br>/*<br>Columns in concert and 3 examples in each column for the high cardinality columns :<br>Singer_ID : 10235 , 110231 , 1242447<br>Name : &quot;Jordan&quot;, &quot;Gabriel&quot;, &quot;Tiffany&quot;<br>Country : &quot;Iran&quot;, &quot;India&quot;, &quot;Canada&quot;<br>Song_Name : &quot;dance in the fire&quot;, &quot;rain&quot;, &quot;sky&quot;<br>Age : 19, 20, 21<br>*/<br>/*<br>Columns in concert and all categories for the low cardinality columns :<br>Is_male : &quot;MALE&quot;, &quot;FEMALE&quot;,<br>Song_release_year : 2022, 2021, 2023, 2020<br>*/</pre><p>An important element in the database content is how to identify categorical (low cardinality) columns. The threshold for distinguishing between low and high cardinality columns depends on the context window size of the Large Language Model (LLM) being fine-tuned. Given the 4096 token context window of GPT-3.5-turbo, we determined 20 tokens as the appropriate threshold between low and high cardinality columns.</p><p><strong>Schema Linking</strong></p><p>The final challenge in creating the system_prompts for our training set is to provide samples in such a way that train the model to correctly perform schema-linking on the database. To do this, we employed the following heuristic: for each individual NL &lt;&gt; SQL sample we included a random selection of other tables from the DB in addition to the correct tables until we reached the context window limit of 4000 tokens. To mitigate the influence of positional information, we further randomized the order of tables. In short, each system_prompt included the schema and content of the relevant tables mixed in with other irrelevant tables, helping train the model in picking the correct tables for the query.</p><p>We will now put all of this together to build our system_prompts.</p><p>For the sample below from Spider:</p><pre>Question : &quot;How many heads of the departments are older than 56 ?&quot;<br>SQL: &quot;SELECT count(*) FROM head WHERE age &gt; 56&quot;</pre><p>The system_prompt will be</p><pre>You are an assistant that is an expert in generating Sqlite SQL queries.<br>Having the access to database content, generate a correct Sqlite SQL query for the given question.<br>### Database content ###<br>CREATE TABLE trip (<br> id INTEGER, duration INTEGER,<br> start_date TEXT,<br> start_station_name TEXT,<br> start_station_id INTEGER,<br> end_date TEXT,<br> end_station_name TEXT,<br> end_station_id INTEGER,<br> bike_id INTEGER,<br> subscription_type TEXT,<br> zip_code INTEGER,<br> PRIMARY KEY (id)<br> )<br>/* Columns in trip and 3 examples in each column for high cardinality columns :<br> id : 900645, 900752, 900524<br> duration : 1131, 2146, 1155<br> start_date : 8/21/2015 17:39, 8/21/2015 17:03, 8/21/2015 17:16<br> start_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th<br> start_station_id : 56, 65, 49 end_date : 8/21/2015 17:19, 8/21/2015 18:08, 8/21/2015 17:32<br> end_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th<br> end_station_id : 56, 65, 49<br> bike_id : 586, 56, 65<br> zip_code : 94070, 94530, 94040-1724<br> */ <br>/* Columns in trip and all categories for low cardinality columns :<br> subscription_type : Customer, Subscriber<br> */<br><br> CREATE TABLE management (<br> &quot;department_ID&quot; INTEGER,<br> &quot;head_ID&quot; INTEGER,<br> temporary_acting TEXT,<br> PRIMARY KEY (&quot;department_ID&quot;, &quot;head_ID&quot;),<br> FOREIGN KEY(&quot;head_ID&quot;) REFERENCES head (&quot;head_ID&quot;),<br> FOREIGN KEY(&quot;department_ID&quot;) REFERENCES department (&quot;Department_ID&quot;)<br> )<br> /* Columns in management and all categories for low cardinality columns :<br> department_ID : 7, 15, 2, 11<br> head_ID : 5, 4, 6, 3, 10<br> temporary_acting : Yes, No<br> */<br><br> CREATE TABLE department (<br> &quot;Department_ID&quot; INTEGER,<br> &quot;Name&quot; TEXT,<br> &quot;Creation&quot; TEXT,<br> &quot;Ranking&quot; INTEGER,<br> &quot;Budget_in_Billions&quot; REAL,<br> &quot;Num_Employees&quot; REAL,<br> PRIMARY KEY (&quot;Department_ID&quot;)<br> )<br> /* Columns in department and 3 examples in each column for high cardinality columns :<br> Department_ID : 1, 13, 11<br> Name : Energy, Interior, Health and Human Services<br> Creation : 1913, 1979, 1989<br> Ranking : 1, 13, 11<br> Budget_in_Billions : 10.7, 77.6, 59.7<br> Num_Employees : 112557.0, 3000000.0, 235000.0<br> */<br><br><br>...<br><br><br>CREATE TABLE head (<br> &quot;head_ID&quot; INTEGER,<br> name TEXT,<br> born_state TEXT,<br> age REAL,<br> PRIMARY KEY (&quot;head_ID&quot;)<br> )<br> /* Columns in head and all categories for low cardinality columns :<br> head_ID : 1, 2, 5, 7, 8, 4, 6, 3, 10, 9<br> name : Jeff Maggert, Pádraig Harrington, Billy Mayfair, K. J. Choi, Dudley Hart, Sergio García, Stewart Cink, Tiger Woods, Nick Faldo, Franklin Langham<br> born_state : Delaware, Connecticut, Alabama, California, Florida<br> age : 69.0, 67.0, 68.0, 53.0, 56.0, 52.0, 50.0, 43.0<br> */<br><br>...</pre><p><strong>The user prompt</strong></p><p>The user prompt is simple, the user question for each sample in Spider. For example:</p><pre>How many heads of the departments are older than 56 ?</pre><p><strong>The assistant prompt</strong></p><p>The assistant prompt is also simple, containing the associated SQL query from Spider and the reasoning step to find the correct column and correct table for the SQL query. To construct the reasoning step we simply extracted the tables and columns that are used in the SQL query. For example:</p><pre>To construct the query, I&#39;ll be working with the following tables: head.<br>From these tables, I&#39;ll be using the following columns: age.<br>The SQL query I&#39;ll be generating is:<br>SELECT count(*) FROM head WHERE age &gt; 56</pre><h3>Submitting the training set for fine-tuning</h3><p>Once we have created the JSONL file (you can find a small sample <a href="https://github.com/Dataherald/dataherald-cookbook/blob/main/spider-fine-tuning/spider-fine-tuning-sample-dataset.jsonl">here</a>), the next step involves uploading the created file to OpenAI using the following command:</p><pre>openai.api_key = os.getenv(&quot;OPENAI_API_KEY&quot;)<br>print(openai.File.create(file=open(&quot;spider-finetuning.jsonl&quot;, &quot;rb&quot;),purpose=&#39;fine-tune&#39;))</pre><p>After uploading the file you can check the status of the upload using the following command:</p><pre>print(openai.File.retrieve(id=&quot;file-id&quot;))<br>#OR<br>print(openai.File.list())</pre><p>The result should be something like this:</p><pre>{<br>  &quot;object&quot;: &quot;file&quot;,<br>  &quot;id&quot;: &quot;file-id&quot;,<br>  &quot;purpose&quot;: &quot;fine-tune&quot;,<br>  &quot;filename&quot;: &quot;file&quot;,<br>  &quot;bytes&quot;: 71699079,<br>  &quot;created_at&quot;: 1693343752,<br>  &quot;status&quot;: &quot;uploaded&quot;,<br>  &quot;status_details&quot;: null<br>}</pre><p>When the status has changed to processed (similar to below) you can use the file for fine-tuning:</p><pre>{<br>  &quot;object&quot;: &quot;file&quot;,<br>  &quot;id&quot;: &quot;file-id&quot;,<br>  &quot;purpose&quot;: &quot;fine-tune&quot;,<br>  &quot;filename&quot;: &quot;file&quot;,<br>  &quot;bytes&quot;: 71699079,<br>  &quot;created_at&quot;: 1693343752,<br>  &quot;status&quot;: &quot;processed&quot;,<br>  &quot;status_details&quot;: null<br>}</pre><p>Now, we are ready to start the fine-tuning job. To create a fine-tuning job you can use the following python code:</p><pre>print(openai.FineTuningJob.create(<br>     training_file=&quot;file-id&quot;,<br>     model=&quot;gpt-3.5-turbo&quot;,<br>     suffix = &quot;spider&quot;,<br>     hyperparameters = {<br>         &quot;n_epochs&quot;: #number_of_epochs,<br>     })<br>     )</pre><p>The duration of the fine-tuning process will vary depending on the size of the fine-tuning dataset. There is a maximum token limit for fine-tuning, which is set at 50,000,000 tokens. Therefore, when working with the Spider dataset, we reduced the number of samples from 7,000 to 5,750 and conducted fine-tuning for a total of 2 epochs.</p><p>You can check the status of the fine-tuning job using the following command:</p><pre>print(openai.FineTuningJob.retrieve(id=&quot;ftjob-id&quot;))</pre><p>The result should be something like this:</p><pre>{<br>  &quot;object&quot;: &quot;fine_tuning.job&quot;,<br>  &quot;id&quot;: &quot;ftjob-id&quot;,<br>  &quot;model&quot;: &quot;gpt-3.5-turbo-0613&quot;,<br>  &quot;created_at&quot;: 1693346245,<br>  &quot;finished_at&quot;: 1693353313,<br>  &quot;fine_tuned_model&quot;: &quot;ft:gpt-3.5-turbo-0613:dataherald:spider:id&quot;,<br>  &quot;organization_id&quot;: &quot;org-id&quot;,<br>  &quot;result_files&quot;: [<br>    &quot;file-id&quot;<br>  ],<br>  &quot;status&quot;: &quot;succeeded&quot;,<br>  &quot;validation_file&quot;: null,<br>  &quot;training_file&quot;: &quot;file-id&quot;,<br>  &quot;hyperparameters&quot;: {<br>    &quot;n_epochs&quot;: 2<br>  },<br>  &quot;trained_tokens&quot;: 44722020<br>}</pre><h3>Model Performance</h3><p>We benchmarked the performance of the fine-tuned model against GPT3.5-Turbo without fine-tuning and DIN-SQL + GPT-4 (the current state of the art on Spider) for zero-shot performance.</p><p>The results are as follows</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ICBvQjjKVgejf4Z0Dcee_A.png" /><figcaption>Performance of the fine-tuned GPT-3.5-Turbo against previous methods.</figcaption></figure><p>Fine-tuning GPT-3.5-Turbo yielded a performance improvement of nearly 11 percent brining its accuracy in line with the DIN-SQL + GPT-4, the current state-of-the-art approach which uses GPT-4 and employs various advanced prompting techniques, including few-shot prompting, chain-of-thought prompting and decomposed prompting.</p><p>Critically, the fine-tuned model <strong>significantly reduces both cost and processing time</strong> when compared to the DIN-SQL + GPT-4 approach. The table below provides an approximate cost and speed of difference between the models per single question from Spider.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VUX_trEBQ482E2LWdg0Msw.png" /><figcaption>Cost and speed of different models per question from Spider benchmark</figcaption></figure><p>As demonstrated above, the cost of the fine-tuned GPT-3.5-Turbo model is <strong>30 times less</strong> than DIN-SQL with GPT-4 and it is <strong>12 times faster.</strong></p><h3>Conclusion and Next Steps</h3><p>The results from the experiment are clear: with an initial investment of time and money to build a training dataset the state of the art can be matched in accuracy, while being 12 times faster and 30 times cheaper.</p><p>Fine-tuning is a powerful tool in the NL-2-SQL arsenal. However it is not a silver bullet as few organizations have NL-to-SQL training datasets readily available. It is our belief that the best architectures will combine fine-tuned models together with RAG agents. With the anticipated launch of GPT-4 fine-tuning, we expect progress in the field to accelerate further and finally unlock question-answering from structured data for all businesses.</p><p>In the next post we will show how to plug in the fine-tuned model above into the <a href="https://github.com/Dataherald/dataherald">Dataherald engine</a> and deploy it in a real world scenario.</p><p>If you are interested in NL-2-SQL discussions you can join our <a href="https://discord.gg/A59Uxyy2k9">Discord server.</a> If you want to allow non-technical users to ask questions from your company’s data warehouse please join our <a href="https://www.dataherald.com/">waitlist</a>.</p><h3>References</h3><p>DIN-SQL paper: <a href="https://arxiv.org/abs/2304.11015">https://arxiv.org/abs/2304.11015</a></p><p>NL-to-SQL useful papers:</p><p>How to Prompt LLMs for Text-to-SQL: <a href="https://arxiv.org/abs/2305.11853">https://arxiv.org/abs/2305.11853</a></p><p>Divide and Prompt: <a href="https://arxiv.org/abs/2304.11556">https://arxiv.org/abs/2304.11556</a></p><p>Exploring Chain-of-Thought Style Prompting for Text-to-SQL: <a href="https://arxiv.org/abs/2305.14215">https://arxiv.org/abs/2305.14215</a></p><p>A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability: <a href="https://arxiv.org/abs/2303.13547">https://arxiv.org/abs/2303.13547</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f0bb1db00e05" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Why Enterprise Natural Language to SQL is hard]]></title>
            <link>https://medium.com/dataherald/why-enterprise-natural-language-to-sql-is-hard-8849414f41c?source=rss-5ed30a0aa18------2</link>
            <guid isPermaLink="false">https://medium.com/p/8849414f41c</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[text-to-sql]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[business-intelligence]]></category>
            <dc:creator><![CDATA[Amir Zohrenejad]]></dc:creator>
            <pubDate>Tue, 08 Aug 2023 00:02:40 GMT</pubDate>
            <atom:updated>2024-03-18T22:51:06.103Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="Enterprise SQL" src="https://cdn-images-1.medium.com/max/512/1*xyCG9pEk7ZNBWhIvG2NbrQ.jpeg" /></figure><p>“The future of BI is conversational.” This has been the prediction of industry analysts for a number of years. Yet despite the amazing progress in conversational LLM-based applications in the past year such as ChatGPT + Bard and new powerful models like GPT-4, conversational BI is still not deployed in most companies. Business users are still looking for insights in BI dashboards and data analysts are still sifting through Slack and Jira tickets, opening up a SQL engine connected to their data warehouse and hand-writing SQL queries to answer ad-hoc business questions. Why is conversational BI still not here?</p><p>While structured data only makes up ~20% of the world’s data, the majority of enterprise data is still in structured data stores and accessible mainly through SQL queries. Therefore at a high level in order to enable conversational BI, a solution needs to be devised that can translate natural language business questions to valid SQL queries that are then executed against the enterprise data warehouse. Engineers have tried to build “Natural Language to SQL” (NL2SQL) engines since the 70s (using rules-based techniques) which would very quickly get too complex to be useful. But with the advancement of transformers which have enabled tools like GitHub CoPilot and OpenAI Code Interpreter it would seem this should be a trivial problem to solve. It is not.</p><p>There are (at least) two ways a company can build an LLM-based NL2SQL engine to enable conversational BI:</p><ol><li><strong>Fine-tuning your own LLM —</strong> This approach would require taking an existing LLM and then training it further using NL&lt;&gt;SQL pairs relating to the company’s structured data. A couple of challenges with this approach are that a) coming up with the training dataset is hard and expensive and b) the most powerful LLM model around (GPT-4) cannot be fine-tuned (as of this writing).</li><li><strong>Leveraging In-context learning —</strong> The latest LLM models (like GPT-4–32K) can write SQL quite well out of the box and have enough context window for quite a bit of few shot training and for an <a href="https://www.pinecone.io/learn/series/langchain/langchain-agents/">agent</a> to try to recover from errors by performing follow-ups using chain-of-thought techniques. The idea here is to build an LLM agent on top of GPT-4 that can implement NL2SQL with few shot learning.</li></ol><p>So what are the challenges of deploying solution #2? Here are six we have encountered:</p><ol><li><strong>Table and Column descriptions</strong>— Even the best data teams often do not have clear documentation about tables, columns and metadata. With the rise of ELT where data is simply dumped in the warehouse from various sources and transformed on query the situation becomes even worse. Therefore the table and column names might be the only info available to the engine at configuration time.</li><li><strong>Missing Context and Metadata–- </strong>There are often business definitions which live in data analyst’s heads and are not in the underlying data. We encountered a real-world home rental marketplace, for which what constitutes an “active listing” is a combination of WHERE clauses which are different based on the value of another column which specifies the building_type. In rare cases these are stored as Views on the table, but more often that not they are just stored in a query in the BI tool/dashboard.</li><li><strong>Incomplete info in question, lack of “common sense” — “</strong>what was the average rent in Los Angeles in May 2023?” A reasonable human receiving this question would simply assume the question is about Los Angeles, CA or would confirm with the asker in a follow up. However an LLM usually translates this to select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”which pulls up data for Los Angeles, CA and <a href="https://en.wikipedia.org/wiki/Los_Angeles,_Willacy_County,_Texas">Los Angeles, TX</a> without even getting columns to differentiate between the two</li><li><strong>Speed — </strong>In order for the engine to be “conversational,” response times must be fast (sub 30s). This is often very hard to achieve, especially if the agent tried to recover from errors or evaluate generated responses with subsequent LLM calls.</li><li><strong>Complex Queries – </strong>While GPT-4 writes simple SQL queries very well, it can often stumble on complex queries that require aggregations and joins. This is exacerbated in cases where the column name contains an action that can be done in SQL (for example Average or SUM)and in join operations on data warehouses where FOREIGN KEYS are not clearly enforced like they are in production DBs.</li><li><strong>Privacy and Data Leaking – </strong>Many organizations do not want their database data or schema being sent to companies like OpenAI since it can leak into their training corpus.</li><li><strong>Validation – </strong>There is no known way to identify cases where the system returns a syntactically valid but incorrect SQL. For example if the user asks for and ‘average’ value, and the system runs an AVG instead of picking a column called ‘average_price’</li></ol><p>So is enterprise conversational BI impossible in 2023? Will there be a few more years of academic papers and company AI hackathon projects before a solution can be deployed in production? We don’t think so.</p><p>While the challenges are definitely real, we believe with the right tool an enterprise data team can deploy solutions to enable business users to self-serve ad-hoc data questions from the company data warehouse. In the coming weeks we will be releasing a number of open source and hosted tools to address this.</p><p>If you are interested in contributing to or deploying NL2SQL for your enterprise, please reach out.</p><h3>About Dataherald</h3><ul><li>Sign up for free and use the <a href="https://www.dataherald.com/news/introducing-dhai?utm_medium=social&amp;utm_source=medium&amp;utm_campaign=post&amp;utm_term=why_enterprise_natural_language_to_sql_is_hard">hosted version</a> of Dataherald</li><li>Our open-source engine is available on <a href="https://github.com/dataherald?utm_medium=social&amp;utm_source=medium&amp;utm_campaign=post&amp;utm_term=why_enterprise_natural_language_to_sql_is_hard">Github</a>.</li><li>Join our <a href="https://discord.com/invite/umZXvgPDBr">Discord </a>server to learn more about the project.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8849414f41c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dataherald/why-enterprise-natural-language-to-sql-is-hard-8849414f41c">Why Enterprise Natural Language to SQL is hard</a> was originally published in <a href="https://medium.com/dataherald">Dataherald</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>