7 Techniques of Semantic Modeling for Databricks Genie

Kyle Hale
12 min readJul 2, 2024

--

A key application for GenAI that has spared a lot of interest is text to SQL (or “text2SQL”) chatbots: an interface where a user can come and ask questions of a dataset in plain language and the chatbot can convert their question into a SQL query and retrieve results from a database.

At Databricks, we’ve recently introduced a new product Genie as part of our data intelligent platform to meet this need. We’ve gone beyond simple text-to-SQL translation to include instructions, feedback loops, and a robust underlying compound AI framework to minimize errors, all sitting on top of a fully governed and secure lakehouse with Unity Catalog.

In many aspects, developing a Genie space for user consumption is very similar to developing a traditional BI semantic model or analytical data product. They have the same objectives: to provide trustworthy, accurate, and useful insights out of such a system, and to serve as a business-friendly conduit between the raw data in your organization and the questions you and your users have.

As we’ll also see, many of the techniques to enhance their usefulness and accuracy are the same between the two of them.

In this blog I’ll discuss 7 common techniques for enhancing the semantic model of your Genie, and share examples of how they improve the accuracy of such a system.

But first ….

Thank you, Retrosheet!

For our data set examples, we’re using the baseball play by play data provided by Retrosheet. Retrosheet is an all-volunteer organization painstakingly transcribing box scores, dailies, newspaper articles, radio broadcasts, and numerous other sources from long ago into a well-curated database, with the ultimate mission of providing and preserving as complete a historical record of baseball on the field as possible.

Please consider donating to the Retrosheet organization for all of their awesome contributions to baseball history!

Technique #1: Entity Relationships

Without relationship definitions, assistants are unwilling to make assumptions about how two or more tables can be joined together. By defining primary and foreign keys between our tables we give the LLMs important metadata it needs to answer queries that cross tables.

Question: Who hit the most home runs in 1995?

Before

SELECT
bat_id,
SUM(h_fl) AS total_home_runs
FROM
powerbi_demos.baseball.event
WHERE
YEAR(game_id) = 1995
GROUP BY
bat_id
ORDER BY
total_home_runs DESC
LIMIT
1

This incorrectly returns an empty result. Genie didn’t know how to join to the event description or game tables to get the proper fields to find the year and home run events in the dataset. Instead it simply used the limited fields it had available out of a single table.

Technique

By adding primary and foreign keys to the various tables in our dataset, Genie will able to use the correct tables to answer the question.

alter table gold_game add constraint game_pk PRIMARY KEY(game_id);
ALTER TABLE gold_bio ADD CONSTRAINT player_pk PRIMARY KEY(player_id);
alter table gold_event add constraint event_pk primary key(event_id);

alter table gold_fact_pas add constraint
pas_batter_fk FOREIGN KEY(bat_id) REFERENCES gold_bio;
alter table gold_fact_pas add constraint
pas_game_fk foreign key(game_id) REFERENCES gold_game;
alter table gold_fact_pas add constraint
pas_bat_event_fk foreign key(event_id) REFERENCES gold_event;

After

select
nickname,
last,
count(1) as home_runs
from
powerbi_demos.baseball.event
inner join powerbi_demos.baseball.code_event on
powerbi_demos.baseball.event.event_cd = powerbi_demos.baseball.code_event.code
inner join powerbi_demos.baseball.game on
powerbi_demos.baseball.event.game_id = powerbi_demos.baseball.game.game_id
inner join powerbi_demos.baseball.bio on
powerbi_demos.baseball.bio.player_id = powerbi_demos.baseball.event.bat_id
where
powerbi_demos.baseball.code_event.description = 'Home run'
and year(powerbi_demos.baseball.game.game_dt) = 1995
group by
bat_id,
nickname,
last
order by
count(1) desc
limit
1

This correctly returns Albert Belle and his 54 home runs. (Baseball Nerd Note: at this point the data also counts postseason batting data.)

Technique 2: Enhancing Table and Column Metadata

LLMs are also able to take table and column descriptions and tags and use those as part of their prompt to generate a correct query. Ideally this metadata should be more universal for the table or column in question, and could also be used in data discovery or user-facing lineage tools to help others make sense of the data.

In the example below, the LLM hallucinates a filter value for a field.

Question: Name the pitchers who pitched in the 1959 World Series.

Before

SELECT
DISTINCT gold_bio.playerName
FROM
powerbi_demos.baseball.event
INNER JOIN powerbi_demos.baseball.game ON event.game_id = game.game_id
INNER JOIN powerbi_demos.baseball.gold_bio ON
gold_bio.player_id = event.pit_id
WHERE
year(game.game_dt) = 1959
AND event.game_id IN (
SELECT
game_id
FROM
powerbi_demos.baseball.game_types
WHERE
game_type = 'WS'
)

Because there is no “WS” game_type, this returns an empty result.

Technique

We want to add a comment to this column in Unity Catalog to help Genie understand the different game types.

In this case, because the game_type column is just a fixed list of enumerated values, adding those values directly to the comment on the column in Unity Catalog makes the most sense.

After

SELECT
DISTINCT gold_bio.playerName
FROM
powerbi_demos.baseball.event
INNER JOIN powerbi_demos.baseball.game ON event.game_id = game.game_id
INNER JOIN powerbi_demos.baseball.game_types ON game_types.game_id = game.game_id
INNER JOIN powerbi_demos.baseball.gold_bio ON gold_bio.player_id = event.pit_id
WHERE
game_types.game_type = 'World Series'
AND YEAR(game.game_dt) = 1959
"playerName"
"Larry Sherry"
"Roger Craig"
"Billy Pierce"
"Don Drysdale"
"Stan Williams"
"Ray Moore"
"Clem Labine"
"Johnny Klippstein"
"Bob Shaw"
"Johnny Podres"
"Chuck Churn"
"Early Wynn"
"Dick Donovan"
"Turk Lown"
"Gerry Staley"
"Sandy Koufax"

Genie now uses the correct filter and returns the 16 pitchers who participated in the Series.

Technique 3: Proper Data Types

Many BI tools provide their own data type mappings and can convert between them automatically, but as Genie operates directly over SQL, it’s critical that for fields that need to be compared or calculated against together that the data types are in sync.

In our original data set, players’ birthdates are provided as strings, and so Genie isn’t able to properly calculate their ages.

Question: How old was Al Kaline when he hit his last home run?

Before

SELECT
bio.first,
bio.last,
bio.birthdate,
MAX(game.game_dt) AS last_hr_date,
TIMESTAMPDIFF(YEAR, bio.birthdate, MAX(game.game_dt)) AS age_at_last_hr
FROM
powerbi_demos.baseball.event
INNER JOIN powerbi_demos.baseball.game ON event.game_id = game.game_id
INNER JOIN powerbi_demos.baseball.bio ON event.bat_id = bio.player_id
INNER JOIN powerbi_demos.baseball.code_event ON event.event_cd = code_event.code
WHERE
code_event.description = 'Home run'
AND bio.last = 'Kaline'
GROUP BY
bio.first,
bio.last,
bio.birthdate
"first","last","birthdate","last_hr_date","age_at_last_hr"
"Albert William","Kaline","12/19/1934","1974-09-18T00:00:00.000","undefined"

The query itself is accurate, but the TIMESTAMPDIFF function in Databricks SQL returns null when using a string in the second argument.

Technique

By correctly converting the players’ birthdate column into a date data type, we get the answer we were looking for.

create or replace table gold_bio as
select player_id, concat_ws(' ', nickname, last) as playerName,
to_date(
-- some players only have a year listed
case when len(case when int(birthdate) = birthdate
then concat('01/01/', birthdate)
else birthdate end) = 10
then case when int(birthdate) = birthdate
then concat('01/01/', birthdate)
else birthdate end else null end
, 'MM/dd/yyyy') as birth_date,
birth_city, birth_state, birth_country,
bats, throws, height, weight
from bronze_bio;

After

SELECT
bio.first,
bio.last,
bio.birthdate,
MAX(game.game_dt) AS last_hr_date,
TIMESTAMPDIFF(YEAR, bio.birthdate, MAX(game.game_dt)) AS age_at_last_hr
FROM
powerbi_demos.baseball.event
INNER JOIN powerbi_demos.baseball.game ON event.game_id = game.game_id
INNER JOIN powerbi_demos.baseball.bio ON event.bat_id = bio.player_id
INNER JOIN powerbi_demos.baseball.code_event ON event.event_cd = code_event.code
WHERE
code_event.description = 'Home run'
AND bio.last = 'Kaline'
GROUP BY
bio.first,
bio.last,
bio.birthdate
"first","last","birthdate","last_hr_date","age_at_last_hr"
"Albert William","Kaline","1934-12-19T00:00:00.000","1974-09-18T00:00:00.000","39"

Instructions

Instructions are probably the newest and least well understood technique today to improve Genie spaces. Effectively they are business logic, synonyms, and other rules written in plain English to help Genie better translate the user question into SQL that’s relevant for both the dataset and the user’s context.

The two main types of instructions are instructions

  1. Instructions about the engine or dataset itself.
  2. Instructions about the domain of the dataset.

We’ll cover both of these in the next two techniques.

Technique 4: Supply Instructions about the engine or dataset

Instructions about the dataset or engine help Genie understand technical or logical aspects of the fields and how to interact with them within the confines of Databricks SQL, irrespective of the domain.

Some examples might include:

  • which fields should be searched with wilcards
  • which fields are case-sensitive
  • which fields contain duplicate values (and need a different join key)
  • what to do if a field is NULL or has a specific value such as “Unknown”

Question: How many games were played in Yankee Stadium in 2014?

Before

SELECT
COUNT(*) AS games_played
FROM
powerbi_demos.baseball.gold_game
WHERE
park_id IN (
SELECT
park_id
FROM
powerbi_demos.baseball.gold_park
WHERE
park_name ILIKE 'Yankee Stadium'
)
AND YEAR(game_dt) = 2014
"games_played"
0

You and Genie would be forgiven for assuming this query is correct!

Here the issue is the park_name field in the dataset has Roman numerals for the various iterations of stadium names e.g. the Yankees today play in “Yankee Stadium II.”

Technique

Let’s add an instruction to ensure the park field is searched with wildcards to account for this. (Note: we could also modify the data model by marking different parks active, or removing the Roman numerals … but the instruction is a less intrusive option.)

After

SELECT
COUNT(*) AS games_played
FROM
powerbi_demos.baseball.gold_game
WHERE
park_id IN (
SELECT
park_id
FROM
powerbi_demos.baseball.gold_park
WHERE
park_name ILIKE '%Yankee Stadium%'
)
AND YEAR(game_dt) = 2014
"games_played"
81

Our wildcards instruction worked and Genie is able to find the correct total.

Technique 5: Supply Instructions about the domain of the Genie space

Instructions about the domain help Genie understand parts of user questions where the users may assume context about the domain that Genie is not aware of.

Genie actually knows a lot about many domains (including our example, baseball) so it’s not usually necessary to add a lot of instructions upfront. This is an area that can grow over time as users interact with the chatbot.

Some examples might include:

  • jargon and term definitions (what is a “perfect game”?)
  • specific metrics (e.g. how to calculate Pythagorean win percentage)
  • disambiguation (the difference between a “franchise” and a “team”)
  • additional guidance when Genie makes incorrect assumptions (not every game lasts 9 innings)

Question: How many hits did Aaron Judge have in 2018?

Before

SELECT
bat_id,
SUM(h_fl) AS total_home_runs
FROM
powerbi_demos.baseball.event
WHERE
YEAR(game_id) = 1995
GROUP BY
bat_id
ORDER BY
total_home_runs DESC
LIMIT
1
"hits"
123

Because we haven’t added any instruction about what games to count, this query combines Aaron Judge’s regular season and post season hits. This of course is technically correct but it’s not what most people are expecting when they ask this sort of question.

Historically, regular season totals have been kept separately from post season totals to support fairer comparisons between players across seasons.

Technique

We can add an instruction explicitly to our Genie Space to account for this (feel free to peek at the other instructions to give you a flavor of :

After

SELECT
COUNT(*) AS hits
FROM
powerbi_demos.baseball.gold_fact_pas
INNER JOIN powerbi_demos.baseball.gold_event ON gold_fact_pas.event_id = gold_event.event_id
INNER JOIN powerbi_demos.baseball.gold_game ON gold_fact_pas.game_id = gold_game.game_id
INNER JOIN powerbi_demos.baseball.gold_bio ON gold_fact_pas.bat_id = gold_bio.player_id
WHERE
gold_bio.playerName ILIKE 'Aaron Judge'
AND gold_game.game_type = 'Regular Season'
AND YEAR(gold_game.game_dt) = 2019
AND (
gold_event.description ILIKE 'Single'
OR gold_event.description ILIKE 'Double'
OR gold_event.description ILIKE 'Triple'
OR gold_event.description ILIKE 'Home run'
)
"hits"
115

The query now uses our instruction and applies the Regular Season game_type filter to our query, giving the correct total.

Technique 6: Example Queries

While many domain concepts, synonyms, and the like can be handled through instructions, sometimes it’s easier to “show” rather than “tell” the LLM how to calculate certain things.

For example, in baseball a player’s batting average is calculated as the number of hits divided by the number of plate appearances minus plate appearances that ended in a walk, being hit by a pitch, a sacrifice hit, or a sacrifice fly. You can certainly just copy this definition into the instruction and Genie will try to apply it as faithfully as it can.

But by supplying a specific query that correctly calculates and returns batting average, it lets Genie bypass any interpretation — if your SQL is correct, it can just reuse that when answering similar questions.

Question: What was Reggie Jackson’s batting average every year of his career?

Before

WITH Hits AS (
SELECT
YEAR(gold_game.game_dt) AS year,
COUNT(*) AS hits
FROM
powerbi_demos.baseball.gold_fact_pas
INNER JOIN powerbi_demos.baseball.gold_game ON gold_fact_pas.game_id = gold_game.game_id
INNER JOIN powerbi_demos.baseball.gold_bio ON gold_fact_pas.bat_id = gold_bio.player_id
INNER JOIN powerbi_demos.baseball.gold_event ON gold_fact_pas.event_id = gold_event.event_id
WHERE
gold_bio.playerName ILIKE 'Reggie Jackson'
AND (
gold_event.description ILIKE 'Single'
OR gold_event.description ILIKE 'Double'
OR gold_event.description ILIKE 'Triple'
OR gold_event.description ILIKE 'Home run'
)
GROUP BY
YEAR(gold_game.game_dt)
),
AtBats AS (
SELECT
YEAR(gold_game.game_dt) AS year,
COUNT(*) AS at_bats
FROM
powerbi_demos.baseball.gold_fact_pas
INNER JOIN powerbi_demos.baseball.gold_game ON gold_fact_pas.game_id = gold_game.game_id
INNER JOIN powerbi_demos.baseball.gold_bio ON gold_fact_pas.bat_id = gold_bio.player_id
WHERE
gold_bio.playerName ILIKE 'Reggie Jackson'
GROUP BY
YEAR(gold_game.game_dt)
)
SELECT
Hits.year,
Hits.hits,
AtBats.at_bats,
ROUND((Hits.hits :: FLOAT / AtBats.at_bats), 3) AS batting_average
FROM
Hits
INNER JOIN AtBats ON Hits.year = AtBats.year
ORDER BY
Hits.year
"year","hits","at_bats","batting_average"
1967,21,135,0.156
1968,138,614,0.225
1969,151,678,0.223
1970,101,514,0.196
1971,161,654,0.246
1972,137,592,0.231
1973,170,681,0.25
1974,152,640,0.238
1975,155,681,0.228
1976,138,558,0.247
1977,161,648,0.248
1978,155,626,0.248
1979,138,537,0.257
1980,157,613,0.256
1981,89,422,0.211
1982,148,641,0.231
1983,77,458,0.168
1984,117,584,0.2
1985,116,541,0.214
1986,106,545,0.194
1987,74,374,0.198

Here Genie simply treats all plate appearances as at bats in the denominator, resulting in incorrect batting average calculations.

Technique

We can write an example query that addresses this specific question and save it to the Genie Space. (I modified the player in my example to noted sideburns sporter Don Mattingly just for fun.)

SELECT
YEAR(gold_game.game_dt) AS Year,
gold_bio.playerName,
ROUND(
TRY_DIVIDE(
SUM(
CASE
WHEN gold_event.description ILIKE 'Single'
OR gold_event.description ILIKE 'Double'
OR gold_event.description ILIKE 'Triple'
OR gold_event.description ILIKE 'Home run' THEN 1
ELSE 0
END
),
COUNT(*) - SUM(
CASE
WHEN gold_fact_pas.isSacrificeHit = TRUE
OR gold_fact_pas.isSacrificeFly = TRUE
OR gold_event.description ILIKE '%walk%'
OR gold_event.description ILIKE 'Hit by pitch' THEN 1
ELSE 0
END
)
),
3
) AS BattingAverage
FROM
powerbi_demos.baseball.gold_fact_pas
INNER JOIN powerbi_demos.baseball.gold_event ON gold_fact_pas.event_id = gold_event.event_id
INNER JOIN powerbi_demos.baseball.gold_game ON gold_fact_pas.game_id = gold_game.game_id
INNER JOIN powerbi_demos.baseball.gold_bio ON gold_fact_pas.bat_id = gold_bio.player_id
WHERE
gold_bio.playerName ILIKE 'Don Mattingly'
AND gold_game.game_type = 'Regular Season'
GROUP BY
YEAR(gold_game.game_dt),
gold_bio.playerName
ORDER BY
Year

After

SELECT
YEAR(gold_game.game_dt) AS Year,
gold_bio.playerName,
ROUND(
TRY_DIVIDE(
SUM(
CASE
WHEN gold_event.description ILIKE 'Single'
OR gold_event.description ILIKE 'Double'
OR gold_event.description ILIKE 'Triple'
OR gold_event.description ILIKE 'Home run' THEN 1
ELSE 0
END
),
COUNT(*) - SUM(
CASE
WHEN gold_fact_pas.isSacrificeHit = TRUE
OR gold_fact_pas.isSacrificeFly = TRUE
OR gold_event.description ILIKE '%walk%'
OR gold_event.description ILIKE 'Hit by pitch' THEN 1
ELSE 0
END
)
),
3
) AS BattingAverage
FROM
powerbi_demos.baseball.gold_fact_pas
INNER JOIN powerbi_demos.baseball.gold_event ON gold_fact_pas.event_id = gold_event.event_id
INNER JOIN powerbi_demos.baseball.gold_game ON gold_fact_pas.game_id = gold_game.game_id
INNER JOIN powerbi_demos.baseball.gold_bio ON gold_fact_pas.bat_id = gold_bio.player_id
WHERE
gold_bio.playerName ILIKE 'Reggie Jackson'
AND gold_game.game_type = 'Regular Season'
GROUP BY
YEAR(gold_game.game_dt),
gold_bio.playerName
ORDER BY
Year

Note that Genie simply uses my example query and plugs in Reggie Jackson in the correct spot. Rather than try to come up with its own calculation of the complex definition of batting average, it can rely on the expertise of an analytics engineer.

Technique 7: Materialized Aggregation

Many BI tools provide some form of performance acceleration through either pre-aggregation or managed materialization of certain query results.

Adding intermediate materializations of complex calculations or even just frequently repeated aggregate queries both improves performance and reduces error by simplifying the queries Genie needs to produce.

One nice thing about Genie vs BI tools is these aggregations don’t need to be explicitly connected or declared. Genie can decide on its own which tables are the best for the query at hand.

Question: What pitcher had the most seasons with at least 200 Ks and 15 wins?

WITH PitcherStats AS (
SELECT
gold_fact_games.win_pit_id AS pitcher_id,
YEAR(gold_game.game_dt) AS year,
COUNT(*) FILTER (
WHERE
gold_fact_pas.event_id IN (
SELECT
event_id
FROM
powerbi_demos.baseball.gold_event
WHERE
description ILIKE 'Strikeout'
)
) AS strikeouts,
COUNT(
DISTINCT CASE
WHEN gold_fact_games.win_pit_id IS NOT NULL THEN gold_game.game_id
END
) AS wins
FROM
powerbi_demos.baseball.gold_fact_pas
INNER JOIN powerbi_demos.baseball.gold_game
ON gold_fact_pas.game_id = gold_game.game_id
INNER JOIN powerbi_demos.baseball.gold_fact_games
ON gold_fact_pas.game_id = gold_fact_games.game_id
WHERE
gold_game.game_type = 'Regular Season'
GROUP BY
gold_fact_games.win_pit_id,
YEAR(gold_game.game_dt)
HAVING
COUNT(*) FILTER (
WHERE
gold_fact_pas.event_id IN (
SELECT
event_id
FROM
powerbi_demos.baseball.gold_event
WHERE
description ILIKE 'Strikeout'
)
) >= 200
AND COUNT(
DISTINCT CASE
WHEN gold_fact_games.win_pit_id IS NOT NULL THEN gold_game.game_id
END
) >= 15
),
PitcherSeasons AS (
SELECT
pitcher_id,
COUNT(*) AS seasons_with_200K_15W
FROM
PitcherStats
GROUP BY
pitcher_id
ORDER BY
COUNT(*) DESC
LIMIT
1
)
SELECT
gold_bio.playerName,
PitcherSeasons.seasons_with_200K_15W
FROM
PitcherSeasons
INNER JOIN powerbi_demos.baseball.gold_bio
ON PitcherSeasons.pitcher_id = gold_bio.player_id
"playerName","seasons_with_200K_15W"
"Roger Clemens",12

The query gets the pitcher right, but the season count is wrong. That’s because it attributes all of the strikeouts in the games the pitcher won to the pitcher, even if another pitcher recorded the strikeout.

Ultimately this is a result of Genie having to do too much “cognitive labor” with just the plate appearances table to work with. While it can occasionally produce a correct query, it’s not reliable.

Technique

One way to reduce some of this complexity and reliability is to add a fact table to our data room’s model that pre-aggregates pitchers’ season level statistics.

create or replace table powerbi_demos.baseball.gold_fact_pitcher_seasons 
cluster by (pitcher_team_id) as
select
people.retro_id as pitcher_id,
pitching_seasons.* except (player_id, team_id, lg_id),
p_team.team_key as pitcher_team_id
from powerbi_demos.baseball.pitching_seasons
inner join gold_team p_team on
p_team.team_id = pitching_seasons.team_id and year_id between p_team.minyear and p_team.maxyear
inner join parquet.`/Volumes/powerbi_demos/baseball/raw_zip/baseballdatabank/people.parquet` people
on people.player_id = pitching_seasons.player_id;

After

SELECT
pitcher_id,
COUNT(*) AS seasons_with_200K_15W
FROM
powerbi_demos.baseball.gold_fact_pitcher_seasons
WHERE
so >= 200
AND w >= 15
GROUP BY
pitcher_id
ORDER BY
seasons_with_200K_15W DESC
LIMIT
1
"pitcher_id","seasons_with_200K_15W"
"clemr001",11

By using our aggregated table, this query is considerably simpler for Genie to generate and avoid complexity pitfalls.

This record was achieved after Clemens was temporarily hypnotized into believing he was a chicken.

Conclusion

Producing useful and accurate text-to-SQL chatbots on top of your data is a priority for Databricks and the Genie Spaces team. While we are constantly working to improve the product, hopefully these techniques give you a strong start on doing your part to bring high quality, LLM-friendly data models to your users for analysis.

Feedback always welcome!

--

--

Kyle Hale

Azure Solution Architect at Databricks. I’ve seen things.