7 Lessons Learned migrating dbt code from Snowflake to Trino

Published in

datamindedbe

9 min readJan 30, 2024

Migrating from one data platform to another is never as simple as it seems. In the process of migrating several dbt projects from a Snowflake data warehouse to Starburst (i.e., enterprise Trino), we encountered a couple of differences between their SQL dialects that proved either tricky to find or difficult to deal with. In this post, I share 7 of these differences, and how we dealt with them.

1. Replace IFF by IF and PIVOT by …?

Let’s start with a simple one. Some of the SQL functions in Snowflake have different names in Trino. For example, IFF() in Snowflake is called IF() in Trino, NVL() is called COALESCE(), DATEADD() is called DATE_ADD(), etc. These things are easy to find and fix, using e.g. regex replacements.

With some functions, like DECODE(), there are no simple equivalent Trino functions, but it is still relatively easy to implement the same functionality (in the case of DECODE, using CASE WHEN). Tools like sqlglot can also do these transformations automatically.

However, for some SQL functionality, like PIVOT in Snowflake, there is no equivalent functionality in Trino. In general, there is no way to implement pivot in Trino using SQL alone. If you know the values you’re pivoting over, you can implement it manually using aggregation functions (and certain Trino-specific optimized aggregation functions like count_if can make the result quite fast), but the resulting SQL queries can become huge, because you have to repeat the aggregation function for each value in the column to pivot over. Luckily, we could keep this maintainable by using dbt to generate these queries. dbt_utilsto the rescue with the macro pivot()!

2. Implicit type coercion

When you compare two values in Snowflake and one of them has type VARCHAR while the other has type INT, Snowflake casts the varchar to an int. For example, 123 = '123' returns TRUE in Snowflake. This process is called implicit type coercion.

Trino does not do implicit type coercion. In Trino, 123 = '123' raises an error: “Cannot apply operator: varchar(3) = integer”. If you want to compare a varchar to an integer, you have to cast the varchar explicitly using CAST(x AS INTEGER). Similarly, if you concatenate a string with an integer using '123' || 123 you get the lovely error

While this error technically makes sense, it comes across as:

Dealing with this difference took quite a bit of work in adding casts and thinking about data types. However, I do not consider this as wasted work. The stronger type checking in Trino also has several advantages: the code is more explicit, and you can be more confident that it’s correct and that implicit casts aren’t happening where you don’t expect them. Especially considering the following points:

3. Best wishes for the year 2.024E3!

As we’ve seen in the point above, you can’t concatenate varchars and numbers without an explicit cast. So if you see the following Snowflake SQL

SELECT
  (
    "year" || '-' ||
    "month" || '-' ||
    "day"
  ) AS "date_string"
FROM orders

You think you can just add a few casts to convert it to Trino:

SELECT
  (
    CAST("year" AS VARCHAR) || '-' ||
    CAST("month" AS VARCHAR) || '-' ||
    CAST("day" AS VARCHAR)
  ) AS "date_string"
FROM orders

While this works if day, month, and year are integer types, if they are DOUBLEs instead you get the lovely date 2.024E3-1.1E1-1.9E1, because Trino always casts DOUBLE to VARCHAR in scientific notation, so 123 becomes "1.23E2", and 1 becomes "1.0E0".

Of course the real solution here is to not store these types as doubles in the first place. Nevertheless, this happened quite a lot in the legacy Snowflake code, perhaps partly because Snowflake allows you to play fast-and-loose with numeric types.

4. Referring to calculated columns

If you have a calculated column like foo in this query

SELECT
  SOME_LONG_CALCULATION(...) AS foo
FROM orders

then in Snowflake you can refer to this calculated column in other parts of your SQL such as other select expressions, where clauses, group by’s, etc. For example, in Snowflake you can do this:

SELECT
  SOME_LONG_CALCULATION(...) AS foo,
  foo*123456 AS bar
FROM orders
WHERE foo > 10

In Trino, you cannot do this. The “equivalent” query would be

SELECT
  SOME_LONG_CALCULATION(...) AS foo,
  SOME_LONG_CALCULATION(...)*123456 AS bar
FROM orders
WHERE SOME_LONG_CALCULATION(...) > 10

As you can see, this becomes quite cumbersome and unmaintainable. One possible solution to make this more maintainable is to calculate foo in a CTE like this:

WITH calculate_foo AS (
  SELECT
    orders.id AS order_id,
    SOME_LONG_CALCULATION(...) AS foo
  FROM orders
)
SELECT
  foo,
  foo*123456 AS bar
FROM orders
JOIN calculate_foo ON orders.id = calculate_foo.order_id
WHERE foo > 10

This works fine in simple cases, but there are a few cases where this gets tricky:

If you have multiple such calculated fields that depend on each other, you need multiple CTE’s, and everything gets quite complex.
When this happens in subqueries or other CTE’s rather than the final expression, this can lead to situations where Trino fails to compile the query with the error “given correlated subquery is not supported”
Where this was most painful of all, was a subtle one, because the query still runs and looks like it does the same: if the calculation (SOME_LONG_CALCULATION() above) involves window functions, the granularity of the window is different! The WHERE clause affects only the final SELECT, not the CTE! This can result in incorrect output of the query.

Using dbt, it is possible to implement a better solution: use set to initialize a dbt variable foo with the definition of the long-expression, and use that dbt variable everywhere, like this:

{% set foo %} SOME_LONG_CALCULATION(...) {% endset %}

SELECT
  {{ foo }} AS foo,
  {{ foo }}*123456 AS bar
FROM orders
WHERE {{ foo }} > 10

5. Where do NULLs go when sorting?

If you sort a result using ORDER BY foo, and some values of foo are NULL, where do they go? First? Last? Depends? Snowflake and Trino disagree on this one.

Snowflake considers NULLs as the “greatest” possible value: when you sort in ascending order it puts the NULLs last, when you sort in descending order, it puts the NULLs first.

Trino on the other hand always puts NULLs last, whether you sort in ascending or descending order.

Who’s right? The ANSI SQL standard doesn’t specify what should happen here, so in some sense neither is right or wrong, but in my opinion, it “feels” wrong that in Trino ascending order is not the opposite of descending order.

We now make it explicit: just always specify “NULLS FIRST” or “NULLS LAST” in queries where a column that can contain NULLs is used in an ORDER BY. Then there is no possible confusion.

6. Default window of some window functions

Consider the following query:

SELECT
  LAST_VALUE(delivery_address) IGNORE NULLS OVER (
    PARTITION BY customer_id
    ORDER BY order_time ASC
  ) AS latest_address
FROM orders

Which address should latest_address contain?

A) The most recent delivery address of this customer at the time of the current row’s order
B) The most recent delivery address of this customer in the entire orders table?

There is a right and a wrong answer here! The correct answer, according to the ANSI SQL standard, is (…drumrolls…) A.

More generally, for window functions like FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE(), if you don’t explicitly specify a window, the window used should be RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Trino correctly implements this.

Snowflake, on the other hand, does not. It uses the default window ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING if you don’t specify a window. As a result, Snowflake returns B instead of A for the query above. In other words, Snowflake does not follow the ANSI SQL standard here.

To be fair to Snowflake, the fact that this deviates from the standard is documented. It’s not a bug, it’s a feature? I admit that this window is more likely to be what you want, so perhaps Snowflake’s default window is what the ANSI SQL standard should’ve been. But it isn’t. And selectively following a standard makes the standard significantly less useful.

As a result, we now always explicitly specify a window when we use LAST_VALUE() and its companions:

SELECT
  LAST_VALUE(delivery_address) IGNORE NULLS OVER (
    PARTITION BY customer_id
    ORDER BY order_time ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  )
  AS latest_address
FROM orders

7. Which day of the week is Sunday?

The last one is one of the “smallest” in some sense, but it took quite a while to find because it looks quite innocent.

In Snowflake there is a function DAYOFWEEK() which returns the day of the week. In Trino, there is also a function DAY_OF_WEEK() which also returns the day of the week. So when migrating dbt code from one to the other, you’d think it’s as simple as replacing DAYOFWEEK(…) by DAY_OF_WEEK(…).

As you might have guessed, it isn’t. The difference between the two is that DAYOFWEEK() on Snowflake returns 0 for sunday (to be precise: the actual value depends on the WEEK_START parameter) whereas on Trino,DAY_OF_WEEK() returns 7 for Sundays. For all the other days of the week, these functions return the same value.

The equivalent function for Trino’s DAY_OF_WEEK() is DAYOFWEEKISO() instead of DAYOFWEEK(). That’s obvious once you know it, but if it’s part of a larger SQL query, it doesn’t stand out. Even more insidiously, if you test a few sample values — e.g. using dbt_unit_testing — if you’re unlucky your sample values may not include Sundays and then the output of these functions are exactly the same.

Bonus: Gapless date spines

This last one is technically not a difference between Snowflake and Trino but it caused enough hair-pulling that I’m including it anyway. Some of our Snowflake code generates a sequence of dates — a “date spine” — like this:

SELECT
  DATEADD(DAY, SEQ4(), TO_DATE('01-01-2019', 'dd-MM-yyyy')) AS "date_day"
FROM TABLE(GENERATOR(ROWCOUNT => 365))

Trino does not have a function like SEQ4() to generate sequences, so we had to look for a different solution. Again, dbt_utils comes to the rescue with the macro date_spine. It can generate dates with dbt code that looks like this:

{{
  dbt_utils.date_spine(
    datepart="day",
    start_date="cast('2019-01-01' as date)",
    end_date="cast('2020-01-01' as date)"
  )
}}

If you run the above queries, they probably give the same output. But if they are part of a bigger query, suddenly the output may — or may not — be different. In our case, the difference depended on the Snowflake warehouse where the query was run. Running the exact same query on a small Snowflake warehouse gave different results than running it on a medium warehouse.

The culprit is the function SEQ4. If you go to its documentation page, a big red warning box explains:

Important
This function uses sequences to produce a unique set of increasing integers but does not necessarily produce a gap-free sequence. When operating on a large quantity of data, gaps can appear in a sequence. If a fully ordered, gap-free sequence is required, consider using the ROW_NUMBER window function.

In other words, the function SEQ4 may return the sequence 1, 2, 3, 4 or it may return 1, 2, 17, 18. The tricky thing is that SEQ4 typically doesn’t produce gaps in small queries or small warehouses, which is typically what you’re using to “debug” a query. So while you’re debugging, SEQ4 neatly returns 1, 2, 3, and 4 as expected. The gaps only start appearing when running as part of a larger query on a larger warehouse, which makes finding the cause of the difference all the more difficult.

So beware, if you need a date spine, use ROW_NUMBER, as the Snowflake documentation says, or even better: use the dbt_utils.date_spine macro.

Conclusion

This post highlighted the differences between the Snowflake and Trino SQL dialects that took the most time to find and/or fix, and how we dealt with them. Ultimately, none of the differences we encountered proved too different to overcome: all data products were migrated and are now running successfully in production on Trino.

If you have experience with either of these two SQL engines and have struggled with any of the above differences, or with an 8th difference (or 9th, 10th, etc.), feel free to share!