Netezza’s Evolution From Warehouse to Lakehouse With Watsonx.data

5 min readJul 24, 2023

Introduction

A year ago I wrote an article describing how Netezza could access parquet data from S3. This was only the beginning of our vision to bring Netezza into the lakehouse, and I’m excited to provide an update on how that vision is unfolding.

The Lakehouse

This year at Think 2023 IBM has announced the new watsonx family of products and a key piece of this is watsonx.data, a lakehouse platform backed by opensource technology. You may ask, “What is a lakehouse platform?”. A lakehouse platform is simply a centralized database catalog (referred to simply as “the metastore”), plus object storage and compute engines. The metastore provides a catalog of all the schemas and tables in the object storage, allowing you to organize the data as you would with a traditional data warehouse. In the lakehouse, different compute engines like Spark and Presto access the same catalog, so updates made to a table by one engine are immediately visible to the other engine. How can two different engines access the same data? It’s made possible by sharing an open table format that all engines know how to read. In watsonx.data, tables are created using the Apache Iceberg open source table format, allowing any compatible engine to access the data and preventing you from being locked in to any specific engine.

Netezza Support For Lakehouses

Connecting Netezza to a Datalake

As I alluded to earlier, support for parquet on S3 was just the beginning of our vision to bring Netezza into the lakehouse. IBM is bringing first class lakehouse integration into the Netezza engine, allowing you to query Iceberg tables from both the watsonx.data platform, as well as other datalake platforms. To access a datalake from within Netezza, you must simply create a new ‘Datalake Database’ within Netezza by specifying the metastore type and URI, as well as any required S3 connection information.

Example SQL for creating a database in Netezza that references an external datalake

If we connect to the database, you’ll see that schemas and tables have been automatically added to the datalake database. Thanks to this synchronization of catalog entries from the metastore, you no longer need to execute a ‘CREATE EXTERNAL TABLE’ statement for each table you wish to access from the datalake.

Schemas from the datalake become automatically visible from Netezza

Note that Netezza supports two levels of organization by grouping tables in schemas and schemas in databases, while the hive metastore has only one level of organization. Although the hive metastore uses the terms ‘schema’ and ‘database’ synonymously for this single level of organization, Netezza will always refer to them as schemas. Also note that the DEFINITION_SCHEMA and INFORMATION_SCHEMA are reserved for internal Netezza use and any schemas in the datalake with these names will not be synchronized. This is also true with the NETEZZA_SCHEMA which is Netezza’s default schema when connecting to the datalake database.

Tables from the datalake become automatically visible from Netezza

Querying Data From A Datalake

Queries can be run against the tables hosted within the lakehouse just as easily as they can with local Netezza tables. Below is an example TPCDS query which has been modified to return 5 rows. The scan of the table is spread evenly between all of our compute processes, and since the parquet files backing the iceberg table are columnar, we only download and read the data for the columns that the query references.

Querying tables from the datalake is as simple as querying Netezza native tables

Ingesting Data From A Datalake

To get improved performance, you may want to load the data from the datalake into Netezza native tables. Netezza native tables cannot be created in a datalake database, but a simple cross database CTAS operation allows you to easily ingest data into your desired database and schema.

Loading data into Netezza from the datalake can be performed with a simple CTAS query that references the datalake table using it’s fully qualified name

You can also ingest specific columns to reduce the amount of data you need to download into the engine

Netezza will only read from the columns referenced in the query.

Performing Analysis With Both Local Data and Data From The Lake

A combination of both local data stored in Netezza and remote data that has been loaded into the datalake can be queried by performing a cross database query. Notice in the below example we’re running the same modified tpcds query we ran earlier, but it’s referencing the ‘STORE’ table we loaded into a Netezza native table.

This example is running the same tpcds query we performed against the datalake, but using the STORE table that we’ve loaded locally by referencing it’s fully qualified name ‘NZDB.TPCDS.STORE’

A great example that takes advantage of this functionality is weather data. I could have all my sales information for the year loaded into Netezza, and then join against a weather table in the datalake to see what sells best when it rains or snows.

We’re able to query our local SALES_2023 Netezza table, and join it against a weather data table in the datalake by providing it’s fully qualified name ‘MYLAKE.SHARED.WEATHER’

Now that I’ve done some analysis on the data from this year, Lets do the same experiment on last years data. Now that we have connectivity to the datalake, you can choose to keep the most recent (and most frequently) accessed data on your Netezza system for optimal performance, while offloading infrequently accessed historical data into the lake. Querying this data is as simple as running the same query from netezza, but referencing the sales_2022 table stored on the datalake

We’re able to run the exact same query against our 2022 sales data which has been uploaded to the datalake and no longer lives on the Netezza system.

we can even query our historical data, together with our current data stored on Netezza by performing a ‘union all’ on our local sales table, and our historical sales table stored in the lake.

The sales data from 2022 (in the datalake) and 2023 (in Netezza) can be queried together by performing a ‘union all’ in the query. In this example I’ve also highlighted our ability to create temporary tables referencing both datalake and Netezza tables to facilitate such queries.

Interoperability Between Engines In The Datalake

Now lets highlight how different compute engines can interact within a lakehouse. Watsonx.data comes with the IBM Presto database, so lets use presto to create a table and load some data.

Example SQL for creating a table in the IBM Presto database and inserting some tuples.

Notice that I’ve run 3 insert statements. With Apache Iceberg format, each insert results in an additional parquet file on S3. All of these files together make up the Iceberg table (this is a simple example, but things get more complicated when alters and deletes get involved).

When appending to an iceberg table, existing parquet data files will not be modified, but instead new files will be added to the table which contain the new tuples being inserted.

Now that the data is loaded with presto, the table will automatically become visible to Netezza and can be queried using the Netezza engine.

Netezza is able to see and read from tables written to the datalake by other database engines like Presto

Conclusion

Netezza’s new functionality makes accessing tables in the datalake as simple as running a single ‘CREATE DATABASE’ sql statement and opens a whole new world of possibilities for our users. This functionality is now available on Netezza on Cloud, which was just made available on AWS. For more information on lakehouses check out this link, and make sure to check out watsonx.data for your lakehouse needs.

More Netezza Blogs

If you haven’t yet read it, check out last year’s article on Netezza’s support for parquet. Also check out these Netezza articles written by some of my colleagues.