Netezza’s Evolution From Warehouse to Lakehouse With Watsonx.data
Introduction
A year ago I wrote an article describing how Netezza could access parquet data from S3. This was only the beginning of our vision to bring Netezza into the lakehouse, and I’m excited to provide an update on how that vision is unfolding.
The Lakehouse
This year at Think 2023 IBM has announced the new watsonx family of products and a key piece of this is watsonx.data, a lakehouse platform backed by opensource technology. You may ask, “What is a lakehouse platform?”. A lakehouse platform is simply a centralized database catalog (referred to simply as “the metastore”), plus object storage and compute engines. The metastore provides a catalog of all the schemas and tables in the object storage, allowing you to organize the data as you would with a traditional data warehouse. In the lakehouse, different compute engines like Spark and Presto access the same catalog, so updates made to a table by one engine are immediately visible to the other engine. How can two different engines access the same data? It’s made possible by sharing an open table format that all engines know how to read. In watsonx.data, tables are created using the Apache Iceberg open source table format, allowing any compatible engine to access the data and preventing you from being locked in to any specific engine.
Netezza Support For Lakehouses
Connecting Netezza to a Datalake
As I alluded to earlier, support for parquet on S3 was just the beginning of our vision to bring Netezza into the lakehouse. IBM is bringing first class lakehouse integration into the Netezza engine, allowing you to query Iceberg tables from both the watsonx.data platform, as well as other datalake platforms. To access a datalake from within Netezza, you must simply create a new ‘Datalake Database’ within Netezza by specifying the metastore type and URI, as well as any required S3 connection information.
If we connect to the database, you’ll see that schemas and tables have been automatically added to the datalake database. Thanks to this synchronization of catalog entries from the metastore, you no longer need to execute a ‘CREATE EXTERNAL TABLE’ statement for each table you wish to access from the datalake.
Note that Netezza supports two levels of organization by grouping tables in schemas and schemas in databases, while the hive metastore has only one level of organization. Although the hive metastore uses the terms ‘schema’ and ‘database’ synonymously for this single level of organization, Netezza will always refer to them as schemas. Also note that the DEFINITION_SCHEMA and INFORMATION_SCHEMA are reserved for internal Netezza use and any schemas in the datalake with these names will not be synchronized. This is also true with the NETEZZA_SCHEMA which is Netezza’s default schema when connecting to the datalake database.
Querying Data From A Datalake
Queries can be run against the tables hosted within the lakehouse just as easily as they can with local Netezza tables. Below is an example TPCDS query which has been modified to return 5 rows. The scan of the table is spread evenly between all of our compute processes, and since the parquet files backing the iceberg table are columnar, we only download and read the data for the columns that the query references.
Ingesting Data From A Datalake
To get improved performance, you may want to load the data from the datalake into Netezza native tables. Netezza native tables cannot be created in a datalake database, but a simple cross database CTAS operation allows you to easily ingest data into your desired database and schema.
You can also ingest specific columns to reduce the amount of data you need to download into the engine
Performing Analysis With Both Local Data and Data From The Lake
A combination of both local data stored in Netezza and remote data that has been loaded into the datalake can be queried by performing a cross database query. Notice in the below example we’re running the same modified tpcds query we ran earlier, but it’s referencing the ‘STORE’ table we loaded into a Netezza native table.
A great example that takes advantage of this functionality is weather data. I could have all my sales information for the year loaded into Netezza, and then join against a weather table in the datalake to see what sells best when it rains or snows.
Now that I’ve done some analysis on the data from this year, Lets do the same experiment on last years data. Now that we have connectivity to the datalake, you can choose to keep the most recent (and most frequently) accessed data on your Netezza system for optimal performance, while offloading infrequently accessed historical data into the lake. Querying this data is as simple as running the same query from netezza, but referencing the sales_2022 table stored on the datalake
we can even query our historical data, together with our current data stored on Netezza by performing a ‘union all’ on our local sales table, and our historical sales table stored in the lake.
Interoperability Between Engines In The Datalake
Now lets highlight how different compute engines can interact within a lakehouse. Watsonx.data comes with the IBM Presto database, so lets use presto to create a table and load some data.
Notice that I’ve run 3 insert statements. With Apache Iceberg format, each insert results in an additional parquet file on S3. All of these files together make up the Iceberg table (this is a simple example, but things get more complicated when alters and deletes get involved).
Now that the data is loaded with presto, the table will automatically become visible to Netezza and can be queried using the Netezza engine.
Conclusion
Netezza’s new functionality makes accessing tables in the datalake as simple as running a single ‘CREATE DATABASE’ sql statement and opens a whole new world of possibilities for our users. This functionality is now available on Netezza on Cloud, which was just made available on AWS. For more information on lakehouses check out this link, and make sure to check out watsonx.data for your lakehouse needs.
More Netezza Blogs
If you haven’t yet read it, check out last year’s article on Netezza’s support for parquet. Also check out these Netezza articles written by some of my colleagues.
- Time Travel in Netezza Performance Server
- Data-driven astronomy with IBM Netezza
- Easy Data Analytics using Netezza and Python
- Netezza In-Database Analytics in Python — Part 1: Data Frames
- Integrate Azure Data Factory with Netezza Performance Server on Cloud