If your Organizing Keys suck, maybe you’re not doing it right

As a developer — especially a Data Warehouse developer — there are few things more satisfying than making one tiny change and not only fixing a problem, but also making everything run better than anyone imagined possible. For best results, have your manager watch over your shoulder while you’re doing this.

We’ve all had those moments.
“I changed one line and it went from running 17 hours down to 35 seconds.”

I hear this regularly, and while internally I’m wondering how you managed to screw it up so royally in the first place, outwardly, I’m effervescing like a teenager at a boy-band concert because I know the joy of sharing an improbable victory (which is a polite way of saying we all like to brag).

The rush you get from this kind of monumental tuning success is addictive. It can drive you to all kinds of crazy behaviours to find your next hit, like for instance, reading the manual. Or Googling. Maybe that’s how you got here. If you’re a Netezza developer struggling with Organizing Keys, maybe I have some good news for you.

Let’s talk about Organizing Keys

If you’ve invested any time researching Netezza, you’ll be at least passingly familiar with the myriad features IBM supplies to improve query performance through table design. In case you haven’t done the research — or if you’re confused by all the options — I’ll list them here:

  1. Define Distribution Keys
  2. Define Organizing Keys

This blog is not about Distribution Keys. Sorry if you came for that — another time, perhaps.

Organizing Keys. What are they? If, like me, you came from an Oracle world of Indexes and Partitions, and you’re hoping they’re kinda the same because that would make life easier for you — sorry, they’re not. If, on the other hand, you’re from the SQL Server world of ubiquitous Clustered Indexes, then I have good news: same kinda thing. Sort of.

When you define an Organizing Key on a table, Netezza will physically re-arrange the data in your table so that rows with the same (or similar) key are co-located in the same disk extent. How does this help? Well, in the most common case, you “organize” the table on (say) the TRANSACTION_DATE to get all of the newest rows stored together (you can think of the table as being sorted). Then when you go to query the table for a single day’s data, or a short date range, Netezza can ignore 99% of the disk extents because the Zone maps say they will contain no matching rows.

(Whoa! What’re Zone maps? I thought you said there were only two things?)

Don’t panic. A Zone map is the physical structure Netezza builds under the covers to support your Organizing Key (a bit like Oracle builds an index to support a Primary Key). You also get Zone maps for free (without having to define Organizing Keys) on all Integer and Date/Time columns. Check out my earlier blog for more on Zone maps.

Getting Organized

Here’s some good news: you can create or change an Organizing Key in Netezza without having to drop the table. I’m not going to do that, though, because I’d rather show the progression.

Consider the following table with 100M rows:

CREATE TABLE ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_NOKEY
(
AN_INTEGER BIGINT NOT NULL,
A_STRING NATIONAL CHARACTER VARYING(40),
A_LONG_STRING NATIONAL CHARACTER VARYING(400),
EFF_FROM_TIMESTAMP TIMESTAMP NOT NULL,
EFF_TO_TIMESTAMP TIMESTAMP NOT NULL,
A_DECIMAL NUMERIC(22,0)
)
DISTRIBUTE ON RANDOM;

Now because Netezza is such a beast, it’s hard to show a really slow single-table query, but something like the following — without Partition Keys or Organizing Keys of any kind — takes a little over a second to return 16 matching rows:

SELECT *
FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_NOKEY
WHERE an_integer = 1811028242013806246;
/* Duration 1.8564357 sec. */
/* Records Affected: 16. */

Using the SHOW PLANFILE technique from a previous blog, I can also see that every S-Blade contains 580 disk extents, and Netezza scanned every single one of them.

Let’s try that again with an organizing key.

CREATE TABLE ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_INTKEY AS
SELECT * FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_NOKEY
DISTRIBUTE ON RANDOM
ORGANIZE ON (AN_INTEGER);
SELECT *
FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_INTKEY
WHERE an_integer = 1811028242013806246;
/* Duration 1.7628339 sec. */
/* Records Affected: 16. */

Okay, so it was, like 5% faster. Big deal.

When I check the SHOW PLANFILE, I see it’s scanned 562 extents per S-Blade this time. So I’ve learned two things:

  • It’s a little bit faster, although not enough to be interesting. It could just be natural variation on a shared appliance.
  • The Organizing Key didn’t seem to change the behaviour of the query.

According to the literature, the Organizing Key was supposed to reduce IO, but it did no such thing. Why?

You’ve got to groom it, groom it

It’s a counter-intuitive behaviour, but when you create an organized table (aka a clustered table) and load it in a single statement, Netezza does not sort the rows. You need to do it yourself with a GROOM statement. Let’s try that.

GROOM TABLE ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_INTKEY;
SELECT *
FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_INTKEY
WHERE an_integer = 1811028242013806246;
/* Duration 0.0468009 sec. */
/* Records Affected: 16. */

Alright! Now we’re getting somewhere. That was a 97% performance improvement. We’re in bragging territory now. Checking the SHOW PLANFILE, we see the groomed table got bigger (641 extents per S-Blade), but we read … wait for it … ONE. That’s right, Netezza jumped straight to the extent that contained our matching rows and ignored all the rest.

Sweet! So are we done? Let’s try some different data types and see how they go.

ORGANIZE BY DECIMAL

CREATE TABLE ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_DECKEY AS
SELECT * FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_NOKEY
DISTRIBUTE ON RANDOM
ORGANIZE ON (A_DECIMAL);
/* ERROR [HY000] ERROR: Column A_DECIMAL cannot be used in ORGANIZE ON list (not zonemappable) */

Oops. Does that mean we can’t ORGANIZE ON a decimal? Not quite — we can’t organize on a decimal longer than 8 bytes. I greedily defined mine as DECIMAL(22,0). If I start again with DECIMAL(18,0) — an 8-byte integer stores up to 18 digits — and load up the same rows, then I should be fine.

CREATE TABLE ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_DECKEY AS
SELECT * FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_NOKEY
DISTRIBUTE ON RANDOM
ORGANIZE ON (A_DECIMAL);
GROOM TABLE ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_DECKEY;
SELECT *
FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_DECKEY
WHERE a_decimal = 24920534536637;
/* Duration 0.0468009 sec. */
/* Records Affected: 7. */

You’ll note I had to change the key I was selecting because the old one was 19 digits and I had to exclude it from the table. You’ll also note it’s nice and quick, and once again SHOW PLANFILE tells me it read a single extent from each S-Blade.

Fantastic! Decimals work in Organizing keys, but only up to DECIMAL(18,*). Note that you can use digits after the decimal point, but you can’t exceed 18 digits total.

ORGANIZE BY STRING

IBM tell us we can also ORGANIZE ON character strings. Let’s try that.

CREATE TABLE ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_CHARKEY AS
SELECT * FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_NOKEY
DISTRIBUTE ON RANDOM
ORGANIZE ON (A_STRING);
GROOM TABLE ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_CHARKEY;
SELECT *
FROM ROSSLEISHMAN.NETEZZA_ORGKEY_TEST_CHARKEY
WHERE A_STRING = 'ROW000009068980';
/* Duration 0.4992096 sec. */
/* Records Affected: 16. */

Okay, so that’s worrying. It’s 10x slower than the Integer Organizing Key. When we look at the SHOW PLANFILE, we find it scanned 34 extents per S-Blade instead of 1. What is it about a string Organizing Key that, even when the table is sorted, it still had to search through 34 disk extents to find a single value?

Let’s check IBM’s documentation again:

The organizing keys must be columns that can be referenced in zone maps. By default, Netezza creates zone maps for columns of the following data types:
- Integer (1-byte, 2-byte, 4-byte, and 8-byte)
- Date
- Timestamp
In addition, Netezza also creates zone maps for the following data types if columns of this type are used as the ORDER BY restriction for a materialized view or as the organizing key of a CBT:
- Char, all sizes, but only the first 8 bytes are used in the zone map
- Varchar, all sizes, but only the first 8 bytes are used in the zone map
- Nchar, all sizes, but only the first 8 bytes are used in the zone map
- Nvarchar, all sizes, but only the first 8 bytes are used in the zone map
- …

First 8 bytes? Well that sucks. Our search string (ROW000009068980) is 15 characters long. In a simple VARCHAR, that would be 15 bytes, but our data type is NATIONAL CHARACTER VARYING, so depending in the encoding it could be longer still. Say we’re using UTF-8 without any special characters (1 byte per character), the first 8 bytes of our string are “ROW00000”, for which there are nearly 8 million matching rows in my table.

No wonder it had to look in more than one extent!

At this point, you’re probably hoping for one of those “a-ha” moments where I pull out a new command and tune another 97% off it. This time we’re both out of luck. The lesson her is to take care with your Varchar Organizing Keys.

Summary

  • Organizing keys effectively sort your table, allowing Netezza to find individual values or short ranges of integers or dates without scanning every disk extent.
  • When you load a table that has an Organizing Key, it will NOT self-organize — data will simply be loaded on the end. You must Groom the table for the organizing key to take effect.
  • As you add new rows to an existing organized table, you will need to re-groom.
  • It’s possible to ORGANIZE ON a decimal column, but not if its maximum value exceeds 18 digits.
  • It’s possible to ORGANIZE ON a string, but only the first 8 bytes of any string will be taken into account. Strings with poor selectivity in the first 8 bytes make for poor organizing keys.

So, apply your Organizing Key, groom the table, schedule grooms as you add more data, and if you tune that query down from 17 hours to 35 seconds, leave me a nice comment below and I’ll celebrate the win with you.

* * * *

If you thought this was amusing, leave me a “clap”. If you thought it was informative, leave me a few. Thanks for reading, and watch out for more Netezza tuning fun in the DWS+Symplicit blog.