Netezza Zone Maps — They’re The Same As Indexes, Right?
The very first thing you’re likely to hear about Netezza — even before you know how to pronounce it (Neh-tease-uh) — is that it doesn’t have indexes.
Yeah, right! *snort*
That’s like a national highway grid without maps. Surely, good ol’ IBM must be selling us a bit of clickbait, and then when we look closer… Oh yeah, actually there are indexes, except the database manages them internally … except for the ones you create yourself; they’re a bit like the indexes in other databases but instead they’re … well, exactly the same as every other database. Sorry.
Sure enough, you go into SQL Aginity and what do you see when you query the data dictionary? Yep — indexes.
But wait! Look at those table names. What kind of perverted naming standard starts object names with an underscore? (I’m looking at you, C programmers.) Indeed, those are not my Netezza tables; they’re Netezza’s Data Dictionary tables. Don’t quote me on this, because it’s only an educated guess, but I don’t believe Netezza actually uses its own database technology to store its Data Dictionary.
So surprise, surprise, surprise, as Gomer Pyle was once wont to say, the clickbait was accurate. Live and learn. Netezza really doesn’t let you index your data tables. Not even little indexes. Not even ones it sneakily creates behind the scenes with self-tuning wizardry. Just … there aren’t any.
Here be Dragons
Going back to our highway-grid-without-a-map metaphor, what does this mean for Netezza? Are our tables just one vast, unexplored bog, riddled with man-eating beasts? When we ask a simple question like —
SELECT * FROM CUSTOMER WHERE CUST_ID = 1234
—does Netezza march boldly into the unknowable fathoms of the CUSTOMER table, searching for CUST_ID 1234 under every rotted stump and behind every tangle of vines?
Petty much, yeah. Turn back to my earlier blog on Netezza’s Massively Parallel Processing Architecture for a longer discussion on it, but blazingly fast parallel IO means we can drain the swamp pretty darned quick, weakening the argument for indexes. Sure, you can easily construct a scenario where an index would be faster (finding 1 customer in 1 billion rows using a primary key), but even though Netezza’s MPP full-scan approach is slower, it’s not disastrously slow, and so as long as you don’t do that kind of thing inside a tight loop, you’ll hardly notice the lack of indexes.
I know what you’re thinking:
So why doesn’t IBM just add indexes? If at least some queries can be made faster, then isn’t it worth it?
Not in IBM’s opinion. And they have a point. Indexes do bring their own issues, namely:
- The cost of maintaining them upon insert/update/delete;
- The heavyweight architecture that is necessary to keep tables and indexes internally consistent;
- And last but certainly not least, the cost of queries that use indexes when they shouldn’t. It’s beyond the scope of this blog, but high-volume analytic-style queries could run tragically slower if the database wrongly used an index.
I still know what you’re thinking:
Well that’s just stupid, then. Surely there’s a better way.
IBM is way ahead of you there — there is indeed a better way, and it’s called Zone Maps.
What are Zone Maps?
Let’s be really clear about this: Zone Maps are not indexes.
Comparisons without context are not helpful. Zone Maps and Indexes have exactly two things in common:
1. They are both constructs that can make queries run faster.
2. The performance benefit you get will depend greatly on how well you understand them.
Clean-up in Aisle Five
It’s time to shift metaphors. I’m going to move from the uncharted highway map to the supermarket.
Have you ever seen a supermarket with an index? My local has them. At the end of each aisle hangs a little sheaf of pages in plastic sleeves listing every item in the supermarket and where to find it. Nobody uses them, except as a last resort. Can you imagine shopping for breakfast? Eggs: aisle 5; bread: aisle 2; bacon: deli counter; OJ: aisle 4; milk: aisle 12; coffee: back to aisle 4… It’d be lunchtime before you found everything, but at least you’d get fit.
When you’re after a bunch of stuff, most people just find it easier to make a single pass of every aisle, picking up the things they need along the way and ignoring the stuff they don’t need. Sound familiar? That’s right; we just did a full table scan.
But let’s think about this — I don’t go to the deli counter unless I need something there; I don’t go down the toiletries aisle if I’m shopping for breakfast; I don’t go down the health food aisle unless … well, I just don’t go down the health food aisle.
My full table scan of the supermarket isn’t a true full table scan. I’m able to skip bits, and the supermarket helps me out by hanging a sign above each aisle that lists the kinds of things you will find when you walk down there. This is important — it’s not like the plastic-sleeve index at the end because it doesn’t list every item. It gives me just enough information to work out whether maybe something I want will be in there, but then maybe it won’t, meaning I’ll end up walking that aisle in vain. C’est la vie. It’s not a perfect system.
This is the exact principle that makes Zone Maps work. Rather than providing perfect information like an index, Zone Maps provide summary information, which sometimes will save you a scan, and sometimes it won’t.
Highs and Lows
Unlike the supermarket, which lists broad categories of items in its aisle “zone maps”, Netezza captures maximum and minimum values, and instead of by “aisle”, it’s by extent.
Database tables are stored on disk in extents. Each extent is a contiguous 3MB slice of disk, broken up into twenty-four blocks of 128KB each. For each zone-mapped column, Netezza records the maximum and minimum value for that column per extent, so if you’re looking for invoices dated 2016–03–23, then Netezza can ignore any extent with a lower maximum in the Zone Map or a higher minimum.
Q. But 3MB is a big extent. What if they all contain a wide range of values? Won’t you end up searching every extent anyway?
A. Yes, that’s exactly right. Zone maps work best when like-data is clustered together, which is not always possible.
If you think that sounds like a pretty random way to tune your database, you’re right. It doesn’t always work, and if you’re hoping for performance efficiencies from Zone Maps then you’d better start understanding how your data is distributed on disk.
The Ageing Effect
Here’s the best thing about Zone Maps, they tend to work straight out of the box for dates. In fact, IBM is so darned confident, they have Netezza automatically create Zone Maps for every date and timestamp column. The Ageing Effect exploits a common behaviour of dates, whether we’re talking about Order Date, Invoice Date, Shipped Date, Received Date, or Best Before Date of the product, it is almost always the case that the oldest dates are in the oldest extents and the newest dates are on the newest extents. When one extent fills up, a new one is allocated, and it will be filled with rows that contain dates around about today.
Obviously this is not true of all dates; Customer Date of Birth for example, will contain a range of old values even for new rows, but the vast majority of date columns are roughly clustered in the table, oldest to newest. In this way, any query that filters on a short range of dates (or a single date!) will end up scanning only a fraction of a table if it contains years of history.
The Ageing Effect also works for a lot of integer key values if they have been allocated from an increasing sequence. Old extents will contain low-numbered Invoice IDs and newer extents will contain high-numbered Invoice IDs. Again, Netezza creates Zone Maps automatically to exploit this effect, but ONLY if the column is an integer data type (BYTEINT, SMALLINT, INTEGER or BIGINT, but not NUMERIC(n,0)).
Customising your Zone Maps
Obviously not everything will benefit from the Ageing Effect, because not every query is constrained to a small date range. Some potentially large tables, like Customer or Product, don’t even have interesting dates that we might use in filters.
It is possible to choose other types of columns — for example VARCHARs — and Zone Map them to tune common query filters. Strangely enough, there is no CREATE ZONEMAP statement — this is a further clue that they’re not like indexes.
As mentioned above, all date, timestamp, and integer columns will already be zone mapped, but if you want a Zone Map on a VARCHAR or NUMERIC field, then you’re going to need an ORGANIZING KEY, which is a topic for a future blog. Suffice to say that it is not safe to simply go creating Organizing Keys to zone map your favourite columns; they will rearrange your date-oriented table and you may ruin the benefits you’re already getting from the ageing effect.
So What Have We Learned?
Well, it’s a blog, so hopefully you set your expectations pretty low, but if you came out of this understanding that everything you thought you knew about indexes is no help at all in Netezza, then I’ve done my job.
A quick recap:
- Zone Maps are not like indexes;
- Netezza zone maps every date, timestamp and integer column by default;
- A Zone Map stores the column low and high value for each 3MB extent;
- When you filter on a column, Netezza uses Zone Maps to bypass extents with no matching values;
- Zone Maps are at their best when used on Date and ID columns that increase over time;
- Zone mapping other columns (like NUMERIC and VARCHAR) is possible with Organizing Keys, but it’s also easy to make things worse.