Select partitioning and clustering keys for BigQuery tables when migrating from Netezza

Anna Epishova
Google Cloud - Community
9 min readJun 5, 2020

When migrating a data warehouse to BigQuery one of the key parts is to select partitioning and clustering strategy for your tables. It’s important to come up with the strategy at the very beginning when your tables are still small because changing partitioning or clustering schema requires rebuilding the tables.

To add partitioning to a table you will need to recreate it. Although adding the clustering doesn’t require rebuilding the table you may still consider this option. In BigQuery, automatic re-clustering works only for new data which is inserted after the clustering has been applied to the table. So, the data that initially existed in the table before it was clustered would stay intact by the re-clustering auto process.

Since Netezza doesn’t have such concepts as partitioning, clustering, or indexing, it’s not possible to automatically determine a corresponding partitioning and clustering schema in BigQuery.

Note that Netezza distribution columns are not good candidates for partitioning or clustering keys in BQ because they serve different purposes. The main purpose of Netezza distribution is to uniformly spread data across nodes to maximize the parallel query execution. It doesn’t work for data pruning. In contrast, in BQ partitioning and clustering helps to reduce the query cost by reducing the number of scanned bytes.

So, good key candidates should have the following characteristics:

  • They should be often used in WHERE/HAVING clauses to filter the rows.
  • They should not be frequently updated.
  • A partitioning key should be a DATE/TIMESTAMP or INTEGER column. A clustering key can also have one of BOOL, NUMERIC, CHAR, VARCHAR, NCHAR, or NVARCHAR data types.
  • They should have at least two unique values, otherwise there would be nothing to prune. Note that it’s possible to combine multiple unique values into a single partition by using the integer partition key that defines a range of values or by truncating a date key to month/year using DATE_TRUNC(). In BigQuery, the maximum number of partitions per partitioned table cannot exceed 4000.
  • BigQuery stores data in files that have a fixed minimum size. This size may change with a new release. A suitable estimate would be 512 MB. So, each BQ partition/cluster should be at least 512 MB.

You can get partitioning and clustering candidates by analysing Netezza statistics and query history tables.

List large tables

You can start from a simple query which lists large tables in your database. Later, you will add more to this query to identify the best key candidates. In Netezza, you can list the large tables in YOUR_DATABASE ADMIN schema using the following query:

SELECT 
_V_TABLE.database as "Database Name",
_V_TABLE.schema as "Schema Name",
_V_TABLE.tablename as "Table Name",
used_bytes/pow(1024, 3) as "Used GB"
FROM _v_table_storage_stat stat
INNER JOIN _V_TABLE on (stat.objid = _V_TABLE.objid)
WHERE
"Used GB" > 1
and _V_TABLE.objtype='TABLE'
and _V_TABLE.database = 'YOUR_DATABASE'
and _V_TABLE.schema = 'ADMIN'
ORDER BY "Used GB" desc, "Table Name";

The results will look like:

Query 1 result

Use Netezza statistic to get information about table columns

Next, you can enhance the above query and use Netezza statistics to get information about columns in your large tables. Your goal is to identify good partitioning and clustering keys, so select columns that:

  • Have DATE/TIMESTAMP, INTEGER, BOOL, NUMERIC, CHAR, VARCHAR, NCHAR, or NVARCHAR data types. This is done by adding a filter on _v_relation_column.atttypid.
  • Have at least two unique values. This is implemented by a filter on _v_relation_column.atttdisperson where disperson measures the number of distinct values in a column.

Before running the next query, make sure that table statistics is up-to-date. Adding the column statistics to the first query, you’ll get:

SELECT          
-- Tables:
_V_TABLE.DATABASE as "Database Name",
_V_TABLE.schema as "Schema Name",
_V_TABLE.TABLENAME as "Table Name",
_V_TABLE.reltuples as "# of Rows",
round(stat.used_bytes/pow(1024, 3), 3) as "Used GB",
-- Columns:
_v_relation_column.attname as "Column Name",
case _v_relation_column.atttypid
WHEN 20 THEN 'BIGINT'
WHEN 21 THEN 'SMALLINT'
WHEN 23 THEN 'INTEGER'
WHEN 1082 THEN 'DATE'
WHEN 1184 THEN 'TIMESTAMP'
WHEN 2500 THEN 'BYTEINT'
WHEN 16 THEN 'BOOLEAN'
WHEN 1700 THEN 'NUMERIC'
WHEN 18 THEN 'CHAR'
WHEN 1042 THEN 'CHARACTER(' || _v_relation_column.atttypmod -12 || ')'
WHEN 1043 THEN 'VARCHAR(' || _v_relation_column.atttypmod -12 || ')'
WHEN 2522 THEN 'NCHAR(' || _v_relation_column.atttypmod -12 || ')'
WHEN 2530 THEN 'NVARCHAR(' || _v_relation_column.atttypmod -12 || ')'
end as "Column Type",
-- Statistics Details
CASE WHEN _v_statistic.recent IS NULL THEN 'not maintained '
ELSE
CASE
WHEN _v_statistic.recent = 85 THEN 'Full '
WHEN _v_statistic.recent = 341 THEN 'Express '
WHEN _v_statistic.recent = 149 THEN 'Basic '
WHEN _v_statistic.recent = 1 THEN 'Full Min/Max OK '
WHEN _v_statistic.recent = 257 THEN 'Express Min/Max OK'
WHEN _v_statistic.recent = 129 THEN 'Basic Min/Max OK '
WHEN _v_statistic.recent = 0 THEN 'Full Outdated '
WHEN _v_statistic.recent = 256 THEN 'Express Outdated '
WHEN _v_statistic.recent = 128 THEN 'Basic Outdated '
WHEN _v_statistic.recent = 169 THEN 'Min/Max Only '
WHEN _v_statistic.recent = 170 THEN 'Unavailable '
WHEN _v_statistic.recent = 16554 THEN 'Unavailable '
ELSE SUBSTRING ('flags=' || _v_statistic.recent || ' ', 1, 20)
END END as "Statistics",
-- Column stats:
SUBSTRING(NVL(_v_statistic.loval,'') || ' ', 1, 15) || ' ' as "Minimum Value",
SUBSTRING(NVL(_v_statistic.hival,'') || ' ', 1, 15) as "Maximum Value", CASE WHEN _v_relation_column.attdispersion = 0
THEN ' '
WHEN _v_relation_column.attdispersion = -1
THEN ' 100% Unique '
ELSE TO_CHAR( (CAST((1.0/_v_relation_column.attdispersion) AS BIGINT)), ' 999,999,999,999,999 ' )
END as "# of Unique Values",
CASE WHEN _v_statistic.nullfrac = 0
THEN ' '
ELSE TO_CHAR( (CAST((RELTUPLES * _v_statistic.nullfrac) AS BIGINT)), '999,999,999,999,999' )
END as "# of NULLs",
CASE WHEN _v_relation_column.attdispersion = -1
THEN 1
ELSE round(_V_TABLE.reltuples * (1 - _v_statistic.nullfrac) * _v_relation_column.attdispersion, 3)
END as "# of Rows in Cluster",
round((stat.used_bytes / pow(1024, 3)) / CASE _v_relation_column.attdispersion WHEN -1 THEN _V_TABLE.reltuples ELSE 1.0 / _v_relation_column.attdispersion END, 3) as "Cluster Size GB",round(stat.used_bytes * _v_statistic.nullfrac / pow(1024, 3), 3) as "NULL-Cluster Size GB"

FROM _v_relation_column
inner join _V_TABLE on
_v_relation_column.objid = _V_TABLE.OBJID
inner join _v_table_storage_stat stat on
stat.objid = _V_TABLE.OBJID
left join _v_statistic on
_v_relation_column.objid = _v_statistic.objid AND
_v_relation_column.attnum = _v_statistic.attnum
WHERE
"Used GB" > 1
and _V_TABLE.objtype='TABLE'
and _V_TABLE.DATABASE = 'YOUR_DATABASE'
and _V_TABLE.schema = 'ADMIN'
and _v_relation_column.atttypid IN (20, 21, 23, 1082, 1184, 2500, 16, 1700, 18, 1042, 1043, 2522, 2530)
and _v_relation_column.attdispersion <= 0.5
ORDER BY
"Used GB" desc,
"Table Name",
_v_relation_column.attnum;

Here are the results:

Query 2 result

The above query needs to include some additional information in order to identify good partitioning/clustering key candidates. In particular you need to identify columns that are referenced the most in your production workload.

Gather historical information about column references in SQL statements

You can enable Netezza history to gather information about column references in SQL statements. You can get this information from $v_hist_column_access_stats (NUM_WHERE andNUM_HAVING columns). Additionally, the column should not be frequently updated, that is NUM_UPDATED should be small. You can compute the number of filter and update references as a fraction of all column references.

In BigQuery you can partition a table by DATE/TIMESTAMP and additionally specify four clustering columns. Only a partitioned table can be clustered. So, add column reference statistics to the previous query and select the top five columns ordered by the number of references. First list partitioning key candidates, then clustering key candidates. If your history database is called histdb.histown you’ll get:

SELECT          
-- Tables:
_V_TABLE.DATABASE as "Database Name",
_V_TABLE.schema as "Schema Name",
_V_TABLE.TABLENAME as "Table Name",
_V_TABLE.reltuples as "# of Rows",
round(stat.used_bytes/pow(1024, 3), 3) as "Used GB",
-- Columns:
_v_relation_column.attname as "Column Name",
case _v_relation_column.atttypid
WHEN 20 THEN 'BIGINT'
WHEN 21 THEN 'SMALLINT'
WHEN 23 THEN 'INTEGER'
WHEN 1082 THEN 'DATE'
WHEN 1184 THEN 'TIMESTAMP'
WHEN 2500 THEN 'BYTEINT'
WHEN 16 THEN 'BOOLEAN'
WHEN 1700 THEN 'NUMERIC'
WHEN 18 THEN 'CHAR'
WHEN 1042 THEN 'CHARACTER(' || _v_relation_column.atttypmod -12 || ')'
WHEN 1043 THEN 'VARCHAR(' || _v_relation_column.atttypmod -12 || ')'
WHEN 2522 THEN 'NCHAR(' || _v_relation_column.atttypmod -12 || ')'
WHEN 2530 THEN 'NVARCHAR(' || _v_relation_column.atttypmod -12 || ')'
end as "Column Type",
-- Statistics Details
CASE WHEN _v_statistic.recent IS NULL THEN 'not maintained '
ELSE
CASE
WHEN _v_statistic.recent = 85 THEN 'Full '
WHEN _v_statistic.recent = 341 THEN 'Express '
WHEN _v_statistic.recent = 149 THEN 'Basic '
WHEN _v_statistic.recent = 1 THEN 'Full Min/Max OK '
WHEN _v_statistic.recent = 257 THEN 'Express Min/Max OK'
WHEN _v_statistic.recent = 129 THEN 'Basic Min/Max OK '
WHEN _v_statistic.recent = 0 THEN 'Full Outdated '
WHEN _v_statistic.recent = 256 THEN 'Express Outdated '
WHEN _v_statistic.recent = 128 THEN 'Basic Outdated '
WHEN _v_statistic.recent = 169 THEN 'Min/Max Only '
WHEN _v_statistic.recent = 170 THEN 'Unavailable '
WHEN _v_statistic.recent = 16554 THEN 'Unavailable '
ELSE SUBSTRING ('flags=' || _v_statistic.recent || ' ', 1, 20)
END END as "Statistics",
-- Column stats:
SUBSTRING(NVL(_v_statistic.loval,'') || ' ', 1, 15) || ' ' as "Minimum Value",
SUBSTRING(NVL(_v_statistic.hival,'') || ' ', 1, 15) as "Maximum Value", CASE WHEN _v_relation_column.attdispersion = 0
THEN ' '
WHEN _v_relation_column.attdispersion = -1
THEN ' 100% Unique '
ELSE TO_CHAR( (CAST((1.0/_v_relation_column.attdispersion) AS BIGINT)), ' 999,999,999,999,999 ' )
END as "# of Unique Values",
CASE WHEN _v_statistic.nullfrac = 0
THEN ' '
ELSE TO_CHAR( (CAST((RELTUPLES * _v_statistic.nullfrac) AS BIGINT)), '999,999,999,999,999' )
END as "# of NULLs",
CASE WHEN _v_relation_column.attdispersion = -1
THEN 1
ELSE round(_V_TABLE.reltuples * (1 - _v_statistic.nullfrac) * _v_relation_column.attdispersion, 3)
END as "# of Rows in Cluster",
round((stat.used_bytes / pow(1024, 3)) / CASE _v_relation_column.attdispersion WHEN -1 THEN _V_TABLE.reltuples ELSE 1.0 / _v_relation_column.attdispersion END, 3) as "Cluster Size GB",round(stat.used_bytes * _v_statistic.nullfrac / pow(1024, 3), 3) as "NULL-Cluster Size GB"--Reference counts:
round((hist.NUM_WHERE + hist.NUM_HAVING) / (hist.refs - NUM_GENSTATS), 3) as "Fraction of WHERE/HAVING Refs",
round(hist.NUM_UPDATED / (hist.refs - NUM_GENSTATS), 3) as "Fraction of UPDATE Refs"

FROM _v_relation_column
inner join _V_TABLE on
_v_relation_column.objid = _V_TABLE.OBJID
inner join _v_table_storage_stat stat on
stat.objid = _V_TABLE.OBJID
left join _v_statistic on
_v_relation_column.objid = _v_statistic.objid AND
_v_relation_column.attnum = _v_statistic.attnum
left join histdb.histown."$v_hist_column_access_stats" hist on
_V_TABLE.DATABASE = hist.DBNAME AND
_V_TABLE.schema = hist.schemaname AND
_V_TABLE.TABLENAME = hist.TABLENAME AND
_v_relation_column.attname = hist.COLUMNNAME
WHERE
"Used GB" > 1
and _V_TABLE.objtype = 'TABLE'
and _V_TABLE.DATABASE = 'YOUR_DATABASE'
and _V_TABLE.schema = 'ADMIN'
and _v_relation_column.atttypid IN (20, 21, 23, 1082, 1184, 2500, 16, 1700, 18, 1042, 1043, 2522, 2530)
and _v_relation_column.attdispersion <= 0.5
ORDER BY
"Used GB" desc,
"Table Name",
"Fraction of WHERE/HAVING Refs" desc,
"Fraction of UPDATE Refs",
_v_relation_column.attnum;

Here are the results:

Query 3 result

First, for each table select a partitioning key. That is the top DATE/TIMESTAMP/INTEGER column. Consider the rest of the columns and select the clustering keys.

Below is the definition of the last three columns in the result set. Assuming that the data is distributed uniformly, if you use “Column Name” as the top level partitioning/clustering key each cluster would have:

  • # of Rows in Cluster — the approximate number of rows in each cluster,
  • Cluster Size GB — the approximate size of each cluster,
  • NULL-Cluster Size GB — the approximate size of the cluster for which Column Name is NULL.

Note that NULL values do not contribute to the column dispersion statistics computed by Netezza. So, the result of # of Rows in Cluster was adjusted to account for that. Additionally, column NULL-Cluster Size GB computes how much disk space rows with NULL partition key consume.

Note, that the query doesn’t limit the number of listed columns. If you have wide tables and the query returns an unmanageable set of columns you can always limit the results using RANK() as shown below and selecting the top 6 columns for each table.

RANK() OVER (
PARTITION BY _V_TABLE.OBJID
ORDER BY
round((hist.NUM_WHERE + hist.NUM_HAVING) / (hist.refs - NUM_GENSTATS), 3) desc,
round(hist.NUM_UPDATED / (hist.refs - NUM_GENSTATS), 3)
)

Summarizing the steps

Review the last query results and select the best partitioning key and then clustering key candidates from the suggested columns.

Also, you can partition and cluster on the same column which can be particularly useful for time series data whose granularity is less than a day. Note, that if you don’t want to use partitions but still want to use clustering, you can create a table with only one partition.

If needed, merge several key values to a single partition if a default split produces partitiones less than 512 MB or the number of partitions exceeds 4000. Note, that if a partition spans several cluster values, then data for each clustering key value will be stored in a separate file. Thus, the best way to select a partition size would be that each cluster is larger than 512 MB.

--

--