Collect faster, easier, smarter and risk-free statistics for Greenplum Database table objects using the analyzedb utility

Constantinos Antzakas
Greenplum Data Clinics
4 min readSep 25, 2019
Photo by Chris Liverani on Unsplash

analyzedb is a Greenplum Database management utility that performs incremental and concurrent ANALYZE operations on tables and which, for append optimized tables, updates statistics only if the statistics are not current.

The analyzedb utility updates append optimized table statistics if the table has been modified by DML or DDL commands, including INSERT, DELETE, UPDATE, CREATE TABLE, ALTER TABLE and TRUNCATE. The utility determines if a table has been modified by comparing catalog metadata of tables with the previous snapshot of metadata taken during a previous analyzedb operation. The snapshots of table metadata are stored as state files in the directory db_analyze/<db_name>/<timestamp> in the Greenplum Database master data directory.

Common uses of analyzedb utility

  • In its simplest form, the analyzedb utility is run with the input parameters, shown below:
analyzedb -d dbname { -s schema }

or

analyzedb -d dbname { -t schema.table }
  • If a table, set of tables, or schema, is not specified, then the analyzedb utility collects the statistics as needed on all system catalog tables and user-defined tables in the database. This feature specifically, allows those responsible for the Greenplum Database operations to collect non-current statistics for all database table required.
analyzedb -d dbname
  • By default, analyzedb creates a maximum of 5 concurrent sessions to analyze tables in parallel. For each session, analyzedb issues an ANALYZE command to the database and specifies different table names. The -p option controls the maximum number of concurrent sessions. The parallel-level parameter, can be any integer between 1 and 10 (inclusive).
analyzedb -d dbname -p parallel-level

Also by default, analyzedb utilizes the ANALYZE command capability to use the faster PostgreSQL implementation to gather table statistics, which has introduced in Greenplum Database version 5.x. Previously, separate queries for each column were issued, but 5.x update improves ANALYZE performance for both heap and append-optimized tables, using but a sample of rows collected in a single query, and the calculation of statistics for each column is performed in memory. These improvements produce better overall statistics for queries that involve large tables and greatly improve the efficiency for running ANALYZE on smaller tables in the database.

  • If needed, instead of calculating statistics for each and every column on the table, one can define a column or set of columns to gather statistics (or to skip) using the -i or -x option, respectively. These must always be specified with the -t schema.table option, and only -i, or -x can be specified (but not both together).
analyzedb -d dbname { -t schema.table 
[ -i col1[, col2, ...] |
-x col1[, col2, ...] ] }
  • More frequently than not, tables get created in the Greenplum Database but statistics are never gathered for those. With a little extra help from the Greenplum Database catalog view gp_toolkit.gp_stats_missing, we can identify these tables and issue the appropriate analyzedb commands to be executed:
SELECT 'analyzedb -d ' || table_catalog || ' -t ' || table_schema || '.' || table_name || ';'
FROM
gp_toolkit.gp_stats_missing STATSM
JOIN information_schema.tables INFO
ON INFO.table_schema = STATSM.smischema
AND INFO.table_name = STATSM.smitable
ORDER BY
table_catalog, table_schema, table_name;;

, which produces a series of analyzedb commands, similar to those below:

analyzedb -d gpadmin -t demo.reviews_ao_co_zlib3;
analyzedb -d gpadmin -t demo.reviews_ao_ro_zlib3;
analyzedb -d gpadmin -t demo.reviews_heap;
  • Finally, we can also dive deeper into the Greenplum Database catalog and identify the last time statistics had been collected for the tables in the databases. This information allows us to run a query and identify statistics collected older than a required point in time (i.e. exactly one hour ago). The same information, allows us to prepare a list of analyzedb statements against the database tables, for which statistics have been collected before the given point in time:
SELECT
'analyzedb -d ' || table_catalog || ' -t ' || table_schema || '.' || table_name || '; ## Stats Last Updated on: ' || TO_CHAR(statime, 'YYYY-MM-DD HH24:MI:SS') || ', Control Timestamp: ' || TO_CHAR((NOW() - INTERVAL '1 HOURS')::TIMESTAMP, 'YYYY-MM-DD HH24:MI:SS') AS control_ts
FROM
pg_catalog.pg_stat_operations STAT_OPS
JOIN information_schema.tables INFO
ON INFO.table_schema = STAT_OPS.schemaname
AND INFO.table_name = STAT_OPS.objname
WHERE
classname = 'pg_class'
AND table_schema NOT IN ('pg_catalog', 'information_schema', 'gp_toolkit', 'madlib')
AND actionname = 'ANALYZE'
AND statime <= (NOW() - INTERVAL '1 HOURS')
ORDER BY
table_catalog, table_schema, table_name

, which eventually produces a series of analyzedb commands, similar to those below:

analyzedb -d gpadmin -t demo.reviews_heap; ## Stats Last Updated on: 2019-09-20 13:26:45, Control Timestamp: 2019-09-20 15:25:31analyzedb -d gpadmin -t public.spatial_ref_sys; ## Stats Last Updated on: 2019-09-20 09:22:18, Control Timestamp: 2019-09-20 15:25:31

analyzedb vs. SQL ANALYZE

Each of the analyzedb utility and ANALYZE SQL command has its own syntax but ultimately, they both serve the same purpose, to update statistics for the Greenplum Database tables; thus they can be used almost interchangeably, one instead of the other.

Personally, I prefer using analyzedb because it updates statistics on table data for the specified tables in a Greenplum database incrementally and concurrently. Also, while performing ANALYZE operations, analyzedb creates a snapshot of the table metadata and stores it on disk on the master host thus an ANALYZE operation is performed only if the table has been modified. If a table or partition has not been modified since the last time it was analyzed, analyzedb automatically skips the table or partition because it already contains up-to-date statistics.

Yet the final decision on analyzedb vs. ANALYZE, is in the hands of the DataOps or Data Administrator responsible for the Greenplum Database Tables statistics collection; users of any level of experience will very easily be able to convert the examples presented above, to work with ANALYZE instead of the analyzedb utility.

Let me know if you found this post helpful, identified any problems, want to ask a question or have a suggestion for my next post; your feedback would be very much appreciated. Feel free to respond below, or connect with me on LinkedIn.

--

--

Constantinos Antzakas
Greenplum Data Clinics

Enabling teams to innovate and succeed. Former national 🇬🇷 swimming champion, aspiring Olympic Weightlifter (96kg) and LGBTQ+ in tech advocate.