The 4+1 Questions Every Data Engineer Should Always Ask When Managing Table Objects In Greenplum Database

Learn the basics of managing table objects in the Greenplum Database, become a smarter Data Engineer and avoid common database design pitfalls

Constantinos Antzakas

Published in

Greenplum Data Clinics

5 min readJun 12, 2020

CREATE TABLE (DDL)
DESCRIBE TABLE using psql meta-commands, and DESCRIBE TABLE using information_schema
TABLE SIZE and DISK SPACE
Data Distribution & Data Skew

Let’s start!

1. CREATE TABLE (DDL)

Q. How can I find the exact CREATE TABLE definition (DDL) which was used to create the table?

A. Use the pg_dump utility from the command line to generate the CREATE TABLE statement for a table, as shown below:

pg_dump -U [ username ] -W [ password ] [ database ] -t [ tablename ] --schema-only

2.1 DESCRIBE TABLE (using psql meta-commands)

Q. Greenplum does not provide a DESCRIBE TABLE statement. How can I find information on the columns for a given table?

A. Connect to the database from the command line using the psql utility, and execute the \d (or \d+) meta-command, followed by the name of the table, as shown below:

\d[S]+ [ pattern ]

For each relation matching the pattern, \d[S]+ [ pattern ], show all columns, their types, the tablespace (if not the default) and any special attributes, such as NOT NULL or defaults. Associated indexes, constraints, rules, and triggers are also shown. By default, only user-created objects are shown; supply a pattern or the S modifier to include system objects. For example,

\d demo.amzn_reviews

returns:

Table “demo.amzn_reviews”

The command form \d[S]+ is identical, except that more information is displayed: any comments associated with the columns of the table are shown, as is the presence of OIDs in the table, and the view definition if the relation is a view.

For example,

\d+ demo.amzn_reviews

returns:

Table “demo.amzn_reviews”

2.2 DESCRIBE TABLE (using information_schema)

Q. I don’t have access to the psql utility or the command line. How can I find information on the columns for a given table?

A. Query the information_schema.columns view and filter on the TABLE_NAME for the table you are interested in. For example:

SELECT * 
FROM information_schema.COLUMNS 
WHERE TABLE_NAME = 'amzn_reviews'
ORDER BY ordinal_position;

returns:

Table “demo.amzn_reviews”

3. TABLE SIZE and DISK SPACE

Q. How can I see how much total disk space is taken by a given table, including information for indexes, toasted data, and compression?

A. The gp_size_of_table_disk, gp_size_of_table_uncompressed, and gp_size_of_table_indexes views on gp_toolkit schema, have all the information you need. Here is an example:

SELECT gpsotd.sotdschemaname AS schema_name
     , gpsotd.sotdtablename AS table_name
     , gpsotd.sotdsize AS total_bytes
     , gpsotaid.sotaidtablesize AS disk_bytes
     , gpsotd.sotdtoastsize AS toast_bytes
     , gpsotaid.sotaididxsize AS index_bytes
     , gpsotd.sotdadditionalsize  AS additional_bytes
     , gpsotu.sotusize AS uncompressed_bytes
     , ROUND( COALESCE(( 100.0 * (1 - (gpsotaid.sotaidtablesize / NULLIF(gpsotu.sotusize, 0)))), 0)::NUMERIC, 2) AS compressionperc
FROM gp_toolkit.gp_size_of_table_disk gpsotd
     LEFT OUTER JOIN gp_toolkit.gp_size_of_table_uncompressed gpsotu ON gpsotd.sotdoid = gpsotu.sotuoid
     LEFT OUTER JOIN gp_toolkit.gp_size_of_table_and_indexes_disk gpsotaid ON gpsotd.sotdoid = gpsotaid.sotaidoid;

returns output similar to this:

Use the pg_size_pretty() function to convert a size expressed in bytes as a numeric value, into a human-readable format with size units, such as KB, MB, GB, etc.

4. Data Distribution & Data Skew

Q. I’ve been told Greenplum database is a “Shared Nothing” database in which data is spread out across many segments located on multiple segment hosts. How can I be sure I am doing the right thing in distributing data correctly?

A. Uneven data distribution due to the wrong choice of distribution keys or single tuple table INSERT or COPY operations may cause data skew. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan performance (aka reading), but it also affects all other query execution operations, for instance, JOIN and GROUP BY operations.

Here are a few things you can check to validate whether data distribution is done properly:

1. Check data distribution across segments

The most common and straightforward way to check for even distribution or what is called data skew is to count the rows per segment for a given table. For example, use this simple query to get a good indication of the data skew for the table store_sales:

SELECT gp_segment_id
     , COUNT(*)
FROM demo.amzn_reviews
GROUP BY gp_segment_id
ORDER BY 1;

2. Check minimum vs. maximum rows variation per segment

Here is a variation of the previous script, which shows the minimum and maximum numbers of rows per segment as well as the variance between the two:

SELECT 'demo.amzn_reviews' AS "Table Name"
      , max(c) AS "Max Seg Rows"
      , min(c) AS "Min Seg Rows"
      , (max(c)-min(c))*100.0/max(c) AS "Percentage Difference Between Max & Min"
FROM (
     SELECT count(*) c
          , gp_segment_id
     FROM demo.amzn_reviews
     GROUP BY 2
) AS a;

3. Use gp_toolkit schema to check for data skew

The gp_toolkit schema has two views that you can use to check for skew.

The gp_toolkit.gp_skew_coefficients shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The skccoeff column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better, while higher values indicate greater data skew.

SELECT skcc.skcnamespace as schemaname
, skcc.skcrelname as tablename 
, skcc.skccoeff as coefficient
FROM gp_toolkit.gp_skew_coefficients skcc
WHERE skcc.skcrelname = 'amzn_reviews'

The gp_toolkit.gp_skew_idle_fractions view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The siffraction column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

SELECT sif.sifnamespace as schemaname
     , sif.sifrelname as tablename
     , sif.siffraction as fraction
FROM gp_toolkit.gp_skew_idle_fractions sif
WHERE sif.sifrelname = 'amzn_reviews'

Conclusion

Which are your most common questions about Greenplum Database? Let me know in the comments below. Have these answers helped you understand the Greenplum Database and manage your table objects better?

If you liked this article then you may also want to check Managing table objects in Greenplum Database — Part 2: Row- vs. Column-Oriented Storage, follow the Greenplum Data Clinics publication or connect with me on LinkedIn.

The 4+1 Questions Every Data Engineer Should Always Ask When Managing Table Objects In Greenplum Database

Learn the basics of managing table objects in the Greenplum Database, become a smarter Data Engineer and avoid common database design pitfalls

Let’s start!

1. CREATE TABLE (DDL)

2.1 DESCRIBE TABLE (using psql meta-commands)

Table “demo.amzn_reviews”

Table “demo.amzn_reviews”

2.2 DESCRIBE TABLE (using information_schema)

Table “demo.amzn_reviews”

3. TABLE SIZE and DISK SPACE

4. Data Distribution & Data Skew

Conclusion

Written by Constantinos Antzakas