The 4+1 Questions Every Data Engineer Should Always Ask When Managing Table Objects In Greenplum Database

Learn the basics of managing table objects in the Greenplum Database, become a smarter Data Engineer and avoid common database design pitfalls

Constantinos Antzakas
Greenplum Data Clinics
5 min readJun 12, 2020

--

Photo by Kevin Ku on Unsplash

Let’s start!

1. CREATE TABLE (DDL)

Q. How can I find the exact CREATE TABLE definition (DDL) which was used to create the table?

A. Use the pg_dump utility from the command line to generate the CREATE TABLE statement for a table, as shown below:

2.1 DESCRIBE TABLE (using psql meta-commands)

Q. Greenplum does not provide a DESCRIBE TABLE statement. How can I find information on the columns for a given table?

A. Connect to the database from the command line using the psql utility, and execute the \d (or \d+) meta-command, followed by the name of the table, as shown below:

For each relation matching the pattern, \d[S]+ [ pattern ], show all columns, their types, the tablespace (if not the default) and any special attributes, such as NOT NULL or defaults. Associated indexes, constraints, rules, and triggers are also shown. By default, only user-created objects are shown; supply a pattern or the S modifier to include system objects. For example,

returns:

Table “demo.amzn_reviews”

The command form \d[S]+ is identical, except that more information is displayed: any comments associated with the columns of the table are shown, as is the presence of OIDs in the table, and the view definition if the relation is a view.

For example,

returns:

Table “demo.amzn_reviews”

2.2 DESCRIBE TABLE (using information_schema)

Q. I don’t have access to the psql utility or the command line. How can I find information on the columns for a given table?

A. Query the information_schema.columns view and filter on the TABLE_NAME for the table you are interested in. For example:

returns:

Table “demo.amzn_reviews”

3. TABLE SIZE and DISK SPACE

Q. How can I see how much total disk space is taken by a given table, including information for indexes, toasted data, and compression?

A. The gp_size_of_table_disk, gp_size_of_table_uncompressed, and gp_size_of_table_indexes views on gp_toolkit schema, have all the information you need. Here is an example:

returns output similar to this:

Use the pg_size_pretty() function to convert a size expressed in bytes as a numeric value, into a human-readable format with size units, such as KB, MB, GB, etc.

4. Data Distribution & Data Skew

Q. I’ve been told Greenplum database is a “Shared Nothing” database in which data is spread out across many segments located on multiple segment hosts. How can I be sure I am doing the right thing in distributing data correctly?

A. Uneven data distribution due to the wrong choice of distribution keys or single tuple table INSERT or COPY operations may cause data skew. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan performance (aka reading), but it also affects all other query execution operations, for instance, JOIN and GROUP BY operations.

Here are a few things you can check to validate whether data distribution is done properly:

1. Check data distribution across segments

The most common and straightforward way to check for even distribution or what is called data skew is to count the rows per segment for a given table. For example, use this simple query to get a good indication of the data skew for the table store_sales:

2. Check minimum vs. maximum rows variation per segment

Here is a variation of the previous script, which shows the minimum and maximum numbers of rows per segment as well as the variance between the two:

3. Use gp_toolkit schema to check for data skew

The gp_toolkit schema has two views that you can use to check for skew.

  • The gp_toolkit.gp_skew_coefficients shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The skccoeff column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better, while higher values indicate greater data skew.
  • The gp_toolkit.gp_skew_idle_fractions view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The siffraction column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

Conclusion

Which are your most common questions about Greenplum Database? Let me know in the comments below. Have these answers helped you understand the Greenplum Database and manage your table objects better?

If you liked this article then you may also want to check Managing table objects in Greenplum Database — Part 2: Row- vs. Column-Oriented Storage, follow the Greenplum Data Clinics publication or connect with me on LinkedIn.

--

--

Constantinos Antzakas
Greenplum Data Clinics

Enabling teams to innovate and succeed. Former national 🇬🇷 swimming champion, aspiring Olympic Weightlifter (96kg) and LGBTQ+ in tech advocate.