Managing table objects in Greenplum Database — Part 2: Row- vs. Column-Oriented Storage

Columnar databases are increasingly popular in the Online Analytical Query Processing (OLAP) systems space against the traditional row-oriented storage databases.

Constantinos Antzakas
Greenplum Data Clinics
4 min readJun 22, 2020

--

Photo by Anne Nygård on Unsplash

Many databases for data warehousing and analytics are following this storage model and are columnar-only stores. Rather than following the trends of the time, Greenplum Database provides a choice of storage orientation models: row, column, or a combination of both.

This article explains the options for table storage and how to choose the best storage model for your workload.

A columnar database is a database which stores data by columns rather than by rows.

How to choose the storage orientation model in Greenplum Database?

For most general-purpose or mixed workloads, row-oriented storage offers a great combination of flexibility and performance. However, a column-oriented storage model provides more efficient I/O and storage.

Consider the following requirements when deciding on the storage orientation model for a table in Greenplum Database:

Row-oriented storage

  • Updates of table data: Choose row-oriented storage, if you update the table data frequently.
  • LOAD/Frequent INSERTs: Consider a row-oriented storage model, for efficient LOAD operations and if rows are frequently inserted into the table. Column-oriented tables are not optimized for write operations, as column values for a row must be written to different places on the disk.

Column-oriented storage

  • Compression. Column data has the same data type, so storage size optimizations are available in column-oriented data that are not available in row-oriented data. Many compression schemes, for example, use the similarity of adjacent data to compress and achieve significant disk space savings.
  • The number of columns in the query(ties): Column-oriented tables are best suited to queries that aggregate many values of a single column where the WHERE or HAVING predicate is also on the aggregate column. For example:
SELECT SUM(salary)…

or

SELECT AVG(salary)… WHERE salary > 10000

or where the WHERE predicate is on a single column and returns a relatively small number of rows. For example:

SELECT salary, dept … WHERE state=’CA’
  • The number of columns in the table: Column-oriented tables can offer better query performance on tables with many columns where you access a small subset of columns in your queries. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small.

A real-life example:

On a 4-node (4x r5.4xlarge) Greenplum Database (v6.6.0) system on AWS, we put to test the two table storage models:

  • Row oriented table (no compression) (amzn-reviews-ro),
  • Column-oriented table (ZLib-Level 3 compression) (amzn-reviews-co).

Each storage model was compared with the other against 6 different use cases:

Bulk (~151M rows) loading

  • Loadingamzn-reviews-ro finished in 544.98 seconds (or appr. 9 minutes and 5 seconds, and
  • Loadingamzn-reviews-co finished in 552.29 seconds (or appr. 9 minutes and 12 seconds).

Loading (~151M rows) from another table

  • Loadingamzn-reviews-ro finished in ~216 seconds (or 3 minutes and 36 seconds), and
  • Loadingamzn-reviews-co finished in ~98 seconds (or 1 minute and 38 seconds).

Table Size and Disk Space Usage

  • amzn-reviews-ro was ~80 GB (no compression),
  • amzn-reviews-co was ~74 GB/32 GB (before/after compression), or ~56% compression overall.

Super Narrow SELECT — 1x “Short” Data Field

SELECT product_id 
FROM demo.amzn_reviews_***
WHERE DATE_PART('year', review_date) BETWEEN 2000 AND 2005;
  • Executed in 5845.564 ms for amzn-reviews-ro, and
  • Executed in 3350.621 foramzn-reviews-co.

Super Narrow SELECT — 1x “Long” Data Field

SELECT review_body 
FROM demo.amzn_reviews_***
WHERE DATE_PART('year', review_date) BETWEEN 2000 AND 2005;
  • Executed in 9244.800 ms for amzn-reviews-ro, and
  • Executed in 23396.014 ms foramzn-reviews-co.

Narrow SELECT — Few columns

SELECT product_id, 
marketplace,
product_category,
star_rating
FROM demo.amzn_reviews_***
WHERE DATE_PART('year', review_date) BETWEEN 2000 AND 2005;
  • Executed in 5717.630 ms for amzn-reviews-ro, and
  • Executed in 4548.015 ms foramzn-reviews-co.

Wide SELECT — Most/Many columns

SELECT marketplace
, customer_id
, product_id
, product_title
, product_parent
, product_category
, star_rating
, helpful_votes
, total_votes
, vine
, verified_purchase
, review_headline, review_body
FROM demo.amzn_reviews_***
WHERE DATE_PART('year', review_date) BETWEEN 2000 AND 2005;
  • Executed in 10359.897 ms for amzn-reviews-ro, and
  • Executed in 29237.993 ms foramzn-reviews-co.

Aggregate/Window Function over a limited number of columns

SELECT COUNT(*)
, product_category
, star_rating
FROM demo.amzn_reviews_heap
GROUP BY 2, 3, 4;
  • Executed in 3989.040 ms for amzn-reviews-ro, and
  • Executed in 1835.179 ms foramzn-reviews-co.

Aggregate/Window Function over more/many columns

SELECT COUNT(*) 
, ROUND(AVG(helpful_votes)) AS helpful_votes_avg
, ROUND(AVG(total_votes)) AS total_votes_avg
, marketplace
, product_category
, star_rating
, verified_purchase
, CASE WHEN LENGTH(review_headline)<= 20 THEN 'Short Headline' ELSE 'Long Headline' END AS review_headline_length_type
, CASE WHEN LENGTH(review_body)<= 400 THEN 'Short Body' ELSE 'Long Body' END AS review_body_length_type
FROM demo.amzn_reviews_ro
GROUP BY 4, 5, 6, 7, 8, 9;
  • Executed in 2780.567 ms for amzn-reviews-ro, and
  • Executed in 1890.663 ms foramzn-reviews-co.

Conclusion

For users who want more than just a one-size-fits-all option, Greenplum Database offers multiple choices of storage orientation models to fit your workload and use case.

Column-oriented storage model provides the most efficient I/O and storage, and really shines for narrow or super narrow queries but fails to deliver for wider SELECT’s. Usually labelled as old-fashioned or obsolete from the new and cool vendors, row-oriented storage is a well-rounded model, still relevant for general-purpose or mixed workloads and offering a great combination of flexibility and performance.

If you liked this article then you may also want to check Managing table objects in Greenplum Database — Part 1: The basics, follow the Greenplum Data Clinics publication or connect with me on LinkedIn.

Looking forward to hearing your comments and if you think there are any additional tests I should be doing on the topic.

--

--

Constantinos Antzakas
Greenplum Data Clinics

Enabling teams to innovate and succeed. Former national 🇬🇷 swimming champion, aspiring Olympic Weightlifter (96kg) and LGBTQ+ in tech advocate.