Managing table objects in Greenplum Database — Part 2: Row- vs. Column-Oriented Storage
Columnar databases are increasingly popular in the Online Analytical Query Processing (OLAP) systems space against the traditional row-oriented storage databases.
Many databases for data warehousing and analytics are following this storage model and are columnar-only stores. Rather than following the trends of the time, Greenplum Database provides a choice of storage orientation models: row, column, or a combination of both.
This article explains the options for table storage and how to choose the best storage model for your workload.
A columnar database is a database which stores data by columns rather than by rows.
How to choose the storage orientation model in Greenplum Database?
For most general-purpose or mixed workloads, row-oriented storage offers a great combination of flexibility and performance. However, a column-oriented storage model provides more efficient I/O and storage.
Consider the following requirements when deciding on the storage orientation model for a table in Greenplum Database:
Row-oriented storage
- Updates of table data: Choose row-oriented storage, if you update the table data frequently.
- LOAD/Frequent INSERTs: Consider a row-oriented storage model, for efficient LOAD operations and if rows are frequently inserted into the table. Column-oriented tables are not optimized for write operations, as column values for a row must be written to different places on the disk.
Column-oriented storage
- Compression. Column data has the same data type, so storage size optimizations are available in column-oriented data that are not available in row-oriented data. Many compression schemes, for example, use the similarity of adjacent data to compress and achieve significant disk space savings.
- The number of columns in the query(ties): Column-oriented tables are best suited to queries that aggregate many values of a single column where the WHERE or HAVING predicate is also on the aggregate column. For example:
SELECT SUM(salary)…
or
SELECT AVG(salary)… WHERE salary > 10000
or where the WHERE predicate is on a single column and returns a relatively small number of rows. For example:
SELECT salary, dept … WHERE state=’CA’
- The number of columns in the table: Column-oriented tables can offer better query performance on tables with many columns where you access a small subset of columns in your queries. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small.
A real-life example:
On a 4-node (4x r5.4xlarge) Greenplum Database (v6.6.0) system on AWS, we put to test the two table storage models:
- Row oriented table (no compression) (
amzn-reviews-ro
), - Column-oriented table (ZLib-Level 3 compression) (
amzn-reviews-co
).
Each storage model was compared with the other against 6 different use cases:
Bulk (~151M rows) loading
- Loading
amzn-reviews-ro
finished in 544.98 seconds (or appr. 9 minutes and 5 seconds, and - Loading
amzn-reviews-co
finished in 552.29 seconds (or appr. 9 minutes and 12 seconds).
Loading (~151M rows) from another table
- Loading
amzn-reviews-ro
finished in ~216 seconds (or 3 minutes and 36 seconds), and - Loading
amzn-reviews-co
finished in ~98 seconds (or 1 minute and 38 seconds).
Table Size and Disk Space Usage
amzn-reviews-ro
was ~80 GB (no compression),amzn-reviews-co
was ~74 GB/32 GB (before/after compression), or ~56% compression overall.
Super Narrow SELECT — 1x
“Short” Data Field
SELECT product_id
FROM demo.amzn_reviews_***
WHERE DATE_PART('year', review_date) BETWEEN 2000 AND 2005;
- Executed in 5845.564 ms for
amzn-reviews-ro
, and - Executed in 3350.621 for
amzn-reviews-co
.
Super Narrow SELECT — 1x
“Long” Data Field
SELECT review_body
FROM demo.amzn_reviews_***
WHERE DATE_PART('year', review_date) BETWEEN 2000 AND 2005;
- Executed in 9244.800 ms for
amzn-reviews-ro
, and - Executed in 23396.014 ms for
amzn-reviews-co
.
Narrow SELECT — Few columns
SELECT product_id,
marketplace,
product_category,
star_rating
FROM demo.amzn_reviews_***
WHERE DATE_PART('year', review_date) BETWEEN 2000 AND 2005;
- Executed in 5717.630 ms for
amzn-reviews-ro
, and - Executed in 4548.015 ms for
amzn-reviews-co
.
Wide SELECT —
Most/Many columns
SELECT marketplace
, customer_id
, product_id
, product_title
, product_parent
, product_category
, star_rating
, helpful_votes
, total_votes
, vine
, verified_purchase
, review_headline, review_body
FROM demo.amzn_reviews_***
WHERE DATE_PART('year', review_date) BETWEEN 2000 AND 2005;
- Executed in 10359.897 ms for
amzn-reviews-ro
, and - Executed in 29237.993 ms for
amzn-reviews-co
.
Aggregate/Window Function over a limited number of columns
SELECT COUNT(*)
, product_category
, star_rating
FROM demo.amzn_reviews_heap
GROUP BY 2, 3, 4;
- Executed in 3989.040 ms for
amzn-reviews-ro
, and - Executed in 1835.179 ms for
amzn-reviews-co
.
Aggregate/Window Function over more/many columns
SELECT COUNT(*)
, ROUND(AVG(helpful_votes)) AS helpful_votes_avg
, ROUND(AVG(total_votes)) AS total_votes_avg
, marketplace
, product_category
, star_rating
, verified_purchase
, CASE WHEN LENGTH(review_headline)<= 20 THEN 'Short Headline' ELSE 'Long Headline' END AS review_headline_length_type
, CASE WHEN LENGTH(review_body)<= 400 THEN 'Short Body' ELSE 'Long Body' END AS review_body_length_type
FROM demo.amzn_reviews_ro
GROUP BY 4, 5, 6, 7, 8, 9;
- Executed in 2780.567 ms for
amzn-reviews-ro
, and - Executed in 1890.663 ms for
amzn-reviews-co
.
Conclusion
For users who want more than just a one-size-fits-all option, Greenplum Database offers multiple choices of storage orientation models to fit your workload and use case.
Column-oriented storage model provides the most efficient I/O and storage, and really shines for narrow or super narrow queries but fails to deliver for wider SELECT’s. Usually labelled as old-fashioned or obsolete from the new and cool vendors, row-oriented storage is a well-rounded model, still relevant for general-purpose or mixed workloads and offering a great combination of flexibility and performance.
If you liked this article then you may also want to check Managing table objects in Greenplum Database — Part 1: The basics, follow the Greenplum Data Clinics publication or connect with me on LinkedIn.
Looking forward to hearing your comments and if you think there are any additional tests I should be doing on the topic.