Data Engineering Series: Transactional vs Analytical Workloads

Gu Xie
6 min readApr 9, 2020

--

Expanding the reaches of data platforms (Source: datawarehouseinfo.com)

Many times, people tend to confuse between transactional and analytical workloads.

In reality, it’s really based on the primary consumption pattern — are you building a system to facilitate “transactions” (think point-of-sale) or are you trying to “analyze” history to see trends and aggregated views?

This is a core concept in data management and the underlining data store technology is designed to optimize for the target workloads.

It is absolutely crucial to use the right data store for the right use-case

Why does this matter?

If we do not care to optimize the data workload, then most apps can be built on top of file management system. Instead of storing data on a database, imagine for a second that your application stores customers as a file CUSTOMER.csv.

This would work for the most basic app but what if the application needs to create or update customer records? Now you will likely have to extract the full file in-memory, scan through the file to find customer 665 and re-write the record in question for the update or append the brand new customer 666.

This works for simple use-cases but you can imagine that it starts breaking down quickly:

  • What if we have millions of customers?
  • Are we going to scan 1 GB file each time? What if the file is 100 GB?
  • What if we are scaling out app out to write against the file?
  • How do we do parallel writes against the file across multiple instances?
  • How do we handle duplicate records?

This is a simple thought experiment into the challenges faced by early days of computing. It’s exactly based on these challenges that RDBMS (Relational Database Management Systems) were invented — to handle these transactional workloads.

RDBMS engines such as MySQL, Postgres, Oracle, and SQL Server were designed to handle transactional workloads. They are highly optimized over the many years to scale, to ensure transactional consistency, to ensure quick responsive queries, and to some degree handle concurrent writes really well — all while enabling a simple language for consuming data via SQL. In some engines such as SQL Server and Oracle, they have specific features to enable OLAP workloads via columnstore indexes and in-memory column store respectively.

Where does analytical workloads come in?

You are now able to build an app on top of a database that can handle the writes and have good response time from the screen as you paint the application with data, and you are storing data over time about your customer, their purchases, the products, the deliveries, and even application health metrics.

Imagine what the next logical set of tasks will be — something on the line of:

  • What is the most popular product?
  • What is our annual or monthly sale?
  • Are we getting organic sales growth?
  • Who is our top customer?

I’m sure many of you experienced this and you can envision the type of queries we are now writing. The challenge with traditional RDBMS, which are optimized for OLTP workloads, is that fundamentally data is stored in a file system row-by-row.

In the case of extracting the annual sales, you are forcing the DB engine to scan the entire transaction table, filtering on the year in question and then computing the aggregate sales amount in a single go.
DB engines have ways to optimize for these types of scenarios by enabling indexing, table partitions, and using the cost-based optimizer to minimize the amount of table scan but nevertheless, you will have to scan row-by-row.
On top of that, most cases we likely querying for a small set of columns in a large table in our analytical queries (eg. trade date, transaction type, amount) where it really doesn’t need to read the entire row.

This is where columnar (aka column-oriented or column-store) DB engines such as Teradata, Sybase IQ, Vertica, Redshift, BigQuery, Synapse, and Snowflake come into play.

How do columnar databases work?

Row-oriented databases store data sequentially row-by-row.

  • Fast data retrieval for applications because you are getting a single record at a time
  • Fast for record-by-record processing for updates and inserts; very slow for bulk updates
  • Aggregation queries essentially force table scans and thus slower for analysis

Column-oriented databases store data sequentially by columns instead.

  • Fast data retrieval for columns but slow for single record retrieval
  • High levels of compression because columns are stored together with same data type
  • Fast for bulk updates and insert scenarios; very slow for single record processing
  • Fast for data analysis requiring aggregations

Row-Oriented Example

Source: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html

The diagram above shows how row store databases store data on disk via file blocks by row so the records are stored sequentially.
Think about how your app is painting a screen of a single customer record at a time. Data retrieval is optimized by rows.

The trade-off is that if you need to aggregate the entire customer table, you are reading every single block.
Eg. 100 GB Customer table (above) performing a Count by City = 100 GB scanned

Column-Oriented Example

Source: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html

The diagram above shows how the same table is store in a columnar database — each column is now stored sequentially in disk.
Now the data retrieval pattern is optimized by columns.

Imagine now that you are now scanning a single column that is 1/6th of the table but also the entire column is compressed (average 4–6X compression ratios).
Eg. 100 GB Customer table (above) performing a Count by City = 4.17 GB scanned (100 GB / 6 columns / 4x compression ratio)

What are the gotchas for columnar databases?

Technology is fundamentally a balance of trade-offs.
Columnar database architecture is highly optimized for aggregations and data analysis like historical year-over-year trends but they will lack in:

Data Ingestion

  1. Slow record-by-record operations for insert, update, delete
  2. Limited optimizations for row-by-row due to lack of complex index structures

Latency

  1. Slow record-by-record retrieval — will not get millisecond response but expect less than a second for a medium sized table or a couple of seconds for a very large table
  2. Concurrency tends to be a challenge with most columnar databases since they rely on MPP architecture for parallelizing compute

This is because they have to write the entire file of the columns being referenced so the traditional operation of row-by-row processing would be much slower.

Some columnar engines minimize these limitations by:

  • Enabling micro-partitions to minimize the amount of file blocks being re-written
  • Advanced caching mechanisms with caching tiers to enable faster response on similar queries
  • Advanced column statistics to table partition pruning/filtering

What’s next?

That concludes the first part of the series around analytical data store.

I hope this brings to light the core concepts of data management and to better understand transactional vs analytical workloads and how to optimize for them.
The holy grail in data processing is the ability to handle mixed workloads of both transactional and analytical but the industry has yet to achieve this — there are some very interesting players in this space where we see advanced streaming-based architectures but they are far too early to tell.

In short, columnar databases are the most optimized way of handling analytical workloads but if you have real-time ingestion and retrieval use-cases, rely on a transactional database.

The next part of the series will explain why we chose Snowflake as we explain more into the evaluations we performed and the vision in which we see for the future.

OLTP vs OLAP

Source: https://datawarehouseinfo.com/how-does-oltp-differ-from-olap-database/

Row vs Columnar

Source: https://blog.jscrambler.com/the-data-processing-holy-grail-row-vs-columnar-databases/

--

--