MPP: The Transformation on Big Data Analytics

Maggy Hu
Slalom Technology
Published in
4 min readJan 22, 2019
Photo by rawpixel on Unsplash

No longer are the days where office small talk is about the missed PAT or what the trendiest neighborhood restaurant is, the hype is around “MPP.” If you’re in the data and analytics space, you may have heard this buzzword frequently. MPP, or massively parallel processing, has been around for some time, but big-branded products such as Redshift and Snowflake have put the spotlight on the technology front and center.

What is MPP you may ask? A basic 101 description of MPP is splitting up a very large task into multiple sub-tasks, and running those sub-tasks at the same time. If you think about it, MPP is basically leveraging a very elementary and logical tale as old as time, divide and conquer. An MPP database or data warehouse partitions both data and computing power among several nodes (servers), and in most technologies, there is a designated leader node that delegates the work, and worker nodes that carry out the tasks.

Horizontal vs. Vertical Scaling

The principle behind dividing and conquering lends itself to the concept of scaling horizontally vs. vertically. Scaling vertically traditionally refers to performance of your data processing. If I run a query that is taking a long time to return results, one would resize to a larger compute node to account for the data size or complexity of the query. Now, slow-running queries are only half of the problem. When you think about multiple concurrent users and their queries fighting for resources, we talk about scaling horizontally. By adding more compute nodes and spreading the workload across more resources, data processing times can be significantly reduced.

The Impact on Analytics

MPP databases and data warehouses are typically columnar stores, which is the most flexible and economical for analytics. Instead of processing data by rows, which is imperative in transactional systems where all details of a transaction are required, MPP columnar databases process data by you guessed it, columns. For analytic-driven insights like deriving aggregates, averages, max/mins, there is no need to access the entire row, you only need to calculate the data from the attribute you are looking at.

For example: in a transactional database at a grocery store, I want to see everything related to the purchase of one item. That means per item scanned, I want to see what that item is, the price for that item, the quantity of that item purchased, the total price of that item, the day it was purchased, etc. As an analytics person, I just want to see the total sales for that item per day, without all the “frivolous” data of price, quantity, etc.

Data Modeling & Engineering Implications

Storing data in MPP columnar data warehouses not only has an impact on how data is consumed by analytic teams, but also influences how the data is transformed and landed into the warehouse through data modeling and engineering techniques. Unlike traditional on-premise relational databases, denormalizing and flattening your data model the most you can leads to the most efficient data processing and query retrieval times. Minimizing joins as much as possible and avoiding snowflake schemas (where tables reference other tables, not to be confused with Snowflake the data warehousing company) increases the performance in MPP columnar stores. In the old days where data storage on disk was very expensive and finite, snowflake schemas were optimal to remove data redundancy and increase performance (think wide, but short tables). With modern data architecture, cloud storage is limitlessly scalable and relatively cheap. Data redundancy can be less of a concern for big data analytics (think long, but thin tables). For analytical queries, denormalized data models and limiting how many joins are needed to occur on very large tables can increase performance significantly, at the same time giving you the flexibility to run complex and unique queries.

What Does This Mean for Your Business?

MPP columnar technologies can be incredibly powerful for your organization’s analytic needs. On top of storing and processing data efficiently for business analytics, organizations overwhelmingly struggle with the same fundamental issues, regardless of the industry you are in. Whether it’s talented teams limited by outdated processes and architecture, multiple sources of truth living in different data stores, and/or siloed data ownership and no data governance, organizations often can be stifled by spending far too much time obtaining and consolidating massive amounts of data, rather than analyzing and driving business insights. Although technology can only resolve some of these issues, organizations can benefit from cloud-native storage built on MPP principles for establishing data curation and operational processes.

While realizing the business impact of moving to MPP architecture is straightforward, choosing the right MPP data warehouse solution can be trickier. There are many versions of cloud data warehouses out there that leverage MPP, including Redshift, Snowflake, BigQuery, and SQL Data Warehouse to name a few. Although most are similar in foundation, there are slight differences between each that can make a noticeable difference in the way your business operates and how your teams curate their data. Cloud vendor-agnostic, MPP technology is the future of data processing and storage for big data analytics.

--

--