PushDown vs PullUp Processing; Why things are Changing

Kirk Haslbeck
CollibraDQ
Published in
5 min readMay 22, 2023

Several years ago a DBA would not want an application developer to send strenuous processing to the database because it might disrupt his/her database performance. Especially if the database is an OLTP database that is responsible for transaction processing in the front office. A disruption in service would equal a business outage. This set the stage years ago for OLTP vs OLAP style databases, due to the difference between analytic workloads and transactional workloads. Analytical workloads typically require aggregate functions, large scans and lots of compute. Transactional workloads require fast writes.

However with modern data processing platforms like Databricks SQL Warehouse the design is different. The more queries or processing you throw at it, the more it scales out the workloads in the background. The cloud has enabled this with compute pools, containers and serverless options. This means, in theory a database doesn’t risk running out of compute and can be used for a combination of workloads. You may hear terms like lakehouse, which combines a warehouse and a lake, the idea being it can function across a variety use-cases and workloads.

When I started using machine learning to automate data quality workloads, the processing was too strenuous to push into a traditional database. Spark was an obvious choice, perhaps the best cluster compute option available. Having a spark cluster in my compute plane separated the database from the DQ jobs.

Modern databases like Databricks SQL Warehouse are using ephemeral compute clusters (k8 and spark) behind the scenes to scale the workloads. So the real question becomes: Do you want your compute pool in front of your database or behind it. There is a good argument to be made that it simplifies your technical ownership as well as lowers your cost of ownership to remove the compute plane in front of the database and let the database do the same thing for you. The looming question is always, which one costs less? My professional opinion is the compute cost difference is becoming negligible and hard to measure, but the human cost difference is substantial.

For years CollibraDQ has used spark to compute massive workloads quickly. Recently we started sending those same workloads to Databricks sql warehouse to compare performance against our spark engine. The big difference being Pulling the data UP (out of the source system into our spark cluster) versus Pushing the operation DOWN into the source system. In this case, sending the same operations to Databricks but without moving or egressing any data.

To Databricks credit they have a few tricks/advantages that we don’t. First, DBX can re-use their storage format and partitioning to avoid data skew when moving data into a new cluster. Second, in some cases DBX can position compute closer to the data, saving the physics of moving data through network pipes a long distance. Third, they have a caching option to keep the results “warm”. Meaning a rerun of the same data would cost next to 0 runtime and next to $0 dollars.

My Thoughts: The Tradeoff is Simplicity vs Virtualization

The good news is CDQ offers both PullUp and PushDown modes of operation with a click of a button. For Databricks users, you now have 2 compute options, the new one being native pushdown into DBX. CDQ Pushdown by roughly all measures offers a faster and lower cost option for DBX customers. In addition it offers a security benefit of less data movement for some organizations. A counter argument for pushdown could be: a compute layer in front of the warehouses could offer a virtualization layer, for reconciliation tasks and logical joins from hybrid clouds. Hence why it may be important to consider a DQ tool that can span both types of workloads, depending on your business use-cases. We find many large organizations have both requirements.

Hardware, Sizing & Caching

The tests were run on 4XLarge. Cold runs = complete cluster restarts. Warm runs = second run with caching on (rerun on cache). Cost estimates can be assumed from uptime of a 4XLarge, meaning most workloads tested here cost less than $1 to run. The datasets number of columns ranged from 29 to 34 columns. Each column adds data quality processing, a 200 column table is significantly more than a 20 column table (10x more as you might expect). CDQ allows you to opt in and out of columns so you can focus your DQ workloads on the columns that are worth the time/dollar cost.

I believe in most real-world use-cases a devOps professional will not have the benefit of a re-run and should look at “cold run” runtimes. Most of the time data coming in from the pipelines is new daily data and will be processed for the first time at each hop. There are of course cases where reruns will greatly benefit.

What is really going on in the DQ workloads?

Let’s discuss what is going on in those runtimes. Every column and every row are processed, a data profile is created, 6 or more rules per column are generated to fulfill dimensions like completeness, accuracy, validity etc… In addition, outlier detection, histograms, and 35 additional manual rules requiring full table scans were also run. In total roughly 300 operations were run on the tables in the test. Some of the 300 being regular expressions and ML to derive the erroneous data points. In no way can the above be compared to a workload that simply loads data, it is far more compute than that. After all, our goal is to stress test the entire dataset and surface any issue, even relationship issues across columns. The 300+ sql operations are all logged in both CDQ and DBX. The days of writing 300 rules per dataset are over. Today’s organizations have far too many datasets for traditional approaches.

Disclaimer: Myself and Collibra team are not the authoritative source for performance benchmarking, we have no bias towards vendors, we are only focused on data quality workloads. Our goal is to achieve the lowest cost and fastest runtimes in the interest of scaling a DQ program. We happen to be a team of performance engineers, but we all know there are many variables in performance testing and each one can change the outcomes. We do use the TPC-DS tables and follow best practices.

--

--