Photo by Joshua Sortino on Unsplash

End-to-End Data Analytics Performance

The End-to-End Data Analytics Workflow Requires Generality

Henry Gabb
4 min readSep 25, 2020

--

I came across an article from NVIDIA talking about their TPCx-BB benchmark results on A100. As a data scientist, I was immediately intrigued because I’m a big fan of the Transaction Processing Performance Council (TPC) benchmarks, which provide reasonable and objective performance metrics. Also, the TPC has clear rules about how their benchmarks are used and how results are reported to ensure that results from different vendors can be directly compared. I’ll say more about this later, but first let’s talk about the end-to-end data analytics workflow.

I’ve drawn a rough sketch of the end-to-end data analytics workflow based on my experience as a data scientist (Figure 1). Not all of my data science projects pass through every stage of this workflow, but it represents the sum total of my projects. Consequently, my computing environment must be able to handle all stages, especially the early stages: OLTP (online transactional processing) and OLAP (online analytical processing). As every data scientist knows, by the time you get to modeling, the hard work is already done. OLTP deals with managing data stores, while OLAP deals mainly with information retrieval. TPCx-BB is mainly an OLAP benchmark.

Figure 1. Rough breakdown of stages in the end-to-end data analytics workflow

It’s always best to assess a computing environment using your specific workflows, but data science is highly variable. Analytics workflows change from one project to the next. A system architecture that performs well in one stage of the end-to-end workflow may perform poorly in another. Therefore, data analytics requires generality. This is why standard, off-the-shelf benchmarks like TPCx-BB are valuable.

The benchmarks shown in Table 1 were created by experts to objectively assess different stages of the end-to-end data analytics workflow. They’re easy to evaluate (i.e., most have built-in correctness evaluators), their performance metrics are clearly defined, and most offer auditing. To quote TPC, this helps “…protect users from misleading or false performance claims…” With that in mind, let’s return to NVIDIA’s TPCx-BB results.

Table 1. Standard benchmarks for the end-to-end data analytics workflow

TPCx-BB is a big data benchmark that contains elements of OLAP and data modeling. It is designed to measure the performance of Apache Hadoop systems using a mix of 30 SQL queries, user-defined functions, and machine learning functions. NVIDIA posted their code on GitHub, so I took a look at their query implementations to see if they actually ran TPCx-BB. They didn’t.

First, they replaced Spark with Dask, which defeats the purpose of a Hadoop-based benchmark. Dask is a nice technology but Spark is far more common in data analytics workflows. Second, some of their query implementations ignored the user-defined and/or machine learning functions. Finally, they do not report the required TPCx-BB performance metrics: BBQpm (queries per minute throughput) and Price/BBQpm. The former is critical for a true assessment of overall performance because TPCx-BB models a system under load rather than the performance of isolated queries. The NVIDIA measurements ignore load and throughput, which isn’t realistic.

The current, audited TPCx-BB results (as of September 25, 2020) from several major hardware vendors are shown in Figure 2. All of their benchmarking systems used Intel Xeon processors at various scale factors and price points. There is no current or historical data for NVIDIA processors.

Figure 2. Audited TPCx-BB benchmark results as of September 25, 2020. (Source: http://www.tpc.org/tpcx-bb/results/tpcxbb_perf_results5.asp, used with permission from TPC)

While I applaud NVIDIA’s attempt to use a standard, off-the-shelf benchmark like TPCx-BB, please run the actual benchmark suite and report the primary metrics — if you can. As I said above, the TPC has strict rules about how their benchmarks are used:

“…it should be noted that the TPC benchmark specifications and policies require the submittal of complete documentation on these tests, which are then reviewed by the TPC Council. If a vendor’s TPC benchmark test is determined to be executed improperly or unfairly, a vendor will have to withdraw the result and can no longer use that result publicly. These rules protect users from misleading or false performance claims and preserves the credibility of TPC benchmark results.” (Source: Running a TPC Benchmark)

I’ve taken NVIDIA to task once before for using contrived tests to represent an entire stage of the end-to-end workflow:

Don’t be fooled. Generality is critical in data science. Xeon-based systems scale better and provide best performance and TCO for the end-to-end data analytics workflow.

--

--