An interactive Collective Knowledge dashboard for the MLPerf Inference v0.5 results.

Benchmarking as a product: a case for Machine Learning

Published in

The Startup

3 min readApr 7, 2020

Every professional programmer is familiar with Fred Brooks’s timeless classic “The Mythical Man-Month”. In Fig 1.1, Brooks distinguishes between two types of programming objects:

a program: “complete in itself, ready to be run by the author on the system on which it was developed”; “the thing commonly produced in garages”; “the object the individual programmer uses in estimating productivity”.
a programming product: “can be run, tested, repaired, and extended by anybody”; “usable in many operating environments, for many sets of data”; “written in a generalized fashion”; “must be thoroughly tested”; “requires thorough documentation”.

Furthermore, Brooks estimates that “a programming product costs at least three times as much as a debugged program with the same function”.

It turns out, we can make a similar distinction for objects used for performance evaluation, colloquially known as benchmarking. Ed Plowman, former Director of Performance Analysis Strategy at Arm, gave the following definitions:

a benchmark: an abusive term for questionably constructed software, e.g. “This piece of software is a benchmark”;
to benchmark: to create a meaningless set of measurements, e.g. “We benchmarked the latest device”.

It is no surprise then to encounter the following attitude to using questionably constructed software coupled with questionable methodology:

“In industry, we always ignore the evaluation in academic papers. It is always irrelevant and often wrong.” — Head of a major industrial lab, 2011

So if a benchmark is similar to a program, “the thing commonly produced in garages” (or research labs), what can be said about a benchmarking product?

Consider several purposes of benchmarking Z, where Z is algorithm/model/technique/software/hardware/etc.

Internal R&D and competitive analysis: Is my Z better than competing ones? Under which conditions? At what cost?
External marketing and sales: I’ve got a proof that my Z is better than competing ones.
Purchasing decisions: Which Z is the best for my needs given my requirements on performance/quality/cost/etc.?

Benchmarking for internal purposes often resembles Brooks’s definition of a program: It’s done by a closed circle of people (e.g. the authors of a paper), in a few environments (e.g. on a couple of platforms), under particular conditions (e.g. using the peak performance settings), etc.

Sometimes, this approach suffices, in particular, in an academic setting where the primary goal is still to publish rather than to produce useful artifacts. (Otherwise, it would not be the case that only one in seven Machine Learning papers comes with any means to reproduce them.)

Using the same approach for external purposes, however, usually fails to convince the skeptics and to inform the curious. For example, with scores of startups and established vendors developing hardware for Machine Learning from the data centre to the endpoint, how can a discerning buyer get a solid grasp of this landscape and make the best purchasing decision for their needs?

Following Brooks and Plowman, we offer a new definition:

a benchmarking product, a set of conclusive results involving meaningful workloads with easy means to reproduce and/or verified by a credible third party.

If it sounds like an unattainable ideal, the MLPerf community is making steady progress towards a framework for creating and evaluating such benchmarking products for Machine Learning Inference and Training. This is not to say that every MLPerf submission fully meets this definition today. But MLPerf does give a boost to the credibility, especially due to the strict submission rules and peer-review requirements.

Similar to a programming product requiring more effort and thus costing more than a program, a benchmarking product costs more than a benchmark. One should, however, consider a potential return on investment (RoI). For example, the three hardware startups that submitted to MLPerf Inference v0.5 have all received an excellent RoI:

Furiosa raised a Series A round of $7 million.
Hailo raised a Series B round of $60 million.
Habana Labs got acquired by Intel for $2 billion.

Is benchmarking-as-a-product only relevant for hardware companies? Not at all! There is no limit to what fair and credible benchmarking can help showcase: professional services (e.g. helping a vendor create a benchmarking product), intellectual property (e.g. state-of-the-art techniques), skills (e.g. in model design, optimization, retraining), tools (e.g. compilers), and so on.

Over time, any product will need to be accompanied by a benchmarking product to be taken seriously by buyers and investors. Perhaps this time has already come?..

Benchmarking as a product: a case for Machine Learning

Written by Dr Anton Lokhmotov