Why Infrastructure Hardware Performance is hard… or is it?

Indraneil Gokhale
Box Tech Blog
Published in
4 min readMar 27, 2019

If you have infrastructure on-premise or planning to have infrastructure on-premise anytime in the future, this is for you!

Qualifying hardware & making the right infrastructure choices can be challenging & sometimes a daunting task. Selecting hardware can be what we make out of it. It’s the Time, Quality, and Cost triangle. Most of the time, we can have 2, and very rarely all 3. And, that’s what makes Infrastructure Hardware hard or interesting at the very least.

Typically, Hardware Qualification takes anywhere between 3 to 9 months depending on the scale. However, hardware that’s critical to the core of a business can take a longer time. Ultimately, we optimize for the Total Cost of Ownership (TCO) & TCO per Performance would potentially dictate the kind of hardware we qualify that eventually sees production traffic. If the hardware is in the critical path of the business or touches production data, it is imperative to have reliable & efficient hardware which matches the performance requirements

Qualifying hardware that is RAS (Reliable, Available, Serviceable) has many components to it, mainly (but, not limited to) -

  1. Hardware Performance Testing
  2. Vendor Management
  3. Data Center Operational Excellence

Hardware Performance Testing

Choosing the perfect hardware for each workload(s) type is “Nirvana”

Companies like Google, Facebook, Microsoft/LinkedIn, Amazon are probably closer to this nirvana than others (that’s a possibility). However, it must have been a long & arduous path for them with probably much more to do.

A comprehensive analysis of the hardware (for e.g. compute server) from concept design all the way to production readiness is a must. Telemetry & Tooling go a long way in providing reliable information to make reliable data-based decisions. The key ingredients to qualify good server hardware are ‘Metrics’. Collecting meaningful metrics & extracting information from them can be a tedious task. The challenge is not just to identify metrics we’d want to collect but also, not to lose sight of the larger picture, i.e. the problem we are solving.

Depending on the use case of the hardware, the extent of performance testing would change. For e.g. any hardware that would eventually see production traffic, would need extensive EVT, DVT, SVT & PVT testing compared to any other ‘support’ hardware in the data center.

Everyone (hopefully!) agrees Hardware Performance Testing needs to be done to qualify quality hardware. However, every Hyper-scale entity has their own ways of doing it depending on their scope, timelines, budget, etc.

If we were to pick Compute Server as an example, the easiest way to test the hardware is to put the box(es) in production directly, collect production metrics and then fine tune the box to our requirements. However, this approach could work for non-critical Hardware. Another approach is to test the compute server design in an protected environment (i.e. maybe a Lab) and run Functional & Performance tests (e.g. burn-in tests, stress tests & workload tests) to collect metrics for analysis. A more detailed, complex and therefore more rewarding path is creating a ‘test’ production set-up in a protected environment like a lab and running production traffic replays and thereby collecting metrics that are very close to production.

Hardware Performance Testing of a server can be broken down into multiple levels depending on what the scope of the project is. Speaking of qualifying quality hardware, performance & functional testing can be done at a sub-component level. The testing could be targeted at the system as a whole and if needed, at individual components of the system.

Typically, we could breakdown Hardware Performance Testing into 4 broad categories -

  • Audit Tests
  • Burn-In Tests
  • Stress Tests (Saturation Tests)
  • Proxy Workload Tests

Audit Tests — These are physical audits when the gear lands in your “protected test environment”, could be a lab or a dev-zone. Checking Rail Racks, CMA, check if the PDUs are accessible, Redfish/IPMI checks, Drive Health, etc.

Burn-In Tests — These are very basic functional tests to make sure there are no Dead On Arrival (DOA) parts & components. This includes basic benchmarks like memtest, linpack, resiliency tests for storage media (like SSDs, HDDs).

Stress Tests — These tests are designed to test the saturation of each part or component in the System under Test. For example, STREAM to check the usable memory bandwidth, HPL or Linpack for Compute & Power tests, FIO for I/O, qperf/iperf for Network, etc. The idea is to check what the saturation levels or performance limits are for each component in the DUT.

Proxy Workload Tests — These tests would be the closest to the real work service or application. The purpose of these tests would be to simulate the production workload/traffic in a lab environment and collect performance metrics to evaluate the Hardware.

Automation is the thread that binds all these concepts into one single entity or system.

Be it provisioning, monitoring or alerting, running functional & performance tests or telemetry — Automation is King!

Thanks to Matt Singer, Sr. Staff Engineer at Twitter & Mike Peterson, Sr. Hardware Engineer at Netflix & Janet Petelo, Staff Capacity Lead at Box for providing feedback.

--

--