Supercomputing and the Datacenter Part 1 — A primer

Mix Amos
10 min readJun 26, 2019

--

This is part 1 of a 3 part series on the HPC landscape and the ongoing supercomputing race. It serves as a light introduction of corporate players and the trend of the industry, feel free to skip if you’re already quite familiar with the industry.

Summit, the world’s most powerful supercomputer

Supercomputing and datacenters are one of those topics that only nerds really care about, or at least that’s how it used to be. Anyone paying attention to the milieu of corporate slang that comes out of marketing and investor relations departments will notice a few terms popping up in recent parlance: “Data-driven”, “AI integration”, “Machine Learning assisted solutions”, or any other such permutation of contemporary technological buzz words.

Most of the time these are patently false, but in the rare cases in which a company’s key infrastructure involves a really powerful computing cluster crunching away at a large dataset, you can bet that they have a definite competitive edge. In the OEM industry, there’s a raging war as to who can get the largest number of compute resources to verify their designs as fast as possible. In the pharmaceutical industry, computer simulation of various drugs promises to accelerate drug development time and have fewer human trials. Genetic research, something finally taking off after being stifled by a needlessly panicky public, requires the exploration of such a massive number of genetic combinations that simulation is practically a requirement if you want to be competitive in the industry. Companies are rushing to integrate HPC in their design process flow to gain competitive advantages, leading to an exciting boom in the HPC market.

Modeling, Simulation and AI — the next driver of HPC adoption

In a world where most OEMs and engineering companies use some sort of simulation software to test and refine their designs before making physical models, getting your design to simulate and verify as fast as possible means a time-to-market advantage that’s absolutely crucial. In January, Western Digital, a designer of hard-disk drives and other memory products, ran a parametric simulation on Amazon’s AWS cloud computing service. The simulation was an engineering one, involving the read/write head that uses lasers to manipulate bits of data from a memory platter. Normally, doing this on their previous 80K cluster setup would require close to 4 days, but with this setup they were able to finish the entire simulation in a mere 7 hours, all from the cloud.

If you were an engineer working in Western Digital’s team, the difference is massive; you either let the simulation run for the greater part of a week while you twiddled thumbs and waited for the results to return, or you let it run overnight and come back the next morning ready to work on your new batch of design data. This was only for one compute cycle, it takes many more of these to move from the product development and design stage to tapeout and making money off of it. The time saved would have been on the order of months; and that’s for a company that’s already extremely well versed with HPC. Western Digital’s competitors (to the best of my knowledge) are still reliant on smaller clusters with barely a hundredth of that compute power — what an astonishing lead!

More and more companies are moving their design flows to HPC, either by buying their own supercomputers or renting them from a cloud provider. This coincides with the machine learning / AI craze, which requires significant compute resources to train functional models. As more and more governments, militaries and corporations acquire ever larger and more powerful compute clusters in a war of who can get the better AI model or product, there can only be one winner — the arms dealers of the war.

IBM, Intel, AMD, Nvidia and ARM — the big 5

First off, let’s introduce the big 5. These are the corporations that design the actual computer chips that run on the servers and the supercomputers. Each one is uniquely positioned with their own IP blocks and market advantages. If you’re already familiar with them, skip ahead. If not, you’ll want to give them a read.

Intel: The leading industry giant in the server and desktop markets. Also currently being dethroned by AMD, but more on that later. Intel has been around for longer than I have been and it shows. They invented the x86 instruction set which goes into every single desktop and laptop computer and also has a 95%+ market share in datacenter space. They absolutely dominate this market segment with their Lake series of products and have a diverse set of IP blocks as well, thanks to their many acquisitions. They manufacture FPGAs via their subsidiary Altera for use in specialized compute scenarios, microcontrollers for use in tiny devices and wearables, modems and other connectivity electronics used in smartphones, notably by Apple.

Unique market advantage: They are the only chip designer among the big 5 to have their own fabrication plants and the ability to make their own chips. Historically, their manufacturing edge has been one of the primary drivers of their market leadership, though that has definitely changed in recent times.

AMD: My favorite company on this list, AMD has been the eternal underdog for their 50 years of existence. They are the smallest corporation in this list, with a market cap of “only” 32 billion. They have repeatedly demonstrated extreme agility in their development cycles, most famously with their Phenom line of processors in the 2000s era and now with their market-shattering Zen microarchitecture. Currently on a massive upswing thanks to extremely robust engineering teams, AMD is locked in an epic battle with not only Intel, but also with Nvidia, both giants whose combined market capitalization is nearly 10 times AMD’s own. They produce desktop and server processors and are aggressively putting the heat on Intel, gaining market share in all segments that the two compete in. They also produce GPUs, though it is anemic compared to the brutally savage beating their CPU segment is inflicting on Intel.

AMD is also unique in the industry for having a semi-custom design business, producing the chips for Sony’s PS4 and Microsoft’s XBox, as well as their upcoming successors. They were at one point contracted by Amazon to design ARM-based server processors but did not meet performance targets and were replaced by AnnaPurna Labs which Amazon acquired, though this is most likely due to them not having enough engineering staff at the time.

Unique market advantage: AMD is the only company on this list aside from ARM who has both CPU and GPU IP blocks, giving them a market advantage in offering compute solutions that require both in a single package. They have also designed memory solutions such as HBM, which they license to other companies for royalties, and interconnects. This veritable collection of IP blocks lets AMD create a product that fits a client’s use case depending on precisely what they want. Currently there are no HPC clusters that take advantage of this custom functionality, but many planned and upcoming supercomputers (military ones too) are lining up for AMD to design chips for them.

IBM: An ancient giant fallen on hard times (from the consumer market perspective), their titanic duel with Intel may have been lost but they are by no means dead. Their modern product stack in the HPC consists of their POWER9 family of chips based on the Power ISA, which is specialized for heavily parallelized work. Though it’s very difficult to buy an IBM chip these days, their designs are very good at what they do, powering Summit, the world’s fastest supercomputer and the one pictured in this article’s header. IBM specializes in making military hardware for the USA, including radiation-hardened chip packages for use in space exploration and weapons systems. The Curiousity rover uses IBM equipment, and most likely so do American nuclear submarines and ballistic missiles. IBM specializes in these niches and despite their lack of competition in the consumer market are still very profitable.

Their most interesting department it their semiconductor research division. There was once a time where IBM was capable of fabricating its own chips, but no longer; now they have partnered up with Samsung which handles the actual fabrication while they deal with the theoreticals and create transistor designs for use. Thanks to this, Samsung has the industry lead in EUV deployment and GAAFET designs, which makes them the top 1 contender for 3nm node manufacturing and beyond, though this will only come into play in the future and will be the focus of an entire article.

Unique market advantage: IBM has a stranglehold on military applications and are found in every high end supercomputer thanks to their engineering departments specifically designing entire supercomputer packages for clients. From the computing elements to the motherboard design, to the networking backend and the software stack, even the cooling systems and building infrastructure; IBM is a one-stop shop for your supercomputing needs. They even have their own cloud computing services called Bare Metal. Though not as large as Amazon or Microsoft’s cloud services, they do offer very interesting setups designed for researchers and a comprehensive suite of features that neither are specifically catering to.

Nvidia: Uncontested masters of GPU space (for now), Nvidia has made great strides in solidifying their position in supercomputing with their Tesla and Volta series of cards. They compete in consumer markets, especially in the gaming sector, against AMD, winning many successes against their rival, both in technical superiority and marketing clout. Where they have been beaten by their competitor, their marketing department is very effective at convincing consumers to buy their products still. This has led to a scenario in which AMD, starved for cash from not having product sales, stalled in R&D and was surpassed by the upstart Nvidia. Nvidia undeniably leads in every single market it competes in, from automotive to machine learning to gaming and scientific research. This does not appear to be the case for much longer however, as certain very promising engineering patents by AMD indicate, but I digress.

Unique market advantage: Aside from their indisputable and commanding market lead, Nvidia have a veritable and robust software stack for GPU-accelerated computing while their rivals have practically nothing. This makes them the only viable option when purchasing GPUs for acceleration. This is currently being challenged by AMD, but there are currently very few design wins and this is unlikely to change over the next few quarters.

ARM: Though not really a competitor to the 4 giants mentioned previously, they have recently started up their efforts to enter the datacenter and supercomputing space with their own chip designs, based on the ARM spec they have used to gain 100% dominance in the smartphone space. They don’t actually implement the chips themselves, they design the specs and the compute blocks of the ARM instruction set for use by third parties to integrate into SoCs with their own custom implementations, which leads to some very large variance between each parties’ implementation. Thanks to the RISC design of their instruction set resulting in inherently less power usage, no competitor from the previous 4 can ever arise due to power issues and anyone trying to design a RISC microarchitecture from the ground up will take the greater part of a decade before becoming competitive. IBM theoretically could try entering the smartphone market since they could make their chips more power efficient while still having the IPC to match, but that involves making an entire software stack and interconnects, it’d just be too much effort.

Unique market advantage: ARM has a massive number of corporations under its gigantic ecosystem. Qualcomm, Apple, MediaTek, Huawei and AMD (though since discontinued) are all licensees of the ARMv8 ISA and architecture, just to name a few. Their complete domination of the smartphone market ensures that they have a software stack to build off of and have few issues when migrating into the supercomputing space. The small die size of individual cores also ensures that they can provide even greater density than any of their CISC-based counterparts at greater power efficiency, though at a great performance penalty. Their Neoverse product stack is still currently in a fledgeling state and it remains to be seen whether their datacenter foray will be a successful one, but all signs point towards it only being a matter of time before they catch up to x86 in IPC.

Financial implications and trends

So let’s say you do want to get an HPC cluster to get things done faster; how do you get it? You could either build your own cluster, which is going to be incredibly expensive but will pay for itself over time provided you can justify the exorbitant cost. Or you can use a smaller on-premise cluster for small tests and rent out a large number of cloud instances for when you need a burst of computational power. This mixed model of cloud compute and on-premise compute is very attractive for smaller design companies who can’t justify the cost of a gigantic 2 megawatt compute cluster like what Intel has.

The cloud providers are all too happy to let companies like Western Digital rent out their spot instances, it ups their utilization rate and gets them a faster ROI. I wouldn’t be surprised if AWS opens up a compute-oriented set of instances for use in acceleration and training, Microsoft already has it and there’s a handful of services offering dedicated training/inference cloud hardware already. As HPC giants continue to expand their datacenters and pour billions of dollars worth of revenue into the big 5, the financially savvy amongst us may consider investing into them. I myself have been calling AMD out since 2017, back when their stock was a mere 11$ compared to today’s 32$. Part 2 of this series will be dedicated to the analysis of the tech of the big 5 and which ones are likely to gain steam in the following quarters.

As computing becomes an increasingly central component of a modern economy, national governments and power blocs are mobilizing to amass their own computing resources, either for scientific research, military uses or just plain political clout. China in particular has been very aggressive with their own supercomputer designs, designing and manufacturing their own chips for their supercomputers. Europe is catching up, launching their own processor initiatives and taking out specially designed ARM processors for use in their own computing grids. The US isn’t investing as much as its other two competitors, but still holds the crown with their latest IBM and Nvidia powered Summit supercomputer, though recent actions seem to indicate a renewed interest in winning the supercomputer race. I will be dedicating part 3 of this series to the supercomputing race and its implications.

--

--