A lighthearted but often serious look at the challenges involved in running High Performance Computing (HPC) workloads both on premises and on the Cloud with a slant towards financial services use cases.
Also available ad, tracking and paywall free and in dark mode here: https://cloudhpc.news/adventures-in-high-performance-computing/
Many epochs ago, when Dread Pirate Roberts still ruled the dark digital seas and Bitcoin could still be mined at home somewhat cost effectively at a price below $15 a coin, I walked into a Director’s office at my Investment banking client with a twinkle in my eye and a crazy idea in my head.
As was often the case, I had been engaged to work on a large high-performance computing (HPC) system at a financial institution. The purpose of the engagement was to implement additional risk measures for the regulator (this being a post-2008 financial meltdown world). However, I had a good working relationship with the client and an often-discussed topic was the efficient utilisation of the rather large (tens of thousands of cores) compute capacity. The regulators were demanding an ever-increasing amount of risk reporting requiring increased compute capacity. This led to additional hardware purchases on the one hand but also a massive amount of unused compute capacity for large periods of the day on the other.
For the uninitiated, large financial institutions calculate their risk exposures (for the trading desks, the risk managers, and the financial regulators) at the end of every trading day in each region globally. Depending on the financial product, these can be incredibly complex calculations that take hours to perform using hundreds and even thousands of servers. The number of servers used is generally dictated by the amount of time available to calculate the risk. Generally, the calculation of risk is embarrassingly parallel in nature, using twice as many servers means calculating the risk in half the time. In theory at least, in reality, I have yet to see a risk system that can actually scale linearly. The time between the market closing and the earliest of the regulator and risk managers demanding the risk reports or the trader wishing to see their exposure gives you the window of time.
Reduce that window of time or increase the amount of risk calculations you need to perform, and your choices are pretty limited. You can either make your risk system work faster (what I was there to do) or buy more hardware. Usually, you also need to fix your software too so it scales to the new hardware. Using the Cloud wasn’t really an option at this point in time, not because it didn’t exist or couldn’t provide what was required, but because of the internal policies still in place in most financial institutions. Investment banks at this point in time still had a complete lack of acceptance of letting any “proprietary” code, let alone highly sensitive quant libraries for calculating risk, outside of the confines of the bank’s firewall. I say “proprietary” because much (but admittedly, not all) of this code is virtually identical across every investment bank… That’s perhaps a topic for another day!
While the landscape is changing a little now with an increased demand for real-time risk (using the capacity available during the trading day), certainly at the time, increased compute capacity was a slightly double-edged sword in that it also meant increased wasted capacity. All the time during the day while the markets were open the servers sat idle before running at 100% capacity all night. If you’ve ever seen enterprise CPU utilisation rates and wondered how on earth they are often so low as to be in single figures, this is a small insight into why.
HPC = High Performance Crypto?
It’s with this backdrop that I walked into the office of the Director for interest rate risk systems. My proposal was simple really. We had tens of thousands of cores sat idle all throughout the day. Let’s put them to good use.
Mining Bitcoin.
Done in conjunction with the FX desk managing the portfolio and trading out of (or into) Bitcoin as appropriate and with Bitcoin currently at over $40,000 a coin today, they’d have made Apple look poor today if we had done it.
Given Bitcoin’s reputation at the time though, I might as well have suggested smuggling cocaine across the Columbian border in the corporate jet. It was never going to happen. In fairness, I was only half serious when I was suggesting it. Much like the start of this article, it was mostly an opener to a larger conversation about the better and more efficient utilisation of compute.
Cloudy with Risk
While no use was made of any Cloud providers at the time, it wouldn’t be long before that would change and the possibility was being actively discussed (even if it wasn’t possible due to internal policies). After all, it seemed to be a great natural fit. Why pay for compute capacity permanently when you can just rent it only when needed?
Sadly, the economics of it didn’t quite add up. The pricing models of the main vendors were such that even if the on-premises hardware was only being utilised 50% of the time (at best), it was still cheaper than renting the necessary capacity in the Cloud. This actually remains true today for most institutions. It was against this backdrop that I proposed my second crazy idea. Instead of renting compute from Cloud providers, what if the bank rented out its spare capacity when it wasn’t being used?
Naturally the odds of this ever happening were about as high as those of mining crypto or drug running with the corporate jet. The aversion to risk was sufficiently high that it wasn’t acceptable to run workloads on the Cloud. Given that, what were the chances that we’d allow strangers inside the firewall to run their botnets, spambots, and DDOS attacks from within the corporate network?
Compute Futures
The conversation, which took place over a number of weeks, getting on into months then naturally moved on. While it may not have been possible to run the risk calculation workload anywhere other than on premises in the short term, the possibility of it was too enticing to ignore. As such, we began to discuss the compute pricing implications of such a move. It was a banking client after all!
Most Cloud providers already provide multiple pricing structures. Essentially you can buy the desired capacity at the list price, you can reserve it up front for a defined period of time in order to secure a discount, or you can take your chances and use what’s available. The last option is on the understanding that you may be asked to return it at any point for an even bigger discount.
Imagine if compute capacity was hotel rooms. If you were to walk in and ask for a room that you expect to keep until the morning, you’d pay the list price. Reserve that same room a year in advance for two weeks and the price per night would be lower. The additional option that compute provides is the ability to walk in and take the room at an even lower cost but on the understanding that if someone else decides they need a room and is willing to pay the list price, you end up sleeping on the pavement.
The last of the options is often referred to as the spot price, much like how it would be described in the financial markets. Much as with the financial markets, there is also anecdotal evidence to show that a large demand can shift the spot price.
It wasn’t therefore much of a stretch to imagine a future where compute capacity becomes fungible (one of the problems with comparing prices today) and traded as a commodity, much like gas or electricity is today. Not that I imagine this would be a future that the Cloud providers would necessarily want¹.
There were numerous technical impediments to that at the time (these obstacles are slowly falling away, more on that later) but the ability to trade compute futures (say on the CME) opens up all sorts of exciting opportunities. Like deliberately buying out all the available capacity in a volatile market to prevent your competitors from being able to run their risk calculations. Oh what fun that would bring. Of course, this turns the conversation back to keeping the minimum required capacity on premises.
High Performance Kubernetes
Some years later, now consulting for a different client, I ran into the multi-Cloud HPC problem again. We are now firmly in the era of Cloud computing. Most financial institutions have at the very least started engaging with Cloud vendors in earnest if they are not already running some workloads with them. It is also the era of peak Kubernetes. While we were certainly not the only ones thinking about running HPC under K8 (there’s a whole Special Interest Group devoted to that), I certainly hadn’t — and still haven’t — come across anyone else in the financial service sector that was or is looking into doing so. Please get in touch if you are though!
We spent a little time developing the idea (some limited parts of it can still be found in our GitHub repo) with the client but ultimately it didn’t gain enough traction and was abandoned. It was a challenging problem not only in the realms of HPC but also in many other domains. It remains so in some regards. While Kubernetes certainly helps in certain aspects (an almost standardised API across Cloud vendors to access compute for example), there are other aspects in which it remains solidly inferior to even a poor HPC scheduler. This was much to my dismay after reading some of the initial claims from K8 fans.
Most importantly though, while it has helped solved one problem (that of a common API across Cloud providers to access compute), if anything, it makes the problem of attaining a common unit of compute even more difficult. After all, a cost-based scheduler to distribute your workload in the most cost-efficient manner across multiple Cloud vendors only makes sense once you have a fair performance-adjusted measure of cost. What really is a vCore? More on this in our next article.
Cloud Vendors as Utility Companies
This where I still find myself today. Working with another client, more than 10 years since some of my original musings around cost-based HPC workload scheduling across vendors and still no closer to a standardised solution. I have seen limited proprietary solutions in use though.
Much like how utility companies (think of ISPs as a good recent example) resisted attempts to devolve into “dumb pipes”, I suspect that our current Cloud computing vendors are doing much the same thing. Why provide fungible compute competing purely on price and uptime when you can compete on the basis of other value added higher margin services? Sadly, for the rest of society, the longer they hold out, the longer we wait to evolve to the next phase of the information age and its associated benefits.
¹ If we extend the analogy of utility providers (and looking at mobile telephony as a recent example is probably helpful) to the Cloud providers, it slowly moves the utility/Cloud provider to a progressively less profitable business as the margins are cut and competition is on the basis of price alone. This is once the commodity is truly fungible. Much like many utility companies that went before them though, its therefore no surprise that we see Cloud providers trying to provide and encourage the use of features with more value added for the client. Features that won’t be fungible. Features that lock you in to that provider.
Fundamentally I still believe that, eventually, compute will be no more than a fungible resource and as such, I believe we’d be better off architecting and building our systems with that in mind. We’re not there yet but we will get there.