Can We Disrupt Computation in the 2020s?

Guy Harpak
7 min readMar 6, 2020

--

Disclaimer: for simplicity, some of the terms used in this article are not 100% technically accurate. My perspectives are my own and in no way reflect other people’s or companies I work with/for.

Development in computation took an interesting turn: Moore’s law original proposition was that integrated circuits (ICs) become denser and so, with more transistors per unit area, computational power in a single IC increases. When making transistors smaller reached a physical limit, computational power kept on increasing by integrating diverse functions in the same package. Today, when reaching another peak in computation, I believe the next leap will only come if we change the way in which we utilise the different types of processors. This article is about compute orchestration: a concept to optimize binary code to best utilize the available computation hardware in a data center (and more specifically, available processor types in a data center).

Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018

Compute Orchestration

There are many processor types. To name a few: microprocessors in small devices like electronic calculator, general purpose CPUs in personal computers, GPUs for graphical applications. Just recently new types has entered the field, like TPUs and Neuromorphic processors for running efficient machine learning algorithms. To this mix we can also add my personal favourite: the flexible FPGA.

Variety creates potential: if the existing compute technologies can be repurposed and orchestrated to act as a coordinated swarm, each unit handling its optimal work load, our data centres could deliver more computation bandwidth with less power, lower latency and lower TCO (Total Cost of Ownership).

Repurposing and better utilisation of existing processors is already happening: GPUs, for example, were already adapted from being something gaming freaks brag about, into a core enterprise component- used for machine learning work loads and for mining crypto currency.

The technology of coordinating different processor types to work as an optimised swarm is called Compute Orchestration.

The Compute Orchestration thesis can be described in 4 simple bullets:

  1. There are various types of processing units (processors) at our disposal.
  2. Optimal utilisation of these processor types requires smart allocation of the different work loads to the different processors.
  3. Programmers can’t be expected to be responsible for optimal adaptation of their code nor for the 3rd party code they use to the available processing units.
  4. Better compilers are not the solution: automatic optimisation of code during runtime, automatic allocation to available computational resources and agile hardware configuration are required.

Compute orchestration is the automatic optimisation & allocation of binary code to the most relevant computation units available.

To illustrate the trajectory of this technology, this article will split the evolution of compute orchestration into four generations. A summary of these generations is detailed in the following table:

Gen 0: static allocation, dedicated co-processors

Intel i386 with and i387 math coprocessor

This is the most prevalent type of compute orchestration. Most of todays’ devices include co-processors designed to off-load the CPU in preforming specific tasks. Usually, the compiler, the OS or the CPU architecture takes care of the allocation task. This method is pretty seamless to the developer, but is also limited in functionality.

Best known example is the use of cryptographic co-processors for executing cryptographic functions. If we want to be even more liberal in our definitions of co-processors, the usage of an MMU (Memory Management Unit) to manage virtual memory address translation can also be considered as an example.

Gen 1: static allocation, heterogeneous computation

This is where we are at now. In Gen 1 the software relies on libraries, JIT (Just in Time) compilers, or VMs to best utilise the available hardware. Let’s call the collection of components that help utilise the hardware “framework”. Current frameworks help with things like correct usage of GPUs in the cloud. Usually, smart allocation to bare metal hosts must be done by the developer. This phenomena sparked my interest in Compute Orchestration in the first place, as it proves there is more “slack” in our current hardware.

An nVidia GPU

OpenCL is an example of a framework that allows executing compute kernels on different processors. TensorFlow computation graphs can be executed on heterogeneous elements, assigning different computation nodes to different processors.

As cool as it is, I believe this is not where the real edge is. Existing frameworks still require too much “intelligent design” to be optimal. They are developer dependant. Also, while running a well designed TensorFlow code in my data centre could be optimal, no legacy code from 2016 is ever going to utilise my rack of GPUs. My view is that the leap will be enabled by frameworks that are more dynamic and automatic.

Gen 2: dynamic allocation, heterogeneous computation

Computation can take an example from the world of storage: products that increase utilisation, reliability and efficiency of storage have innovated for years. The storage industry is ripe with abstraction layers and specialised filesystems that improve efficiency and reliability during the operational phase. However, computation still remains a simple allocation of hardware resources. Smart allocation of workloads to specific hardware could result in better performance and efficiency for massive data centres (i.e. hyperscalers like cloud providers). The infrastructure for this leap is already there, supported by trends like resource disaggregation at the data center, introduction of diverse accelerators, and increased work on automatic acceleration (for example: Workload-aware Automatic Parallelization for Multi-GPU DNN Training).

For specific applications and high level resource management we already have automatic allocation. For example, project Mesos (paper) allows fine-grained resource sharing. Slurm is a cool open source project that allows cluster management.

The major leap here will be composed of two steps: mapping of the processors (i.e. compute environment) and workload adaptation. Imagine a situation where the developer doesn’t have to optimise her code to the hardware, but the optimisation is made in real time and according to the specific available resources at runtime. Since our cloud environments are heterogeneous and dynamic so should the execution of our code and with no reliance on the developer.

Gen 3: dynamic allocation, cognitive computing

“A thought, even a possibility, can shatter and transform us.”
Friedrich Wilhelm Nietzsche

With enough vision, we can imagine a technology that automatically re-designs our data centres according to the current needs of running applications. This revolution in the way that data centers compute has already started. For example, with the small step of integrating FGPAs in new domains (FPGAs in servers, FPGA machines in AWS, FPGAs in NICs). To try and illustrate this type of compute orchestration let’s look at a leading example: Microsoft’s project Catapult is an initiative to transform cloud computing by augmenting CPUs with an interconnected and configurable compute layer composed of programmable silicon (yes, it’s copy-paste from Microsoft’s site…). Just look at the timeline in the project’s website- it’s fascinating. The project started in 2010 with a goal to improve Bing search queries by using FPGAs. Pretty quickly, it sparked a new data center architecture that uses FPGAs as bumps in the wire to accelerate compute. The project also suggested an architecture that allows reuse of FPGAs as a resource pool distributed across the data center. Following the potential success, the project spun off Project Brainwave, to use FPGAs to accelerate AI applications.

But Microsoft is just an example. The academic work around this area is growing. To promote Gen 3 we need is to marry some relatives:

  1. Low effort HDL generation and developer abstraction (e.g. Merlin compiler, BORPH)
  2. Heterogenous execution support (e.g OpenCL and TensorFlow)
  3. Flexible infrastructure (i.e. interconnect and computation)
  4. Automatic deployment and scheduling (conceptual example, another one)

Having automatic allocation with agile hardware will provide optimal execution on existing resources: faster, greener, cheaper.

Summary

There is no telling where the trends and concepts briefly mentioned in this article will lead. It is my personal belief that we are in an intermediary phase: development is focused on innovative compute technologies, like new processors, and innovative concepts, like resource desegregation and edge computing. As we deploy more of these technologies in the field we reach a point where orchestrating and managing the variety is where the real value is.

In this article I only touched the concepts that I believe generate the most potential for improvement by orchestration. There are many other areas to analyze and deepen the research on these topics.

Please feel free to contact me on any thoughts on the topics at harpakguy@gmail.com

Further reading

For readers looking for further information I would recommend to follow the links in the article and also research on the following topics:

  1. SmartNIC, composable/disaggregated infrastructure, Microsoft project catapult.
  2. A lot of insights can be found in studying the latest trends in processor development (for example Habana labs, Hailo, BrainChip) and programs for exascale computing like the European Processor Initiative.
  3. Kubernetes, especially as it moves to the bare metal and virtual machines space (e.g. KubeVirt)

Some interesting articles on similar topics:

Return of the Runtimes: Rethinking the Language Runtime System for the Cloud 3.0 Era

The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design (by Jeffrey Dean from Google research)

Beyond SmartNICs: Towards a Fully Programmable Cloud

Hyperscale cloud- reimagining data centers from hardware to applications

--

--

Guy Harpak

Sharing parts of a journey through life, tech and being a human being. 36 years old, happily married, raising 3 kids and a dog.