CalcGraph: a new computational graph model

Tailored for meta-Learning

Thomas HARTMANN
DataThings
12 min readJul 28, 2020

--

Recently, researchers and engineers started to investigate if machine learning itself could be used to learn how to learn, i.e. to automatically find which algorithms and which parameters would be the most appropriate ones for a given problem. This is referred to as meta-learning or automatic machine learning. Meta-learning is a growing research axis and, although it is currently still in an early stage, impressive results have already been achieved. For example, it has been shown that meta-learning can outperform the best-handcrafted neural networks in many domains, e.g. image classification and object detection. However, one of the biggest concerns with meta-learning today are the required high-performance computational resources. It is often already very costly to train a single model not to speak of training a full ensemble of different algorithms and parameters.

A computational graph model for meta-learning

Instead of relying on existing machine learning frameworks and their underlying computational models, we suggest an alternative computational graph model that has been specifically designed with meta-learning in mind. We call it CalcGraph.

Similar to TensorFlow’s computation graph, we suggest to use a model descriptor (the CalcGraph model) to define the computation flow and then statically compile the code for the model execution. More specifically:

  • Our model compiler generates optimized branchless byte code that can be executed as a fixed sequence of operations. Most importantly, this allows to compute the required main- and GPU memory size for a model and a given batch size — or an optimal batch size for a given memory size — at compile time.
  • We minimize expensive memory allocations by preallocating the entire needed memory (computed by the model compiler) and then reusing it by switching models from storage to the preallocated memory and vice versa. This avoids expensive malloc and free operations during runtime and allows us to run meta-learning with constant memory. We refer to this as model switching.
  • Knowing the memory requirements at compile time enables us to further optimize the usage of the available resources by scheduling as many models as possible for parallel execution and sharding the executions among the available resources.

In the following, we describe the main ideas behind our CalcGraph model and how we think that it can help to reduce the high costs of meta-learning. The figure below shows a conceptual view of it.

Each CalcGraph defines an execution mode, which specifies where the graph should run on: CPU(s), GPU(s), or BLAS libraries. This indicates the resources and libraries which are later needed for the execution. Additionally, each CalcGraph contains one or many execution paths.

A Path defines a computational sequence of operators. Paths specify a calculation mode. Possible values are forward only (no backpropagation, i.e. no learning involved) and backpropagation (forward and backpropagation, i.e. learning involved). For instance, we can create a “pre-processing path”’ to specify what operators need to be executed on an image before we execute another path for the neural network. The pre-processing path would be in forward only mode, since it is a sequence of operators with no weights to learn. Another example of using multiple paths is a Generative Adversarial Network (GAN). In other words, a CalcGraph is a composition or sequence of paths, which define a control flow.

Operators represent atomic operations that can be executed on variables. Examples are Add, Sub, and MatMul. Most operators require one or several inputs and have one output. Others need additional parameters. Operators can be grouped into classes of similar characteristics. One class is the group of unary operators that take one tensor as input and produce one tensor of the same size and dimension as output. An example for a unary operator is Y = sin(X). Another class of operators takes two inputs and produces one output, all of the same size and dimension, like C = add(A,B). Then, we have a more advanced class of operators, which needs additional parameters and have more complex dimension relationships between their inputs and outputs. Examples for this class are MatMul, convolution, and de-convolution.

Most operators are linear in terms of memory consumption. Other operators need to provide their own checkers of size, type, and memory needed for the output. The list of all operators can be seen as an instruction set, which defines the possible operations that can be performed on variables with a CalcGraph. This is somehow similar to the x86 instruction set, which defines which operations can be executed on a compatible microprocessor. A de-facto standardized set of operators has been defined by ONNX and we aim at implementing a superset of these operators to be compatible with ONNX models.

Operators can have an optional Workspace. This is necessary because some operators need additional memory. The required memory is different per target device (CPU/BLAS/GPU). This is a main reason why a change of the target device makes it necessary to recompile the model. For CPU only we do not need a workspace, but for BLAS and GPU it is required. All operators in the whole CalcGraph model share the same workspace. The size of the workspace is based on the operator with the biggest memory requirement. The workspace is a special zone in the heap.

Variables are the central elements of a CalcGraph. Each variable represents a tensor. There are three types of variables: 1) placeholders, which hold data coming from outside of the CalcGraph, 2) variables to be optimized, meaning they have to be updated after a learning operation, and 3) generic variables, which are usually just the results of operators. For example, the following equation O = W*I shows these three types of variables. I represents a tensor input or a placeholder provided from a dataset outside the CalcGraph. W can be marked as to be optimized, it represents the weights to learn, thus has to be initialized in some way, then updated regularly after each learning round. O is just calculated straightforwardly from the execution of the operator multiply on W and I. Placeholders have getters and setters while variables and variables to optimize have only getters (for the value and for the gradient). Variables that are marked as to be optimized require two additional parameters: one to define how to initialize them (InitializationMethod, e.g. everything to zero, random, etc.) and one to define what regularization method (RegularizationMethod, e.g. 1, L2, etc.) has to be applied on them during the optimization phase.

Every variable has two TensorDescriptors. One specifies the generic shape of the tensor and can contain a variable size dimension. The other one represents the current shape of the variable as configured to run with the current batch size. All variables allocate a forward weight in a so-called CalcHeapZone. If the path is configured for backpropagation, all variables in that path have in addition to the forward weight also a gradient weight. The variables marked as to be optimized have an extra place for the optimizer.

The CalcHeap is an abstraction responsible for allocating memory for the execution of the CalcGraph. It is subdivided into four memory zones: the forward zone, the gradient zone, the optimizer zone, and the workspace zone. Each variable in the CalcGraph points to its space, the CalcHeapZoneSpace, within one or several of the zones. Variables participating in any path marked for backpropagation will have memory allocated in both forward and gradient zones. Variables marked to be optimized will have in addition space in the optimizer zone. The remaining variables participating in forward only paths will only have memory allocated in the forward zone. Each of the zones is a continuous memory block. This division of zones is a key characteristic of our approach.

The CalcGraph compiler

The model compiler takes a CalcGraph model as input. In the model, either the maximum allowed memory size (main and GPU) must be specified, or the maximum allowed batch size, depending on what the compiler should optimize. In case the (optimal) maximum batch size is known, e.g. due to the dataset or the given learning problem at hand, the compiler can automatically derive the required memory size for this batch size and for the target device. The other way around, the maximum allowed memory size can be specified and the compiler can automatically derive the maximum possible batch size given the memory restrictions. The compiler then generates the byte code of the model and derives the heap memory partitions, to maximize the usage of the allocated heap space, as output. This information is directly stored in the byte code of the model. The following figure shows the compilation process.

While computing the required memory size from a given maximum batch size is straightforward — with the help of the tensor descriptors, we know the exact size of each variable — computing the maximum batch size based on a given maximum memory size is more difficult. This is because some operators are not linear in their memory allocation. Another reason is that the same operator can have different implementations according to the target device or according to the input tensor size (due to optimizations). We decided to use a dichotomic search algorithm.

Optimized code

The use of CalcGraph models as abstractions allows us to optimize the generated byte code. First, the generated byte code consists of a simple sequence of operations without branches, i.e. the byte code is branchless. Although modern processors are equipped with sophisticated branch prediction algorithms, each wrong prediction hits the performance quite substantially. Branching to an unexpected location makes it necessary to flush the pipelines, pre-fetch new instructions, etc. To avoid such dreadful stalls, our generated byte code is branchless. This optimization is possible by analyzing the model abstraction, or more precisely, the paths and operators of a CalcGraph model before generating the byte code.

Another optimization that we perform is to merge several operators or to split an operator into several others in case it is beneficial for a target device. For instance, CUDA libraries provide an optimized implementation for linear, recurrent, and convolutional layers. In these cases, there is no need to run first a Multiply and then an AddBias operator, since we can fuse these two operators into one LinearLayer operator. In addition to that, the model compiler can optimize for the data transfer necessary between CPU and GPU, by carefully splitting the model into chunks of independent operations to enable data transfer in parallel with model execution on GPU. This is possible thanks to the asynchronous transfer and execution API of many GPU providers.

Besides the sequence of operations, the model compiler also generates the partitioning of the heap. As mentioned previously, the heap gets partitioned into four distinct zones: the forward zone, the gradient zone, the optimizer zone, and the workspace zone. For paths in forward only mode variables only need space in the forward zone, while for paths in backpropagation mode variables need space in the forward zone and in the gradient zone. Variables to optimize need additional space in the optimizer zone. The workspace zone is a common workspace for the whole model and is used in case some operators need to allocate additional memory for some target devices, i.e. BLAS and GPU. It is important to notice that all of these zones are continuous.

Each variable of a CalcGraph points to a distinct position inside of the zones. This is implemented using offsets, i.e. for each variable and for each zone the variable needs space in, the byte code of the CalcGraph contains an offset specifying the position inside a zone. This offset mechanism allows to omit bound checks during runtime and therefore to further increase performance.

When the model compiler is calculating the offsets of variables in each zone, it always considers the biggest batch size possible within the maximum memory bounds. Later, during the execution time, it is always possible to use a batch size smaller than the maximum allowed one. This will only have the effect of a sparse memory usage, but it is still much cheaper than to free, re-compute, recompile, and re-allocate the memory. The following example shows when changing the batch size would be necessary: Let us consider we have a dataset of 1065 entries, and within the memory constraint, we can only fit a maximum batch size of 100. Then, we can proceed by splitting the dataset into 10 mini batches of 100 each and the last one of 65. There is no need to reallocate the memory for the last mini batch. A sparse usage is absolutely fine for such cases.

The CalcGraph interpreter

The model interpreter is a virtual machine. It reads the byte code generated from the model compiler and executes it. As explained previously, the byte code is a sequence of executions. During execution, the model interpreter takes advantage of the memory segmentation that has been precomputed by the model compiler. This allows the model interpreter to statically allocate the whole required memory for the execution of the CalcGraph model in advance. This avoids expensive memory allocation operations (malloc) during the execution of CalcGraph models and significantly improves the performance. Moreover, it allows to execute CalcGraphs with constant memory usage, no additional memory needs to be allocated nor freed. Therefore, unlike in many other approaches, it can never happen that training a model ends in an out of memory error after hours or days of computation. The working process of the interpreter is depicted in the following figure.

The execution granularity is per operator. At the beginning of the execution the pre-allocated memory gets initialized. Initializing just means the memory is set to 0, i.e. it is just a single memset operation. During execution the place of each variable inside this memory block is defined by the offset that has been precomputed by the model compiler and stored inside the byte code. The fact that each variable has a precomputed static place inside the memory allows to completely omit bound checks during runtime, which again increases the performance. In addition to the four zones (forward, gradient, optimizer, and workspace), the main memory also holds a program counter. Besides the fact that paths can be executed in forward and in backward mode, and that therefore the program counter cannot be just increased but must also be decreased, our program counter is a standard program counter. The way we divide the main memory and generate the byte code from the model makes it unnecessary to use an additional stack during execution.

Model switching

The common way to train models is to allocate and free memory dynamically during learning, and this per model. This is not only extremely costly but the required main memory for training is not known in advance, possibly leading to out of memory errors after hours or days of computation. It makes it also more difficult to optimally utilize the available resources. One of the core concepts behind our CalcGraph model is that multiple models can share the same preallocated memory and just repartition it, instead of expensively freeing and reallocating memory. As a reminder, repartitioning the CalcHeap means just initializing it to 0. In reference to the concept of context switching, we refer to our approach of switching models into the preallocated memory as model switching.

Just like a context switch is a procedure to change from one task (or process) to another while ensuring that the tasks do not conflict, a model switch is a procedure to change from one model to another while ensuring that the models do not conflict. And just as effective context switching is performance critical to provide user-friendly multitasking, effective model switching is performance critical when it comes to executing many CalcGraph models, as it is the case for meta-learning.

  • A model switch consists of the following basic operations:
  • stop the execution of the current model
  • swap the heap of the current model to storage
  • re-initialize the heap, i.e. set the whole memory back to 0 (memset)
  • copy the swapped heap of the new model from storage to memory
  • continue the execution of the new model

Since each zone is a continuous block, swapping the heap from storage to main memory and vice versa is very cheap because we only need a single memory copy (memcpy) operation per zone. Furthermore, it is not always necessary to swap all zones. As soon as the variables to optimize are optimized, there is no need to keep the gradient zone, i.e. the gradient zone will no longer be swapped. Similar, if the target device does not require a workspace, there is no need to swap the workspace zone.

In fact, this model switching mechanism is a key enabler of our approach to tame the high costs of meta-learning. However, even besides meta-learning, our approach of inexpensive model switching is very beneficial for many practical learning problems using multiple models. This is especially the case for different instances of the same model, or different parameters for the same model. For example, recommender systems need to train at least one model per user, or per category of users. Without an effective model switching mechanism, one would need to costly allocate and free the memory dynamically during learning — and this per model. Another example along these lines are profiling systems, where at least one model per user needs to be trained.

Acknowledgement

This work is supported by the Luxembourg National Research Fund, FNR (https://www.fnr.lu/), under the Industrial Fellowship program (grant 12490856).

If you enjoyed reading, follow us on: Facebook, Twitter, LinkedIn

--

--