ONNX.jl : The past, present and future.

Ayush Shridhar
Jul 21, 2018 · 6 min read

Machine learning is a more than a field, it’s a revolution. In the recent years, there has been a quantum leap in the number of companies and people involved in research and development of software centered around machine learning technologies. The big giants in the world of tech have tried to make this field more accessible to students, researchers and professionals alike, by developing and open sourcing their machine learning frameworks, such as TensorFlow, PyTorch, Caffe2, CoreML just to name a few.

But the presence of so many frameworks raises a simple yet powerful question: Can transfer learning be made cross platform? Is it possible to avoid reinventing the wheel to develop networks using a uniform framework? Like they say, Necessity is the mother of invention. Naturally, people started working in this direction. Their goal was simple: Develop a format for interchangeable machine learning models. Remove the framework specific constraint from the world of machine learning. And hence, ONNX was born.

ONNX, or Open Neural Network eXchange, is a framework for interchangeable Machine Learning models. It provides an easy interface to run models built in diverse frameworks in any other framework. It uses Google’s Protocol Buffers to serialize models into a common format, and a set of specifications to read and load it in other. Over the summer, I worked on implementing an ONNX backend for the Julian Flux framework, as a part of my GSoC project. In this article, I will try to list the objectives I was able to meet, the obstacles in my path and the work that needs to be done in this regard in the future.

Cross platform machine learning poses a good number of problems, mostly because every framework has a different architecture. ONNX tries to connect the common dots from each framework, and also provides functionality to implement these for any other framework, Flux.jl in this case. This task, of writing an ONNX backend in Flux consists of a number of steps, such as :

  • Processing the raw ProtoBuf data. The raw data is really difficult to understand and cumbersome to work with. This means that we essentially create new structures, similar to the primitive ProtoBuf generated structures, but which store only useful information, in a more simple, straight forward and Julian format.
  • Extracting weights from these models. The weights are generally stored in the TensorProto structure. One thing to keep in mind here is that a few models may opt to store weights in a Constant tensor, rather than in the regular structure.
  • The nodes present in the graph need to be converted to Flux operators. DataFlow.jl is used to get the dataflow graph from the GraphProto structure. Every operator needs to be mapped to the corresponding Flux operator. The ops.jl file contains this mapping from ONNX nodes to Flux ops.
  • Emitting code is the next major step in this process. DataFlow.jl can be used to read the computation graph from ONNX structure, generate the dataflow Julian graph and finally emit the code. This produced two files: The model.jl file which contains the Flux representation of the entire model, and the weights.bson file, which stores the weights of the model tensors. Both of these need to be imported externally for running the model.

So to summarize this section, we’ve essentially converted the ONNX model into Julia code, along with a separate weight file. This is, in short ,the entire process adopted my ONNX.jl.

Current Challenges

ONNX.jl can be used to run quite a few models now. I was successfully able to load the pre-trained MNIST, VGG19, SqueezeNet and Emotion Ferplus models, and run them to obtain accurate results. But I also came across quite a few problems, mostly because of difference in implementation of operators in other frameworks as compared to Flux, and non-availability of various operators in Flux. This restricted loading more models into Flux. A few of these challenges are listed below:

  • AveragePool:

Different behavior of AveragePool. I came across this while testing the pooling operators. Apparently, the AveragePool tests failed even after a straight-forward implementation in Flux. To verify the results, I decided to compare the results of the same layer on the same input, but from two different frameworks: Flux and Keras-TensorFlow. That is when I realized that both these frameworks have different implementation for AveragePool. While Flux pads the input tensor and then pools it, Keras pads and then pools it, ignoring inclusion of padded elements while pooling. As a result, we obtain wrong results at the border of the output tensors.

  • Asymmetrical pads:

Asymmetrical padding isn’t very common. Nevertheless, many ONNX models, generally the ones exported from frameworks like Caffe2 tend to use it. In Flux, we specify padding as (a,b), which expands to (a,b,a,b) in the four directions. Hence, the padding in opposite directions is always same. So, ONNX padding of the type (2,2,0,0) can’t be implemented in Flux. One way around this would be to convert this into symmetrical padding. So (2,2,0,0) gets converted to (1,1,1,1), which is supported in Flux. However, this approach has two shortcomings:

There is depreciation in performance, which is fairly obvious.

Not all paddings can be converted to symmetrical ones, without disturbing the expected output shape. For example, (3,3,0,0) cannot be converted to symmetrical padding without changing the output shape from the ideally expected one, which will most probably lead to some sort of DimensionError in the later stages.

  • Grouped Convolutions

Flux doesn’t support grouped convolutions. These convolutions are fairly common in Caffe2 exported models. This involves splitting the inputs into various groups along the channels axis, convolving each group individually with the filter and concatenating the results. Now Flux does have an implementation of Depthwise convolutions, but they are just a special case of grouped convolutions, when number of channels is equal to number of groups. It’d be better to support groups in the long run, as it’d be easier to use this interface every time, rather than having a separate implementation for Depthwise convolutions.

  • Local Response Normalization

Flux again doesn’t support LRN right now. Although I’ve opened a Pull Request for this, so this shouldn’t be a big issue in the later stages.

  • BatchNormalization Interface

Flux’s BatchNorm uses a slightly uncommon interface. The constructor uses standard deviation instead of variance. Apart from this, the epsilon parameter is ignored during the forward pass. This poses a problem when loading pre-trained models, as this parameter can make quite a difference. As a result, the current implementation for BatchNorm in ONNX.jl involves more operators than necessary and can be made simpler if the uniform interface is implemented.

  • Difference in type of inputs and outputs

Flux tends to change the type of data in most cases. As an example, consider a small Dense layer, Dense(2,4). So this takes a 2-element vector as it’s input and returns a 4-element vector. Naturally, if we input a vector of type Array{Float64, 1}, we’d expect the output also to have the type Array{Float64,1}. But if the input is a vector of any other types, such as Float32 for instance, the output returned still has elements of Float64 type. As a result, if all operations we’re doing consist of Float32 values, and all of a sudden the type of one output changes to Float64 during the forward pass, we’re going to get errors.

  • “SAME_UPPER”/”SAME_LOWER” padding

This padding actually implies “pad the input so as to keep the shape of the output same as that of the input”. This is a deprecated attribute but is still used in quite a few ONNX models.

What’s next?

ONNX is itself under rapid development, so it’s going to be a rough road ahead. For starters, ONNX operator node tests are updated regularly, so we need to keep a track of these changes, keep testing the operators, and adding tests for newly added operator tests. Not only operators, but the ONNX models also need to be constantly tested, since they are replaced by a newer corrected versions regularly.

Moreover, over the long term, we’d have to add support for various other features in Flux, if we need it to import and run the latest, state-of-the-art models in the field of computer vision, ranging from Image classification to object detection. Along with addition, we’d also need to remove the deprecated attributes, as support is dropped for them from ONNX.

Notes: Special thanks to philtor for the review.