What I Have Done Wrong Leading Baidu’s Open-Source Deep Learning System PaddlePaddle

Yi Wang
The Startup
Published in
5 min readNov 5, 2020

In late 2016, I was assigned the tech lead of PaddlePaddle, the open-source deep learning system of Baidu. The team updated the technology from the generation of Caffe1 to something towards a deep learning language and named it PaddlePaddle Fluid. By the year of 2019, the Fluid version is overwhelmingly used in Baidu products. There were a lot of good things that happened during the journey, but this article is all about regrets.

At a time when TenosrFlow had built a large community and before the release of PyTorch, the work to upgrade PaddlePaddle from using the graph-based autodiff approach to a recent generation of technology basically implies the two choices, (1) imperative programming, also known as the dynamic network, and (2) autodiff-by-the-compiler, I aggressively chose the later one and named it PaddlePaddle Fluid, which, however, took the team two years to deliver a stable and usable system.

It was DyNet https://github.com/clab/dynet which inspired us the former idea. A young engineer, Yang Yu, tried to persuade me and the rest of the team to follow it. However, I was afraid that we cannot go faster than the DyNet team if we follow their idea. After all, we need time to bring our minds to the new page, and during the time, the DyNet team would have kept moving forward. Even if DyNet failed to keep its advantage, any tech giant in the Internet industry could build a team to realize that advantage. This was later proven true when Facebook released PyTorch.

My fault lies in the other end of ROI — I significantly underestimated the cost of implementing the idea of autodiff at compile-time, especially, compared with that of the dynamic network. my wrong estimate became crystally clear two years later, right before I left Baidu, a new grad engineer, Yang Yang, wrote a tape-based implementation, PaddlePaddle Tape, in a week, alone!

It is embarrassing and even painful for me to revisit the story, but 2.5-year means a lot of pressure for the team. Indeed, I didn’t lead the exploration into a stable version before I left Baidu in late 2018. It was my colleague Tian Wu and the team who did it. And till now, the project has not yet had a frontend language; users build the IR by calling Python APIs, just like they build the TensorFlow graph. Without a frontend language, the only notable difference between PaddlePaddle IR and TenosrFlow graph is that the former has additional control flows like function definitions and function calls. And, to support the autodiff of these control flows, we extended the approach used in TensorFlow graph mode invented by Yuan Yu and the team at Google.

It is not too embarrassing to tell the story, because our trial was years before Swift for TensorFlow, MLIR, and the extension to Julia to support deep learning. Giving enough time, the team should be able to achieve the ultimate goal of a new deep learning programming language. Unfortunately, there has probably never “enough time” for any project.

In the recent two years, I cannot stop thinking about how if I could have had done my work better. Now, we can take the footprints of PyTorch, TensorFlow Eager Execution, and JAX as references, and we can see that the right choice in 2016 should have been implementing a tape-based dynamic network system, in, say, a few weeks, reusing existing PaddlePaddle operators in C++. Then, we should have delivered a stable implementation in the next few months. Of course, this implementation would need a lot of performance tuning work, just like PyTorch is still tuning its performance. But an easy-to-use system can start to attract users much earlier than its performance maturity.

It is also easy to expand the tape-based approach to other languages than Python. A few weeks ago, I started a hobby project GoTorch https://github.com/wangkuiyi/gotorch that reimplements the high-level API of PyTorch in Go but calling the C++ core of PyTorch. It takes only four developers working part-time a few weeks to make it builds and runs on x86, ARM, and CUDA, to support Raspberry Pi and NVIDIA Drive PX2, and to achieve better performance than PyTorch on CUDA servers at training the ResNet50 model using the ImageNet dataset.

If I could have made a decision in 2016 to produce smooth outputs as these, we could have accumulated fame and resource to move forward to compile-time autodiff. However, in 2016, I could not foresee all these.

In October of that year, advocated by Andrew Ng, Baidu finally decided to open-source PaddlePaddle, developed by Wei Xu in 2013. Wei recommended me to be the new leader and to upgrade the graph-based system to catch up with the recent generations of technology.

In 2013, the primary competitor of PaddlePaddle was Caffe, created by Yang-qing Jia, which boosted computer vision research with its support of CNN. PaddlePaddle supported RNN in addition to CNN and made an advantage in the field of NLP. However, to the end of 2016, TensorFlow had been overwhelming and pushed the technology of graph-based autodiff to a summit. It would be hopeless to attract users’ interest with yet another graph-based system.

At that time, TensorFlow users had been complaining about the primary drawback of graph-based autodiff — debug is very painful. However, it was months before Facebook released PyTorch and years before MLIR, Swift for TensorFlow, or the extension of Julia to deep learning. What I took as the reference to my thinking was DyNet and my consideration of TensorFlow graph-building API as the compiler frontend that generates graphs as the intermediate representation (IR).

I think if I tried to prototype both ideas, the tape-based dynamic network approach, and the extending of Yuan Yu’s approach to support function definitions and invocations, I should have had more information that might help me with an accurate assessment of the ROI of each direction. This is what I have been done in GoTorch — I tried to reuse the C++ core of PyTorch and TensorFlow. After learning the source code, I also asked for advice from Alex Passos, Yufeng Zhou, and other Google TensorFlow team members in August 2020, which helped me. I noticed that over 200 gradient operators of TensorFlow have only Python implementations. So, if I am going to build a non-Python deep learning system that can run efficiently on self-driving cars, I would have to manually rewrite these operators in C++, which is prohibitive for me and my friends working part-time on GoTorch.

Quick PoC before making an important decision is what I learned from the lesson.

--

--

Yi Wang
The Startup

Principal engineer at Meta. PyTorch Distributed and TorchRec