XDL Framework: Delivering powerful Performance for Large-scale Deep Learning Applications
The Alibaba tech team open sourced its self-developed deep learning framework that goes where others have failed
Deep learning AI technologies have brought remarkable breakthroughs to fields including speech recognition, computer vision, and natural language processing, with many of these developments benefiting from the prevalence of open source deep learning frameworks like TensorFlow, PyTorch, and MxNet. Nevertheless, efforts to bring deep learning to large-scale, industry-level scenarios like advertising, online recommendation, and search scenarios have largely failed due to the inadequacy of available frameworks.
Whereas most open source frameworks are designed for low-dimensional, continuous data such as in images and speech, a majority of Internet applications deal with heterogeneous data, which is high-dimensional, sparse, and discreet, and occurs on parameter scale of tens and even hundreds of billions. Furthermore, many product applications demand real-time training and updates for large-scale deep learning models, vastly outpacing the distributed performance, computational efficiency, horizontal scalability, and real-time system adaptability of existing open source frameworks.
Now, Alibaba has advanced an industrial-scale deep learning framework specifically designed and optimized for such scenarios. Known as X-DeepLearning, or XDL for short, it is emerging from development in Alibaba’s online ecosystem with distinction for its training scale, performance, and horizontal expansion capabilities.
This article looks in detail at the technical fundamentals of Alibaba’s novel framework, from its built-in algorithm solutions for advertising to its support for recommendation and search functions.
(Explore X-DeepLearning on Github: https://github.com/alibaba/x-deeplearning)
The following sections together offer an overview of the key components that drive XDL’s viability as a framework for industrial-scale deep learning — specifically, its kernel and readily equipped algorithm solutions.
The XDL framework
XDL Framework is designed for high-dimensional sparse data scenarios. It supports ultra-large-scale deep model training for 100 billion parameters and multiple training modes like batch learning and online learning.
With industry-level distributed training capabilities, it supports CPU/GPU hybrid scheduling and has a complete distributed failover strategy. Beyond this, it also has excellent horizontal scalability, allowing XDL to do thousands of concurrent trainings with minimal effort.
XDL’s kernel also offers efficient structured compression training, with a structured training method proposed to address the data features of Internet samples. Compared with traditional flat sample training methods, sample storage space, sample IO efficiency, and absolute calculation volumes for the training are greatly reduced in typical scenarios. This increases the overall efficiency of training by more than 10 times in recommender scenarios, for example.
Finally, the kernel offers mature multi-backend support. High-density network computation inside a single machine reuses the mature open source framework, and only a small number of distributed driver codes needs to be changed to run single-machine codes such as TensorFlow/MxNet on XDL and obtain XDL’s distributed training and high-performance sparse computation abilities.
Built-in algorithm solutions
For click-through rate estimation, XDL’s latest algorithms include Deep Interest Network (DIN), Deep Interest Evolution Network (DIEN), and Cross Media Network (CMN).
The Entire Space Multi-Task Model (ESMM) algorithm jointly models click-through rate and conversion rate data.
In terms of matching, XDL uses the latest algorithm in the recall field: Tree-based Deep Matching (TDM).
For a light general model compression algorithm, XDL uses Rocket Training.
These algorithms are discussed further in the section of this article titled “Exploring XDL’s Algorithm Solutions”.
System Design and Optimization
The following sections introduce the design principles and optimization measures Alibaba has undertaken in building the XDL framework.
XDL-Flow: Data flow and distributed runtime
The XDL-Flow component drives the generation and execution of the entire deep learning computing graph, including the data reading and processing pipeline, sparse representation learning, and dense network learning. Simultaneously, it is responsible for the global consistency coordination of distributed model storage and parameter exchange control logic, and for distributed disaster recovery.
Sample sizes in search, recommender, advertising, and other such scenarios are huge, usually reaching anywhere from dozens to several hundreds of terabytes. If the sample reading and processing pipeline is not well optimized, the sample IO system can easily become a bottleneck for the entire system, leading to a low rate of computing hardware utilization. In large-scale sparse scenarios, sample reading is characterized by IO-intensiveness, while sparse representation calculations are characterized by intensive parameter-switched network communication, and dense deep computing by intensive computation.
XDL-Flow can better adapt to the performance demands of three different types of tasks by parallelizing the three main stages asynchronously. In the best case, the delays of the first two phases are hidden. Currently, Alibaba is attempting automated tuning as a way to asynchronize various parameters on the execution line, including the parallelism of each step, the size of the buffer, and so on so that the user does not need to mind details of the whole parallel and asynchronous execution line.
AMS: An efficient distributed model server
XDL’s AMS component is a distributed model server designed and optimized for sparse scenarios. For it, Alibaba has implemented numerous hardware and software optimizations by integrating small sized packet network communications, parameter storage structure, parameter distribution strategy, and so on. This makes AMS much better than traditional parameter servers in terms of both throughput and horizontal scalability. AMS also supports built-in deep network computing, enabling second-order calculation of the representation sub-network.
Much of AMS’s optimization has focused on the network communication layer, using software and hardware integration and technologies including Seastar, DPDK, CPUBind, and ZeroCopy to fully squeeze out hardware performance. After large-scale concurrent training, small sized packet throughput caused by parameter exchange was found to be more than five times than that of a traditional RPC framework used in testing.
By using a built-in parameter dynamic balanced strategy, Alibaba discovered the optimal sparse parameter distribution strategy in the running process, which effectively solves the hot spot problem caused by uneven distribution of parameters for the traditional parameter server, and which greatly improves the ability of the system to scale horizontally in the case of high concurrency.
AMS supports sparse embedding computing in the scenario of accelerating large batch sizes with GPU, and performs well in addressing ultra-large batch size scenarios. It also supports internal-defined subnetworks. For example, the cross-media modeling provided in XDL’s algorithm solution, of which the image’s representation sub-network is defined as running in AMS, greatly reduces repeated computing and network throughput.
Backend engine: Reusing a mature framework’s standalone capability
In order to take full advantage of the existing open source deep learning framework on the dense deep network, XDL takes the open source deep learning framework (currently supporting TensorFlow and MxNet) as its standalone dense network computing engine backend by using bridging technology. Users can directly obtain XDL’s distributed training ability in large-scale sparse computing with modification to just a few driver codes while retaining TensorFlow or MxNet network development tendencies.
In other words, using XDL does not require learning a new framework language. This brings the additional benefit that XDL can seamlessly interface with the existing mature open source community, meaning users can easily extend some open source models from the TensorFlow community to industry-level scenarios using XDL.
Compact Learning: Increasing training efficiency
Sample representation in sparse industry-level scenarios often presents strong structural features such as user features, commodity features, and scenario features. This type of structure decides that certain features will appear in large numbers in duplicate samples, such that a number of samples belong to the same user, and a large portion of user features are the same. Structured sample compression makes use of massive local feature repetition in massive samples. It compresses features in both storage and computational dimensions, thus saving storage, computation, and communication bandwidth resources.
In the sample preprocessing stage, the features that need to be aggregated are sorted (for example, by user ID in order to aggregate user features); at the batching phase, these features are compressed at the tensor level; at the computing phase, the compressed features are only expanded at the last level, which greatly saves computational overhead for deep networks. The effect verification in the recommender scenario indicates that in typical operation data, the AUC based on aggregated samples and on fully shuffled samples are the same, and that overall performance is improved by a factor of more than 10.
Large-scale online learning
Online learning has been widely applied to industry-level scenarios in recent years as an in-depth combination of engineering and algorithms that gives the model the ability to capture online traffic changes in real time. It is especially valuable in scenarios where timeliness is crucial, such as e-commerce promotion scenarios, in that online learning can capture changes in user behavior in a more real-time manner and significantly improve the real-time effectiveness of the model.
XDL provides a complete online learning solution that supports real-time continuous learning. XDL has built-in support for Kafka as a message source and allows users to control the model writing cycle according to their settings. Additionally, in order to avoid real-time model explosion caused by the inflow of new features without restrictions, XDL has built-in functions such as automated feature selection and expired feature elimination to ensure users’ convenience using XDL for online learning.
Traditional machine learning frameworks generally require ID-like representation sparse features (compact coding from 0) to ensure the efficiency of training. XDL allows direct training with original hash features, which greatly reduces the complexity of feature engineering and significantly increases the efficiency of the end to end modeling pipeline. This is even more important in real-time online learning scenarios.
With real-time feature frequency control, users can set a threshold for feature filtering. For example, only features that appear more than N times will be included in the model training. The system automatically adopts the automatic probability discarding algorithm for feature selection, which can greatly reduce the space occupied by the model’s invalid ultra-low frequency features.
While doing long term online learning, users can also activate XDL’s outdated feature elimination function. The system will then automatically disable feature parameters that are weakly affected and left untouched for long periods of time by users.
Exploring XDL’s Algorithm Solutions
As previously mentioned, XDL’s algorithm solutions enable a range of applications spanning click-through rate prediction, conversion rate estimation, match and recall, and model compression. The following sections explore specific algorithms for each of these areas in detail.
Click-through rate prediction
XDL’s first click-through rate prediction algorithm is Deep Interest Network, or DIN for short.
Traditional embedding and MLP models do not do much work for user expression. Historical user behavior is often projected into a fixed-length vector space by the use of the embedding mechanism, and then a fixed-length user vector expression is obtained by a sum/avg pooling operation. However, users’ interests are diverse, and it is very difficult to express a user’s various interests with a fixed vector. When users face different commodities, they show different interests toward them, and only interests that are relevant to a specific product will affect their decisions.
Therefore, it is only necessary to display a user’s interests that are relevant to a given product when estimating the user’s click-through rate for that product. With DIN, XDL proposes an interest activation mechanism that activates the relevant parts of a user’s historical behavior by going through the products used as a basis for estimating the user’s interest in a particular product.
XDL’s second such algorithm, Deep Interest Evolution Network (DIEN), solves the problems of interest extraction and interest evolution. In terms of interest extraction, traditional algorithms simply regard users’ historical behavior as indicating their interest. At the same time, the supervision information for the whole modeling process is concentrated in advertisement click samples, which can only reflect a user’s interest in deciding whether to click on an advertisement or not. It is difficult to model a user’s interest at the moment of each behavior occurring in the history.
With DIEN, XDL proposes use of auxiliary loss in the interest extraction module. The constraint model can infer subsequent behavior from the hidden layer expression of each historical behavioral moment of a user. Ideally such a hidden layer expression can better reflect the user’s interest at every different moment of action.
Subsequent to the interest extraction module, XDL proposes another interest evolution module. Whereas traditional RNN-like methods can only model a single sequential order, in e-commerce scenarios the evolution processes for users’ interests is quite different. Thus, XDL proposes AUGRU (Activation Unit GRU) to make the GRU’s update gate relevant to the predicted item. When modeling the evolution process of a user’s interest, AUGRU will construct different interest evolution paths according to the target commodities being predicted and infer the interests of the user that are relevant to these products.
Finally, XDL’s Cross Media Network (CMN) algorithm aims to introduce more modal data like image information into the click-through rate prediction model. As well as original ID-like features, image visualizing features are added in it; together, these are added into the CTR prediction model, which achieves significant improvement in application for large-scale Alimama data.
CMN includes a number of technical features. First, its image content feature extraction model is jointly trained and optimized with the main model. Second, image information is used to express the advertisement and the user, wherein the image is used as the user expression corresponding to the user’s historical behavior. Third, in order to deal with the massive volume of image data involved in training, the calculation paradigm Advanced Model Service is proposed in it to effectively reduce calculation, communication, and storage loads during the training process. Apart from image-based signals, CMN can also be used to extract features from text, video, and other sources under the same paradigm.
Conversion rate estimation
XDL’s Entire Space Multi-Task Model (ESMM) is a new multi-task joint training algorithm paradigm developed by Alimama.
The ESMM model initially proposed the idea of learning CVR indirectly by learning the assistant tasks of CTR and CTCVR. It uses the user behavior’s sequential order data to do modeling in the complete sample space, avoiding the problems often encountered by traditional CVR models — namely, sample selection bias and sparse training data. This has since achieved significant results.
ESMM can be easily spread to sequence-dependent user behavior prediction (for browsing, clicking, buying, purchasing, and so on) to build a full-link multi-target prediction model. The BASE sub-network in the ESMM model can be replaced with any learning model, so the ESMM framework can easily integrate with other learning models and thus absorb their advantages to further enhance the learning effect.
Match and recall
XDL’s Tree-based Deep Match (TDM) independently creates a complete, tree-based deep learning recommender matching algorithm framework. It realizes efficient whole-library retrieval by establishing a user interest hierarchy tree structure, and uses this as the basis for empowering depth modeling by introducing Attention and other advanced computing structures. Its computational structure has achieved significant improvements in accuracy, recall, and novelty over traditional recommender methods.
Further, the TDM design is a complete joint training iteration framework built from initial tree-model training-tree reconstruction-model retraining, which further promotes effectiveness. Joint training gives the TDM algorithm framework strong versatility and provides a solid theoretical basis, as well as engineering feasibility, for TDM to migrate into new scenarios and areas.
For model compression, XDL uses an algorithm called Rocket Training.
Real-time reasoning of online models for industry uses imposes very stringent requirements on response time, which limits models’ complexity to some extent. This limited complexity may lead to a decrease in a model’s learning ability and a decrease in its effectiveness.
There are currently two ways to solve this problem. On the one hand, inference time can be reduced by calculating numerical compression in the case of fixed model structure and parameters, as well as designing a more streamlined model and changing the way the model is calculated, such as with Mobile Net, ShuffleNet, and so on.
On the other hand, complex models can be used to assist training of a streamlined model. During the testing phase, a learned small model can be used to do reasoning tasks, and these two schemes do not conflict with each other. In most cases, the second scheme can further reduce the inference time by applying the first scheme. At the same time, considering the relatively rigorous online response time, more free training time is available for training a complex model. Rocket Training belongs to the second type, and is relatively light and elegant. Its method has strong versatility, and it can customize model complexity according to system capabilities and provide a means of stepless speed regulation. In the Alimama practice environment, Rocket Training has proved able to conserve online computing resources and significantly improve the system’s ability to cope with peak flow during the annual 11.11 Global Shopping Festival.
Findings from Benchmark Data
In the following sections detailing benchmark data on XDL, its training performance and horizontal scalability in large batch, small batch, and other training scenarios are especially notable, along with the improved speed brought by its structured compression training.
CPU training-based deep CTR model
For a model structure, Alibaba chose a Sparse Embedding DNN approach, performing N-way sparse feature-embedding and then obtaining some NFM features via BiInteraction. When selecting two-feature scale scenarios, the total size of Sparse features is betweem 1 billion (corresponding to 10 billion parameters) and 10 billion (corresponding to 100 billion parameters); the dimension density is several hundred dimensions; and the number of sparse feature IDs for a single sample is between 100+ and 300+. The training mode is defined as BatchSize=100 for asynchronous SGD training.
As benchmark results indicate, XDL has strong advantages in high-dimensional sparse scenarios, and maintains strong linear scalability in the case of significant concurrency.
GPU training-based deep CTR model
For this model, a structure of 51-way embedding + 6-layer fully-connected layer feature scale: batch_size=5000 was used. With over a million sparse ids per batch and 200,000 after unique training with a large batch under GPU, it also has good linear scalability.
Structured compression model
In this case the test sample compression ratio is 4:1, and the batch size is 5000 samples. As the table below shows, structured compression improves the overall training performance by 2.6 times.
As the above benchmark results indicate, XDL offers a high-performing deep learning solution uniquely suited to large-scale online business and related activity, an area previously inaccessible to open source frameworks. With a strong design foundation and built-in algorithms that tackle the unique demands of data in highly complex tasks, the framework is well positioned to continue its development on an open source basis to further support the challenging scenarios businesses like Alibaba’s introduce.
Interested developers can explore X-DeepLearning further on Github:
(Original article by Zhu Xiaoqiang朱小强)