Large-Scale Deep Learning Model Engineering at SmartNews

SmartNews
SmartNews, Inc
Published in
11 min readSep 28, 2022

Introduction

SmartNews is the world’s leading news aggregation app with a large user base in the Japanese and the US markets. In our core business such as advertising, we extensively use deep learning modelsin scenarios such as search and recommendation to optimize user experience and improve revenue. However, as our online business continues to grow, the number of training samples and the size of the models are also growing rapidly, making it difficult for the native TensorFlow to support the training needs of the models. At the same time, as the business grows rapidly, the speed of model iterations and real-time performance are also in high demand. How to support larger and more real-time model training is an urgent problem to be solved.

In this blog, we describe how to optimize the engineering architecture of deep learning models to support online model optimization, performance improvement and cost control, and also share some engineering details of the evolution process.

1. Larger Scale Model Architecture

Why do we need to support larger models at SmartNews?

  • More data will lead to improved model performance.
  • Some internal experiments have shown that when we increase the daily training sample size, there is a significant improvement in model performance. However, due to training time constraints, we usually sample the training data, especially the negative samples, and then add them to the training data. This results in a loss of a large number of training samples, which affects the model’s performance. It would be helpful if we could train more samples in the same training time.
  • User Embedding training will lead to more personalized recommendations and improved metrics.
  • User Embedding helps to learn user characteristics, especially for long-term interests, which can help to personalize the user’s recommendation experience. However, due to the large number of users, the introduction of User Embedding features will significantly increase the size of the model and pose a challenge to the loading, training and inference of the model.

1.1 Distributed Training

1.1.1 How to build a distributed training service quickly?

Since most DNN models can run well on CPUs, we chose a data parallelism strategy based on the PS architecture to slice the training data to different worker machines for parallel training (in our practice, it is faster to slice the training files than to slice the tf.dataset for training).

Based on the k8s cluster, we use TFJob for distributed computing resource management and integrate AWS EFS for log and checkpoint storage. On the training side, we used ParameterServerStrategy as the distributed training strategy. Since Estimator has more complete support for PS architecture than Keras, we mainly chose Estimator as the base for model development. A basic architecture is shown in the figure below. As a user, you only need to submit the model’s configuration file to get the full results of the model on the Dashboard afterwards.

Fig1. Distributed Training Architecture

1.1.2 How to optimize the performance of a distributed training model?

In real-world advertising training, the increased arithmetic power of distributed training allows us to add a large number of negative samples to the training that we could not add before, in the hope of improving the generalization ability of the model. However, directly adding data did not give us the expected improvement in results, and instead we observed a slight performance degradation.

Through our experiments, we found two effective ways to improve the performance of the distributed training model. One is that when we increase the proportion of negative samples, the number of positive samples in a batch becomes smaller with the previous smaller batch size, especially in scenarios like advertising where positive samples are already sparse. Therefore, it is necessary to increase the batch size to ensure the number of positive samples in a batch. Secondly, if we increase the proportion of negative samples, the cross-entropy function we use as the loss function will easily be ruled by negative samples when calculating the loss. This is where the weight of the negative samples in the loss function needs to be reduced, so that the model can make better use of the increased generalization ability brought about by more data, without being affected by too many negative samples.

Of course, some problems have been identified during the distributed training process. The communication performance bottleneck is particularly significant. When the embedding size is too large, the communication time between worker and ps affects the model training speed. Therefore, we are also trying to combine the advantages of pipeline parallelism and model parallelism strategies, e.g. by splitting the two pipelines of embedding query and neural network computation to achieve parallel computation.

1.2 Dynamic Embedding

Our early DNN models did not include user embedding as a trainable feature. However, as the business grows, it is more important to have a deeper understanding of the user and to capture long term interests. Therefore, we want to add user embedding to our recommendation model as a personalized feature representation of the user. Due to the large number of users, we need a model architecture that can support large-scale sparse embedding training.

1.2.1 Implementation of Dynamic Embedding

TensorFlow’s native static embedding cannot support dynamic embedding feature addition and deletion, so we borrowed and used TensorFlow Recommenders-Addons (TFRA)’s implementation of dynamic embedding to complete support for all types of embedding. The dynamic embedding is supported by TensorFlow Recommenders-Addons (TFRA). However, as the business model is based on Keras+Feature Column, TFRA’s support for it was not complete at that time, so we developed various TensorFlow APIs related to dynamic embedding, so that users can reuse the original APIs without changing any habits.

Due to the large number of classes of embedding features, model weights are difficult to store in memory. We therefore make use of AWS’ ElastiCache as external storage for embedding weights. In the external storage, we periodically clean up inactive users as well as implement a series of service monitoring. We then use a custom developed ResourceVariable to dynamically update the embedding weights from the ElastiCache, while the model accesses the same ElastiCache to obtain the latest embedding weights for online inference when serving.

Fig2. Dynamic Embedding for Training and Serving

1.2.2 Dynamic Embedding in practice for advertising scenarios

In the advertising CTR model, we mainly used dynamic embedding on user embedding to achieve iterative training and fast update of user embedding features. However, during the initial experiments with user embedding, we found that the model could easily fall into overfitting. Therefore, for the training of the user embedding, we use limited conditional gradient update, where we only do forward propagation of the embedding weights in the first few epochs of training, and only do backpropagation + gradient update in the last epoch, so as to avoid the overfitting phenomenon caused by the feature. We also found that the initialization value of the user embedding has a significant impact on the convergence of the model. We finally introduced the user embedding of the external model as the initialisation value of the new user embedding, and iterated on this basis to achieve better results than random initialisation. We eventually achieved an offline auc gain of 0.001 on the advertising CTR model and successfully went live to achieve better online business gains.

2. Near Real-time Training Streaming

2.1 Why real-time training streams are needed

In the past, the training samples in our business were mainly based on offline batch tasks from generation to training as shown in the figure below, with a 2+ hour delay from sample generation to entering the model during peak traffic periods. However, in many business scenarios, there is a very high demand for real-time modeling. If we can enter the user’s behavior into the model learning faster, we can quickly capture the user’s interest and thus form better quality recommendation content.

Fig3. Batch Feature Pipeline

2.2 How to implement real-time training streams

All the offline batch tasks in the above figure are replaced with real-time Flink tasks. KV storage is also used to cache the feature data. By generating the training data stream in real time, we successfully reduced the model update latency from 2+ hours to less than 10 minutes, resulting in a significant improvement in model performance.

Fig4. Stream Feature Pipeline

The feature data stream is directly dumped to KV storage, and the real-time data stream is first de-duplicated and aggregated to produce a label, e.g. Impression behavior generates negative samples by default, and positive samples are generated if there is a click behavior afterwards. The aggregation function ensures that only one label is generated for the same request + item within the time window, and that the data is de-duplicated for actions that exceed the time window. The aggregated label data is then joined directly with the feature in KV to produce a sample. In order to ensure the success rate of the action and feature join and to prevent the feature data stream from lagging behind the action data stream, the action data stream is always lagging behind the feature data stream by coordinating the action job watermark with the event time of the feature event.

How do we verify the accuracy of the live stream? Compare the number of samples, the proportion of positive and negative samples, and the accuracy of the sampled comparison data between the real-time stream and the batch stream. Currently we do not validate the model before it is updated, but monitor the model in real time through online logs (calculating AUC…), based on our experience and the model’s characteristics: samples can be missing but not wrong. If some samples are lost, the model can be recovered automatically, but for sample errors, the model can only be recovered by restoring the historical checkpoint and skipping the error data.

3. More Efficient Training Methods

Deep learning models often require a lot of experimentation to try out different features and model structures to achieve better model results. In this process, how to train the model more “economically” is of great importance to save cost, algorithm engineers’ time and maintenance effort. Here we share two of these practices: feature selection and Warm-Start training.

3.1 Feature selection

As a model is optimized, it tends to ‘feature bloat’, with some models having hundreds to thousands of features. While too many features may not necessarily hurt the performance of the model, they can pose a greater challenge in terms of online service latency, feature storage costs, and maintenance costs. Therefore, we provide a set of feature thinning solutions.

3.1.1 Feature importance

In order to retain important features and remove non-important features, we need a feature importance evaluation criterion. As our model is DNN structured, we utilize the feature gradient variation as the importance score for that feature. More specifically, we selected the concat layer before the feature enters the fully connected layer, fixing the input to that layer and the final logit output to obtain the gradient by deriving the gradient from the input. We then multiplied the gradient by the variance of the feature to approximate how much variation in the feature would affect the final output. Here we also need to regularize the different dimensions of the features to remove the effect of high dimensional features on the final output due to their high dimensionality.

As the model is continuously updated, we generate a daily report on the importance of the model to keep track of changes in the importance of the features and to alert the model of excessive changes in importance.

3.1.2 Feature selection

Based on the above feature importance, we can then select features to be removed. When selecting features to be removed, in addition to the importance score of the features, the cost of the features also needs to be considered. The so-called cost of a feature is mainly the additional storage and computational cost due to factors such as its dimensionality or the length of the sequence of features. We therefore combined the feature cost + feature importance (the feature importance is generated daily, so we chose a one-month average as the final result) to obtain a feature cost-performance ranking, which was used as the basis for feature selection model experiments.

We verified in numerous ad recommendation models that removing 50% of the number of features did not significantly degrade the model’s offline metrics or online performance. And for both training and inference time, there is a 30–40% reduction. This is a great help in terms of cost saving and feature maintenance cost reduction.

3.2 Warm-Start Training

The process of optimizing a model is often accompanied by extensive feature engineering experiments. However, if training starts from 0 after each addition, deletion and change of features, the model needs to be trained for a long period of data (often more than six months) before it can be compared with the performance of the online model. This is a significant overhead, both in terms of machine cost and time cost. If we can use the weights already trained online to initialize existing feature weights, we can quickly verify the change in model effectiveness brought about by the new features.

Take a typical DNN network as an example (shown below), when we need to add new features to the original model, the weights of the gray layer will be read from the online checkpoint. The weights of the layers in the red box will be added or changed. For the new weights, TensorFlow will automatically initialize them according to the initializer, so no additional work is needed. When a new feature is added, the weights of this layer will change in order and shape due to the new feature, so we need to dynamically adjust the position of the weights of this layer according to the structure of the graph, and initialize the weights of the new Fig5.

Fig5. Warm-Start Example

By using Warm-Start, we can reduce most of our feature engineering experiments from a week to a day.

4. Future Plans

In the future, we will further explore and support faster, real-time and more efficient model training architectures. For example, as mentioned above, we will investigate how to better combine data parallelism and pipeline parallelism to optimize the performance bottleneck of distributed training on the communication side. At the same time, we will explore more model training optimization on GPUs, including optimization of GPU operators, single and multi-card training strategies, etc. In terms of real-time, there is still much room for optimization of the overall training pipeline, and the current 10-minute model update can be further optimized to near real-time model feedback through Online Learning. In addition to supporting larger scale and more real-time model training, we as a machine learning platform group will also provide a complete tool chain of MLOps within the company to open up the complete chain from data generation to model deployment to make model training iterations more efficient.

--

--