Interview with Tencent Big Data Technology Team: Tencent Launched Open-source Computing Platform Named Angel (Part Ⅰ)

Published in

SyncedReview

12 min readApr 23, 2017

Intro:

As deep learning technologies further develop in recent years, a lot of machine learning platforms have become open source instead of being specialty use. Now, if a technology firm does not have its leading machine learning platform, it would be embarrassing for the company. For instance, Google has TensorFlow, Microsoft has CNTK, Facebook is a strong Torch’ supporter, IBM supports Spark from the behind, Baidu has recently open-sourced PaddlePaddle, while Amazon announced their innovative platform named MXNet.

Tencent also joins this machine learning platform wave. At the Tencent Big Data Technology and KDD China Technology Summit, which was held in Shenzhen on December 18, 2016, the company launched their third-generation high-performance computing platform in machine learning named Angel, and stated that they would make this platform open source in the first quarter of 2017.

During the summit, Tencent’s vice president Yao Xing claimed, “The advancement of artificial intelligence in the past 60 years had ups and downs, and we are experiencing a peak now. Cloud computing and big data are major contributors to this evolutionary change. Learning how to deal with big data properly, further mine and analyze the limited data resource we have will be a crucial step. This step will drastically affect the future development and upgrade of the entire industry. I believe big data will be the backbone of such digital revolution, while algorithm will be the soul.”

At the summit, the general manager of Tencent’s data platform and the chief data expert Jie Jiang shared experiences like how Tencent develops its big data and the ecological circle founded based on the Angel system. As Tencent’s third generation computing platform, Angel uses Java and Scala as its primary programming languages and is a high-performance distributed computing framework designed for machine learning. The platform is jointly developed by Tencent data, Hong Kong University of Science and Technology, and Peking University. It uses the parameter server architecture to solve its scalability issue, which occurred in previous generated frameworks. Angel also supports computing models in data and model parallelism, and trainings for billion-level dimension models.
Furthermore, Angel combines a variety of the latest technologies in the industry, in addition to Tencent’s independent research outcomes. It has higher performance, and the system is easier to use. Angel is now employed in Tencent video, Tencent social advertising, user portrait mining, and other recommendation businesses. It has become the next-generation core computing platform of Tencent’s big data.

Recently, Synced conducted an exclusive interview with Jie Jiang, the general manager of Tencent data platform and chief data expert. Jiang talked about the development of Angel, and the story behind it in detail. (Jiang, Jie’s speech at the Tencent Big Data Technology and KDD China Technology Summit will also be appended.)

1) The Features and Advantages of Angel

Synced: Why does Tencent choose to open source Angel at this point? What do you think of the current open source machine learning platforms? What are Angel’s advantages over them?

Jiang Jie: We didn’t deliberately choose a time or a date. It happened naturally over the course of time. Angel has been used internally in Tencent for some time so far, and its system stability and performance have passed the Tencent business testing. The system has reached a certain maturity, so it is about time to make it open to all users, hoping to inspire more innovative ideas. Eventually, the platform will be transformed into a valuable ecosystem.

Some of the current machine learning platforms are:

1) Spark (MLlib): uses MapReduce computational models to perform distributed machine learning calculations. High versatility. Not suitable for large-scale model.

2) Petuum: Petuum verifies SSP’s feasibility, which is its biggest contribution. It has comprehensive functionalities. However, for now, it is more like a laboratory product rather than a mature application in the industry.

3) TensorFlow: replaces DistBelief, and becomes Google’s open source machine learning platform. It provides Tensor stream programming models, and specialized in its common operator and GPU parallel computing for deep learning. TensorFlow’s open source version is more suitable for stand-alone multi-card environments, but it reaches its bottlenecks in terms of its multi-machine multi-card functionalities.

Synced: What is Angel’s most appealing feature to developers?

Jiang Jie: Higher performance and higher usability. It also passed billion-level large-scale application testing in Tencent, which makes it suitable for industrial use.

Synced: According to this data (figure below), Angel’s iteration time is significantly better than Spark; the iteration gap is more obvious while models are larger models. Can you explain how did you achieve this high performance in a mroe easy-to-understand way?

Jiang Jie: Angel’s model is distributed on multiple high-performance Parameter Servers, and optimized especially for the pull & push of the model. In contrast to Spark’s single point model broadcast mode, the larger the model Angel has, the better its performance will be.

Synced: How do you compare Angel, with Spark, Petuum, GraphLab (Turi’s fundamental technology, acquired by Apple) or even other platforms.

Jiang, Jie: Spark is very versatile, but its structure is not suitable to large-scale parameter exchange, so we developed Angel. Petuum validates the feasibility of SSP, which is seen as its greatest contribution. Angel also has relatively comprehensive functionalities. Nonetheless, to a certain extent, it is more like a laboratory product rather than an mature industrial application. GraphLab is very good with handling images, but many machine learning algorithm are not suitable for making image-based abstract models. Thus, the platform does not have enough versatility. Its error tolerance is just normal. Angel combines the strengths of Spark and Petuum. It avoids some of their shortcomings, and further enhances the platform’s performance, usability, and reliability.

Synced: Why did you use Java & Scala to develop the system, instead of C / C ++?

Jiang Jie: For the sake of continuation. The Tencent big data platform originated from Hadoop and Spark, which are based on Java. We took user habits into consideration and used the same language. In this way, users can accept the new platform at lower costs. In addition, Scala has richer and more expressive interface, which is also more user friendly.

The other factor is the simplicity of deployment and upgrading. Our previous distributed platforms mainly use Java architecture. If we run Angel on these machines and apply for corresponding resources, the entire process will be transparent with very low transferring cost.

Synced: We have learned that Angel supports Latent Dirichlet Allocation(LDA), Matrix Factorization (MF), Logistic Regression (LR), and Support Vector Machine (SVM). These models are inseparable from matrix calculations. Can you talk a bit about Angel’s optimization in terms of matrix calculation?

Jiang Jie: It is currently available for Vector, Matrix library. It supports not only a variety of expressions (sparse or dense) and common storage formats (CSR, COO, etc.), but also common data types and linear algebra calculations.

Synced: How did you optimize the parameter server? How is it different from DistBeilef?

Jiang, Jie: Angel is an architecture based on a parameter server, and we have done a lot of optimizations in comparison to other platforms. First of all, we can support BSP, SSP, ASP calculations and their parameter update modes. Second, we support model parallelism; the parameter model can be more flexibly divided. Third, we have a mechanism for compensate service. The server parameters will serve slower nodes first. According to our past test results, the waiting time for large models are now significantly reduced by 5% to 15%. Finally, we have done a lot of optimizations for parameter update, such as filtering the sparse matrix’s 0 parameters and convergence parameters. We compress different algorithms based on different parameter values in order to minimize network load. We also optimize the order of participation in acquisition and calculation. We collect the parameters while calculating variables, which can save 20–40% of calculation time.

We read some papers on DistBelief and the computation principle is somewhat similar, but because it is not open source, there is no way to make specific comparison in details. Of course, Google has replaced it with TensorFlow.

Synced: In order to support hundreds of millions, even billions of feature dimensions, you need to improve system infrastructure and algorithms. Especially algorithms, you have to optimize each single one of them. What are the main optimizations Angel has done in terms of infrastructure and algorithms?

Jiang Jie: As it is mentioned earlier, Angel is based on a distributed parameter server architecture, which solves the Spark’s bottleneck for parameter update and computation. At the same time, we improve Angel’s own parameters and network scheduling, reduce its network load, and do a lot of architectural optimizations. Now it can support data parallelism and model parallelism, and therefore larger models.

There are a wide range of selectable algorithms, and each has their own specific optimization methods. Yet there are also common optimization methods:

Low-precision compression of the transmission algorithm model, using fewer bytes to transmit floating-point in order to reduce network traffic and speed up the system.
Build index for each computational node and only gets a subset of the model that is needed for the node from PS;
Filter out update value with less impact on the model, and reduce the amount of network transmission data.

In addition to the above common methods, Angel has done a lot of targeted optimizations, such as GBDT, and LDA.

● GBDT

○ Provides custom Pull function at the PS side, can the completion of the split at a tree node on the server side to avoid the entire gradient histogram sent to the computing node, greatly reducing network traffic. When the computational node push the local gradient histogram to the PS side, the system uses low precision compression.

● LDA

○ Angel achieves a variety of LDA samplers, which is appropriately selected under specific application scenarios; takes full advantages of the characteristics of data sparse and non-uniform distributions; provides efficient compressions, reduces the amount of data transmission. The system determine the division strategy based on the distribution of the matrix so as to achieve the load balance of ps.

○ It also does fine-grained scheduling on different words, and choose whether to perform the calculation in the worker or the server based on the word — the topic matrix and the document — and the size of the topic matrix, so that the system can reduce its network overhead.

Synced: Are Angel and Spark the same for in-memory calculation? Because we know that in-memory computing’s difficulty lies in resource allocation and memory management. How does Angel as a service platform deals with the different sizes, frequencies, algorithms, and time needs of the workload within Tencent cooperate?

Jiang Jie: Yes, Angel is also in-memory calculation, but Angel memory will be much smaller than Spark, because Angel is mainly used for machine learning, especially in optimization. In addition, Angel is not resident service, each computing task is independent, its life cycle and computing tasks are consistent without long-term occupation. We can use the parameters to set the amount of resources occupied by Angel, and can also train the amount of data and model to calculate the occupancy default resources.

2) Angel and Deep Learning

Synced: How does Angel support deep learning and reinforcement learning? Does it support GPU?

Jiang Jie: Yes, Angel supports GPU-based deep learning, and it supports DL4J. In addition, Angel can also support learning frameworks such as Caffe, Torch, TensorFlow to accelerate computations. Two years ago when we tried to use deep learning in digital advertising, we had achieved very good advertisment results after we combine deep learning techniques with online learning platforms. Now we are conducting applied experiments in reinforcement learning, exploring how we can combine deep learning techniques with reinforcement learning techniques.

Synced: How difficult is it to migrate architectures on frameworks like TensorFlow and MXNet to Angel?

Jiang Jie: In the overall architecture level, we have design considerations about how to make Angle compatible with different computation frameworks. Moreover, we built a lot of complement tools to reduce migration costs, therefore, reduce the overall migration difficulty.

3) Security and Privacy

Synced: As users demand better information security and data privacy, what information security and data privacy challenges do Tencent’s big data analysis services face? How did these requirements affect the design and implementation of a system like Angel?

Jiang Jie: Data security has always been Tencent’s top priority. We have a lot of requirements for data security. Each level of the platform also requires full technical security. For Angel — — first, it has a complete user authentication and authority control system to ensure that illegal users cannot login, and legitimate users can only see their private data. Second, Angel’s data is stored on distributed storage systems with high fault tolerance and availability. Data will never get lost. At the same time, they are fragmented, and encrypted using a specific format. In addition, data between different services are isolated. Finally, Angel has a comprehensive monitoring system and log audit history, so illegal access will be timely spotted and dealt with.

4) Background and Outlook

Synced: In the Sort Benchmark contest, the Tencent team won first place in GraySort and MinuteSort, how did you speed up the technology used by the application? How did you achieve such speed improvement?

Jiang, Jie: As for competitions, we won based on previous accumulations. We have developed our platforms for 7 years, underwent three generations of evolution, experienced off-line computing, real-time computing, and finally the three stages of machine learning. Everyday, our platform processes Tencent’s large business volumes, and some of them are very complex businesses, forcing us to cope with resource distribution and high-performance computing. The platform must be highly flexible with great functionality.

Synced: Are there any products that are based off of Angel? Are there any issue in terms of marketing the application?

Jiang Jie: Angel is built for large-scale machine learning computations. This year, we are applying it to Tencent’s video, digital advertising, user portrait mining, and other recommendation businesses. It is very effective. As for now, all of the BG has business in use, and are accumulating more users. There are a lot of problems in the promotion process, mainly because the user needs a step-by-step process of accepting a new thing. There is a certain amount of learning and business migration. Therefore, we do a lot of work on ease of use and business migration to lower the use of the threshold.

Synced: How much resources have been invested in developing this framework? How many people are in the development teams?

Jiang Jie: The Angel project began in 2014, officially launched in early 2015. It started with only four people, and then the team gradually grew. After working with Peking University and Hong Kong University of Science and Technology, there were a total amount of six doctoral students joined our development team. At present, there are more than 30 people in the development team — includes system, algorithms, and technical supports — testing and maintenance team,
as well as product design and operations team.

Synced :Angel has been supporting the SGD, and ADMM optimization algorithms. What algorithms Angel might continue to support later on?

Jiang Jie: It mainly depends on the needs of users and the application. If there are needs, then we will support.

Synced : Can you talk about the reason and significance of making Angel open source? What are the follow-up short-term and long-term plans for Angel?

Jiang Jie: Tencent big data platform is from our open source community, as it benefits from this community. Hence, we would like to offer back. Furthermore, open sourcing not only benefits the inventors and developers within our technology community, but also creates a win-win ecosystem for software development. Here, developers can improve their development efficiency by spending less time to learn how certain projects work. They can spend more time in trying out new ideas. In this way, developers are supported by the open source community; they will be able to develop and complete their projects much faster. We have trying to given back to our community. We open sourced a lot of our source codes, and educate a few project committer. We will keep trying to give back in the future as well.

Open sourcing is just a start. In the not far future, we will work on community development by inventing more resources to respond to our community’s needs. We will develop a lot more complement tools for Angel to further support this platform.

Synced: So far, many technology giants, in mainland or oversea, are all launching their own open-source platforms. What do you think of Tencent’s open sourcing Angel, and its own competitive advantages?

Jiang Jie: Competition always exists. It also fosters progress, pushing the entire industry to further grow. It is a good thing for all the professionals in the industry and the public users. For these open-sourced platforms, we think that each platform has its own advantages and disadvantages. Open sourcing these platforms can optimize their downsides.

Synced: Why does Tencent name its big data platform “Angel”? Does it come from any anecdote during your development process?

Jiang Jie: At the very beginning, our intention is to develop a platform that can calculate larger models with super fast speed, like the platform can take its programs fly. At the same time, we want this platform to be user-friendly enough, and have low learning curve and high usability. We would like it to give users a friendly and pleasant impression. In addition, this project is very important to all of our developers; we love it more than anything else. Therefore, it is our “Angel.”

Original Article from Synced China http://www.jiqizhixin.com/article/2016|Author: Pan wu, Joyce Zhou | Localized by Synced Global Team: Jiaxin Su, Rita Chen