MesaTEE GBDT-RS Open Source Released

Baidu Security X-Lab
Baidu Security X-Lab
4 min readJun 7, 2019

GBDT (Gradient Boosting Decision Tree) is a widely used machine learning algorithm in the industry, and XGBoost is an open source GBDT toolkit initiated by the renowned Chinese scholar Tianqi Chen and becoming popular across industry. GBDT/XGBoost has helped to win numerous championships in various machine learning competitions, and is one of the most commonly used methods/tools in machine learning.

As data security and privacy protection have received more and more attention from many different areas, protecting data from leaks and abuses in public cloud and data offshore scenarios has become an urgent issue of common concerns. Consequently, the industry is eager to have a GBDT solution with strong data security mechanism. With the development of hardware Trusted Execution Environment (TEE) technology represented by Intel SGX, the integrity and confidentiality of data codes can be supported by chip-level security. At the software level, Baidu Security X-Lab’s original Hybrid Memory Safety (HMS) technology guarantees the memory safety of the system in the software architecture. Baidu Security MesaTEE project combines hardware TEE and HMS technologies, protects machine learning data and codes from both hardware and software, and ensures that sensitive data and confidential models cannot be leaked. Hence it is providing next-generation secure big data/machine learning solutions.

Combining GBDT with HMS/TEE, Baidu Security X-Lab partnered with the Information Security Lab of Institute of Computer Science & Technology of Peking University, implemented the GBDT algorithm as part of the open source project MesaTEE GBDT-RS, using the Rust language. The MesaTEE GBDT-RS is compatible with the XGBoost models and also meets MesaTEE memory safety requirements while it can run directly in the SGX environment. In addition, MesaTEE GBDT-RS provides superb high-speed prediction performance: in SGX environment, GBDT-RS single-thread prediction performance can reach more than twice of the XGBoost 8-thread in regular environment.

Using the toolchain including MesaTEE GBDT-RS and Baidu MesaTEE Rust SGX SDK, developers can create memory-safe machine learning programs that run in SGX, protecting models and data. In a cloud computing scenario, even if the operating system, virtual machine manager (VMM), or other adjacent virtual machines in the cloud computing environment are compromised, the integrity and confidentiality of the model and data can still be protecting with high guarantee. Users can also remotely attest the execution environment, ensure that the code is as expected, and pass models and data over a trusted communication channel.

MesaTEE GBDT-RS open source address: https://github.com/mesalock-linux/gbdt-rs

Major features of MesaTEE GBDT-RS (hereinafter referred to as GBDT-RS) in comparison with XGBoost.

  • Security: GBDT-RS is written completely with Rust and does not contain unsafe codes. Therefore, the Rust compiler guarantees the memory safety of GBDT-RS. Developers do not have to worry about model and data leaks caused by memory corruption vulnerabilities when using GBDT-RS.
  • Easy for audit: GBDT-RS only contains about 2000 lines of codes, the codes are organized and compact, and less dependent. Any skeptic can quickly and easily audit the codes and establish a trust relationship. XGBoost contains tens of thousands of lines of codes, and relies on other C/C++ open source projects. That makes auditing very difficult.
  • High performance: GBDT-RS has been tested to support 200D x 5 million, or 35D x 11.86 million data for training. The training speed of a single-threaded single decision tree can reach about 70% of XGBoost single-thread; the prediction speed can reach 4–10 times of XGBoost. Using GBDT-RS single-threaded prediction in SGX is even faster than XGBoost multi-threaded prediction in non-SGX environments.
  • Easy to use: GBDT-RS supports both regression and classification tasks and supports multi-threaded concurrent prediction. At the same time, GBDT-RS is compatible with XGBoost models, and XGBoost models can be used for regression, classification, multi-classification and other predictions.
  • Support for SGX: GBDT-RS supports training and prediction in SGX, and it also supports the use of XGBoost models in SGX environments. Developers can easily use Baidu’s rust-sgx-sdk to compile programs that run GBDT-RS in SGX, putting data and models under a high level of security protection.

Performance Test & Results

  • Model Training
    Dataset: 35D x 11.86 million
    Environment: Linux, i7–8086K, 64G Memory, non-SGX
  • Inference Comparison (Ratio) in Non-SGX
    Model: 32D, 10000 decision trees with depth of 6, trained with XGBoost
    Dataset: two datasets with sizes 10k and 100k
    Environment: i7–8086K/Linux, i7–8850H/maxOS, Intel J5005/Linux
GBDT-RS is 4–10 times faster on 10k dataset, and 3.6–7.7 times faster on 100k dataset
  • Inference Comparison (Ratio) in SGX
    Model: 32D, 10000 decision trees with depth of 6, trained with XGBoost
    Dataset: two datasets with sizes 10k and 100k
    Environment: Intel J5005 (4-core 4-thread)/Linux, XGBoost in regular environment, GBDT-RS in regular environment & SGX environment

On 10k dataset, SGX brings ~13% performance loss, but is still 8.7 times faster than single thread XGBoost and 2.1 times faster than 8-thread XGBoost.

On 100k dataset, SGX is 2 times faster than single thread XGBoost. If the batch size is set at 10k, it reduces the cost of memory transfer, and SGX performs 9.6 times faster than single thread XGBoost and 2.3 times faster than 8-thread XGBoost.

--

--