Impressions of Intel® SGX performance

Danny Harnik
9 min readDec 27, 2017

--

This Blog is co-authored with Eliad Tsfadia as part of our work at IBM Research — Haifa.

Intel® SGX enclaves provide hardware enforced confidentially and integrity guarantees for running computations. This is achieved mainly by encrypting all information as it leaves the CPU, effectively shielding data in the memory from external observers.

But what is the overhead of running computations inside an enclave? One would expect some overheads due to the added encryption and decryption complexity. In addition, extra security measures such as integrity tests and memory usage limitations can also effect performance. In this blog, we try to shed some light on this question by presenting benchmark results of various operations running inside and outside enclaves.

The performance overheads can result from two main aspects: first is the actual overhead of executing CPU instructions and accessing the encrypted memory in an enclave. The second is the overhead associated with entering and exiting an enclave. Enclaves are only invoked via a special interface called ECALLs (defined in an “edl” file). The ECALLs are known to have a performance impact due to the CPU’s context switches and this is seen clearly in our tests below.

Our benchmarks were tested in the following setup: We used a Lenovo laptop with 8-core Intel® Core(TM) i7–6820HQ CPU with 2.70GHz, using 16GB 2133MHz DDR4 RAM and running on an Ubuntu 16.04 OS. The libraries that we used were the Intel® SGX Linux 2.0 and SGXSSL (taken from the intel-sgx-ssl git repository on Nov 29th 2017).

Note: All the benchmarks discussed in this blog are single threaded runs.

We tested the overhead of using enclaves for a number of computational tasks, with a focus on cryptographic functions which are likely to be carried out in enclaves. Specifically, we tested hashing and encryption operations. However, since these are complex computations, we start with a simpler test that runs a very basic computation (with the only requirement that it touches the entire input buffer).

Testing a simple function

We tested the overhead of finding the maximum 4-byte integer of a given byte array (i.e., we treat an array of N bytes as an array of N/4 integers).

We implemented four versions of the function find_max:

1. A regular function running in the untrusted area as one would run it without enclaves.

2. Copy and compute version — the array is copied into the enclave’s encrypted memory and then the maximum value is found over this array. This is implemented using an ECALL, in which the input array is declared with the, so called, “in” option in the edl file.

3. Compute on encrypted memory variant — in this option we find the maximum on an array that resides in the enclave’s encrypted memory. The array is prepared before the ECALL (by a previous ECALL). This is similar to option 2, but does not include the initial buffer copying operation in the measurements.

4. Compute on cleartext memory — in this option the ECALL finds the maximum of a given external input array without moving it into the enclave’s memory (namely, accessing only clear-text memory). This is achieved by declaring the array with the “user_check” option in the edl file.

Of the 3 enclave variants, option 4 should be the fastest, since it does not require the array to be decrypted by the SGX’s Memory Encryption Engine (MEE). Options 2 and 3 do require the MEE decryption in order to perform the computations (while option 2 also involves MEE encryption as well as decryption).

Observe that by evaluating the performance of these three options, we gain a pretty good understanding of the expected overhead of ECALL context-switches, the overhead of the MEE operations, and what is the overhead of copying data using the “in” (or “out”) declarations in the edl. These observations should hold for other computation rather than just the “find_max” function.

For evaluating the performance of each option, we compared the throughput of the calls using various array sizes. The results in Figure 1 show the throughput (MBs processed per second) as a function of the array size. The first observation is that for small arrays, there is a huge overhead in running a function in an enclave which is likely caused by the context switches overhead of entering and exiting the enclave. This overhead becomes negligible for larger buffers and the gap between the untrusted and the two faster trusted versions is mostly closed with arrays of size larger than 256KB.

There are three other interesting phenomena: For arrays of size larger than 8MB, we see a throughput degradation in the ECALL that runs on enclave encrypted memory. We assume that it might be related to additional L3 cache misses when reading an enclave’s local array (In our setup, L3 cache is of size 8MB). For arrays larger than 64MB the performance drops off dramatically due to the limitations of the enclave page cache (EPC). Finally, the ECALL that uses the “in” option has a significant slowdown for arrays of size larger than 64KB. Our limited investigation indicates that this may be caused by a slowdown when calling the function memcpy inside an enclave with buffers of size larger than 64KB.

Testing SHA256

We turn to investigate the overhead of running heavier and more interesting tasks inside an enclave, where we first focus on the fastest ECALL implementation that uses the “user_check” declaration in the edl file, both for input and output buffers. We start by examining the overhead of computing SHA256. In the untrusted area, we tested the openssl implementation of SHA256. In the trusted area (i.e., inside the enclave), we tested the intel-sgxssl implementation of SHA256 which in turn calls openssl. We also tested the function sgx_sha256_msg, provided by intel-sgxsdk API.

The results in Figure 2 indicate that, as expected, we still have a huge gap in the throughput of computing sha256(msg) for small size messages. However, as the message sizes increases, the gap doesn’t completely close even for very large size messages. The throughputs for large messages are: openssl — 435MB/sec, sgxsdk — 350MB/sec (80% of openssl throughput) and sgxssl — 295MB/sec (67% of openssl throughput). We suspect that the difference is due to different implementations of the function being used when running in the enclave (rather than an inherent slowdown of the actual computation).

In addition, we also ran a similar SHA256 test in which the input message was stored in the enclave’s local memory rather than clear-text input buffers from the untrusted memory. Unlike the find_max test, in which we saw a throughput degradation for input sizes larger than 8MB, we didn’t see it here, and the results were like the case of clear-text input buffers. This can be attributed to the fact that the overall throughput is much lower in this test.

Testing AES Encryption

Finally, we tested the overhead of encrypting and decrypting messages. In particular, we focused on testing AES128-GCM encryption. In the trusted area, we tested three libraries: sgxsdk, sgxssl and an encryption library used in the open source project Opaque which can be found here. In the untrusted area we tested two libraries: openssl and Opaque (we also tried to test the sgxsdk library in the untrusted area, but were unsuccessful in running it). Similarly to the previous tests, for each tested library in the trusted area we created an ECALL of the form

aes128_gcm_encrypt(uint8_t* in_buf, uint32_t in_buf_len, uint8_t* out_buf, uint32_t out_buf_len)

We first focused on the fastest ECALL implementation in which both “in_buf” and “out_buf” are declared with the “user_check” option.

The results, shown in Figure 3, are somewhat surprising and more complex than one would hope for:

· The sgxsdk version achieves a maximal throughput of 2150MB/sec (about 43% of the untrusted throughputs). This is due to the fact that by default it runs a non-optimized version of Intel’s IPP Crypto library which does not use Intel’s AES-NI hardware optimizations. We were told that by manually compiling and linking the SDK with an optimized binary of the IPP Crypto for SGX, one might achieve the desired acceleration, but we have not tried this.

· The sgxssl implementation presumably runs the same code as the untrusted openssl version. However, its performance completely crashes and achieves only a 110MB/sec throughput (about 2% of the untrusted throughputs). UPDATE: This issue was investigated and fixed by Intel. See update at the bottom of this Blog.

· The only code that ran reasonably well both inside and outside an enclave was the Opaque encryption library. Here we see the expected trends that we have seen in the simpler cases: on short messages, we have a huge gap between the trusted and untrusted throughputs, which is caused by the context switches overhead. However, for large size messages, the trusted version of Opaque nearly closes the gap with the untrusted libraries, reaching a throughput of 4900MB/sec.

In addition, we also ran the “local input” variant of the encryption test, in which the input messages were in the enclave’s encrypted memory rather than clear-text input buffers from the untrusted memory. The results in Figure 4 show that the “trusted” opaque library, which achieves the best throughput in the trusted area, suffers from a throughput degradation on messages of size larger than 8MB, which is a similar degradation to the one we have seen in the find_max test.

Note that the results for decryption (rather than encryption) were very similar.

Conclusions

On the positive side, we see that when running on large inputs, code running inside enclaves can achieve very high throughput, on par with code running outside enclaves. For short inputs, there is a significant overhead of invoking enclave calls. However, such effects can be mitigated, for the most part, using techniques that avoid ECALLs and OCALLs as much as possible. See, for example the paper: “Eleos: Exit-Less OS Services for SGX Enclaves” M. Orenbach, M. Minkin, P. Lifshits, M. Silberstein, at Eurosys 2017, or the paper “Regaining Lost Cycles with HotCalls: A Fast Interface for SGX Secure Enclaves”, O. Weisse, V. Bertacco, T. Austin at ISCA 2017.

On the negative side, we see that nothing is simple with SGX enclaves. First, using the default crypto libraries provided by Intel®, or any library ported for SGX, does not guarantee optimal performance. Second, even when we have a code that achieves an optimal throughput when it accesses a clear-text data, it might not achieve that when it accesses an enclave’s local memory. In fact, when running on enclave’s memory, there seems to be a limited sweet spot, where the input should not be too small nor too large.

Our strongest impression was that without actual performance testing, we find it hard to know what performance degradation to expect when running a complex computation inside an enclave.

UPDATE (March 2018)

The sgxssl issue was investigated and resolved by Intel: “The root-cause was related to the auto initiation flow of OpenSSL and it’s integration into SGX SW stack. OpenSSL didn’t receive the CPU capabilities to determine the best AES-GCM implementation for the given platform. Therefore due to the lack of platform information, OpenSSL have fallback to the basic C implementation (which is not optimized at all).
Solution: To get maximum performance, enclave developers should explicitly initialize OpenSSL crypto library. The auto initiation flow will be fixed in a future releases.”

The issue and how to handle it was updated in the sgxssl documentation here.

Rerunning the tests with this fix gets the sgxssl behavior very close to that of the Opaque library, and exhibits the same overall behaviors. Figures 5 and 6 depict these new tests (just for the OpenSSL trusted and untrusted code bases).

  • This work was partially conducted under the European H2020 project RestAssured and is part of research carried out in the cloud storage group in IBM Research — Haifa regarding the use of Intel SGX in the Cloud. You can read about our exploration of SPARK SQL and SGX here.

--

--