Intel Optimization at Netflix

Amer Ather
10 min readMay 14, 2023

--

Netflix Performance Engineering work closely with Intel in optimizing workloads hosted on AWS public cloud. We have successfully integrated Intel AVX-512 and VNNI accelerations in a wide range of use cases to boost performance. Intel profiling tools: VTune, Perfspect, Processwatch and PCM offer powerful insights into system performance. We have used these tools’ versatile capabilities to uncover regressions and boost performance in test and production environments.

As a strategic technology partner, Netflix was invited to Intel Vision Conference, held on May 8–10, in Orlando, FL. Vision conference is about fueling innovation to drive business transformation and accelerate growth in areas of: cloud and edge computing, software solutions, security, sustainability, resilient supply..

I represented Netflix at the conference and spoke at “Supercharge Your Application Performance” break out session and participated in industry analyst Q&A. Instead of just posting slides with key takeaway, I felt diving deeper into technical aspect of the talk would benefit the larger community.

Supercharge Your Application Performance — Breakout Session

Netflix Performance Engineering Team charter is to bring a higher level of efficiency to the Netflix stream environment with a goal of reducing the cloud infrastructure cost of doing streaming business. We achieve this goal by active benchmarking, prototyping performance enhancement, and building performance tools. We also triage production issues and offer consulting to Netflix service team on performance aspect of the workload.

Netflix offers premium streaming service to over 230 million paid subscribers worldwide.

Netflix app is hosted on a variety of devices each with its own unique capabilities, access profile and requirements. These devices are operated under varying network conditions and thus performance optimization and end-to-end reliability is critical for delivering quality content to our members spanning across 190 countries.

Our subscribers expect Netflix app to be robust and responsive. We need to deliver multiple video formats to every device in a way that meets user expectations with performance and cost efficiency — and stay ahead of the competition. Intel has given us a competitive advantage in this regard. Intel AVX-512 and VNNI acceleration techniques and advanced debugging and profiling capabilities of VTune helped Netflix optimize and boost performance in a variety of use cases such as: video encoding, microservices latency and throughput improvements, and accelerating Machine Learning inference tasks

Netflix strives to offers high quality streaming to any device anywhere at any time. Each title is optimally encoded to achieve the best overall streaming quality even on slow congested networks.

Techniques like: “Adaptive Bit Rate” (for both video and audio) and “Per shot encoding” allow devices to choose the best bitrate to stream by continuously probing network conditions and estimating device buffer capacity.

Streaming pipeline is basically a three step process: downsampling the source asset, encoding it and then ship the encoded video to the end device that decode and upsample it and play it on the device. All encoding is done in the cloud and decoding on a member’s device.

Netflix uses traditional interpolation filters like lanczos and neural based networks to improve the quality of downscaling. The DL model, called Downsampler, is trained to generate the best downsampled representation of the source such that, after upscaling, mean squared error (MSE) is minimized.

Neural based downsampling reported improvement for various upsamplers: bilinear, bicubic , lanczos. It is also encoder agnostic and thus benefits are seen across all popular encoding: H.264, HEVC, VP9, AV1

To validate the encoded video quality, encoded video is scaled up to a source video resolution and then pre-processed pixels of the encoded frame are fed into a video quality filter, part of libvmaf, to measure quality.

Considering the size of video library and frequency of new titles released by Netflix every year, encoding jobs are run at a massive scale. Performance improvement in the form of reduced cpu hours or Frames per second (fps) speedup to encode a title means huge saving in cloud infrastructure cost. Intel OneDNN library have demonstrated 15% to 2x improvement in the various FFmpeg workloads.

We have successfully used Intel profiling tools: VTune, Perfspect, Processwatch and PCM to debug and fine tune production workloads. VTune makes things relatively easier for developers to gain a better understanding of how their code is executing and identify areas that can be optimized to improve cpu and memory usage.

VTune offers advanced features like: call graph and hotspots analysis, that enable developers to gain deeper insights into application behavior.

Popular IDEs, Visual Studio and Eclipse, offer integration with Intel VTune, thus making it easy to incorporate performance analysis into existing workflows and that can avoid costly performance issues in production.

One of the success story was highlighted in the Netflix blog: “Seeing through hardware counters — a journey to the threefold performance increase”. VTune ability to program PMU/PEBS on Intel processor to trace hardware events (instruction per cycles, cache misses, branch mis-predictions etc..) helped us isolate the cause of regression seen while migrating/consolidating critical microservice to a larger cloud instances.

It turned out false/true sharing of hardware caches were resulting in frequent cache invalidation. True sharing refers to the problem of multiple cores frequently writing to the same shared variable, while false sharing refers to multiple cores writing to different variables that are on the same cache line.

Addressed the issue by inserting padding into the data layout to prevent cache line sharing. Patching (JDK patch) the issue, found via VTune, helped us achieve 3.5x performance improvement over the throughput we initially reached when upgraded the service to a bigger cloud instance. We were also able to reduce both average and tail latency of the service.

VTune can profile workloads to find bottlenecks hidden deeper in the stack. Few use cases listed on VTune site are:

Intel PerfSpect tool is a wrapper around linux perf and makes it much easier to collect relevant hardware events (CPI, AVX/AVX-512, CPU frequency etc..) and post-process data.

Intel ProcessWatch displays per-process instruction mix in real-time. We used this tool on few of our production (FFmpeg) workloads and found AVX-512 instructions were missing:

PID NAME AVX AVX2 AVX512 %TOTAL
25398 ffmpeg 3.65 26.68 0.00 100.00

After rebuilding FFmpeg correctly, we saw performance gain of 18% with AVX-512 instructions usage increased from 0 to 25%.

PID NAME AVX AVX2 AVX512 %TOTAL
78887 ffmpeg 0.32 2.48 25.35 100.00

Intel PCM Library and tools (pcm.x, pcm-numa.x, pcm.mem.x) can provide off core (uncore) statistics like: shared L3 cache stats, Memory channels utilization and throughput, numa and memory latency and bandwidth.

PMU on Intel processor can be programmed to track specific types of hardware events. These counters are incrementally updated during the execution of the CPU instructions. The PMU can have two modes:

  1. Counting. counts hardware events like cache misses... This mode is used by Intel PCM tool.
  2. Sampling. Take samples when an events occur a certain number of times. For example, one collects samples after 1M cache misses. This mode is used by Intel VTune and Linux Perf.

PEBS is an extension of “sampling”. The PMU is instructed to collect additional information if a sample is taken. For example, the precise instruction counter, registers or flags are recorded. When PEBS is enabled, the PMU is programmed to periodically interrupt the CPU and collect performance data for the currently executing unit. The data is then aggregated and analyzed to provide a detailed view of the application’s performance.

PEBS is thus a sampling-based profiling that uses PMU to collect performance metrics such as: instruction fetch, integer execution, memory subsystem metrics, etc..

Netflix success is credited to pioneering ways that the company introduced AI and ML into its products, services and infrastructure.

ML learning is applied to solve a wide range of problems at Netflix. At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles. Everything on the member home page is an evidence-driven, A/B-tested experience backed by ML models

Services like Adaptive Row Ordering helps personalize a subscriber’s home page to make it easy to discover relevant content. Models pick a winner given a set of assets, viewing history, country and language. Evidence service is used for exploring asset selection and ranking. Decides which shows and season to show the artwork for. It is used to pick the relevant artwork for user home page: Horizontal/Vertical, Billboard artwork, browse video, billboard video, story art, short panel artwork, evidence cards etc..

At Netflix, Machine Learning (ML/DL) inference workloads are primarily hosted on Intel CPUs as it’s more practical and cost effective than using GPU. We use our trough (unused reserved capacity) cloud capacity to perform batch/offline inference tasks to reduce cost.

Most of Netflix ML workloads use performant java based inference with little penalty for JNI to Tensor flow for inference. We have a pure java implementation of XGBoost for inference. Feature encoding and generation is a good proportion of our end-to-end (E2E) pipeline and that is written in Java. Thus offloading inference tasks to GPU will just increase cost with minor latency wins.

We are actively looking for opportunities to quantize production models to bfloat16 (mixed precision) and int8 to achieve full benefits of Intel VNNI acceleration. VNNI extends AVX-512 by introducing new instructions for accelerating inner convolutional neural network loop.

Intel Deep Learning Boost (Intel DL Boost) Technology is a combination of VNNI and Neural compressor. Intel used INT8 and bfload16 (Mixed precision) in various MLPerf inference benchmarks. Intel OpenVINO Inference engine supports low precision inference on Intel processors.

AVX-512 extends AVX to 512-bit. It adds 512-bit vectors to the existing 256-bit vectors, allowing for even greater parallelism and faster data processing. AVX-512 instructions allows for efficient parallel data processing.

AVX-512 offers much faster execution times for data-intensive applications, as well as improved energy efficiency. By processing more data in parallel with each instruction, these instruction sets can reduce the amount of time that processors need to spend executing code, leading to faster application performance. Additionally, because these instructions can handle more data with each instruction, they can reduce the overall number of instructions that need to be executed, which can help to conserve power and reduce energy consumption

AVX-512 Instructions:

VADDPS: Add packed single-precision floating-point values
VSUBPS: Subtract packed single-precision floating-point values
VMULPS: Multiply packed single-precision floating-point values
VDIVPS: Divide packed single-precision floating-point values
VMAXPS: Find maximum packed single-precision floating-point values
VMINPS: Find minimum packed single-precision floating-point values
VFMADDPS: Multiply packed single-precision floating-point values and add the result to another packed single-precision floating-point value
VPOPCNTDQ: Count the number of bits set in each 64-bit element of a packed doubleword integer vector
VREDUCEPS: Reduce packed single-precision floating-point values using a specified operation and return a scalar result

VNNI instruction set are designed specifically to accelerate deep learning inference task. Together these two technologies (AVX-512 and VNNI) harness the power of Intel processor specialized instructions to leverage the parallel processing capabilities.

The VNNI instruction set provides dedicated instructions for performing convolutional neural network operations such as matrix multiplication, a key operation in many neural network algorithms.

By offloading this computation to the VNNI, applications can achieve significant speedups compared to running without them. This can enable faster training times and more responsive inferencing.

In addition, the increased performance provided by AVX-512 and VNNI can enable new use cases in AI and ML, such as training larger models or performing more complex inference tasks.

VNNI Instructions

VPSHUFF32: Shuffle packed 32-bit integers based on control mask
VPDPBUSD: Packed doubleword bytes to packed single-precision floating-point with unsigned saturation
VPDPWSSD: Packed doubleword integers to packed single-precision floating-point with signed saturation
VPDPD: Packed double-precision floating-point dot product of two vectors
VP4DPWSSD: Packed 4-DWORD elements with signed saturation, 16-bit outputs
VP4DPWSSDS: Packed 4-DWORD elements with signed saturation, 16-bit output, and store
VDPBF16PS: Dot product of BF16 vectors with single-precision scaling and rounding
VDPBUSD: Dot product of unsigned bytes with 16-bit signed result and saturation
VDPWSSD: Dot product of signed words with 32-bit signed result and saturation
VDPWSSDS: Dot product of signed words with 32-bit signed result, saturation, and store

Additional use cases where Intel AVX-512 improves performance:

Hyperscan: Library that can match large numbers of patterns simultaneously with high performance and good scalability. It has been used to improve Envoy (Service Mash) routing performance and effecient packet processing using DPDK framework.

Simdjson: Library to parse gigabytes of JSON per second. It is 4-25x faster than alternatives

4th Gen Intel Sapphire Rapids Processors (Future)

AWS will soon be releasing next generation of instances based on Intel Sapphire Rapids. There are number of accelerator available with Sapphire Rapids. What other accelerators except AMX are whitelisted on AWS instances is yet to be announced.

  • AMX: Advanced Matrix Extensions (AMX) instructions, It uses a new set of two-dimensional registers called tiles that are primarily be used to boost performance of AI training and inference workloads.
  • DLB: Dynamic Load Balancer (DLB) accelerator offers features like packet prioritization and dynamically balance network traffic across the CPU cores as the system load fluctuates
  • QAT: Offloads data encryption, decryption and compression
  • IAA: In-Memory Analytics Accelerator (IAA) — accelerates analytics performance by offloading the CPU cores. Improves database query throughput and other functions
  • DSA: Data Streaming Accelerator (DSA) improves data movement by offloading the CPU of data-copy and data-transformation operations.

--

--