Photo by Ryan Quintal on Unsplash

Accelerate Your Big Data Cluster

Intel Optane Persistent Memory for HDFS Caching

Jian Zhang
9 min readMar 6, 2020

--

By Jian Zhang, Feilong He, and Tao He, Intel Corporation

The Hadoop Distributed File System (HDFS) [1] is a fault-tolerant, scalable, distributed file system designed to run on commodity hardware. It is the primary distributed storage used by Hadoop applications. Though similar to other distributed file systems, HDFS relaxes some POSIX requirements to allow streaming access to data. It is ideally suited to applications that require high throughput across large data sets.

Centralized HDFS caching allows users to specify paths to be cached in DRAM, thus improving read-performance for frequently used data. It takes advantage of DataNode memory, which loads the data blocks into memory and pins it there so clients requesting those blocks can be served directly from memory. It provides significant performance benefits for small files and jobs where certain data is frequently accessed. This allows HDFS to provide different service level agreements to different workloads.

Big data, IoT, and AI are driving the rapid evolution of storage technology. Persistent memory represents a new storage class that offers high performance, high capacity, and data persistence at an affordable price. Developers can use persistent memory to build a new cache solution based on HDFS caching to achieve higher performance at lower cost.

Motivations

Challenges of HDFS Cache

Centralized cache management in HDFS is a mechanism that enables users to specify paths for directories or files that will be cached by DataNode memory, so that read requests can be served directly from memory to improve performance. Moreover, HDFS Cache has the following advantages:

  • Avoid eviction: Explicitly pin data blocks to memory to prevent frequently used data from being evicted from memory.
  • Better locality: Co-locate a task with a cached block replica to improve read performance.
  • Higher performance: Clients can use a new, more efficient, zero-copy read API for cached blocks.
  • Better memory utilization: Explicitly pin only m of the n (where m<n) replica to save space in memory.

However, as HDFS Cache is based in RAM, it competes for memory resources with other applications, which might degrade the performance of memory-intensive workloads. Also, DRAM cache often incurs a high warm-up time (e.g., in the case of DataNode restart due to node failure or maintenance).

HDFS Cache on Intel Optane Persistent Memory

Intel Optane Persistent Memory [2] (henceforth referred to simply as Optane) is an innovative memory technology that sits between memory and storage. It delivers a unique combination of affordable large capacity plus support for data persistence. Optane is available 128 GB, 256 GB, and 512 GB capacities, which is much larger than DRAM. This allows users to add a large, low-cost, flexible (it can be volatile or non-volatile), and high-performance memory tier to their storage hierarchy. Various workloads can take advantage of this new tier, e.g.: virtual machines in a cloud environment, in-memory databases, file systems, etc.

Optane has two operating modes: memory mode and app-direct mode [3]. In memory mode, Optane extends the amount of volatile memory that is visible to the operating system. DRAM becomes cache for the Optane memory. In app-direct mode, software and applications have the ability to talk directly to the Optane tier, and decide whether to write or read data from volatile DRAM or Optane persistent memory.

Optane persistent memory (PMem) fits HDFS Cache usage quite well:

  1. User can build a larger cache pool compared with DRAM cache pool, so it can cache more data, delivers higher performance, and benefits more workloads.
  2. PMem-based HDFS Cache has a much smaller memory footprint, saving memory for more compute-intensive workloads.
  3. By leveraging data persistence, PMem-based HDFS Cache does not need cache warm-up. This significantly reduces overhead in the event of node failure or planned cluster maintenance, which are common in large Hadoop clusters.

HDFS PMem Cache

HDFS Cache

Understanding how HDFS centralized cache works will help to understand the HDFS PMem Cache design. HDFS Cache architecture is shown in Figure 1. NameNode coordinates all DataNode caches in the HDFS cluster, it periodically receives a cache report from each DataNode that describes all the blocks cached on a given DataNode. NameNode manages DataNode caches by piggybacking cache and uncache commands on DataNode heartbeat. HDFS Cache works as follow:

  1. Admin sends command to NameNode to cache a path.
  2. NameNode translates the path to a set of blocks, adds to pending cache queue, and the cache commands piggybacked on heartbeat response.
  3. DataNode heartbeat contains cache block report.
  4. DataNode caches data from HDD to DRAM.
  5. Client reads cached data directly from DRAM.
Figure 1. HDFS Cache architecture

HDFS PMem Cache Design

Figure 2 illustrates the HDFS PMem Cache architecture. Optane replaces DRAM and is used as the storage medium for HDFS Cache. After the user-specified path is cached into the PMem device, all the following reads can be satisfied directly from PMem cache. As HDFS PMem Cache will leverage PMem’s data persistence feature to build a persistent cache, app direct mode was used in the design.

HDFS PMem Cache design is consistent with how DRAM cache works: when DataNode receives a DNA_CACHE command, it will try to pull corresponding blocks from HDD to PMem. When a DNA_UNCACHE command is received, DataNode will remove the cache from PMem. If multiple PMem volumes are configured, a simple round-robin policy is used to select an available volume to cache a block. The selection policy can be extended to consider space usage of PMem devices in the future. If all PMem volumes are unable to cache more blocks, it will throw an exception and the block cache action will fail, which is consistent with current DRAM cache implementations.

When DataNode receives a read request from a client, if the corresponding block is cached into PMem, DataNode will instantiate an InputStream with the block location path on PMem (Java implementation) or PMem cache address (PMDK-based implementation [4]). Once the InputStream is created, DataNode will send the cache data to the client.

Figure 2. HDFS PMem Cache architecture

HDFS PMem Cache Implementation

There are two implementations for PMem cache: a Java-based implementation and native PMDK-based implementation. The default is the pure Java-based implementation. The native implementation leverages the PMDK library to improve the cache read/write performance. The user can choose which implementation to use by adding an option in the build phase. Both of the implementations have been merged to the Hadoop trunk [5].

Figure 3. HDFS PMem cache implementation

Performance Evaluation

System Configurations

The test cluster consists of two nodes, each equipped with two Intel Xeon Gold 6240 processors and different capacities of DRAM and PMem with the same cost for an ISO-cost comparison. For the DRAM cache configuration, 24 x 32 GB memory was used, while for the PMem cache configuration, 8 x 128 GB PMem DIMM was used. For DataNode storage, each node was configured with 6 x 1 TB HDDs. Apache Hadoop 3.1.2 with a special patch (i.e., HDFS PMem Cache support) was used. Configuration details are shown in Table 1.

Table 1. System configuration

Test Methodology

We compared three different cases with both DFSIO [6] and decision support workloads: HDD (non_cache), DRAM cache, and PMem cache to understand HDFS PMem Cache performance benefits. The HDFS cacheadmin utility was used to cache the test files to HDFS Cache first before the formal performance tests for warm run method. Multiple runs were tested, and the median results were reported. Table 2 shows the test configuration details.

Table 2. Test configuration

Performance

DFSIO Performance

For DFSIO performance, we tested both DFSIO random and sequential read with 1 TB data size. The raw capacity for HDFS PMem Cache is 918 GB. For HDFS DRAM Cache, the raw capacity is 560 GB because some DRAM is reserved for Hadoop jobs. So, for 1 TB data size, PMem cache can almost fully cache the 1 TB data set, while DRAM can only cache partial data. As shown in Figure 4, HDFS PMem Cache delivered up to 3.16x speedup for random read and 6.09x speedup for sequential read compared with DRAM cache in ISO-cost configuration for DFSIO 1TB tests.

Figure 4. DFSIO performance

Figure 5 shows the disk bandwidth characterization data of DFSIO random read, it demonstrated with larger capacity, HDFS PMem Cache can cache more data than DRAM, and thus reduce the amount of data loaded from HDD.

Figure 5. Disk bandwidth comparison

Decision Support Workload Performance

The decision support workload is a typical workload that models multiple aspects of a decision support system, including queries and data maintenance. We selected 54 queries that represent the typical SQL query behavior in Hadoop clusters.

For a 1 TB data set, the raw data set size for different formats is as follows: 1 TB Parquet (422 GB), 1 TB ORC (359 GB), and 1 TB text (916 GB). The HDFS PMem Cache size is 918 GB, while the HDFS DRAM cache size is 560 GB. Thus for 1 TB text format, DRAM cannot cache all the data, but for Parquet and ORC format, both DRAM and PMem can cache all the data. As shown in Figure 6, PMem cache provided 2.84x and 7.45x speedups over DRAM and HDD for text format because PMem can cache all the tables, while DRAM cannot.

Figure 6. Decision support workloads performance for 1 TB data set

We redesigned the test case with a larger data set to simulate a partial cache case. With a 2 TB Parquet data set, the raw data size is 816 GB. HDFS PMem Cache capacity is 918 GB, so it can fully cache the data set. But for DRAM cache, the usable capacity is 560 GB, so it can only cache part of the data set. As shown in Figure 7, PMem cache provided 2.44x speedup over no cache for Parquet format and 1.23x speedup over DRAM cache.

Figure 7. Decision support workloads performance for 2 TB data set

Summary

HDFS Cache is a centralized cache management in HDFS based on memory, it provides performance and scalability benefits in lots of production environments and can be used to accelerate queries and other jobs where some tables or partitions are frequently accessed. Intel Optane Persistent Memory is a perfect fit for HDFS Cache as it can create a large cache pool at lower cost than DRAM, and thus deliver significant performance benefit. Our tests showed HDFS PMem Cache delivered promising results under ISO-cost configuration compared with DRAM: HDFS PMem Cache delivered 3.16x (random) and 6.09x (sequential) speedup compared to DRAM cache for DFSIO. It delivered 2.84x speedup compared to DRAM cache (partial cache, Text format) for the decision support workload.

System Configurations

PMem Configuration: Tested by Intel as of 09/05/2019. 2 socket Intel Xeon Gold 6240 Processor, 18 cores, HT on, Turbo on. Total Memory 192 GB (12 slots/ 16 GB/ 2666 MHz), PMem 1 TB (8 slots/ 128 GB/ 2666 MHz), PMem firmware version:01.02.00.5360, BIOS: SE5C620.86B.02.01.0008.031920191559.

(ucode:0x500001c), Fedora 29 (Server Edition), 4.20.6–200.fc29.x86_64. Storage for application: 6 x 1 TB HDD (ST91000640NS) for HDFS DataNode, Hadoop version: Apache Hadoop 3.1.2, Apache Spark 2.3.0, Apache Hive 2.3.2.

DRAM Configuration: Tested by Intel as of 09/05/2019. 2 socket Intel Xeon Gold 6240 Processor, 18 cores, HT on, Turbo on. Total Memory 768 GB (24 slots/ 32 GB/ 2666 MHz), BIOS: SE5C620.86B.02.01.0008.031920191559.

(ucode:0x500001c), Fedora 29 (Server Edition), 4.20.6–200.fc29.x86_64. Storage for application: 6 x 1 TB HDD (ST91000640NS) for HDFS DataNode, Hadoop version: Apache Hadoop 3.1.2, Apache Spark 2.3.0, Apache Hive 2.3.2.

References

  1. HDFS Architecture
  2. Intel Optane DC Persistent Memory
  3. Intel Optane DC Persistent Memory Modes Explained
  4. pmem.io: Persistent Memory Programming
  5. Non-Volatile Storage Class Memory in HDFS Cache Directives
  6. Running DFSIO MapReduce Benchmark Test

--

--

Jian Zhang

Software engineer manager at Intel. Developing and Optimizing open source data analytics storage solutions.