Accelerating AI/ML and Data-centric Applications with Temporal Caching

Author: Allison Goodman, Senior Principal Engineer and Director of Optane Solutions Architecture at Intel

Published in

Intel Tech

5 min readApr 20, 2022

The tech industry is in the midst of “the fifth epoch of distributed computing.”[i] It is a crucial evolutionary moment, driven by the need to support artificial intelligence (AI), machine learning, and data-centric workloads that require real-time data insights. The world is not short of data to power these insights. However, computing resources are often bottlenecked by the inability to quickly retrieve the necessary data. This is especially true in a massive, distributed computing system, where multiple processors share the same data. In a distributed system, the concept of memory caching can be complicated.

During the 2022 International Solid-State Circuits Conference (ISSCC), Dr. Frank Hady presented the need for system-level near memory compute for overcoming the data bottlenecks in future AI data processing systems. The session was titled, “We’ve rethought our commute; Can we rethink our data’s commute?” In his session, Dr. Hady made the following points:

AI compute needs are increasing exponentially and demand that data centers maximize performance and minimize system energy consumption.
There is a direct correlation between TCO and system energy, because data movement through the memory hierarchy is costly.

The talk introduced system-level relevance criteria for understanding the likely success of near memory compute solutions. This blog continues that discussion by introducing a novel approach to data management that unifies temporal caching data access methods to maximize performance in these increasingly important applications. The foundation for this new approach is Intel® Optane™ persistent memory (PMem), serving as a secondary memory tier.

Background and Terminology

Data is retrieved from main memory and stored in cache memory according to the principle of locality. This principle acknowledges that programs tend to access a relatively small portion of the memory address space at any given time. The general rule of thumb is that the average application spends 90 percent of its time accessing about 10 percent of the data.[ii] Now, there are two different types of locality:

Temporal locality (location in time) is a program’s tendency to use data items often during program execution. If a program uses an instruction or data variable fairly frequently, then this data or instruction should be kept close to the CPU, because it is likely to be referenced again soon.
Spatial locality (location in space) refers to items whose addresses are held in nearby memory storage, and therefore tend to be referenced again soon.

Note: The concept of “semantic locality” as part of knowledge management is beyond the scope of this blog.

If a distributed computing system is using “true sharing,” the involved processors must synchronize their caches to ensure program correctness, which can slow down cache access. In this blog, we focus on temporal caches, which can be accessed far faster than most other caches because programs with high temporal locality tend to have fewer true-sharing cache misses. A true-sharing cache miss can occur when two processors access the same data word, invalidating the cache block in one processor’s cache.

Traditional I/O-centric data management practices that rely on Load-Store instructions to access in-memory data and POSIX file I/O to access persistent data do not provide the consistency, durability, or integrity that AI and other data-intensive applications require. As an alternative, we advocate using secondary memory options like High Bandwidth Memory (HBM) and Intel Optane PMem to unify temporal cache data access and minimize latency. In essence, Intel Optane PMem creates a tiered memory system, where the DRAM serves as an L4 cache for the large-capacity and low-latency Intel Optane PMem.

Advantages and Challenges Associated with Temporal Caching

Temporal caching is used in a variety of use cases:

Time-series data. Many scheduling, banking, medical, and scientific applications manage temporal data. A temporal database stores data relating to time instances. More specifically, it associates data with a start time and an end time value. Data can be time-stamped with two concepts of time: a valid time interval (when the data event occurs in modeled reality) and a transaction time interval (the period over which the event information is stored in the database).
Network operations. Temporal content caching can help improve network operation and end user experience by reducing the distance that packets must travel within a network.[iii]

While the benefits of a temporal cache are clear, developers face several challenges when designing them. These challenges include dimensioning the temporal cache (especially relevant to content delivery networks), as well as improving the energy efficiency of memory management. The latter challenge can be partially addressed by using runtime-assisted dead region management (RADAR) to predict and evict dead blocks in last-level caches.[iv]

The Current Temporal Caching Model

An example of a general temporal cache model is a parallel job that is designed to implement a query execution plan against immutable data, which is presented as a table described by a schema. (A query execution plan is a sequence of steps used to access data in a SQL relational database management system.) The table is generated from the log constructs that consist of the current mutable (unsealed) log segment, as well as many immutable (sealed) log segments. Amazon Redshift, Databricks Delta Lake, F1 Lightning, and Procella all share a common architecture that uses this model, as depicted in Figure 1.

**Figure 1.** Common high-level data warehouse architecture.

As shown in the diagram, the current design uses storage — locally attached NAND SSDs — to buffer intermediate results of a query following the first scan/merge operation. This data must be paged into memory before further processing. Unfortunately, the access latency for NAND SSDs is orders of magnitude higher than if this data was accessed in memory.

Aside from high latency, complexity is another problem with the current temporal caching model. Developers typically use one interface (Load-Store) for memory-resident data, and a second I/O-based interface (POSIX) for data that is in storage. The rest of this blog explores a fascinating question: What if I/O operations could be eliminated from all operations in the query execution plan after the initial scan/merge?

Learn more about Implementing a Memory-Centric Temporal Cache here.

Disclaimers and Notices:

Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.