Introducing new memory types to CacheLib #102

victoria-mcgrath · 2021-12-15T17:10:52Z

victoria-mcgrath
Dec 15, 2021

We are a software team at Intel working to implement support for different memory types in CacheLib to enable performance experiments and optimizations for different use cases that could benefit from using other memory types besides currently supported DRAM and flash/NVMe.

The vast majority of contemporary servers present a homogeneous memory abstraction. As far as user-space software is concerned, all main system memory is the same. That memory is usually DRAM, with latencies in the neighborhood of ~50-100ns. But the server landscape is changing rapidly towards more heterogeneous systems in pursuit of ever-greater efficiency.

Heterogeneous computing is already commonplace, with CPUs often working in tandem with various accelerators (GPUs, TPUs, ...). The memory technology development exhibits similar trend: persistent memory is already broadly available on the market, and new innovative cache-coherent interconnects (e.g., CXL) are not far behind. Typically, those types of auxiliary memory, sometimes referred to as far memory, have higher latency than DRAM but come with other benefits. Be it higher bandwidth (HBM), persistence, capacity and affordability (PMem), or better manageability and other characteristics (CXL.mem).

Heterogeneous memory, just like heterogeneous compute, requires software to understand how to use it efficiently. Primarily, software needs to make data placement decisions on which memory to use for what purpose. Very frequently accessed locations, such as metadata, are typically best placed on the lowest latency memory. In contrast, less often accessed data should be placed in more cost-efficient tiers. Data placement decisions can usually be made effectively in the kernel, provided that user-space applications can guarantee data locality at a page level. This is conceptually similar to how caching on a block device works, but software retains the ability to immediately and directly access individual memory locations (cache lines).

With this publication, we would like to get early feedback from the CacheLib community for our proposed design. Our goal is to create a solution that is easy to use and experiment with, applicable to common usage scenarios, and backward compatible.

At a high level, we suggest allowing cached items to migrate from the fastest memory (e.g. DRAM) to a slower memory when the item gets warm, and then to even slower memory. The current implementation of CacheLib can be configured to use NVMe as an additional memory layer which allows for keeping the warm items in the NVMe memory instead of evicting them out of the cache:

We are working on generalizing this approach to work with other modern types of memories, for example, persistent memory PMem, so that CacheLib users could take advantage of faster and bigger memory:

We made the following modifications to the code:

File-backed memory support in shared memory manager (#63)
The idea behind these PRs is to introduce support for different memory types by extending Shm Manager implementation. The original design only supports POSIX and SysV Shared Memory Segments (which use either shm_open or shmget for memory allocation). In the PRs mentioned above, we introduced a third type of segment: FileShmSegment. It provides access to file-backed memory, which can be used to expose PMEM or CXL memory to CacheLib.
Cache configuration API extension for memory tiers (cachelib/allocator/MemoryTierCacheConfig.h, #138)
Configuration API was extended so that a user can set up a heterogeneous memory cache. Specifically, we have added a new configureMemoryTiers method to CacheAllocatorConfig. This method accepts a vector of MemoryTierCacheConfig structures that describe the type of memory (via a path to a memory-mapped file) and the size or ratio of a particular tier. The example below shows a configuration that uses two levels: the top is created using Shared Memory (POSIX or SysV), and the bottom is built on top of the path file. The bottom tier is twice as large as the top one:

// Configure 2-tier cache:
//  1) DRAM tier 1/3 of the cacheSize;
//  2) File-backed memory tier 2/3 of the cacheSize
CacheAllocatorConfig<LruAllocator> config;
config.setCacheSize(cacheSize).enableCachePerisistence(cacheDir).configureMemoryTiers({
    MemoryTierCacheConfig::fromShm().setRatio(1), 
    MemoryTierCacheConfig::fromFile(path).setRatio(2)});

Memory tiers support in memory allocator (#2, #3, #4)
We have converted MemoryAllocator and MMContainers to arrays to support multiple memory tiers: each memory tier has its own Memory Allocator with Memory Pools, and MMContainer. AccessContainer was left unmodified – the same data structure still indexes all items. Also, there were no changes (except for item promotion) for the find item path – the user still gets direct access to the item’s memory and there are no copies made.

The image above shows that the IDs of Memory Pools match for each tier: if an element from the memory pool with ID=1 is evicted from tier 0 to tier 1, it will end up with a memory pool with ID=1 at the tier 1. Relative sizes of Memory Pools are the same at each tier.
Since each memory tier is an independent piece of memory, we also had to modify CompressedPtr so that it knows to which level the pointer belongs. This resulted in increasing the size of the CompressedPtr from 32 to 64 bits (we use the topmost 32 bits to store the tier ID). We believe this approach can be improved and it’s open for discussion.
There are several incomplete work items: 1) We have not implemented proper serialization support for multiple tiers in the correct version. 2) Some operations and methods for gathering statistics should be extended to work with various tiers (currentTier() method).
Multi-tiered eviction and promotion (#16)
Eviction:
If an item is a candidate for eviction (according to eviction policy), we first try to evict it to the next memory tier if configured. If the following memory tier is not configured or eviction to the next memory tier has failed, we try to evict to the NVMe storage.
Our design is done with concurrency in mind: the Item, which is moved between memory tiers, can be accessed by concurrent threads. The diagram below shows eviction operation with simultaneous find operation. The high-level idea is to use Wait Context and make lookup threads wait on the Item handle while the move operation is in progress.
Promotion:
The main difference between memory tiering and hybrid cache (DRAM + NVMe storage) is that all memory tiers allow direct access without requirements to promote Item to the top-most tier on Item access. For promotion, we are planning to introduce promotion policy abstraction. Each find request will decide whether we need to promote the Item to the top tier or return its handle to the current tier.

Find item	Evict item

Let us know what you think.

therealgymmy · 2021-12-16T21:24:00Z

therealgymmy
Dec 16, 2021
Collaborator

Hi, thank you for putting together a detailed design. This seems very interesting! We have a few questions regarding some details, and also would like to know what your plans are.

Question on details

Far Mem
- Is this byte-addressable? Are there any restrictions or differences in how we can access it compared to local memory?
Access
- what is the latency penalty of directly accessing an item in (Far Mem)?
Promotion
- how is promotion handled? Is a memcpy required to promote an item from Far Mem to local memory?
- Does the caller have to wait for this to complete first? (via the WaitContext?)
- Is memcpy to local memory faster/slower than reading from far mem?
- Can we allow direct access to the caller (pay higher latency but cheaper than memcpy), and only switch over to local memory after memcpy is complete?
MMContainers
- Are the data structures maintained in local memory or Far Mem? (Does it make a difference?)
- CacheLib has policies such as 2Q where we treat items differently according to their “hotness/warmness”. One thought is to logically extend such a policy to cover Far Mem (e.g. use local memory for Hot & Warm, and Far for “cold”). What do you think?

Question on next steps

Have you run any benchmark with this design? How does it perform compared to standard cachelib with local + far memory? (How does it compare to using NvmCache directly on far memory?)

5 replies

skyelves Dec 16, 2021

I am working on integrating PM into CacheLib as well but in a slightly different architecture as discussed in # 68. For your question, when we use NvmCache directly on PM and run CacheBench, the performance seems quite poor, 10x slower than our implementation.

ss7pro Dec 16, 2021

Hey Jimmy,

Yes, far memory is byte-addressable. For metadata, we still use local aka near memory.

On next steps we just started collecting benchmarks data, and we should have results in the next few weeks (unfortunately we have holiday period now). We would be using cachebench TAO-L2 as a benchmark. Initially, we plan to test the following configurations (let us know if you would like to change anything in those tests):

DRAM + NVMe - as a baseline
DRAM + PMEM (with and without NVMe)
PMEM (without and without NVMe)

We would also be comparing that to kernel-based memory tiering solutions:

Intel AutoNUMA patches
Meta AutoNUMA patches
AWS damon for page tracking plus user-space daemon to trigger page migration.

For kernel memory tiering we would be evaluating:

DRAM + PMEM managed by kernel and single cachebench TAO-L2 in regular memory + nvme configuration.

In the second phase of our benchmarking effort, we would be adding CXL DRAM to our scenarios but this is not finalized yet (for both native cachelib and kernel based memory tiering).

Tomasz

vinser52 Dec 16, 2021

Hi @therealgymmy,

As Tomasz wrote we just started benchmarking and have no results yet.
Also below is my answers to your questions:

Far Memory: It is byte-addressable and there is not much difference with DRAM from SW perspective. The only meaningful difference with DRAM is performance. CXL attached memory and PMem are slower than DRAM in terms of latency and bandwidth.
Access: I am not sure about CXL memory because it depends on concrete device characteristics. As for PMem Optane dimms the average latency is ~200-300 ns.
Promotion:
- For Item movement between memory tiers we prioritize to use move callback if it is configured by the user. Otherwise, we use memcpy.
- Yes, we use the approach with WaitContext. When the find method is called for the Item which is in the process of move we return the Handle with wait context to it.
- I think so. But we need to wait for benchmark results.
- it is not clear how to switch over if we already returned the handle to the far memory. Furthermore, it is not clear how to synchronize old and new locations if the user wants to change memory location.
MMContainers and AccessContainer
- are stored in near memory
- Today we have global AccessContainer and per-tier MMContainers. We found this approach less intrusive for the current design.

vinser52 Jan 13, 2022

Hey @therealgymmy,

During benchmarking we found that cachebench does not touch/read value pointed by the returned handle. As a result, the DRAM-only setup and the PMEM-only one have on-pair performance. As I described earlier, in our design accessContainer is a global entity and stored on DRAM. When cachebench executes find request it only accesses accessContainer stored in DRAM and does not touch PMEM at al, as I can understand. Any thoughts regarding it?

therealgymmy Jan 14, 2022
Collaborator

@vinser52: Hi, I think there are two options.

Enable consistency checker. This checker will write and read an integer value into the item, that should show the difference between accessing DRAM vs PMEM. Downside it has the overhead of consistency checker itself. You can enable it by adding "checkConsistency" : true, to the test_config section of your config file.
Modify CacheBench code to add some read/write. You can modify here https://github.com/facebook/CacheLib/blob/main/cachelib/cachebench/cache/Cache-inl.h#L415

skyelves · 2021-12-16T22:22:34Z

skyelves
Dec 16, 2021

I am trying to build DRAM + PMEM + NVMe three layer architecture with your latest version.
When I specify 2 memory tiers:

"cache_config" : {
    "cacheSizeMB" : 512,
    "usePosixShm" : true,
    "persistedCacheDir" : "/media/data/ke/cachelib",
    "memoryTiers" : [
      {
        "ratio": 5,
        "file": "/media/pmem0/ke/cachelib/3layer_test"
      },
      {
        "ratio": 2,
        "file": "/media/data/ke/cachelib/3layer_test"
      }
    ],
    "poolRebalanceIntervalSec" : 1,
    "moveOnSlabRelease" : false,

    "numPools" : 2,
    "poolSizes" : [0.3, 0.7]
  },

It says current implementation doesn't support multiple memory tiers. Can you point out if there is any misunderstanding in the config?

1 reply

vinser52 Dec 16, 2021

@skyelves The work is still in progress and there may be bugs. The DRAM+Pmem+NVMe configuration is supported with the following config section (please note that NVMe is not a memory tier and it can be enabled disabled independently):

"cache_config" : {
    "cacheSizeMB" : 512,
    "usePosixShm" : true,
    "persistedCacheDir" : "/media/data/ke/cachelib",
    <enable/configure NVMe>
    "memoryTiers" : [
      {
        "ratio": 5,
      },
      {
        "ratio": 2,
        "file": "/media/data/ke/cachelib/pmem"
      }
    ],
}

The code for multi-tier support is on code review now, please refer to this PR in our fork pmem#16.
If you want to test it you should use my branch from this PR.

Also, I added a simple example (/CacheLib/examples/multitier_cache) which demonstrates how multitiered cache works.

sathyaphoenix · 2022-01-21T21:41:33Z

sathyaphoenix
Jan 21, 2022
Collaborator

What are you hoping to accomplish with the benchmarking with respect to the code changes ? Do you envision these PRs would be eventually merged in to the main cachelib branch ? This will play a role into how we review the design. Design requirements to be merged into main branch would need more consideration than a design review to evaluate a prototype.

Some design specific feedback.

File-backed memory support in shared memory manager and the configuration changes sound good from a design point. I have a related question though about the Item layout. Wouldn't it be better if Items in far-memory that are byte-addressable have item headers separated from the item data layout ?

Since each memory tier is an independent piece of memory, we also had to modify CompressedPtr so that it knows to which level the pointer belongs. This resulted in increasing the size of the CompressedPtr from 32 to 64 bits (we use the topmost 32 bits to store the tier ID).

with 64 bits, it is no longer a "compressed" pointer and I think the only benefits we get is pointer fix-up for cache persistence mode. It might be worth to see if the memory mode can be tracked at the slab level instead ?

the Item, which is moved between memory tiers, can be accessed by concurrent threads. The diagram below shows eviction operation with simultaneous find operation. The high-level idea is to use Wait Context and make lookup threads wait on the Item handle while the move operation is in progress

Using the wait-context sounds like a good idea. Does this ensure that item can not be mutated while it is in the process of a move between levels ? Cachebench has consistency testing mode which can also help flush out concurrency bugs with this. Have you tried that (@therealgymmy pointed some instructions on how to do this in an earlier comment).

1 reply

vinser52 Jan 24, 2022

Hi @sathyaphoenix,

What are you hoping to accomplish with the benchmarking with respect to the code changes ? Do you envision these PRs would be eventually merged in to the main cachelib branch ? This will play a role into how we review the design. Design requirements to be merged into main branch would need more consideration than a design review to evaluate a prototype.

With the benchmarking, we want to understand whether our changes bring any benefits. We want to compare our user-space memory tiering with the kernel-space memory tiering approach. We have two main scenarios in mind:

Single memory tier (PMEM, CXXL.mem) which allows storing Items on a particular memory type.
Two memory tiers (DRAM+PMEM, DRAM + CXL.mem). The idea is to combine fast but expensive and cheap but with lower performance memories to achieve performance and TCO balance.

We understand that the current prototype quality may not be enough to get merged into the main branch. But it is the next step, our plan is trying upstream our changes by several independent chunks (not the all changes in a single PR).

File-backed memory support in shared memory manager and the configuration changes sound good from a design point.

Our plan is first to try upstreaming single-tier support - the ability to use various types of memory as Item's storage.

I have a related question though about the Item layout. Wouldn't it be better if Items in far-memory that are byte-addressable have item headers separated from the item data layout ?

It makes sense from the performance perspective but I am not sure how easy it is to implement. Do you think it is feasible from the implementation perspective? Anyway, first, we need to estimate performance impact.

with 64 bits, it is no longer a "compressed" pointer and I think the only benefits we get is pointer fix-up for cache persistence mode. It might be worth to see if the memory mode can be tracked at the slab level instead?

Yeah, with 64 bits there is no compression anymore. And we extended CompressedPtr to encapsulate tier ID very quickly just for prototyping purposes. Today, CompressedPtr can address 256 Gb of memory trying to encapsulate tier ID inside 32 means we decrease maximum addressable space.
Regarding your idea to track tier ID at the Slab layer, we do not exactly understand how to implement it. As I understand we can restore the pointer to the Slab from Item*, but how to get Item* from CompressedPtr if we do not know tier ID. If you have an idea could you please describe in details?

Using the wait-context sounds like a good idea. Does this ensure that item can not be mutated while it is in the process of a move between levels ? Cachebench has consistency testing mode which can also help flush out concurrency bugs with this. Have you tried that

Correct, Item cannot be mutated while it is in the process of a move. And our intention was make Item movement between memory tiers implicit to the user. And it is different from the data movement on the Slab release. Can you describe why you make movements on Slab release user-visible, maybe we miss some corner-cases? Regarding testing, we tried the consistency check flag when we run benchmark, so far we have not experienced consistency issues.

victoria-mcgrath · 2022-05-17T20:37:55Z

victoria-mcgrath
May 17, 2022
Author

Update from the Intel team:

Plan for upstreaming our changes: Upstreaming plan (5/4/2022) pmem/CacheLib#68;
Ideas for performance optimizations: background eviction thread; shortened critical section;
Working on scalable eviction policies: partition 2Q into independent structures so they use separate locks per region.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing new memory types to CacheLib #102

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Introducing new memory types to CacheLib #102

We made the following modifications to the code:

Replies: 4 comments · 7 replies

therealgymmy Dec 16, 2021 Collaborator

Question on details

Question on next steps

therealgymmy Jan 14, 2022 Collaborator

sathyaphoenix Jan 21, 2022 Collaborator

victoria-mcgrath May 17, 2022 Author

Replies: 4 comments 7 replies

therealgymmy
Dec 16, 2021
Collaborator

therealgymmy Jan 14, 2022
Collaborator

sathyaphoenix
Jan 21, 2022
Collaborator

victoria-mcgrath
May 17, 2022
Author