bees causes memory fragmentation and low cached memory #260

lilydjwg · 2023-07-08T12:16:32Z

When bees is running, I get a lot of free memory but low cached usage, and more swap usage, i.e.

You can see the change here:

Also, kcompactd is constantly running.

This makes pagecache hit rate lower. There are many more disk reads, making processes accessing disks slow or unresponsive. I have three disks, nvme0n1 is my system disk, sda is a data disk receiving some constant writes, sdb is the one bees running with (it started at 13:00 and before that it was deleting unneeded snaphosts). Both sda and sdb are spinning disks.

There are a lot of snapshots on sdb and I'm using --scan-mode=0. I've also observed the same issue on another machine running bees and without spinning disks.

The text was updated successfully, but these errors were encountered:

kakra · 2023-07-08T20:53:08Z

This is probably the same as #257. I can observe this behavior since switching from LTS 5.19 to 6.1. It's not that bees occupies free memory, it's rather btrfs seems to create high memory fragmentation as can be observed in /proc/buddyinfo.

lilydjwg · 2023-07-09T02:12:51Z

I see those vmalloc errors on 6.3 too, but this one is running 6.1. I'll check /proc/buddyinfo (I didn't know about it).

Zygo · 2023-07-09T03:14:26Z

The vmalloc bug from #257 was backported to 6.1. The fix hasn't been backported yet, so 6.1 is currently broken (and so is 5.15).

One thing to try is reducing the size of the buffers for LOGICAL_INO by changing this line in include/crucible/fs.h:

            BtrfsIoctlLogicalInoArgs(uint64_t logical, size_t buf_size = 16 * 1024 * 1024);

Try lower values for buf_size, e.g. 64 * 1024. This will drastically reduce the number of reflinks bees can handle to a single extent, but that size is beyond more than enough for most filesystems. Making the buffer smaller may also reduce the size of vmallocs which might be aggravating the kernel's memory manager.

If this does work, then maybe a workaround is possible: use the small buffer size at first, but if the buffer fills up, then switch to the larger buffer.

Off the top of my head I don't know why the kernel would allocate a buffer here, since userspace is already providing a buffer, but this is off the top of my head--I haven't looked at the kernel code for this issue.

lilydjwg · 2023-07-09T04:10:56Z

This is the first time memory fragmentation caused issues for me. My memory is already very fragmented. I can see counts of lower orders of blocks change significantly.

I've applied the settings in #257 (comment) and it seems better now (much less cache miss; I no longer see my postgres processes being in D state frequently and causing timeout everywhere).

Not sure what a lower buf_size has achieved. It seems that cached memory size no longer changes a lot.

10:40: bees starts
10:55: applied those kernel settings mentioned above
11:40: bees with 64 * 1024 buf_size starts

How many reflinks is "a single extent"? I should have a few dozens of reflinks for most files on this filesystem.

I'll skip newer 6.1 kernels and update to latest 6.3 to avoid those vmalloc messages. They seem to not cause visible harm but waste journald's disk space.

kakra · 2023-07-09T11:10:46Z

If this does work, then maybe a workaround is possible: use the small buffer size at first, but if the buffer fills up, then switch to the larger buffer.

How do we know if the buffer is too small?

Zygo · 2023-07-09T14:20:39Z

If the number of references returned is equal to the maximum number for that buffer, then the buffer is probably too small (it might be exactly the right size, but that's unlikely enough not to matter). Then we can use a larger buffer and repeat the entire LOGICAL_INO operation. With ideal kernel behavior, we would just use the larger buffer all the time, as there's no way to restart or extend a LOGICAL_INO result.

A smaller buffer limits the total number of references that bees can create to a common block of data. Once it hits 10,000 or so, other parts of btrfs start getting slower, so it's arguably not worth creating that many references in any case. A 64K buffer holds 2730 references, a 16 MB buffer holds a little less than 700,000.

lilydjwg · 2023-07-09T14:51:28Z

A 64K buffer holds 2730 references

Thanks for the explanation. That's quite a lot! It's still possible that my filesystem could reach this limit but it's good enough for me.

kakra · 2023-07-09T16:04:05Z

So bees could dynamically increase that number by 64k up to 3 times until we pass around 10k references. So if it would get exactly 2730 refs, it would use a 128k buffer for future requests? Or maybe even increase in steps of 4k?

OTOH, the kernel should handle the allocations better.

Does the kernel write to that buffer and reads data back in? Then I could imagine why it does not write directly to the user-space buffer: It could be modified while the kernel is still working with it, and thus open attack vectors.

For now, I'll be shipping the Gentoo version of bees with 64k buffer.

Zygo · 2023-07-09T17:17:52Z

It would have to skip directly from the minimum size to the maximum size (which would initially be 64K and 16MB) when the minimum size fills up. It can keep track of the distribution of ref counts and adjust the minimum size until it is large enough for 99% of all requests. We don't want to do more than two requests for any given block, since all of the prior work is wasted when we repeat the request, and the large requests aren't that expensive if they are rare. We can also do things like limit the number of threads that can use a large buffer, while allowing all threads to use a small one at the same time.

The kernel's risks are the same whether it writes the buffer as it goes, or does one giant memcpy at the end; however, there might be some issues with memory fault errors interrupting the kernel thread. Either way, there is a lot of room for improvement of LOGICAL_INO, starting from making it work incrementally, or providing a way for userspace to follow metadata tree page backrefs.

kakra · 2023-07-11T08:02:12Z

We don't want to do more than two requests for any given block, since all of the prior work is wasted when we repeat the request, and the large requests aren't that expensive if they are rare.

Would it be possible to first dedupe the refs found by the first lookup, and then resubmit that extent later to see if there are more refs? As far as I understand, that could provide some level of incremental work but care would need to be taken to prevent loops if no more progress is made and the same refs show up over and over again.

Zygo · 2023-07-13T12:57:19Z

Would it be possible to first dedupe the refs found by the first lookup, and then resubmit that extent later to see if there are more refs?

It's possible, but it would only be productive on large extents. For those 4K blocks that appear a million times on a filesystem, it's better to simply ignore them when they have too many references, unless the dedupe is combined with a defrag operation to make the extents larger.

kakra · 2023-07-13T13:12:14Z

Okay but in theory, bigger extents tend to have fewer refs - statistically. So in the end, such a "resubmit" because of a small buffer would not happen anyways, and we tend to ignore the huge ref counts for small common extents anyways due to bad timing behavior. So yes, there's probably no point in resubmitting such extents, but there's also no point in having a huge buffer of 16M and 64k or 128k should work just fine. Unless the defrag thing happens...

So I conclude going with my 64k patch is just fine for the moment.

This leaves the patch problematic for filesystems with thousands of snapshots - and under that condition, btrfs doesn't behave very well anyways.

Zygo · 2023-07-13T13:43:13Z

I've been collecting some data on that. 64K is a little too small (it's very common to blow past 2730 refs with a year's worth of snapshots, which is why I made the patch to increase it in the kernel), but the ideal size seems to be smaller than 256K (once an extent has 10920 refs it takes a millisecond of CPU to process each one). Maybe go with a 128K buffer for subvol scans?

My extent tree scanner prototype does much better with snapshots, but it has its own LOGICAL_INO code (because IGNORE_OFFSETS is mandatory), so it can easily construct the object with a different buffer size. I suspect defrag will be able to do the same thing.

kakra · 2023-07-13T21:28:23Z

with a year's worth of snapshots

You mean: daily snapshots? hourly?

Zygo · 2023-07-13T21:45:13Z

Daily (365 snapshots) would leave only 7.49 references available for each extent in a 64K buffer, so there can be only 7 duplicates within the subvol before we hit the limit and can't dedupe further. Containers, build workspaces, even rootfs upgrade snapshots can all easily generate more than 7 duplicate refs after dedupe.

On build servers it can be the other way around: they average 350 or so references to each file, so only 7 or 8 snapshots can be deduplicated.

There is a log message for this:

 addr 0x46ce881fc000 refs 3112 beats previous record 2730

which will start appearing once the 64K limit has been exceeded.

kakra · 2023-07-13T21:48:54Z

Yeah okay I think I get how the numbers calculate. Thanks for the explanation.

Zygo · 2023-07-17T21:22:38Z

Note to self: LOGICAL_INO does report the total number of references when the buffer overflows, in btrfs_data_container::elem_missed (that plus elem_cnt = total number of refs). Unfortunately the kernel still does all the work to find each reference just to fill in that count, so it's not exactly efficient to e.g. call with a zero buffer size, then call again with a buffer exactly the right size for the number of refs found.

Zygo · 2023-07-17T21:37:20Z

Getting back to the actual issue for a moment: for the benefit of those who haven't clicked on it, the graph at #260 (comment) does show a dramatic improvement in kernel behavior right after setting a smaller buf_size.

bees has used 16 MiB buf_size since 2018, with a more recent change at 3430f16 which kept the buffers between uses instead of allocating and deallocating 16 MiB for each ioctl call.

Maybe the second change (in v0.9.3) accidentally triggered some unfortunate kernel behavior, or maybe bees and the kernel have always behaved this way and nobody reported it before. v0.9.3 fixed a number of bees memory management problems which could easily have concealed an issue like this by making the cache page evictions look like "normal" bees process memory usage.

kakra · 2023-07-17T23:36:28Z

Okay, so I thought I didn't see that behavior with last years LTS kernel and I observed bad memory pressure after going to 6.1. But coincidentally, your change to bees in February falls into the same time period when I started to use kernel 6.1. So maybe I should try again with that change reverted?

OTOH, even when I stop bees, the kernel behaves bad when memory cgroups are enabled. So I believe there has been some impactful change in the kernel between last years LTS and current LTS - and it is in memory management. I tried with the new multigen LRU and without. Only turning off both memory cgroups AND transparent hugepages gets this under control for me. We are still facing issues with memory pressure when bees is running on our servers but it's mostly okay now with hugepages in madvise-only mode.

Zygo · 2023-07-18T00:08:32Z

So maybe I should try again with that change reverted?

Sounds like a good idea. If we get an effect that can be used for bisection, then we might be able to find where things went wrong.

I believe there has been some impactful change in the kernel between last years LTS and current LTS - and it is in memory management.

There's no rule that says kernel bugs have to appear one at a time. ;)

Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent references by definition, it follows that we should not allocate buffer space for them when we perform the LOGICAL_INO ioctl. There is some evidence (particularly #260 (comment)) that the kernel is subjecting the page cache to a lot of disruption when trying allocate large buffers for LOGICAL_INO. Signed-off-by: Zygo Blaxell <bees@furryterror.org>

kakra · 2024-04-21T12:35:24Z

Following up from #268 (comment):

... find out why btrfs is causing that high memory fragmentation. I feel like this has become worse and worse with each LTS kernel version. It's probably less an issue within bees but rather a side effect of btrfs doing some very unfavorable memory management in the kernel.

After my repeated issue in #268, it becomes very apparent that btrfs has some really bad memory fragmentation patterns.

As posted before, with cgroups enabled, the kernel almost has no chance of using physical memory anymore, swap usage increases a lot. Most of the memory stays completely free. cgroups steal cache memory a lot from each other and fight over a very tight memory resource. This happens within the first few hours of system operation.

With transparent huge pages, a very similar effect kicks in. This is similar to the previous issue but cache memory fluctuates a lot more. This happens with the first few hours of system operation.

Without cgroups and transparent huge pages, the effect takes longer to emerge, usually within a few days of system operation. But this is largely impacted by bees activity: If bees activity is high, the effect kicks within the first 1 or 2 hours of operation. If activity is low, the system can work for days before the effect kicks in and causes high desktop latency. This is different from the previous two observations which were largely based on low bees activity.

Zygo · 2024-04-22T00:29:23Z

Lately I've been running with vm.swappiness=0 and BEES_MAX_EXTENT_REF_COUNT set to 9999. With those settings there's no swapping at all.

kakra · 2024-04-22T10:28:08Z

Is BEES_MAX_EXTENT_REF_COUNT and env var, or is it a defined constant in the source code?

kakra · 2024-05-02T21:30:20Z

Lately I've been running with vm.swappiness=0 and BEES_MAX_EXTENT_REF_COUNT set to 9999. With those settings there's no swapping at all.

I've created a patch (9999 max ref count) and reverted my previous patch (64kB max memory, as suggested previously in Jul 2023), and running that for around 48 hours now shows really great results. I'll let it run for some more time to see how it behaves after different workloads applied to the system.

KeinNiemand · 2024-10-16T18:00:19Z

is there a way to know if BEES_MAX_EXTENT_REF_COUNT is to low?
Going from the default of 699049 to just 9999 seems like a huge decrease.

kakra · 2024-10-16T20:57:45Z

I wonder if this is related to zswap using zbud or zfold3 allocators... Due to the latest LTS kernel update, I've switched to zsmalloc, and memory stats look much better since. But I couldn't collect a lot of data up to now.

Zygo · 2024-10-17T01:44:24Z

is there a way to know if BEES_MAX_EXTENT_REF_COUNT is to low?

See #260 (comment) - if the log message says 9999 references have been reached, then there may be some benefit to a higher max ref count. If the log message never reaches 9999, or if it does so on only a few extents in the entire filesystem, then there will be no significant improvement when the limit is raised.

Note that benefit will typically be very small even when it exists, since the maximum space savings after deduping 9999 copies of the data is 0.01% of the logical data size, and each additional reference adds more work for any btrfs operation that modifies the extent later on (both within bees and other applications). Each additional reference adds a fraction of a millisecond CPU time, but after 10,000 of those, they add up to whole seconds.

The exceptional case is where you have exactly 10000 to 19999 copies of everything in the filesystem: these would be deduped to two extents with a maximum of 9999 references. Adding one to BEES_MAX_EXTENT_REF_COUNT would result in a 50% total saving in that case, as the last two copies of the data are merged into one.

Going from the default of 699049 to just 9999 seems like a huge decrease.

699049 to 9999 is a huge decrease, but the increase from 2730 to 699049 was even larger. 2730 to 9999 is a reasonably-sized increase.

lilydjwg · 2024-10-17T02:41:32Z

I wonder if this is related to zswap using zbud or zfold3 allocators... Due to the latest LTS kernel update, I've switched to zsmalloc, and memory stats look much better since. But I couldn't collect a lot of data up to now.

I have a server running zswap with zsmalloc and bees still causes a lot of unused memory until I applied the buf_size patch. I can't say if it is better with zsmalloc or not, however. Probably yes because I've seen some improvement after a kernel update (among other packages) in the past.

kakra · 2024-10-17T10:43:10Z

@lilydjwg Well, it is clearly visible when I deployed the change to zsmalloc:

But this doesn't necessarily mean that bees is the only process causing it. It could be inside the kernel from btrfs page cache handling, or other services like databases, PHP, redis etc...

I'm currently testing my home PC and office workstation with transparent hugepages reenabled (mode "always"). It looks mostly good so far. But some more time is needed to evaluate behavior under various different workloads and memory pressure situations.

On all systems, the buf_size patch is deployed. Also, I'm guessing that we might get much better behavior with next LTS kernel 6.12 because it seems to have gained multi-size THP through folios since LTS 6.6.

KeinNiemand · 2024-10-17T11:08:38Z

The weird thing for me is that I had low cache usage and high free ram even tough fragmentation didn't seem that bad
this was my /proc/buddyinfo while I was having the problems
this is /proc/buddyinfo
Node 0, zone DMA 0 0 0 0 0 0 0 0 1 2 2
Node 0, zone DMA32 472 424 733 608 351 182 86 45 38 1 0
Node 0, zone Normal 3658 8084 10386 7519 5560 4188 3314 2902 2726 5 0

Order 8 is still pretty high and order 9 is above 0. This doesn't seem that bad fragmentation wise yet the system was using less ony about 500MB of ram for cache and had massive amounts of free ram.

Lowering BEES_MAX_EXTENT_REF_COUNT to 9999 (in bees code) seemingly fixed things the system is now using almost all it's free ram as cache like it should.

What do you mean with buf_size patch? Is that just lowering BEES_MAX_EXTENT_REF_COUNT inside bees code like I did or does it do something else too?

lilydjwg · 2024-10-17T11:11:07Z

What do you mean with buf_size patch? Is that just lowering BEES_MAX_EXTENT_REF_COUNT inside bees code like I did or does it do something else too?

I mean this change #260 (comment). I didn't change BEES_MAX_EXTENT_REF_COUNT.

kakra · 2024-10-17T11:26:25Z

My Gentoo package currently uses these patches (the two recent ones):
https://github.com/gentoo/gentoo/tree/master/sys-fs/bees/files

It works fine with those. Maybe logging pressure without my logging patch is still causing issues for some people?

lilydjwg changed the title ~~bees "occupies" free memory~~ bees causes memory fragmentation and low cached memory Jul 9, 2023

Zygo mentioned this issue Jul 17, 2023

regression on linux 6.3.1: 'vmalloc error' during crawl #257

Open

kakra mentioned this issue Apr 21, 2024

bees seemingly cannot catch up with snapper snapshot creation #268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bees causes memory fragmentation and low cached memory #260

bees causes memory fragmentation and low cached memory #260

lilydjwg commented Jul 8, 2023

kakra commented Jul 8, 2023

lilydjwg commented Jul 9, 2023

Zygo commented Jul 9, 2023

lilydjwg commented Jul 9, 2023 •

edited

Loading

kakra commented Jul 9, 2023

Zygo commented Jul 9, 2023

lilydjwg commented Jul 9, 2023

kakra commented Jul 9, 2023 •

edited

Loading

Zygo commented Jul 9, 2023

kakra commented Jul 11, 2023

Zygo commented Jul 13, 2023

kakra commented Jul 13, 2023

Zygo commented Jul 13, 2023

kakra commented Jul 13, 2023

Zygo commented Jul 13, 2023 •

edited

Loading

kakra commented Jul 13, 2023

Zygo commented Jul 17, 2023

Zygo commented Jul 17, 2023

kakra commented Jul 17, 2023

Zygo commented Jul 18, 2023

kakra commented Apr 21, 2024

Zygo commented Apr 22, 2024

kakra commented Apr 22, 2024

kakra commented May 2, 2024

KeinNiemand commented Oct 16, 2024 •

edited

Loading

kakra commented Oct 16, 2024

Zygo commented Oct 17, 2024

lilydjwg commented Oct 17, 2024

kakra commented Oct 17, 2024

KeinNiemand commented Oct 17, 2024

lilydjwg commented Oct 17, 2024

kakra commented Oct 17, 2024 •

edited

Loading

bees causes memory fragmentation and low cached memory #260

bees causes memory fragmentation and low cached memory #260

Comments

lilydjwg commented Jul 8, 2023

kakra commented Jul 8, 2023

lilydjwg commented Jul 9, 2023

Zygo commented Jul 9, 2023

lilydjwg commented Jul 9, 2023 • edited Loading

kakra commented Jul 9, 2023

Zygo commented Jul 9, 2023

lilydjwg commented Jul 9, 2023

kakra commented Jul 9, 2023 • edited Loading

Zygo commented Jul 9, 2023

kakra commented Jul 11, 2023

Zygo commented Jul 13, 2023

kakra commented Jul 13, 2023

Zygo commented Jul 13, 2023

kakra commented Jul 13, 2023

Zygo commented Jul 13, 2023 • edited Loading

kakra commented Jul 13, 2023

Zygo commented Jul 17, 2023

Zygo commented Jul 17, 2023

kakra commented Jul 17, 2023

Zygo commented Jul 18, 2023

kakra commented Apr 21, 2024

Zygo commented Apr 22, 2024

kakra commented Apr 22, 2024

kakra commented May 2, 2024

KeinNiemand commented Oct 16, 2024 • edited Loading

kakra commented Oct 16, 2024

Zygo commented Oct 17, 2024

lilydjwg commented Oct 17, 2024

kakra commented Oct 17, 2024

KeinNiemand commented Oct 17, 2024

lilydjwg commented Oct 17, 2024

kakra commented Oct 17, 2024 • edited Loading

lilydjwg commented Jul 9, 2023 •

edited

Loading

kakra commented Jul 9, 2023 •

edited

Loading

Zygo commented Jul 13, 2023 •

edited

Loading

KeinNiemand commented Oct 16, 2024 •

edited

Loading

kakra commented Oct 17, 2024 •

edited

Loading