bees breaks existing reflinks? #270

jaens · 2023-11-13T12:27:48Z

While running bees (version 0.9.3), I noticed space usage slightly increasing on the disk. From the log, I think I saw it trying to "deduplicate" files that were already "whole-file" reflinked (via eg. cp --reflink).

Does bees preserve existing reflinks?

The text was updated successfully, but these errors were encountered:

kakra · 2023-11-13T20:24:36Z

Bees will break existing reflinks if it sees other duplicate chains of blocks matching the ones just found. But it will eventually clean up the unreachable extents after some time, just leave it running for long enough. This behavior is probably already described in the documents somewhere and it expected due to bees working very different from other deduplicators.

Also, in your situation, bees will probably add more metadata and thus also increase allocation somewhat.

jaens · 2023-11-13T20:43:50Z

Thank you for your reply. There is some documentation regarding snapshot gotchas, which I guess might also apply to reflinks?

So, based on what you are saying:

If the reflink content is not duplicated elsewhere, bees will leave it alone? (although this, unfortunately, seems unlikely for my dataset of medium-to-large executable files etc. – it will generally find a page or two in common with some other random file...)
bees, I think, is not perfect in finding duplicated extents (due to eg. hash table failures or "too many duplicates"?), so technically, if it does try to deduplicate such a reflinked file, the final result might be worse since it does not dedupe 100% of the file?

kakra · 2023-11-13T20:47:36Z

If your dataset in snapshots is already highly deduplicated, you can try stopping bees, open beescrawl.dat and set the min_transid to the value of max_transid per each line. Then it should leave the already existing extents alone and not scan them. But then you loose an opportunity of deduplicating new content with existing data in the snapshots. See also here where I was seeing a similar effect due to loss of beescrawl.dat: #268

Also, if the hash table fills up, bees should automatically start ignoring very small duplicate blocks, so it should ignore creating 4k or 8k reflinks if you don't size the hash table too big. The docs have a table of typical extent size vs hash table size per unique dataset size. Trying to dedup every single small extent is bad for performance, so you should not over-size your hash table.

Zygo · 2024-06-28T23:00:13Z

bees processes one reflink at a time, except in the case where it has to split an extent into duplicate and non-duplicate parts. In that case, the non-duplicate portion is moved to a new extent, and all reflinks that refer to the non-duplicate blocks are replaced at once, but the reflinks to the duplicate blocks are handled one reflink at a time using hash table matching. This means that the non-duplicate portion of the data occupies additional space until the last duplicate blocks are removed from all reflinks.

If the hash table evicts all hashes of the duplicate portion of the data before the last reflink is removed, then both the original extent and a temporary copy of its unique portion will persist in the filesystem. That consumes additional data space.

bees also tends to collect unreachable blocks in extents. Extents with unreachable blocks tend to be older than extents with all reachable blocks, and bees always keeps the first extent it encountered when it finds a duplicate. Technically this doesn't allocate any new data space, but it can make it harder to release blocks from deleted files that contain "popular" data.

stroucki mentioned this issue Jun 28, 2024

btrfs send size #284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bees breaks existing reflinks? #270

bees breaks existing reflinks? #270

jaens commented Nov 13, 2023 •

edited

Loading

kakra commented Nov 13, 2023

jaens commented Nov 13, 2023 •

edited

Loading

kakra commented Nov 13, 2023 •

edited

Loading

Zygo commented Jun 28, 2024

bees breaks existing reflinks? #270

bees breaks existing reflinks? #270

Comments

jaens commented Nov 13, 2023 • edited Loading

kakra commented Nov 13, 2023

jaens commented Nov 13, 2023 • edited Loading

kakra commented Nov 13, 2023 • edited Loading

Zygo commented Jun 28, 2024

jaens commented Nov 13, 2023 •

edited

Loading

jaens commented Nov 13, 2023 •

edited

Loading

kakra commented Nov 13, 2023 •

edited

Loading