Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bees breaks existing reflinks? #270

Open
jaens opened this issue Nov 13, 2023 · 4 comments
Open

bees breaks existing reflinks? #270

jaens opened this issue Nov 13, 2023 · 4 comments

Comments

@jaens
Copy link

jaens commented Nov 13, 2023

While running bees (version 0.9.3), I noticed space usage slightly increasing on the disk. From the log, I think I saw it trying to "deduplicate" files that were already "whole-file" reflinked (via eg. cp --reflink).

Does bees preserve existing reflinks?

@kakra
Copy link
Contributor

kakra commented Nov 13, 2023

Bees will break existing reflinks if it sees other duplicate chains of blocks matching the ones just found. But it will eventually clean up the unreachable extents after some time, just leave it running for long enough. This behavior is probably already described in the documents somewhere and it expected due to bees working very different from other deduplicators.

Also, in your situation, bees will probably add more metadata and thus also increase allocation somewhat.

@jaens
Copy link
Author

jaens commented Nov 13, 2023

Thank you for your reply. There is some documentation regarding snapshot gotchas, which I guess might also apply to reflinks?

So, based on what you are saying:

  1. If the reflink content is not duplicated elsewhere, bees will leave it alone? (although this, unfortunately, seems unlikely for my dataset of medium-to-large executable files etc. – it will generally find a page or two in common with some other random file...)
  2. bees, I think, is not perfect in finding duplicated extents (due to eg. hash table failures or "too many duplicates"?), so technically, if it does try to deduplicate such a reflinked file, the final result might be worse since it does not dedupe 100% of the file?

@kakra
Copy link
Contributor

kakra commented Nov 13, 2023

If your dataset in snapshots is already highly deduplicated, you can try stopping bees, open beescrawl.dat and set the min_transid to the value of max_transid per each line. Then it should leave the already existing extents alone and not scan them. But then you loose an opportunity of deduplicating new content with existing data in the snapshots. See also here where I was seeing a similar effect due to loss of beescrawl.dat: #268

Also, if the hash table fills up, bees should automatically start ignoring very small duplicate blocks, so it should ignore creating 4k or 8k reflinks if you don't size the hash table too big. The docs have a table of typical extent size vs hash table size per unique dataset size. Trying to dedup every single small extent is bad for performance, so you should not over-size your hash table.

@Zygo
Copy link
Owner

Zygo commented Jun 28, 2024

bees processes one reflink at a time, except in the case where it has to split an extent into duplicate and non-duplicate parts. In that case, the non-duplicate portion is moved to a new extent, and all reflinks that refer to the non-duplicate blocks are replaced at once, but the reflinks to the duplicate blocks are handled one reflink at a time using hash table matching. This means that the non-duplicate portion of the data occupies additional space until the last duplicate blocks are removed from all reflinks.

If the hash table evicts all hashes of the duplicate portion of the data before the last reflink is removed, then both the original extent and a temporary copy of its unique portion will persist in the filesystem. That consumes additional data space.

bees also tends to collect unreachable blocks in extents. Extents with unreachable blocks tend to be older than extents with all reachable blocks, and bees always keeps the first extent it encountered when it finds a duplicate. Technically this doesn't allocate any new data space, but it can make it harder to release blocks from deleted files that contain "popular" data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants