-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bees breaks existing reflinks? #270
Comments
Bees will break existing reflinks if it sees other duplicate chains of blocks matching the ones just found. But it will eventually clean up the unreachable extents after some time, just leave it running for long enough. This behavior is probably already described in the documents somewhere and it expected due to bees working very different from other deduplicators. Also, in your situation, bees will probably add more metadata and thus also increase allocation somewhat. |
Thank you for your reply. There is some documentation regarding snapshot gotchas, which I guess might also apply to reflinks? So, based on what you are saying:
|
If your dataset in snapshots is already highly deduplicated, you can try stopping bees, open Also, if the hash table fills up, bees should automatically start ignoring very small duplicate blocks, so it should ignore creating 4k or 8k reflinks if you don't size the hash table too big. The docs have a table of typical extent size vs hash table size per unique dataset size. Trying to dedup every single small extent is bad for performance, so you should not over-size your hash table. |
bees processes one reflink at a time, except in the case where it has to split an extent into duplicate and non-duplicate parts. In that case, the non-duplicate portion is moved to a new extent, and all reflinks that refer to the non-duplicate blocks are replaced at once, but the reflinks to the duplicate blocks are handled one reflink at a time using hash table matching. This means that the non-duplicate portion of the data occupies additional space until the last duplicate blocks are removed from all reflinks. If the hash table evicts all hashes of the duplicate portion of the data before the last reflink is removed, then both the original extent and a temporary copy of its unique portion will persist in the filesystem. That consumes additional data space. bees also tends to collect unreachable blocks in extents. Extents with unreachable blocks tend to be older than extents with all reachable blocks, and bees always keeps the first extent it encountered when it finds a duplicate. Technically this doesn't allocate any new data space, but it can make it harder to release blocks from deleted files that contain "popular" data. |
While running bees (version 0.9.3), I noticed space usage slightly increasing on the disk. From the log, I think I saw it trying to "deduplicate" files that were already "whole-file" reflinked (via eg.
cp --reflink
).Does bees preserve existing reflinks?
The text was updated successfully, but these errors were encountered: