Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File deduplication #6332

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

File deduplication #6332

wants to merge 34 commits into from

Conversation

Hocuri
Copy link
Collaborator

@Hocuri Hocuri commented Dec 11, 2024

When receiving messages, blobs will be deduplicated with the new function create_and_deduplicate_from_bytes(). For sending files, this adds a new function set_file_and_deduplicate() instead of deduplicating by default.

This is for #6265; read the issue description there for more details.

TODO:

  • Set files as read-only
  • Don't do a write when the file is already identical
  • The first 32 chars or so of the 64-character hash are enough. I calculated that if 10b people (i.e. all of humanity) use DC, and each of them has 200k distinct blob files (I have 4k in my day-to-day account), and we used 20 chars, then the expected value for the number of name collisions would be ~0.0002 (and the probability that there is a least one name collision is lower than that) 1. I added 12 more characters to be on the super safe side, but this wouldn't be necessary and I could also make it 20 instead of 32.
    • Not 100% sure whether that's necessary at all - it would mainly be necessary if we might hit a length limit on some file systems (the blobdir is usually sth like accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs (53 chars), plus 64 chars for the filename would be 117).
  • "touch" the files to prevent them from being deleted
  • TODOs in the code

For later PRs:

  • Replace BlobObject::create(…) with BlobObject::create_and_deduplicate(…) in order to deduplicate everytime core creates a file
  • Modify JsonRPC to deduplicate blob files
  • Possibly rename BlobObject.name to BlobObject.file in order to prevent confusion (because name usually means "user-visible-name", not "name of the file on disk").

Footnotes

  1. Calculated with both https://printfn.github.io/fend/ and https://www.geogebra.org/calculator, both of which came to the same result (1,
    2)

@Hocuri Hocuri force-pushed the hoc/file-deduplication branch from f492186 to ab4a882 Compare January 15, 2025 13:52
src/blob.rs Show resolved Hide resolved
deltachat-ffi/src/lib.rs Outdated Show resolved Hide resolved
deltachat-ffi/src/lib.rs Outdated Show resolved Hide resolved
src/blob.rs Outdated
@@ -74,7 +75,8 @@ impl<'a> BlobObject<'a> {
Ok(blob)
}

// Creates a new file, returning a tuple of the name and the handle.
/// Creates a new file, returning a tuple of the name and the handle.
/// This avoids race conditions when creating multiple files with the same name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "This" here? The function? "Returning ... the handle"? Something else?

Probably "this function" because of the comment added inside.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but actually, probably the comment added inside is enough, I'll just remove this one.

// from an async context thanks to `block_in_place()`.
// Tokio's "async" I/O functions are also just thin wrappers around the blocking I/O syscalls,
// so we are doing essentially the same here.
task::block_in_place(|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this happen on the caller side? It's strange for a blocking function to depend on tokio. E.g. if we call this from another blocking function that is already running on a dedicated thread, this block_in_place is not needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still fine not to care about this in tests, they are running in separate processes by nextest anyway.

Copy link
Collaborator Author

@Hocuri Hocuri Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, it's easy to forget it at the caller site since it's not visible from the function signature that it needs to be wrapped in block_in_place(). And if we call this from another blocking function, block_in_place is a no-op:

calling the function outside a runtime is allowed. In this case, block_in_place just calls the provided closure normally.

OTOH, I do agree that it doesn't "look" nice that with the newest changes, block_in_place() is called twice when you call create_and_deduplicate_from_bytes().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what other functions do, so you need to think about it when calling sync functions from async functions anyway.

context: &'a Context,
data: &[u8],
) -> Result<BlobObject<'a>> {
task::block_in_place(|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as create_and_deduplicate, I think this should be on the caller side.

src/blob.rs Outdated Show resolved Hide resolved
src/blob.rs Outdated Show resolved Hide resolved
Hocuri and others added 5 commits January 18, 2025 23:18
Co-authored-by: l <link2xt@testrun.org>
Co-authored-by: l <link2xt@testrun.org>
Co-authored-by: l <link2xt@testrun.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants