File deduplication #6332

Hocuri · 2024-12-11T15:04:37Z

When receiving messages, blobs will be deduplicated with the new function create_and_deduplicate_from_bytes(). For sending files, this adds a new function set_file_and_deduplicate() instead of deduplicating by default.

This is for #6265; read the issue description there for more details.

TODO:

Set files as read-only
Don't do a write when the file is already identical
The first 32 chars or so of the 64-character hash are enough. I calculated that if 10b people (i.e. all of humanity) use DC, and each of them has 200k distinct blob files (I have 4k in my day-to-day account), and we used 20 chars, then the expected value for the number of name collisions would be ~0.0002 (and the probability that there is a least one name collision is lower than that) ¹. I added 12 more characters to be on the super safe side, but this wouldn't be necessary and I could also make it 20 instead of 32.
- Not 100% sure whether that's necessary at all - it would mainly be necessary if we might hit a length limit on some file systems (the blobdir is usually sth like accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs (53 chars), plus 64 chars for the filename would be 117).
"touch" the files to prevent them from being deleted
TODOs in the code

For later PRs:

Replace BlobObject::create(…) with BlobObject::create_and_deduplicate(…) in order to deduplicate everytime core creates a file
Modify JsonRPC to deduplicate blob files
Possibly rename BlobObject.name to BlobObject.file in order to prevent confusion (because name usually means "user-visible-name", not "name of the file on disk").

Calculated with both https://printfn.github.io/fend/ and https://www.geogebra.org/calculator, both of which came to the same result (1,
2) ↩

…on on reception

…eeping

I'm not sure whether we still need it and the tests pass without it, but also I don't want to introduce a new bug by changing stuff, and it's just 8 lines, anyway.

… correct when deduplicating

src/blob.rs

deltachat-ffi/src/lib.rs

link2xt · 2025-01-18T19:47:23Z

src/blob.rs

@@ -74,7 +75,8 @@ impl<'a> BlobObject<'a> {
        Ok(blob)
    }

-    // Creates a new file, returning a tuple of the name and the handle.
+    /// Creates a new file, returning a tuple of the name and the handle.
+    /// This avoids race conditions when creating multiple files with the same name.


What is "This" here? The function? "Returning ... the handle"? Something else?

Probably "this function" because of the comment added inside.

Yes, but actually, probably the comment added inside is enough, I'll just remove this one.

link2xt · 2025-01-18T19:52:51Z

src/blob.rs

+        // from an async context thanks to `block_in_place()`.
+        // Tokio's "async" I/O functions are also just thin wrappers around the blocking I/O syscalls,
+        // so we are doing essentially the same here.
+        task::block_in_place(|| {


Shouldn't this happen on the caller side? It's strange for a blocking function to depend on tokio. E.g. if we call this from another blocking function that is already running on a dedicated thread, this block_in_place is not needed.

It's still fine not to care about this in tests, they are running in separate processes by nextest anyway.

Not sure, it's easy to forget it at the caller site since it's not visible from the function signature that it needs to be wrapped in block_in_place(). And if we call this from another blocking function, block_in_place is a no-op:

calling the function outside a runtime is allowed. In this case, block_in_place just calls the provided closure normally.

OTOH, I do agree that it doesn't "look" nice that with the newest changes, block_in_place() is called twice when you call create_and_deduplicate_from_bytes().

This is not what other functions do, so you need to think about it when calling sync functions from async functions anyway.

link2xt · 2025-01-18T19:53:21Z

src/blob.rs

+        context: &'a Context,
+        data: &[u8],
+    ) -> Result<BlobObject<'a>> {
+        task::block_in_place(|| {


Same as create_and_deduplicate, I think this should be on the caller side.

src/blob.rs

Co-authored-by: l <link2xt@testrun.org>

…_and_deduplicate()

Hocuri force-pushed the hoc/file-deduplication branch from abbaa2e to 3cb9a66 Compare December 12, 2024 21:57

Hocuri mentioned this pull request Dec 19, 2024

[Tracking Issue] Deduplicate blob files #6265

Open

6 tasks

Hocuri force-pushed the hoc/file-deduplication branch from 0469519 to aa9dc2b Compare January 6, 2025 18:18

Hocuri mentioned this pull request Jan 6, 2025

fix: Use getFilename() instead of the actual filename on disk deltachat/deltachat-android#3521

Merged

Hocuri added 23 commits January 15, 2025 13:59

File deduplication

8518d5a

--wip-- [skip ci]

8a59b60

Fix the tests (some of the fixes may need a new test)

9752d26

Fix some more tests, I'll need to remove some println statements

0f4affd

Adapt more tests and fix most of them

58d32fe

test: Assume that the avatar name also changes

bc7436c

Adapt summary.rs tests

2d3a44e

Adapt some more tests, they all pass

4977bf0

Adapt src/receive_imf/tests.rs, fails because there is no deduplicati…

4174943

…on on reception

Deduplicate on message reception, fix all tests

ad1e6e2

Small tweaks, clippy

93a8cd0

Set deduplicated files as read-only on the file system

759bcae

Set the file modification time so that it's not deleted during housek…

d8ad0e9

…eeping

Deduplicate the code writing a file

2807ddb

Use only the first 32 characters of the hash

dc53384

Keep the code repairing Param::Filename extensions for now

3009d6a

I'm not sure whether we still need it and the tests pass without it, but also I don't want to introduce a new bug by changing stuff, and it's just 8 lines, anyway.

Some renames, leave set_file_from_bytes() being pub for now

3ff4645

Create blob dir if it doesn't exist

a8feb15

Document and expose via the C ffi

060b8d8

Use the actual file's name if name is None

c312109

Clippy

23afafa

clippy: Make functions that are not async not be async

1a5ed9a

Fix mistake I made when rebasing

ab4a882

Hocuri force-pushed the hoc/file-deduplication branch from f492186 to ab4a882 Compare January 15, 2025 13:52

Hocuri added 2 commits January 16, 2025 15:27

Documentation

03a9cc0

create_and_deduplicate_from_bytes: check if the file content is still…

1d387ac

… correct when deduplicating

Documentation

f594b6a

This was referenced Jan 16, 2025

Adapt to file deduplication deltachat/deltachat-desktop#4498

Open

[WIP] File deduplication, Android part deltachat/deltachat-android#3513

Open

Adapt to file deduplication deltachat/deltachat-ios#2524

Open

Hocuri changed the title ~~[WIP] File deduplication~~ File deduplication Jan 17, 2025

Hocuri requested a review from link2xt January 17, 2025 14:01

Hocuri added 3 commits January 17, 2025 15:16

clippy

294a946

Fix unit tests on Windows

ce59b74

Fix python tests

4cb0cff

Hocuri commented Jan 18, 2025

View reviewed changes

src/blob.rs Show resolved Hide resolved

link2xt reviewed Jan 18, 2025

View reviewed changes

src/blob.rs Outdated Show resolved Hide resolved

Hocuri and others added 5 commits January 18, 2025 23:18

Update deltachat-ffi/src/lib.rs

67e5c92

Co-authored-by: l <link2xt@testrun.org>

Update deltachat-ffi/src/lib.rs

f05e0f2

Co-authored-by: l <link2xt@testrun.org>

Update src/blob.rs

db31302

Co-authored-by: l <link2xt@testrun.org>

Make create_and_deduplicate_from_bytes() a thin wrapper around create…

09a67ce

…_and_deduplicate()

Remove redundant & hard to understand comment

1ba4d93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File deduplication #6332

File deduplication #6332

Hocuri commented Dec 11, 2024 •

edited

Loading

link2xt Jan 18, 2025

Hocuri Jan 18, 2025

link2xt Jan 18, 2025

link2xt Jan 18, 2025

Hocuri Jan 18, 2025 •

edited

Loading

link2xt Jan 19, 2025

link2xt Jan 18, 2025

File deduplication #6332

Are you sure you want to change the base?

File deduplication #6332

Conversation

Hocuri commented Dec 11, 2024 • edited Loading

Footnotes

link2xt Jan 18, 2025

Choose a reason for hiding this comment

Hocuri Jan 18, 2025

Choose a reason for hiding this comment

link2xt Jan 18, 2025

Choose a reason for hiding this comment

link2xt Jan 18, 2025

Choose a reason for hiding this comment

Hocuri Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

link2xt Jan 19, 2025

Choose a reason for hiding this comment

link2xt Jan 18, 2025

Choose a reason for hiding this comment

Hocuri commented Dec 11, 2024 •

edited

Loading

Hocuri Jan 18, 2025 •

edited

Loading