Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mathemtical aspects #1

Open
aead opened this issue Jan 8, 2018 · 4 comments
Open

mathemtical aspects #1

aead opened this issue Jan 8, 2018 · 4 comments

Comments

@aead
Copy link

aead commented Jan 8, 2018

Requirements

Since we use a hash-based bitrot protection approach we need a (hash) function f mapping value of arbitrary length to a value of a fixed length. The function f must satisfy the following properties:

  • The mapping f(x) = y must be deterministic. That means that f(x) always evaluates to y and not to z at some point in time
  • The evaluation of f with k different randomly chosen inputs x1, ..., xk should differ with probability P = e-((k * (k-1)) / (2 * 2n)) where n is length of the output values of f.

The 1. requirement is kind of obvious - if f is not deterministic it is not possible to reliable verify the integrity of data. The 2. requirement is derived from the strong collision requirement. However in contrast to the general birthday problem we are not interested in "constructed" collisions. Therefore we define a requirement violation as a collisions of randomly chosen inputs. A constructed collision would imply and intelligent adversary exploiting a mathematical weakness of a given mapping f to find collisions with a probability P' > P. However bitrot protection is not designed to withstand an attacker because the adversary can simple modify the data and replace the stored hash value with the hash value of the modified data.

Random Oracle and PRF

A cryptographic secure hash function - like BLAKE2b-512 - always satisfies our requirements since it provides strong collision resistance which does not restrict the input values to be chosen randomly.
A random oracle and following a PRF (like SipHash / HighwayHash) also satisfies our requirements since a PRF can be constructed from our function f` using the following scheme:

  1. Generate a set of uniformly at random chosen keys. |ki| = |xi|
  2. yi = f(ki ⊕ xi)

This implies that SipHash, HighwayHash and BLAKE2b can be used for bitrot protection because of their mathematical properties. (Since we discarded Poly1305 anyway I don't discuss it here further)

However SipHash and HigwayHash are PRFs and require a secret key. The secret key can be securely omitted if the PRF provides strong collision resistance for a fixed (but unkown) secret key which is the case for SipHash and HighwayHash.

Choice of hash size

The following list shows the number of data modifications that can happened before the probability for an undetected bitrot exceeds 50%:

  • SipHash-64:
    For k = 232 + 229 + 228 the probability for a collision Pc = 1-e-((k * (k-1)) / (2 * 2n)) = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 5100 million computations - roughly speaking 5 billion objects with bitrot in our case.
  • SipHash-128 / HighwayHash-128:
    For k = 264 + 261 + 260 the probability for a collision Pc = 1-e-((k * (k-1)) / (2 * 2n)) = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 21905508 trillion computations - roughly speaking 21905508 trillion objects with bitrot in our case.
  • HighwayHash-256:
    For k = 2128 + 2125 + 2124 the probability for a collision Pc = 1-e-((k * (k-1)) / (2 * 2n)) = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 4.04×1038 computations - This is far beyond practical computation and can be treated as impossible.

Notice that this numbers refer to all minio instances. So for example if we talk about 5 billion objects with bitrot than we mean 5 billion distributed over all minio instances all over the world.
Basically 50% chance of undetected bitrot on one minio server somewhere on the planet earth.
(Except aliens are using minio 😉)

@fwessels
Copy link
Owner

fwessels commented Jan 8, 2018

Couple of remarks

  • since we only compare checksums within a single part (and not across parts, objects or let alone minio servers), if a bit rot would result in an equivalent checksum to the checksum of another part, this would not matter.
  • if two different bit rots for the same part would generated the same checksum, but as long as it is different from the original checksum, we would have a 'collision' but it would not matter in minio's case.
  • if encryption is used on top, then an undetected bit rot (if it were to happen) would fail the decryption.

@aead
Copy link
Author

aead commented Jan 9, 2018

since we only compare checksums within a single part (and not across parts, objects or let alone minio servers), if a bit rot would result in an equivalent checksum to the checksum of another part, this would not matter.

The checksum (in case of minio) is computed per 64 MB object part in case of PUT and per object part in case of multipart PUT. However for the computed probabilities this does not matter. This is not part of the mathematical considerations - See comment to 2. point.

if two different bit rots for the same part would generated the same checksum, but as long as it is different from the original checksum, we would have a 'collision' but it would not matter in minio's case.

That's why the Choice of hash size paragraph says:

number of data modifications that can happened before the probability for an undetected bitrot exceeds 50%

I don't not compute the probability for a collision of object hashes - I compute the probability for a collision of modified (bitrot occurred) object hashes. All numbers refer to objects which are already modified by bitrot. For example:
"That would mean that the probability for one undetected bitrot is about 50% after 5100 million computations - roughly speaking 5 billion objects with bitrot in our case."

Whether bitrot affects every second object or only 1/100000000 does not affect the probability. It just affects the total number of objects that can be stored before a bitrot is not detect.

if encryption is used on top, then an undetected bit rot (if it were to happen) would fail the decryption.

That is correct. Of course there is also a probability that this could happen (that would be a message forgery) but as long as the chosen cryptographic primitives satisfies the claimed security properties this probability is negligible.

To show an example:
If we would use SipHash-64 and the probability for bitrot would be 0.5 (so every second object would be affected) than the probability for an undetected bitrot is still:
P = e-((k * (k-1)) / (2 * 264))

So the probability for an undetected bitrot would exceed 50% after 232+229+228 corrupted objects.
However since only every 2. object is affected we would be able to store 2 * (232+229+228) before he probability for an undetected bitrot would exceed 50%. If instead of every 2nd object only every 1000nd object would be affected by bitrot we would be able to store 1000 * (232+229+228) objects before the probability would exceed 50%.

@fwessels
Copy link
Owner

fwessels commented Jan 9, 2018

I see what you are saying, you are correct.

I went back to my Excel sheet and did some extrapolations as to when a certain bit length would get a collision with the following results in terms of number of permutations:

  • 64 bit: 9.6E09 (9600 million)
  • 128 bit: 3.3E20
  • 256 bit: 2.5E40

So this seems in line with the results mentioned above, let's discuss it with the team and make a decision.

BTW You don't think they use minio in space? :-)

@aead
Copy link
Author

aead commented Jan 9, 2018

I went back to my Excel sheet and did some extrapolations as to when a certain bit length would get a collision with the following results in terms of number of permutations:

64 bit: 9.6E09 (9600 million)
128 bit: 3.3E20
256 bit: 2.5E40

Yes, this numbers sounds reasonable. I don't know the probability for bitrot (I guess it should be relatively small). So I think 64 bit is too short, 128 bit would be acceptable and 256 would be the conservative choice. So 128bit would probably satisfy most users expectations but I would recommend 256 bit to be sure.

BTW You don't think they use minio in space? :-)

Oh - your right I completely forgot the ISS and all the satellites, my bad 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants