mathemtical aspects #1

aead · 2018-01-08T20:28:59Z

Requirements

Since we use a hash-based bitrot protection approach we need a (hash) function f mapping value of arbitrary length to a value of a fixed length. The function f must satisfy the following properties:

The mapping f(x) = y must be deterministic. That means that f(x) always evaluates to y and not to z at some point in time
The evaluation of f with k different randomly chosen inputs x₁, ..., x_k should differ with probability P = e^{-((k * (k-1)) / (2 * 2ⁿ))} where n is length of the output values of f.

The 1. requirement is kind of obvious - if f is not deterministic it is not possible to reliable verify the integrity of data. The 2. requirement is derived from the strong collision requirement. However in contrast to the general birthday problem we are not interested in "constructed" collisions. Therefore we define a requirement violation as a collisions of randomly chosen inputs. A constructed collision would imply and intelligent adversary exploiting a mathematical weakness of a given mapping f to find collisions with a probability P' > P. However bitrot protection is not designed to withstand an attacker because the adversary can simple modify the data and replace the stored hash value with the hash value of the modified data.

Random Oracle and PRF

A cryptographic secure hash function - like BLAKE2b-512 - always satisfies our requirements since it provides strong collision resistance which does not restrict the input values to be chosen randomly.
A random oracle and following a PRF (like SipHash / HighwayHash) also satisfies our requirements since a PRF can be constructed from our function f` using the following scheme:

Generate a set of uniformly at random chosen keys. |k_i| = |x_i|
y_i = f(k_i ⊕ x_i)

This implies that SipHash, HighwayHash and BLAKE2b can be used for bitrot protection because of their mathematical properties. (Since we discarded Poly1305 anyway I don't discuss it here further)

However SipHash and HigwayHash are PRFs and require a secret key. The secret key can be securely omitted if the PRF provides strong collision resistance for a fixed (but unkown) secret key which is the case for SipHash and HighwayHash.

Choice of hash size

The following list shows the number of data modifications that can happened before the probability for an undetected bitrot exceeds 50%:

SipHash-64:
For k = 2³² + 2²⁹ + 2²⁸ the probability for a collision P_c = 1-e^{-((k * (k-1)) / (2 * 2ⁿ))} = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 5100 million computations - roughly speaking 5 billion objects with bitrot in our case.
SipHash-128 / HighwayHash-128:
For k = 2⁶⁴ + 2⁶¹ + 2⁶⁰ the probability for a collision P_c = 1-e^{-((k * (k-1)) / (2 * 2ⁿ))} = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 21905508 trillion computations - roughly speaking 21905508 trillion objects with bitrot in our case.
HighwayHash-256:
For k = 2¹²⁸ + 2¹²⁵ + 2¹²⁴ the probability for a collision P_c = 1-e^{-((k * (k-1)) / (2 * 2ⁿ))} = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 4.04×10³⁸ computations - This is far beyond practical computation and can be treated as impossible.

Notice that this numbers refer to all minio instances. So for example if we talk about 5 billion objects with bitrot than we mean 5 billion distributed over all minio instances all over the world.
Basically 50% chance of undetected bitrot on one minio server somewhere on the planet earth.
(Except aliens are using minio 😉)

The text was updated successfully, but these errors were encountered:

fwessels · 2018-01-08T21:38:43Z

Couple of remarks

since we only compare checksums within a single part (and not across parts, objects or let alone minio servers), if a bit rot would result in an equivalent checksum to the checksum of another part, this would not matter.
if two different bit rots for the same part would generated the same checksum, but as long as it is different from the original checksum, we would have a 'collision' but it would not matter in minio's case.
if encryption is used on top, then an undetected bit rot (if it were to happen) would fail the decryption.

aead · 2018-01-09T00:28:56Z

since we only compare checksums within a single part (and not across parts, objects or let alone minio servers), if a bit rot would result in an equivalent checksum to the checksum of another part, this would not matter.

The checksum (in case of minio) is computed per 64 MB object part in case of PUT and per object part in case of multipart PUT. However for the computed probabilities this does not matter. This is not part of the mathematical considerations - See comment to 2. point.

if two different bit rots for the same part would generated the same checksum, but as long as it is different from the original checksum, we would have a 'collision' but it would not matter in minio's case.

That's why the Choice of hash size paragraph says:

number of data modifications that can happened before the probability for an undetected bitrot exceeds 50%

I don't not compute the probability for a collision of object hashes - I compute the probability for a collision of modified (bitrot occurred) object hashes. All numbers refer to objects which are already modified by bitrot. For example:
"That would mean that the probability for one undetected bitrot is about 50% after 5100 million computations - roughly speaking 5 billion objects with bitrot in our case."

Whether bitrot affects every second object or only 1/100000000 does not affect the probability. It just affects the total number of objects that can be stored before a bitrot is not detect.

if encryption is used on top, then an undetected bit rot (if it were to happen) would fail the decryption.

That is correct. Of course there is also a probability that this could happen (that would be a message forgery) but as long as the chosen cryptographic primitives satisfies the claimed security properties this probability is negligible.

To show an example:
If we would use SipHash-64 and the probability for bitrot would be 0.5 (so every second object would be affected) than the probability for an undetected bitrot is still:
P = e^{-((k * (k-1)) / (2 * 2⁶⁴))}

So the probability for an undetected bitrot would exceed 50% after 2³²+2²⁹+2²⁸ corrupted objects.
However since only every 2. object is affected we would be able to store 2 * (2³²+2²⁹+2²⁸) before he probability for an undetected bitrot would exceed 50%. If instead of every 2nd object only every 1000nd object would be affected by bitrot we would be able to store 1000 * (2³²+2²⁹+2²⁸) objects before the probability would exceed 50%.

fwessels · 2018-01-09T18:58:25Z

I see what you are saying, you are correct.

I went back to my Excel sheet and did some extrapolations as to when a certain bit length would get a collision with the following results in terms of number of permutations:

64 bit: 9.6E09 (9600 million)
128 bit: 3.3E20
256 bit: 2.5E40

So this seems in line with the results mentioned above, let's discuss it with the team and make a decision.

BTW You don't think they use minio in space? :-)

aead · 2018-01-09T19:16:55Z

I went back to my Excel sheet and did some extrapolations as to when a certain bit length would get a collision with the following results in terms of number of permutations:

64 bit: 9.6E09 (9600 million)
128 bit: 3.3E20
256 bit: 2.5E40

Yes, this numbers sounds reasonable. I don't know the probability for bitrot (I guess it should be relatively small). So I think 64 bit is too short, 128 bit would be acceptable and 256 would be the conservative choice. So 128bit would probably satisfy most users expectations but I would recommend 256 bit to be sure.

BTW You don't think they use minio in space? :-)

Oh - your right I completely forgot the ISS and all the satellites, my bad 😉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mathemtical aspects #1

mathemtical aspects #1

aead commented Jan 8, 2018 •

edited

Loading

fwessels commented Jan 8, 2018

aead commented Jan 9, 2018 •

edited

Loading

fwessels commented Jan 9, 2018

aead commented Jan 9, 2018 •

edited

Loading

mathemtical aspects #1

mathemtical aspects #1

Comments

aead commented Jan 8, 2018 • edited Loading

Requirements

Random Oracle and PRF

Choice of hash size

fwessels commented Jan 8, 2018

aead commented Jan 9, 2018 • edited Loading

fwessels commented Jan 9, 2018

aead commented Jan 9, 2018 • edited Loading

aead commented Jan 8, 2018 •

edited

Loading

aead commented Jan 9, 2018 •

edited

Loading

aead commented Jan 9, 2018 •

edited

Loading