Shuffling can cause (tiny) information loss #564

timcallow · 2024-08-02T14:17:29Z

I was looking into issue #509 and noticed a slightly weird feature of the data shuffling function.

Essentially, snapshots are created such that the total number of grid points (summed across snapshots) divided by the number of shuffled snapshots is an integer (lines 524-546 of data_shuffler.py). The dimensions of the shuffled snapshots is based on this.

However, when shuffling is done, the snapshots are populated by taking the number of grid-points per snapshot divided by the number of shuffled snapshots (lines 157-179). This is not necessarily an integer: for example, consider taking 3 snapshots from a grid of 200x200x200.

The result of this is that, if the number of mixed snapshots is not a divisor for each of the original snapshot gridsizes, then some of the vectors in the final shuffled data (a very small number) will be completely zero. Here is one such example:

I very much doubt that a handful of zero vectors, given the typical grid sizes we work with, will affect the neural network training. But I have a couple of questions:

Is this a known feature?
Is it necessary to take exactly 1/nth from each original snapshot into the final mixed snapshots? Given the grid sizes we work with, there would be statistically no difference to first concatenate the input vectors into a single big one, and then simply shuffle that.

In my opinion, it would be better to do as described above. It simplifies the code and ensures there are no arbitrary zero vectors in the final training set. It would make the solution to #509 easier.

What do you think @RandomDefaultUser? Am I missing something here?

The text was updated successfully, but these errors were encountered:

RandomDefaultUser · 2024-10-07T12:21:24Z

Hi @timcallow I believe the original reason to take exactly 1/nth is that the shuffling is realized via numpy memmaps. Here, we don't load all the snapshots into memory because that would be quite the overhead, but instead we load only 1/nth of n snapshots, so 1 snapshot at a time. As you have explained, that quite clearly leads to a problem.

timcallow assigned timcallow and RandomDefaultUser Aug 2, 2024

timcallow mentioned this issue Aug 14, 2024

Flexible snapshot number for data shuffling #570

Merged

RandomDefaultUser added this to the v1.3.0 - Into the multi-GPU-verse milestone Oct 7, 2024

RandomDefaultUser linked a pull request Nov 15, 2024 that will close this issue

Fixing tiny information loss in shuffling #607

Open

RandomDefaultUser mentioned this issue Nov 15, 2024

Implement all the shuffling interface changes for OpenPMD as well #608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffling can cause (tiny) information loss #564

Shuffling can cause (tiny) information loss #564

timcallow commented Aug 2, 2024

RandomDefaultUser commented Oct 7, 2024

Shuffling can cause (tiny) information loss #564

Shuffling can cause (tiny) information loss #564

Comments

timcallow commented Aug 2, 2024

RandomDefaultUser commented Oct 7, 2024