You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was looking into issue #509 and noticed a slightly weird feature of the data shuffling function.
Essentially, snapshots are created such that the total number of grid points (summed across snapshots) divided by the number of shuffled snapshots is an integer (lines 524-546 of data_shuffler.py). The dimensions of the shuffled snapshots is based on this.
However, when shuffling is done, the snapshots are populated by taking the number of grid-points per snapshot divided by the number of shuffled snapshots (lines 157-179). This is not necessarily an integer: for example, consider taking 3 snapshots from a grid of 200x200x200.
The result of this is that, if the number of mixed snapshots is not a divisor for each of the original snapshot gridsizes, then some of the vectors in the final shuffled data (a very small number) will be completely zero. Here is one such example:
I very much doubt that a handful of zero vectors, given the typical grid sizes we work with, will affect the neural network training. But I have a couple of questions:
Is this a known feature?
Is it necessary to take exactly 1/nth from each original snapshot into the final mixed snapshots? Given the grid sizes we work with, there would be statistically no difference to first concatenate the input vectors into a single big one, and then simply shuffle that.
In my opinion, it would be better to do as described above. It simplifies the code and ensures there are no arbitrary zero vectors in the final training set. It would make the solution to #509 easier.
Hi @timcallow I believe the original reason to take exactly 1/nth is that the shuffling is realized via numpy memmaps. Here, we don't load all the snapshots into memory because that would be quite the overhead, but instead we load only 1/nth of n snapshots, so 1 snapshot at a time. As you have explained, that quite clearly leads to a problem.
I was looking into issue #509 and noticed a slightly weird feature of the data shuffling function.
Essentially, snapshots are created such that the total number of grid points (summed across snapshots) divided by the number of shuffled snapshots is an integer (lines 524-546 of
data_shuffler.py
). The dimensions of the shuffled snapshots is based on this.However, when shuffling is done, the snapshots are populated by taking the number of grid-points per snapshot divided by the number of shuffled snapshots (lines 157-179). This is not necessarily an integer: for example, consider taking 3 snapshots from a grid of 200x200x200.
The result of this is that, if the number of mixed snapshots is not a divisor for each of the original snapshot gridsizes, then some of the vectors in the final shuffled data (a very small number) will be completely zero. Here is one such example:
I very much doubt that a handful of zero vectors, given the typical grid sizes we work with, will affect the neural network training. But I have a couple of questions:
In my opinion, it would be better to do as described above. It simplifies the code and ensures there are no arbitrary zero vectors in the final training set. It would make the solution to #509 easier.
What do you think @RandomDefaultUser? Am I missing something here?
The text was updated successfully, but these errors were encountered: