Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing tiny information loss in shuffling #607

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

RandomDefaultUser
Copy link
Member

As detailed by @timcallow in #564, our current shuffling can lead to information loss. This occurs when one requests a number of snapshots to be shuffled such that the total number of data points may be divisible by the new number of snapshots, while the individual number of grid points per snapshot are not.

This PR fixes this behavior by making the behavior implemented in #570 the default. I.e., snapshot shuffling now means:

(1. If the user does not provide a number for the snapshot files to be shuffled into, it is implicitly assumed that number_after_shuffling = number_before_shuffling)
2. It is checked whether the individual snapshot grid can be divided by this new number of snapshots
3. If not, the overall data size is reduced, and the algorithm implemented in #570 kicks in, discarding a tiny amount of information
4. Data is shuffled in snapshot-like files with dimensions (x,1,1,feature), i.e., effectively a 1D-array compatible with the MALA workflow.

This PR also fixes an issue with non-cubic cells and adds a test.

@RandomDefaultUser RandomDefaultUser linked an issue Nov 15, 2024 that may be closed by this pull request
@RandomDefaultUser
Copy link
Member Author

This PR only tackles the numpy side of things, for OpenPMD it omits some of the logic. I will address this separately and have disabled one of the tests for now,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shuffling can cause (tiny) information loss
1 participant