Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReplicaExchangeSampler failing silently on single A100 GPU when too many replicas are used #732

Open
k2o0r opened this issue Jun 7, 2024 · 1 comment

Comments

@k2o0r
Copy link

k2o0r commented Jun 7, 2024

Hi, I've been trying to use openmmtools for some HREX simulations recently and ran into some unusual behaviour.

I have been trying to use MPI and distribute replicas over multiple GPUs, however I had issues getting this to work.

This seems to be related to the OMPI build on my cluster, rather than anything to do with openmmtools, but I next just tried to run all 16 replicas on a single A100-SXM-80GB card (w/ 32 CPU cores and 250 GB RAM also allocated), however jobs would run for long times (~16 hours) without actually changing the size of the reporter file or finishing very short simulations (e.g. 3 iterations).

The exact same code-- 16 replicas on 1 card-- ran, albeit quite slowly, on my workstation, and I can also run 12 replicas on the cluster using the above setup, so I suspect it's to do with the memory requirements of storing all the contexts simultaneously, but it's strange that I don't get any kind of error. The job would run until it ran out of time without ever writing data to the reporter files.

Has this kind of issue been reported before? Do you think it would be possible to add some kind of checks that ensure files are are actually being written/the sampler is actually progressing through iterations?

@mikemhenry
Copy link
Contributor

I think it is worth to keep this issue open to see if others have similar reports but fundamentally

Do you think it would be possible to add some kind of checks that ensure files are are actually being written/the sampler is actually progressing through iterations?

really reduces to the halting problem, we can't really programmatically tell if something is taking a long time (like a big simulation) or stuck in some infinite loop.

I think your intuition here:

The exact same code-- 16 replicas on 1 card-- ran, albeit quite slowly, on my workstation, and I can also run 12 replicas on the cluster using the above setup, so I suspect it's to do with the memory requirements of storing all the contexts simultaneously, but it's strange that I don't get any kind of error.

is likely correct. The lack of an error may be due to the card attempting to use some virtual memory to allocate more pages than it can store which then cycle out of the cache very slowly. I can't remember what sorts of tricks GPUs due these days. In general I find that when trying to parallel things, you just have to figure things out empirically and check things like throwing 2x, 4x, 8x, 16x resources at the problem to see how things scale. Communication overheads and memory swapping can eat into the gains at some point, and as you observed, result in a decrease in performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants