-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
same data all reduce on H20, but results are different #1497
Comments
How reproducible is this issue? I see that your screen shot is from iteration 59... What NCCL version is this? It looks like a single-node, 8-GPU run, correct? Could you include the debug output generated with Are the GPUs connected by NVLinks? Can you rerun with |
Do you see differences on all iterations? Can you tell me what the purpose of this line in test_all_reduce is? |
I find here are some infos:
|
just for init cur_results to 8 elements,it's not a good implementation. |
NVLS is hardware-accelerated and it's known to be less deterministic than the algorithms implemented by NCCL in software. Even for "classic" NCCL algorithms like RING and TREE I believe TREE is considered more predictable/numerically stable due to a smaller number of individual reduction operations needed (log N vs N). So it's not a binary (yes/no) issue... Back to NVLS, this behavior is not something that NCCL has any control over (other than not using NVLS in the first place), and it's a known trade-off (performance vs stability/determinism) of current CUDA driver releases... |
@kiskra-nvidia Could you please help to confirm that NVLS is not deterministic algorithm? We are also use this algorithm. If not we may change our algorithm to ring to make training convergence. |
Yes it is a known issue with NVLS. We believe it can be fixed though, so we're working to make NVLS operation deterministic in a future driver. |
Does this mean not deterministic between different collective calls, or not deterministic between different switches in e.g. a single allreduce collective involving multiple nodes? |
Not deterministic between different calls. All ranks should always get the same results. |
As the title says, it may be a bug in nccl on the H20 GPU.
Here are some discussions: pytorch/pytorch#138811
I have implemented a pure cpp version and can reproduce the problem on the H20
image:
nvcr.io/nvidia/pytorch:24.09-py3
test_ar.cpp
CMakeList.txt
compile:
The text was updated successfully, but these errors were encountered: