same data all reduce on H20, but results are different #1497

Rainlin007 · 2024-10-28T11:42:59Z

As the title says, it may be a bug in nccl on the H20 GPU.
Here are some discussions: pytorch/pytorch#138811

I have implemented a pure cpp version and can reproduce the problem on the H20

image: nvcr.io/nvidia/pytorch:24.09-py3

test_ar.cpp

#include <iostream>
#include <vector>
#include <thread>
#include <nccl.h>
#include <chrono>
#include <cstdlib>
#include <ctime>
#include <torch/torch.h>

const int NUM_GPUS = 8;
const int TENSOR_ROWS = 512;
const int TENSOR_COLS = 8192;
const int TENSOR_SIZE = TENSOR_ROWS * TENSOR_COLS;

#define NCCL_CALL(cmd) checkNcclError(cmd, __FILE__, __LINE__)

std::vector<torch::Tensor> g_tensors;
std::vector<torch::Tensor> base_result;
std::vector<torch::Tensor> cur_results;

void checkNcclError(ncclResult_t result, const char *file, int line)
{
    if (result != ncclSuccess)
    {
        std::cerr << "NCCL error in " << file << ":" << line << " - " << ncclGetErrorString(result) << std::endl;
        std::exit(-1);
    }
}

void genRandomTensors(std::vector<torch::Tensor> &tensors)
{
    for (int i = 0; i < NUM_GPUS; ++i)
    {
        cudaSetDevice(i);
        torch::Tensor tensor = torch::rand({TENSOR_ROWS, TENSOR_COLS}, torch::device(torch::kCUDA).dtype(torch::kHalf));
        tensors.push_back(tensor);
    }
}

void all_reduce_process(int dev, ncclComm_t comm, cudaStream_t stream)
{
    // Set device
    cudaSetDevice(dev);

    auto tensor = g_tensors[dev].clone();

    NCCL_CALL(ncclAllReduce(
        tensor.data_ptr(),
        tensor.data_ptr(),
        TENSOR_SIZE,
        ncclHalf,
        ncclSum,
        comm,
        stream));

    // Synchronize stream
    cudaStreamSynchronize(stream);

    cur_results[dev] = tensor;
}

torch::Tensor compareTensors(const torch::Tensor &tensor1, const torch::Tensor &tensor2)
{
    if (tensor1.sizes() != tensor2.sizes())
    {
        throw std::invalid_argument("两个张量的尺寸不同");
    }
    auto diffMask = tensor1 != tensor2;
    auto diffIndices = diffMask.nonzero();
    return diffIndices;
}

int test_all_reduce(int iter)
{

    ncclComm_t comms[NUM_GPUS];
    int devs[NUM_GPUS] = {0, 1, 2, 3, 4, 5, 6, 7};
    std::vector<cudaStream_t> streams(NUM_GPUS);
    for (int i = 0; i < NUM_GPUS; ++i)
    {
        cudaSetDevice(devs[i]);
        cudaStreamCreate(&streams[i]);
    }
    NCCL_CALL(ncclCommInitAll(comms, NUM_GPUS, devs));


    genRandomTensors(cur_results);


    std::vector<std::thread> threads;
    for (int i = 0; i < NUM_GPUS; ++i)
    {
        threads.emplace_back(all_reduce_process, devs[i], comms[i], streams[i]);
    }

    // Join threads
    for (auto &th : threads)
    {
        th.join();
    }

    // Copy results to base_result
    if (iter == 0)
    {
        for (int i = 0; i < NUM_GPUS; ++i)
        {
            base_result.emplace_back(cur_results[i].clone());
        }
    }
    else
    {
        auto diffs = compareTensors(base_result[0], cur_results[0]);
        std::cout << "diffs:" << diffs.numel() << std::endl;
        if (diffs.numel() >= 10)
        {
            torch::Tensor firstTen = diffs.slice(0, 0, 10);
            std::cout << "First 10 idx:" << std::endl;
            std::cout << firstTen << std::endl;
            for (int i = 0; i < 10; ++i)
            {
                std::cout << "base: " << base_result[0][diffs[i][0]][diffs[i][1]] << std::endl;
                std::cout << "cur: " << cur_results[0][diffs[i][0]][diffs[i][1]] << std::endl;
            }
        }
    }


    for (int i = 0; i < NUM_GPUS; ++i)
    {
        NCCL_CALL(ncclCommDestroy(comms[i]));
        cudaStreamDestroy(streams[i]);
    }
    return 0;
}

int main()
{
    std::cout << "Start test_ar" << std::endl;
    genRandomTensors(g_tensors);
    for (int i = 0; i < 100; i++)
    {
        std::cout << "===============" << i << " start =================" << std::endl;
        test_all_reduce(i);
        std::cout << "===============" << i << " end   =================" << std::endl;
    }
    return 0;
}

CMakeList.txt

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(test_ar)


find_package(Torch REQUIRED)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

add_executable(${PROJECT_NAME} test_ar.cpp)
target_link_libraries(${PROJECT_NAME} "${TORCH_LIBRARIES}")
target_link_libraries(${PROJECT_NAME} "/usr/lib/x86_64-linux-gnu/libnccl.so")

set_property(TARGET ${PROJECT_NAME} PROPERTY CXX_STANDARD 17)

compile:

cmake -DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` .
make -j8

./test_ar

The text was updated successfully, but these errors were encountered:

Rainlin007 · 2024-10-28T11:47:05Z

kiskra-nvidia · 2024-10-28T14:52:48Z

How reproducible is this issue? I see that your screen shot is from iteration 59...

What NCCL version is this? It looks like a single-node, 8-GPU run, correct? Could you include the debug output generated with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,TUNING (a single iteration should be sufficient; we don't need to see all 100, unless you observe a difference in NCCL's behavior for iterations that produce a different result)?

Are the GPUs connected by NVLinks? Can you rerun with NCCL_NVLS_ENABLE=0 and see if that eliminates the randomness?

marksantesson · 2024-10-28T15:02:19Z

Do you see differences on all iterations?

Can you tell me what the purpose of this line in test_all_reduce is?
genRandomTensors(cur_results);

Rainlin007 · 2024-10-29T01:45:22Z

NCCL_NVLS_ENABLE=0

I find NCCL_NVLS_ENABLE=0 is ok，Is this result of NVLS expected or a bug? @kiskra-nvidia

here are some infos:

The 59th iteration is just one of them, about 1/10 of iterations have diff, I didn’t take screenshots of them
single-node, 8-GPU
GPUs connected by NVLinks
nvidia-nccl-cu12 2.20.5
image: nvcr.io/nvidia/pytorch:24.09-py3

Rainlin007 · 2024-10-29T01:58:45Z

Do you see differences on all iterations?

Can you tell me what the purpose of this line in test_all_reduce is? genRandomTensors(cur_results);

just for init cur_results to 8 elements，it's not a good implementation.

kiskra-nvidia · 2024-10-29T15:23:10Z

ohhh，I find NCCL_NVLS_ENABLE=0 is ok，Is this result of NVLS expected or a bug?

NVLS is hardware-accelerated and it's known to be less deterministic than the algorithms implemented by NCCL in software. Even for "classic" NCCL algorithms like RING and TREE I believe TREE is considered more predictable/numerically stable due to a smaller number of individual reduction operations needed (log N vs N). So it's not a binary (yes/no) issue... Back to NVLS, this behavior is not something that NCCL has any control over (other than not using NVLS in the first place), and it's a known trade-off (performance vs stability/determinism) of current CUDA driver releases...

dearsxx0918 · 2024-10-31T06:59:25Z

@kiskra-nvidia Could you please help to confirm that NVLS is not deterministic algorithm? We are also use this algorithm. If not we may change our algorithm to ring to make training convergence.
Hope to hear from you soon!

sjeaugey · 2024-11-06T07:41:27Z

Yes it is a known issue with NVLS. We believe it can be fixed though, so we're working to make NVLS operation deterministic in a future driver.

152334H · 2024-11-06T10:56:37Z

Yes it is a known issue with NVLS. We believe it can be fixed though, so we're working to make NVLS operation deterministic in a future driver.

Does this mean not deterministic between different collective calls, or not deterministic between different switches in e.g. a single allreduce collective involving multiple nodes?

sjeaugey · 2024-11-06T12:59:53Z

Not deterministic between different calls. All ranks should always get the same results.

Rainlin007 mentioned this issue Oct 28, 2024

same data all reduce on H20, but results are different pytorch/pytorch#138811

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

same data all reduce on H20, but results are different #1497

same data all reduce on H20, but results are different #1497

Rainlin007 commented Oct 28, 2024 •

edited

Loading

Rainlin007 commented Oct 28, 2024

kiskra-nvidia commented Oct 28, 2024

marksantesson commented Oct 28, 2024

Rainlin007 commented Oct 29, 2024 •

edited

Loading

Rainlin007 commented Oct 29, 2024 •

edited

Loading

kiskra-nvidia commented Oct 29, 2024

dearsxx0918 commented Oct 31, 2024

sjeaugey commented Nov 6, 2024

152334H commented Nov 6, 2024

sjeaugey commented Nov 6, 2024

same data all reduce on H20, but results are different #1497

same data all reduce on H20, but results are different #1497

Comments

Rainlin007 commented Oct 28, 2024 • edited Loading

Rainlin007 commented Oct 28, 2024

kiskra-nvidia commented Oct 28, 2024

marksantesson commented Oct 28, 2024

Rainlin007 commented Oct 29, 2024 • edited Loading

Rainlin007 commented Oct 29, 2024 • edited Loading

kiskra-nvidia commented Oct 29, 2024

dearsxx0918 commented Oct 31, 2024

sjeaugey commented Nov 6, 2024

152334H commented Nov 6, 2024

sjeaugey commented Nov 6, 2024

Rainlin007 commented Oct 28, 2024 •

edited

Loading

Rainlin007 commented Oct 29, 2024 •

edited

Loading

Rainlin007 commented Oct 29, 2024 •

edited

Loading