Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

same data all reduce on H20, but results are different #1497

Open
Rainlin007 opened this issue Oct 28, 2024 · 10 comments
Open

same data all reduce on H20, but results are different #1497

Rainlin007 opened this issue Oct 28, 2024 · 10 comments

Comments

@Rainlin007
Copy link

Rainlin007 commented Oct 28, 2024

As the title says, it may be a bug in nccl on the H20 GPU.
Here are some discussions: pytorch/pytorch#138811

I have implemented a pure cpp version and can reproduce the problem on the H20

image: nvcr.io/nvidia/pytorch:24.09-py3

test_ar.cpp

#include <iostream>
#include <vector>
#include <thread>
#include <nccl.h>
#include <chrono>
#include <cstdlib>
#include <ctime>
#include <torch/torch.h>

const int NUM_GPUS = 8;
const int TENSOR_ROWS = 512;
const int TENSOR_COLS = 8192;
const int TENSOR_SIZE = TENSOR_ROWS * TENSOR_COLS;

#define NCCL_CALL(cmd) checkNcclError(cmd, __FILE__, __LINE__)

std::vector<torch::Tensor> g_tensors;
std::vector<torch::Tensor> base_result;
std::vector<torch::Tensor> cur_results;

void checkNcclError(ncclResult_t result, const char *file, int line)
{
    if (result != ncclSuccess)
    {
        std::cerr << "NCCL error in " << file << ":" << line << " - " << ncclGetErrorString(result) << std::endl;
        std::exit(-1);
    }
}

void genRandomTensors(std::vector<torch::Tensor> &tensors)
{
    for (int i = 0; i < NUM_GPUS; ++i)
    {
        cudaSetDevice(i);
        torch::Tensor tensor = torch::rand({TENSOR_ROWS, TENSOR_COLS}, torch::device(torch::kCUDA).dtype(torch::kHalf));
        tensors.push_back(tensor);
    }
}

void all_reduce_process(int dev, ncclComm_t comm, cudaStream_t stream)
{
    // Set device
    cudaSetDevice(dev);

    auto tensor = g_tensors[dev].clone();

    NCCL_CALL(ncclAllReduce(
        tensor.data_ptr(),
        tensor.data_ptr(),
        TENSOR_SIZE,
        ncclHalf,
        ncclSum,
        comm,
        stream));

    // Synchronize stream
    cudaStreamSynchronize(stream);

    cur_results[dev] = tensor;
}

torch::Tensor compareTensors(const torch::Tensor &tensor1, const torch::Tensor &tensor2)
{
    if (tensor1.sizes() != tensor2.sizes())
    {
        throw std::invalid_argument("两个张量的尺寸不同");
    }
    auto diffMask = tensor1 != tensor2;
    auto diffIndices = diffMask.nonzero();
    return diffIndices;
}

int test_all_reduce(int iter)
{

    ncclComm_t comms[NUM_GPUS];
    int devs[NUM_GPUS] = {0, 1, 2, 3, 4, 5, 6, 7};
    std::vector<cudaStream_t> streams(NUM_GPUS);
    for (int i = 0; i < NUM_GPUS; ++i)
    {
        cudaSetDevice(devs[i]);
        cudaStreamCreate(&streams[i]);
    }
    NCCL_CALL(ncclCommInitAll(comms, NUM_GPUS, devs));


    genRandomTensors(cur_results);


    std::vector<std::thread> threads;
    for (int i = 0; i < NUM_GPUS; ++i)
    {
        threads.emplace_back(all_reduce_process, devs[i], comms[i], streams[i]);
    }

    // Join threads
    for (auto &th : threads)
    {
        th.join();
    }

    // Copy results to base_result
    if (iter == 0)
    {
        for (int i = 0; i < NUM_GPUS; ++i)
        {
            base_result.emplace_back(cur_results[i].clone());
        }
    }
    else
    {
        auto diffs = compareTensors(base_result[0], cur_results[0]);
        std::cout << "diffs:" << diffs.numel() << std::endl;
        if (diffs.numel() >= 10)
        {
            torch::Tensor firstTen = diffs.slice(0, 0, 10);
            std::cout << "First 10 idx:" << std::endl;
            std::cout << firstTen << std::endl;
            for (int i = 0; i < 10; ++i)
            {
                std::cout << "base: " << base_result[0][diffs[i][0]][diffs[i][1]] << std::endl;
                std::cout << "cur: " << cur_results[0][diffs[i][0]][diffs[i][1]] << std::endl;
            }
        }
    }


    for (int i = 0; i < NUM_GPUS; ++i)
    {
        NCCL_CALL(ncclCommDestroy(comms[i]));
        cudaStreamDestroy(streams[i]);
    }
    return 0;
}

int main()
{
    std::cout << "Start test_ar" << std::endl;
    genRandomTensors(g_tensors);
    for (int i = 0; i < 100; i++)
    {
        std::cout << "===============" << i << " start =================" << std::endl;
        test_all_reduce(i);
        std::cout << "===============" << i << " end   =================" << std::endl;
    }
    return 0;
}

CMakeList.txt

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(test_ar)


find_package(Torch REQUIRED)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

add_executable(${PROJECT_NAME} test_ar.cpp)
target_link_libraries(${PROJECT_NAME} "${TORCH_LIBRARIES}")
target_link_libraries(${PROJECT_NAME} "/usr/lib/x86_64-linux-gnu/libnccl.so")

set_property(TARGET ${PROJECT_NAME} PROPERTY CXX_STANDARD 17)

compile:

cmake -DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` .
make -j8

./test_ar
@Rainlin007
Copy link
Author

Image

@kiskra-nvidia
Copy link
Member

How reproducible is this issue? I see that your screen shot is from iteration 59...

What NCCL version is this? It looks like a single-node, 8-GPU run, correct? Could you include the debug output generated with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,TUNING (a single iteration should be sufficient; we don't need to see all 100, unless you observe a difference in NCCL's behavior for iterations that produce a different result)?

Are the GPUs connected by NVLinks? Can you rerun with NCCL_NVLS_ENABLE=0 and see if that eliminates the randomness?

@marksantesson
Copy link

Do you see differences on all iterations?

Can you tell me what the purpose of this line in test_all_reduce is?
genRandomTensors(cur_results);

@Rainlin007
Copy link
Author

Rainlin007 commented Oct 29, 2024

NCCL_NVLS_ENABLE=0

I find NCCL_NVLS_ENABLE=0 is ok,Is this result of NVLS expected or a bug? @kiskra-nvidia

here are some infos:

  1. The 59th iteration is just one of them, about 1/10 of iterations have diff, I didn’t take screenshots of them
  2. single-node, 8-GPU
  3. GPUs connected by NVLinks
  4. nvidia-nccl-cu12 2.20.5
  5. image: nvcr.io/nvidia/pytorch:24.09-py3

@Rainlin007
Copy link
Author

Rainlin007 commented Oct 29, 2024

Do you see differences on all iterations?

Can you tell me what the purpose of this line in test_all_reduce is? genRandomTensors(cur_results);

just for init cur_results to 8 elements,it's not a good implementation.

@kiskra-nvidia
Copy link
Member

ohhh,I find NCCL_NVLS_ENABLE=0 is ok,Is this result of NVLS expected or a bug?

NVLS is hardware-accelerated and it's known to be less deterministic than the algorithms implemented by NCCL in software. Even for "classic" NCCL algorithms like RING and TREE I believe TREE is considered more predictable/numerically stable due to a smaller number of individual reduction operations needed (log N vs N). So it's not a binary (yes/no) issue... Back to NVLS, this behavior is not something that NCCL has any control over (other than not using NVLS in the first place), and it's a known trade-off (performance vs stability/determinism) of current CUDA driver releases...

@dearsxx0918
Copy link

@kiskra-nvidia Could you please help to confirm that NVLS is not deterministic algorithm? We are also use this algorithm. If not we may change our algorithm to ring to make training convergence.
Hope to hear from you soon!

@sjeaugey
Copy link
Member

sjeaugey commented Nov 6, 2024

Yes it is a known issue with NVLS. We believe it can be fixed though, so we're working to make NVLS operation deterministic in a future driver.

@152334H
Copy link

152334H commented Nov 6, 2024

Yes it is a known issue with NVLS. We believe it can be fixed though, so we're working to make NVLS operation deterministic in a future driver.

Does this mean not deterministic between different collective calls, or not deterministic between different switches in e.g. a single allreduce collective involving multiple nodes?

@sjeaugey
Copy link
Member

sjeaugey commented Nov 6, 2024

Not deterministic between different calls. All ranks should always get the same results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants