Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI benchmark driver #138

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open

Conversation

thomasgibson
Copy link
Contributor

This PR modifies the current driver main.cpp and adds MPI support for launching the benchmark across multiple devices. The main takeaways here:

  • Each MPI rank is assigned a specific GPU and launches the benchmark
  • There is no direct GPU-to-GPU communication happening
    • For the dot-kernel, the resulting sums are reduced across all MPI ranks (on the host) and broadcasted to each rank (via MPI_Allreduce).
    • Benchmark error checking is performed on all ranks.
  • Measured bandwidths are aggregated across all ranks

The only major question I have is how MPI should be treated by CMake. I am open to suggestions and happy to comply with whatever you all prefer.

Copy link
Contributor

@tomdeakin tomdeakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see above for comments on this. I'm also nervous of merging this without some evidence that this works with all supported models.

Can you also think about how we can update the CI to test this is valid?

src/main.cpp Show resolved Hide resolved
@@ -137,8 +182,15 @@ std::vector<std::vector<double>> run_all(Stream<T> *stream, T& sum)
timings[3].push_back(std::chrono::duration_cast<std::chrono::duration<double> >(t2 - t1).count());

// Execute Dot
#if USE_MPI
// Synchronize ranks before computing dot-product
MPI_Barrier(MPI_COMM_WORLD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overhead is not timed, and waits for 4 kernels to finish before starting the timer for dot. Can you explain the motivation for not including synchronisation time for each kernel?

{
// MiB = 2^20
std::cout << std::setprecision(1) << std::fixed
#if USE_MPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be a better way to do this, as this is hard to read. Maybe creating a string variable with the label that is initialised depending the MPI case. E.g.,

#ifdef USE_MPI
char * array_size_str = "Array size (per rank): ";
#else
char * array_size_str = "Array size: ";
#endif

#if USE_MPI
MPI_Datatype MPI_DTYPE = use_float ? MPI_FLOAT : MPI_DOUBLE;

// Collect global min/max timings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear that the output will be for a single device, even though this is an MPI code. Is that what we expect, or do we want to (additionally?) report aggregate bandwidth?

<< std::left << std::setw(12) << std::setprecision(5) << average
<< std::endl;
<< "--------------------------------"
<< std::endl << std::fixed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to change this? Especially adding the line?

@tomdeakin tomdeakin added this to the v6.0 milestone Oct 6, 2023
@tomdeakin tomdeakin removed this from the v6.0 milestone May 13, 2024
@tomdeakin
Copy link
Contributor

We've got a large general refactor coming for the main driver coming in #186.
We should also think some more about what bandwidth we expect an MPI+X version should be measuring given there is no communication apart from the dot product. I think we discussed it, but it would be good to document the reasons for wanting MPI+X versions of BabelStream vs running this benchmark on multiple nodes concurrently with pdsh, srun, etc and post-processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants