Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timers for the DEM example are confusing #277

Open
cadop opened this issue Jul 26, 2024 · 7 comments
Open

Timers for the DEM example are confusing #277

cadop opened this issue Jul 26, 2024 · 7 comments
Assignees
Labels
question The issue author requires information

Comments

@cadop
Copy link
Contributor

cadop commented Jul 26, 2024

Bug Description

DEM example becomes extremely slow with higher particle count.

self.points = self.particle_grid(32, 64, 32, (0.0, 0.5, 0.0), self.point_radius, 0.1)

When using
self.points = self.particle_grid(1024, 1024, 256, (0.0, 0.5, 0.0), self.point_radius, 0.1)
There is no issue, and each step takes under 0.5 ms.

When using
self.points = self.particle_grid(1024, 1024, 512, (0.0, 0.5, 0.0), self.point_radius, 0.1)
The first ~20 or so frames are fine, and then each step starts taking minutes to compute.
image

(*the timings for rendering are just because I commented out the point renderer)

This is the difference between 200 mil and 500 mil particles, so I don't think its an order of magnitude different that should cause this problem.

If this isn't an issue please feel free to transfer to discussions as Q&A!

System Information

Win11
Warp 1.2.2 initialized:
CUDA Toolkit 11.8, Driver 12.5
Devices:
"cpu" : "AMD64 Family 25 Model 24 Stepping 1, AuthenticAMD"
"cuda:1" : "NVIDIA RTX A6000" (48 GiB, sm_86, mempool enabled)

@cadop cadop added the bug Something isn't working label Jul 26, 2024
@christophercrouzet
Copy link
Member

Hi @cadop, I personally can't run self.points = self.particle_grid(1024, 1024, 256, (0.0, 0.5, 0.0), self.point_radius, 0.1) because my GPU is running out memory, so I'm assuming the same might be happening in your case when you crank up the resolution a bit further? Can you check the memory usage of your GPU?

@cadop
Copy link
Contributor Author

cadop commented Jul 29, 2024

Hey,

It doesn't seem to be the case, for 1024,1024,512 this is a screenshot of the memory.

image

If it was ram, why would it run for the first few frames and not create a memory error later?

If I go to 1024,1024,1024 I do get an out of memory error.

@shi-eric
Copy link
Contributor

@cadop: Can you modify

with wp.ScopedTimer("step"):
to use with wp.ScopedTimer("step", synchronize=True):?

I noticed a similar behavior happening with the timing information printed out to stdout at lower resolutions. Since you turned off rendering there are no CPU synchronizations so the wp.ScopedTimer timings become quite confusing unless you use synchronize=True.

You can just run with python example_dem.py --stage_path None to skip the render step.

@cadop
Copy link
Contributor Author

cadop commented Jul 29, 2024

Be careful what you wish for... yes synchronize made everything equally slow :).

@shi-eric I can do some more exploration on my own, but given you have experience with this, is the stdout actually causing an issue? What causes the change in performance for synchronize?

For example, if I need to do the simulation with large numbers of particles, but only send back to the CPU some average particle position, I wouldn't need to copy the entire array of positions again. So if I am not rendering points out, the original test would be the time on GPU?

for posterity, my original post was in the situation where i was rendering, just not the points
image

@shi-eric
Copy link
Contributor

What causes the change in performance for synchronize?

The issue here is that the default output of wp.ScopedTimer is not usually the time it takes for a block of code to be executed on the GPU due to the asynchronous nature of the CUDA API. Instead, it's typically measuring the time it takes to submit the work to the GPU. @nvlukasz recently expanded our docs about this: https://nvidia.github.io/warp/profiling.html#scopedtimer

My guess about what was making the wp.ScopedTimer in your original question start reporting times matching the time to execute rather than the time to submit work is that after a certain amount of calls, the GPU refuses to accept more items in the queue until the queue finishes processing the current item. @nvlukasz, what do you think?

@cadop
Copy link
Contributor Author

cadop commented Jul 30, 2024

That doc definitely helps, i'll dig into this more.

The profiler is pretty confusing within the dem example. I turn rendering back on, but I am getting render took 3932 ms... step took 0.25ms I am guessing this is because self.x.numpy() is in the render scope so sync doesnt happen in the step. maybe synchronize should be enabled by default (in the demo)? Or, I could make a PR to move out the numpy call from the renderer and put it in the scope.

@nvlukasz
Copy link
Contributor

Hi @cadop, I agree that the output of the timers is a little confusing/misleading. The "step" timer only measures the time to schedule the CUDA work, not the time taken on the GPU to do the work. When rendering is enabled, there's generally some readback, as you noticed with the .numpy() calls. This readback is synchronous, so the "render" timer includes the time spent waiting for the previous GPU step to complete, which is misleading.

We can improve the existing timers to make them more accurate. Setting ScopedTimer(..., synchronize=True) would go a long way to clear up some confusion. That will show accurate measurements of how long each step and render takes, including the CPU and GPU work.

We could separate the readback under it's own timer, but it would complicate the examples a bit (and we should do it in all the examples for consistency). The examples are meant to be short and simple, they're not really meant to be precise benchmarks. I think doing synchronized timings should clarify the timings sufficiently.

Doing proper CPU/GPU timings can be a bit of a dark art sometimes :)

@shi-eric shi-eric added question The issue author requires information and removed bug Something isn't working labels Jul 30, 2024
@shi-eric shi-eric changed the title [BUG] Large particle count for DEM example extremely slow Timers for the DEM example are confusing Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The issue author requires information
Projects
None yet
Development

No branches or pull requests

4 participants