Timers for the DEM example are confusing #277

cadop · 2024-07-26T20:38:48Z

Bug Description

DEM example becomes extremely slow with higher particle count.

Line 136 in acfeabc

    
           self.points = self.particle_grid(32, 64, 32, (0.0, 0.5, 0.0), self.point_radius, 0.1)

When using
self.points = self.particle_grid(1024, 1024, 256, (0.0, 0.5, 0.0), self.point_radius, 0.1)
There is no issue, and each step takes under 0.5 ms.

When using
self.points = self.particle_grid(1024, 1024, 512, (0.0, 0.5, 0.0), self.point_radius, 0.1)
The first ~20 or so frames are fine, and then each step starts taking minutes to compute.

(*the timings for rendering are just because I commented out the point renderer)

This is the difference between 200 mil and 500 mil particles, so I don't think its an order of magnitude different that should cause this problem.

If this isn't an issue please feel free to transfer to discussions as Q&A!

System Information

Win11
Warp 1.2.2 initialized:
CUDA Toolkit 11.8, Driver 12.5
Devices:
"cpu" : "AMD64 Family 25 Model 24 Stepping 1, AuthenticAMD"
"cuda:1" : "NVIDIA RTX A6000" (48 GiB, sm_86, mempool enabled)

The text was updated successfully, but these errors were encountered:

christophercrouzet · 2024-07-28T22:22:21Z

Hi @cadop, I personally can't run self.points = self.particle_grid(1024, 1024, 256, (0.0, 0.5, 0.0), self.point_radius, 0.1) because my GPU is running out memory, so I'm assuming the same might be happening in your case when you crank up the resolution a bit further? Can you check the memory usage of your GPU?

cadop · 2024-07-29T12:56:17Z

Hey,

It doesn't seem to be the case, for 1024,1024,512 this is a screenshot of the memory.

If it was ram, why would it run for the first few frames and not create a memory error later?

If I go to 1024,1024,1024 I do get an out of memory error.

shi-eric · 2024-07-29T16:16:35Z

@cadop: Can you modify

warp/warp/examples/core/example_dem.py

Line 178 in acfeabc

with wp.ScopedTimer("step"):

to use with wp.ScopedTimer("step", synchronize=True):?

I noticed a similar behavior happening with the timing information printed out to stdout at lower resolutions. Since you turned off rendering there are no CPU synchronizations so the wp.ScopedTimer timings become quite confusing unless you use synchronize=True.

You can just run with python example_dem.py --stage_path None to skip the render step.

cadop · 2024-07-29T19:42:51Z

Be careful what you wish for... yes synchronize made everything equally slow :).

@shi-eric I can do some more exploration on my own, but given you have experience with this, is the stdout actually causing an issue? What causes the change in performance for synchronize?

For example, if I need to do the simulation with large numbers of particles, but only send back to the CPU some average particle position, I wouldn't need to copy the entire array of positions again. So if I am not rendering points out, the original test would be the time on GPU?

for posterity, my original post was in the situation where i was rendering, just not the points

shi-eric · 2024-07-29T19:51:13Z

What causes the change in performance for synchronize?

The issue here is that the default output of wp.ScopedTimer is not usually the time it takes for a block of code to be executed on the GPU due to the asynchronous nature of the CUDA API. Instead, it's typically measuring the time it takes to submit the work to the GPU. @nvlukasz recently expanded our docs about this: https://nvidia.github.io/warp/profiling.html#scopedtimer

My guess about what was making the wp.ScopedTimer in your original question start reporting times matching the time to execute rather than the time to submit work is that after a certain amount of calls, the GPU refuses to accept more items in the queue until the queue finishes processing the current item. @nvlukasz, what do you think?

cadop · 2024-07-30T13:53:54Z

That doc definitely helps, i'll dig into this more.

The profiler is pretty confusing within the dem example. I turn rendering back on, but I am getting render took 3932 ms... step took 0.25ms I am guessing this is because self.x.numpy() is in the render scope so sync doesnt happen in the step. maybe synchronize should be enabled by default (in the demo)? Or, I could make a PR to move out the numpy call from the renderer and put it in the scope.

nvlukasz · 2024-07-30T17:09:17Z

Hi @cadop, I agree that the output of the timers is a little confusing/misleading. The "step" timer only measures the time to schedule the CUDA work, not the time taken on the GPU to do the work. When rendering is enabled, there's generally some readback, as you noticed with the .numpy() calls. This readback is synchronous, so the "render" timer includes the time spent waiting for the previous GPU step to complete, which is misleading.

We can improve the existing timers to make them more accurate. Setting ScopedTimer(..., synchronize=True) would go a long way to clear up some confusion. That will show accurate measurements of how long each step and render takes, including the CPU and GPU work.

We could separate the readback under it's own timer, but it would complicate the examples a bit (and we should do it in all the examples for consistency). The examples are meant to be short and simple, they're not really meant to be precise benchmarks. I think doing synchronized timings should clarify the timings sufficiently.

Doing proper CPU/GPU timings can be a bit of a dark art sometimes :)

cadop added the bug Something isn't working label Jul 26, 2024

shi-eric added question The issue author requires information and removed bug Something isn't working labels Jul 30, 2024

shi-eric changed the title ~~[BUG] Large particle count for DEM example extremely slow~~ Timers for the DEM example are confusing Jul 30, 2024

shi-eric assigned nvlukasz Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timers for the DEM example are confusing #277

Timers for the DEM example are confusing #277

cadop commented Jul 26, 2024

christophercrouzet commented Jul 28, 2024

cadop commented Jul 29, 2024

shi-eric commented Jul 29, 2024

cadop commented Jul 29, 2024

shi-eric commented Jul 29, 2024

cadop commented Jul 30, 2024

nvlukasz commented Jul 30, 2024

Timers for the DEM example are confusing #277

Timers for the DEM example are confusing #277

Comments

cadop commented Jul 26, 2024

Bug Description

System Information

christophercrouzet commented Jul 28, 2024

cadop commented Jul 29, 2024

shi-eric commented Jul 29, 2024

cadop commented Jul 29, 2024

shi-eric commented Jul 29, 2024

cadop commented Jul 30, 2024

nvlukasz commented Jul 30, 2024