-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timers for the DEM example are confusing #277
Comments
Hi @cadop, I personally can't run |
@cadop: Can you modify warp/warp/examples/core/example_dem.py Line 178 in acfeabc
with wp.ScopedTimer("step", synchronize=True): ?
I noticed a similar behavior happening with the timing information printed out to stdout at lower resolutions. Since you turned off rendering there are no CPU synchronizations so the You can just run with |
Be careful what you wish for... yes synchronize made everything equally slow :). @shi-eric I can do some more exploration on my own, but given you have experience with this, is the stdout actually causing an issue? What causes the change in performance for synchronize? For example, if I need to do the simulation with large numbers of particles, but only send back to the CPU some average particle position, I wouldn't need to copy the entire array of positions again. So if I am not rendering points out, the original test would be the time on GPU? for posterity, my original post was in the situation where i was rendering, just not the points |
The issue here is that the default output of My guess about what was making the |
That doc definitely helps, i'll dig into this more. The profiler is pretty confusing within the dem example. I turn rendering back on, but I am getting |
Hi @cadop, I agree that the output of the timers is a little confusing/misleading. The "step" timer only measures the time to schedule the CUDA work, not the time taken on the GPU to do the work. When rendering is enabled, there's generally some readback, as you noticed with the We can improve the existing timers to make them more accurate. Setting We could separate the readback under it's own timer, but it would complicate the examples a bit (and we should do it in all the examples for consistency). The examples are meant to be short and simple, they're not really meant to be precise benchmarks. I think doing synchronized timings should clarify the timings sufficiently. Doing proper CPU/GPU timings can be a bit of a dark art sometimes :) |
Bug Description
DEM example becomes extremely slow with higher particle count.
warp/warp/examples/core/example_dem.py
Line 136 in acfeabc
When using
self.points = self.particle_grid(1024, 1024, 256, (0.0, 0.5, 0.0), self.point_radius, 0.1)
There is no issue, and each step takes under 0.5 ms.
When using
self.points = self.particle_grid(1024, 1024, 512, (0.0, 0.5, 0.0), self.point_radius, 0.1)
The first ~20 or so frames are fine, and then each step starts taking minutes to compute.
(*the timings for rendering are just because I commented out the point renderer)
This is the difference between 200 mil and 500 mil particles, so I don't think its an order of magnitude different that should cause this problem.
If this isn't an issue please feel free to transfer to discussions as Q&A!
System Information
Win11
Warp 1.2.2 initialized:
CUDA Toolkit 11.8, Driver 12.5
Devices:
"cpu" : "AMD64 Family 25 Model 24 Stepping 1, AuthenticAMD"
"cuda:1" : "NVIDIA RTX A6000" (48 GiB, sm_86, mempool enabled)
The text was updated successfully, but these errors were encountered: