You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The number of thread blocks needs to be a divisor of N, which is a template parameter to measure. Otherwise many threads will do too much work.
In lines 144 forward, only use multiples of 1024 as template parameter. On some GPUs, which do not have a L1 cache as large, the amount of work per thread would be very small, and the performance actually worse.
Hey, i find in gpu-cache test the blocksize is
256
, why it is not1024
?When i changed blocksize from
256
to1024
, L1 cache bandwidth tested has some improvement and fluctuates more.blocksize =
256
results as followsblocksize =
1024
results as followsMy device is A800 80GB PCIe.
The text was updated successfully, but these errors were encountered: