-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up sum reductions in ADADELTA by using Tensor Cores #252
base: develop
Are you sure you want to change the base?
Conversation
…ased reduction method to work correctly.
@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run. |
…eductions in ADADELTA.
OK, let me know if the |
@L30nardoSV I tested on one of our Nvidia Quadro RTX A5000 cards and I do see a nice speedup for the 3ce3 example input: Docking time of the PR with To evaluate a bit further I used Diogo's test set of 42 ligands, here are the results: Reference:
This PR:
For multiple differently sized ligands with the typical settings It turns out for larger systems the speedup can turn into a slowdown. It looks like the average number of evals w/ AutoStop changed in the PR which could potentially point to a minute difference in calculation (i did test multiple times to make sure this wasn't just an unlucky run). @diogomart Please run your E50 tests for the Cuda version. |
@L30nardoSV Thank you for the encapsulation :-) |
…nsor Cores and error correction (TCEC).
Can you please check commit b2ab3fe that incorporates an WMMA Extension for single precision matmul on Tensor Cores + error correction (TCEC)?
Ref: https://github.com/wmmae/wmma_extension/blob/main/docs/mma_f32.md |
@L30nardoSV I ran the newest version and here are the results (with OpenCL from before as comparison, note: i compiled w/o OVERLAP so the last column takes a bit longer, but compute times are unaffected):
|
While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total). |
Thanks, I look forward to seeing whether the search efficiency is fine at least |
I'll get to this soon |
@L30nardoSV sorry for the long delay. I don't see an improvement, unfortunately. Very similar results to the previous commit. |
@diogomart thanks! |
With TENSOR=ON yes, but not with TCEC=ON |
Can you please try again with |
Thank you @diogomart! Glad the search performance is back to normal. There is no measurable performance benefit though and it adds a wrinkle between Cuda and OpenCL code paths. So, I am not sure what to do with this PR ... |
What is the overall runtime divided by the overall number of evals? (and what is the std.error?) |
(and by runtime, i really mean docking time - although with overlap, runtime is probably good enough) |
Also, did you run on the same type of GPU? |
no, this was a mix of rtx5000 and rtx6000 |
here is the data for the reduced set of 42:
In other words, i've seen 184,850,000 microseconds / 80,037,847 = 2.31 microseconds per eval vs 2.20 microseconds per eval for this PR which would be about 5% faster ... except that that's right around what i would estimate AutoStop to fluctuate by ... I'll run with a fixed number of evals later to get one more data point and try to get on one of our newest cards as well to see if newer tensor cores might be beneficial. |
Thanks @diogomart and @atillack Glad to see that by enabling the error correction code (via TCEC) the docking quality is back to normal.
I look forward to seeing those numbers :) |
@L30nardoSV There is a speedup with increasingly more tensor core-heavy cards and with smaller ligands \o/ Please go ahead and remove the non-TCEC code. Here is the data using varying combinations of AutoStop and Heuristics (NUMWI=128, TARGETS=86) for an RTX A5000:
From this data it seems that small ligands benefit more than larger ones (the fraction of evals spent on larger ligands increases with heuristics and autostop, compared to every ligand getting 2.5M evals) Data for an RTX A6000 Ada showing overall similar relative speedup (compiled same as above):
Data for an H100 (compiled same as above, except TARGETS=90):
So with a card with very strong tensor cores there's a bit more of a benefit ... Interestingly for docking, the newer Ada generation A6000 Ada cards are still much better overall. |
@atillack Many thanks for the detailed evaluation! I will get the non-TCEC code removed in the next few days |
@atillack @diogomart Please test commit 162a850 Only |
@L30nardoSV I am in favor of merging this PR as an option to "future-proof" the Cuda branch a bit. There are still a couple modifications needed - those should be minor, but i since unified the host code for example and made some other small tweaks to performance. There is no free lunch though: theoretical FP32 flops w/ vs. w/o tensor cores and memory bandwidth stay the same, it's just a question which implementation is more efficient (unsurprisingly Nvidia's WMMA does a good job). So even with this PR and the tensor cores used, the Cuda path is still currently at best as fast as the OpenCL path (which typically is about 5% faster). There is a chance though some of the changes i made also will speed up this PR ;-) OpenCL runs of the most recent code (261c91f) run on the same nodes as above:
|
@L30nardoSV @diogomart Code is now updated to current develop branch, here is a quick benchmark on the A6000 Ada showing a speedup (!) of TENSOR=ON over OpenCL between 1-4%.
|
@diogomart Please rerun verification :-) |
…d error message if device is below.
@L30nardoSV @diogomart I optimized and cleaned up the WMMA code a bit, added checks to make sure the device we're running on is able to run the tensor core sum reductions, and now also automatically set the minimum compute capability to 8.0 so From my end that's all the code changes and I'll approve/merge when Diogo's regression check is successful. |
Thank you @atillack. |
Thank you I've been looking everywhere for numbers like this. |
Hi,
This PR aims to increase the performance of the CUDA version by leveraging the Tensor Cores Units (TCU) present in recent NVIDIA GPUs.
The idea is to re-implement the sum reductions as matrix operations (i.e., by using NVIDIA Warp Matrix Functions), which can be offloaded to TCUs.
Experiments on A100 GPU (
make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 test
):Experiments on RTX3050Ti GPU (
make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test
):The baseline implementation for this PR has been taken from this paper: Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores. The contribution of both authors, Gabin Schieffer (@gabin-s) and Ivy Peng (@bopkth), is acknowledged in this PR as well: