Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up sum reductions in ADADELTA by using Tensor Cores #252

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from

Conversation

L30nardoSV
Copy link
Member

Hi,

This PR aims to increase the performance of the CUDA version by leveraging the Tensor Cores Units (TCU) present in recent NVIDIA GPUs.

The idea is to re-implement the sum reductions as matrix operations (i.e., by using NVIDIA Warp Matrix Functions), which can be offloaded to TCUs.

Experiments on A100 GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 test):

Docking time Original Tensor
In seconds 0.8 0.6

Experiments on RTX3050Ti GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test):

Docking time Original Tensor
In seconds 2.4 1.7

The baseline implementation for this PR has been taken from this paper: Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores. The contribution of both authors, Gabin Schieffer (@gabin-s) and Ivy Peng (@bopkth), is acknowledged in this PR as well:

Schieffer, Gabin, and Peng, Ivy. "Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores."
In European Conference on Parallel Processing, pp. 608-622. Cham: Springer Nature Switzerland, 2023.

@atillack
Copy link
Member

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

@L30nardoSV
Copy link
Member Author

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

OK, let me know if the TENSOR directive in commit 10b07fa suffices

@atillack
Copy link
Member

@L30nardoSV I tested on one of our Nvidia Quadro RTX A5000 cards and I do see a nice speedup for the 3ce3 example input:

Docking time of the PR with make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test is 0.70 seconds vs 0.90 seconds (this does use the heuristics and autostop by default).

To evaluate a bit further I used Diogo's test set of 42 ligands, here are the results:

Reference:

Path NUMWI AutoStop & Heuristics overall evals energy rmsd docking idle
OpenCL 128 no, 2.5M evals 105382018 36 / 42 good 36 / 42 good 91.69 s 0.17 s
Cuda 128 no, 2.5M evals 105331182 36 / 42 good 36 / 42 good 90.75 s 0.32 s
Cuda 64 no, 2.5M evals 105404961 36 / 42 good 35 / 42 good 106.24 s 7.93 s
OpenCL 128 yes 84026192 37 / 42 good 37 / 42 good 187.82 s 0.21 s
Cuda 128 yes 80037847 38 / 42 good 38 / 42 good 184.85 s 0.20 s
Cuda 64 yes 84684628 36 / 42 good 38 / 42 good 233.39 s 8.13 s

This PR:

Path NUMWI AutoStop & Heuristics overall evals energy rmsd docking idle
OpenCL 128 no, 2.5M evals 105362595 38 / 42 good 36 / 42 good 92.33 s 0.21 s
Cuda 128 no, 2.5M evals 105177642 35 / 42 good 36 / 42 good 100.71 s 0.20 s
Cuda 64 no, 2.5M evals 105197433 35 / 42 good 38 / 42 good 112.48 s 0.19 s
OpenCL 128 yes 86495325 37 / 42 good 38 / 42 good 192.71 s 0.21 s
Cuda 128 yes 71419809 33 / 42 good 37 / 42 good 182.30 s 0.21 s
Cuda 64 yes 65754981 34 / 42 good 37 / 42 good 214.60 s 0.22 s

For multiple differently sized ligands with the typical settings It turns out for larger systems the speedup can turn into a slowdown.

It looks like the average number of evals w/ AutoStop changed in the PR which could potentially point to a minute difference in calculation (i did test multiple times to make sure this wasn't just an unlucky run).

@diogomart Please run your E50 tests for the Cuda version.

@atillack
Copy link
Member

@L30nardoSV Thank you for the encapsulation :-)

@diogomart
Copy link
Member

Unfortunately, algorithmic performance is worse.

79f13c7-ocl-128wi_vs_PR252-10b07fa-cuda-tensor-128wi-overlap

@L30nardoSV
Copy link
Member Author

@atillack

Can you please check commit b2ab3fe that incorporates an WMMA Extension for single precision matmul on Tensor Cores + error correction (TCEC)?

make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 TENSOR=ON TCEC=ON test

Ref: https://github.com/wmmae/wmma_extension/blob/main/docs/mma_f32.md

@atillack
Copy link
Member

@L30nardoSV I ran the newest version and here are the results (with OpenCL from before as comparison, note: i compiled w/o OVERLAP so the last column takes a bit longer, but compute times are unaffected):

Path NUMWI AutoStop & Heuristics overall evals energy rmsd docking idle
OpenCL 128 yes 86495325 37 / 42 good 38 / 42 good 192.71 s 0.21 s
Cuda 128 yes 88164353 36 / 42 good 38 / 42 good 194.13 s 7.74 s
Cuda 64 yes 77884078 37 / 42 good 37 / 42 good 214.27 s 21.54 s

@atillack
Copy link
Member

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

@L30nardoSV
Copy link
Member Author

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

Thanks, I look forward to seeing whether the search efficiency is fine at least

@diogomart
Copy link
Member

I'll get to this soon

@diogomart
Copy link
Member

@L30nardoSV sorry for the long delay. I don't see an improvement, unfortunately. Very similar results to the previous commit.
79f13c7-ocl-128wi_vs_PR252-b2ab3fe-cuda-tensor-128wi-overlap

@L30nardoSV
Copy link
Member Author

@diogomart thanks!
I just want to make sure: did you compile with both TENSOR=ON TCEC=ON ?

@diogomart
Copy link
Member

With TENSOR=ON yes, but not with TCEC=ON

@L30nardoSV
Copy link
Member Author

L30nardoSV commented Jul 17, 2024

With TENSOR=ON yes, but not with TCEC=ON

Can you please try again with TENSOR=ON and TCEC=ON?

@diogomart
Copy link
Member

TCEC fixed it 👍

79f13c7-ocl-128wi_vs_PR252-b2ab3fe-cuda-tensor-tcec-128wi-overlap

@atillack
Copy link
Member

Thank you @diogomart! Glad the search performance is back to normal. There is no measurable performance benefit though and it adds a wrinkle between Cuda and OpenCL code paths. So, I am not sure what to do with this PR ...

@diogomart
Copy link
Member

diogomart commented Jul 17, 2024

I think there's a small improvement over plain CUDA but OpenCL is still the fastest.
(this was run in a mix of different GPUs, so one should mentally blur the plot to interpret it meaningfully)

PR255-7007db8-cuda-128wi-overlap--PR252-b2ab3fe-cuda-tensor-tcec-128wi-overlap runtime
PR255-7007db8-cuda-128wi-overlap--PR255-7007db8-ocl-128wi-overlap runtime

@atillack
Copy link
Member

What is the overall runtime divided by the overall number of evals? (and what is the std.error?)

@atillack
Copy link
Member

(and by runtime, i really mean docking time - although with overlap, runtime is probably good enough)

@atillack
Copy link
Member

Also, did you run on the same type of GPU?

@diogomart
Copy link
Member

no, this was a mix of rtx5000 and rtx6000

@atillack
Copy link
Member

here is the data for the reduced set of 42:

Path NUMWI AutoStop & Heuristics overall evals energy rmsd docking idle
Cuda 128 yes 80037847 38 / 42 good 38 / 42 good 184.85 s 0.20 s
Cuda 128 yes 88164353 36 / 42 good 38 / 42 good 194.13 s 7.74 s

In other words, i've seen 184,850,000 microseconds / 80,037,847 = 2.31 microseconds per eval vs 2.20 microseconds per eval for this PR which would be about 5% faster ... except that that's right around what i would estimate AutoStop to fluctuate by ...

I'll run with a fixed number of evals later to get one more data point and try to get on one of our newest cards as well to see if newer tensor cores might be beneficial.

@L30nardoSV
Copy link
Member Author

Thanks @diogomart and @atillack

Glad to see that by enabling the error correction code (via TCEC) the docking quality is back to normal.

I'll run with a fixed number of evals later to get one more data point and try to get on one of our newest cards as well to see if newer tensor cores might be beneficial.

I look forward to seeing those numbers :)

@diogomart
Copy link
Member

Each marker is system and aggregates data from 32000 to 8M evals.
time_per_eval_tensor_vs_opencl

@atillack
Copy link
Member

@L30nardoSV There is a speedup with increasingly more tensor core-heavy cards and with smaller ligands \o/

Please go ahead and remove the non-TCEC code.

Here is the data using varying combinations of AutoStop and Heuristics (NUMWI=128, TARGETS=86) for an RTX A5000:

A5000 AutoStop & Heuristics overall evals energy rmsd docking microseconds per eval speedup
Reference no & no 105356778 37 / 42 good 37 / 42 good 110.19 s 1.046
Tensor + TCEC no & no 105336788 36 / 42 good 38 / 42 good 97.01 s 0.921 1.14x
Reference no & yes 110331676 37 / 42 good 36 / 42 good 237.74 s 2.155
Tensor + TCEC no & yes 110389153 39 / 42 good 38 / 42 good 213.16 s 1.931 1.12x
Reference yes & yes 81545345 36 / 42 good 37 / 42 good 203.22 s 2.492
Tensor + TCEC yes & yes 76813659 38 / 42 good 38 / 42 good 184.35 s 2.400 1.04x

From this data it seems that small ligands benefit more than larger ones (the fraction of evals spent on larger ligands increases with heuristics and autostop, compared to every ligand getting 2.5M evals)

Data for an RTX A6000 Ada showing overall similar relative speedup (compiled same as above):

A6000 Ada AutoStop & Heuristics overall evals energy rmsd docking microseconds per eval speedup
Reference no & no 105393255 35 / 42 good 38 / 42 good 49.70 s 0.472
Tensor + TCEC no & no 105405382 37 / 42 good 37 / 42 good 43.53 s 0.413 1.14x
Reference no & yes 110421843 37 / 42 good 38 / 42 good 102.46 s 0.928
Tensor + TCEC no & yes 110388427 39 / 42 good 36 / 42 good 94.18 s 0.853 1.09x
Reference yes & yes 83999741 39 / 42 good 37 / 42 good 91.50 s 1.089
Tensor + TCEC yes & yes 78516735 39 / 42 good 37 / 42 good 81.78 s 1.042 1.05x

Data for an H100 (compiled same as above, except TARGETS=90):

H100 AutoStop & Heuristics overall evals energy rmsd docking microseconds per eval speedup
Reference no & no 105332904 37 / 42 good 36 / 42 good 93.13 s 0.884
Tensor + TCEC no & no 105406910 37 / 42 good 37 / 42 good 68.27 s 0.648 1.36x
Reference no & yes 110399211 38 / 42 good 38 / 42 good 183.82 s 1.665
Tensor + TCEC no & yes 110451270 39 / 42 good 38 / 42 good 152.55 s 1.381 1.21x
Reference yes & yes 82377612 39 / 42 good 38 / 42 good 160.04 s 1.943
Tensor + TCEC yes & yes 81215241 37 / 42 good 37 / 42 good 133.76 s 1.647 1.18x

So with a card with very strong tensor cores there's a bit more of a benefit ... Interestingly for docking, the newer Ada generation A6000 Ada cards are still much better overall.

@L30nardoSV
Copy link
Member Author

@atillack Many thanks for the detailed evaluation!

I will get the non-TCEC code removed in the next few days

@L30nardoSV
Copy link
Member Author

@atillack @diogomart Please test commit 162a850

Only TENSOR=ON is required, because now TCEC is enabled by default (non-TCEC code was removed).

@atillack
Copy link
Member

@L30nardoSV I am in favor of merging this PR as an option to "future-proof" the Cuda branch a bit. There are still a couple modifications needed - those should be minor, but i since unified the host code for example and made some other small tweaks to performance.

There is no free lunch though: theoretical FP32 flops w/ vs. w/o tensor cores and memory bandwidth stay the same, it's just a question which implementation is more efficient (unsurprisingly Nvidia's WMMA does a good job). So even with this PR and the tensor cores used, the Cuda path is still currently at best as fast as the OpenCL path (which typically is about 5% faster). There is a chance though some of the changes i made also will speed up this PR ;-)

OpenCL runs of the most recent code (261c91f) run on the same nodes as above:

GPU AutoStop & Heuristics overall evals energy rmsd docking microseconds per eval speedup vs PR TCEC
A5000 no & no 105421266 36 / 42 good 38 / 42 good 91.58 s 0.869 1.06x
A5000 no & yes 110360358 39 / 42 good 38 / 42 good 212.89 s 1.929 1.00x
A5000 yes & yes 80855441 36 / 42 good 38 / 42 good 184.47 s 2.281 1.05x
A6000 Ada no & no 105393969 37 / 42 good 36 / 42 good 40.29 s 0.382 1.08x
A6000 Ada no & yes 110372962 38 / 42 good 39 / 42 good 91.14 s 0.826 1.03x
A6000 Ada yes & yes 78186199 37 / 42 good 36 / 42 good 79.45 s 1.016 1.03x
H100 no & no 105462969 36 / 42 good 37 / 42 good 64.67 s 0.613 1.06x
H100 no & yes 110356319 39 / 42 good 37 / 42 good 145.83 s 1.321 1.05x
H100 yes & yes 88132921 37 / 42 good 37 / 42 good 136.91 s 1.553 1.06x

@atillack
Copy link
Member

@L30nardoSV @diogomart Code is now updated to current develop branch, here is a quick benchmark on the A6000 Ada showing a speedup (!) of TENSOR=ON over OpenCL between 1-4%.

GPU AutoStop & Heuristics overall evals energy rmsd docking microseconds per eval speedup vs OpenCL
A6000 Ada no & no 105308281 36 / 42 good 36 / 42 good 38.77 s 0.368 1.04x
A6000 Ada no & yes 110376381 38 / 42 good 37 / 42 good 89.92 s 0.815 1.01x
A6000 Ada yes & yes 79465317 38 / 42 good 37 / 42 good 78.25 s 0.985 1.03x

@atillack
Copy link
Member

@diogomart Please rerun verification :-)

@atillack
Copy link
Member

@L30nardoSV @diogomart I optimized and cleaned up the WMMA code a bit, added checks to make sure the device we're running on is able to run the tensor core sum reductions, and now also automatically set the minimum compute capability to 8.0 so make TENSOR=ON is all that's needed now.

From my end that's all the code changes and I'll approve/merge when Diogo's regression check is successful.

@L30nardoSV
Copy link
Member Author

Thank you @atillack.
I look forward to seeing the results of @diogomart's check :)

@hwcopeland
Copy link

@L30nardoSV I am in favor of merging this PR as an option to "future-proof" the Cuda branch a bit. There are still a couple modifications needed - those should be minor, but i since unified the host code for example and made some other small tweaks to performance.

There is no free lunch though: theoretical FP32 flops w/ vs. w/o tensor cores and memory bandwidth stay the same, it's just a question which implementation is more efficient (unsurprisingly Nvidia's WMMA does a good job). So even with this PR and the tensor cores used, the Cuda path is still currently at best as fast as the OpenCL path (which typically is about 5% faster). There is a chance though some of the changes i made also will speed up this PR ;-)

OpenCL runs of the most recent code (261c91f) run on the same nodes as above:

GPU AutoStop & Heuristics overall evals energy rmsd docking microseconds per eval speedup vs PR TCEC
A5000 no & no 105421266 36 / 42 good 38 / 42 good 91.58 s 0.869 1.06x
A5000 no & yes 110360358 39 / 42 good 38 / 42 good 212.89 s 1.929 1.00x
A5000 yes & yes 80855441 36 / 42 good 38 / 42 good 184.47 s 2.281 1.05x
A6000 Ada no & no 105393969 37 / 42 good 36 / 42 good 40.29 s 0.382 1.08x
A6000 Ada no & yes 110372962 38 / 42 good 39 / 42 good 91.14 s 0.826 1.03x
A6000 Ada yes & yes 78186199 37 / 42 good 36 / 42 good 79.45 s 1.016 1.03x
H100 no & no 105462969 36 / 42 good 37 / 42 good 64.67 s 0.613 1.06x
H100 no & yes 110356319 39 / 42 good 37 / 42 good 145.83 s 1.321 1.05x
H100 yes & yes 88132921 37 / 42 good 37 / 42 good 136.91 s 1.553 1.06x

Thank you I've been looking everywhere for numbers like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants