Speeding up sum reductions in ADADELTA by using Tensor Cores #252

L30nardoSV · 2024-01-11T14:30:14Z

Hi,

This PR aims to increase the performance of the CUDA version by leveraging the Tensor Cores Units (TCU) present in recent NVIDIA GPUs.

The idea is to re-implement the sum reductions as matrix operations (i.e., by using NVIDIA Warp Matrix Functions), which can be offloaded to TCUs.

Experiments on A100 GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 test):

Docking time	Original	Tensor
In seconds	0.8	0.6

Experiments on RTX3050Ti GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test):

Docking time	Original	Tensor
In seconds	2.4	1.7

The baseline implementation for this PR has been taken from this paper: Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores. The contribution of both authors, Gabin Schieffer (@gabin-s) and Ivy Peng (@bopkth), is acknowledged in this PR as well:

Schieffer, Gabin, and Peng, Ivy. "Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores."
In European Conference on Parallel Processing, pp. 608-622. Cham: Springer Nature Switzerland, 2023.

…ased reduction method to work correctly.

atillack · 2024-01-11T16:39:23Z

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

…eductions in ADADELTA.

L30nardoSV · 2024-01-11T18:52:02Z

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

OK, let me know if the TENSOR directive in commit 10b07fa suffices

atillack · 2024-01-11T19:29:36Z

@L30nardoSV I tested on one of our Nvidia Quadro RTX A5000 cards and I do see a nice speedup for the 3ce3 example input:

Docking time of the PR with make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test is 0.70 seconds vs 0.90 seconds (this does use the heuristics and autostop by default).

To evaluate a bit further I used Diogo's test set of 42 ligands, here are the results:

Reference:

Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking	idle
OpenCL	128	no, 2.5M evals	105382018	36 / 42 good	36 / 42 good	91.69 s	0.17 s
Cuda	128	no, 2.5M evals	105331182	36 / 42 good	36 / 42 good	90.75 s	0.32 s
Cuda	64	no, 2.5M evals	105404961	36 / 42 good	35 / 42 good	106.24 s	7.93 s
OpenCL	128	yes	84026192	37 / 42 good	37 / 42 good	187.82 s	0.21 s
Cuda	128	yes	80037847	38 / 42 good	38 / 42 good	184.85 s	0.20 s
Cuda	64	yes	84684628	36 / 42 good	38 / 42 good	233.39 s	8.13 s

This PR:

Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking	idle
OpenCL	128	no, 2.5M evals	105362595	38 / 42 good	36 / 42 good	92.33 s	0.21 s
Cuda	128	no, 2.5M evals	105177642	35 / 42 good	36 / 42 good	100.71 s	0.20 s
Cuda	64	no, 2.5M evals	105197433	35 / 42 good	38 / 42 good	112.48 s	0.19 s
OpenCL	128	yes	86495325	37 / 42 good	38 / 42 good	192.71 s	0.21 s
Cuda	128	yes	71419809	33 / 42 good	37 / 42 good	182.30 s	0.21 s
Cuda	64	yes	65754981	34 / 42 good	37 / 42 good	214.60 s	0.22 s

For multiple differently sized ligands with the typical settings It turns out for larger systems the speedup can turn into a slowdown.

It looks like the average number of evals w/ AutoStop changed in the PR which could potentially point to a minute difference in calculation (i did test multiple times to make sure this wasn't just an unlucky run).

@diogomart Please run your E50 tests for the Cuda version.

atillack · 2024-01-11T19:29:49Z

@L30nardoSV Thank you for the encapsulation :-)

diogomart · 2024-01-12T17:42:16Z

Unfortunately, algorithmic performance is worse.

…nsor Cores and error correction (TCEC).

L30nardoSV · 2024-02-16T19:25:59Z

@atillack

Can you please check commit b2ab3fe that incorporates an WMMA Extension for single precision matmul on Tensor Cores + error correction (TCEC)?

make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 TENSOR=ON TCEC=ON test

Ref: https://github.com/wmmae/wmma_extension/blob/main/docs/mma_f32.md

atillack · 2024-02-20T20:10:13Z

@L30nardoSV I ran the newest version and here are the results (with OpenCL from before as comparison, note: i compiled w/o OVERLAP so the last column takes a bit longer, but compute times are unaffected):

Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking	idle
OpenCL	128	yes	86495325	37 / 42 good	38 / 42 good	192.71 s	0.21 s
Cuda	128	yes	88164353	36 / 42 good	38 / 42 good	194.13 s	7.74 s
Cuda	64	yes	77884078	37 / 42 good	37 / 42 good	214.27 s	21.54 s

atillack · 2024-02-20T20:16:57Z

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

L30nardoSV · 2024-03-04T13:05:57Z

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

Thanks, I look forward to seeing whether the search efficiency is fine at least

diogomart · 2024-03-04T19:45:14Z

I'll get to this soon

diogomart · 2024-07-16T20:29:43Z

@L30nardoSV sorry for the long delay. I don't see an improvement, unfortunately. Very similar results to the previous commit.

L30nardoSV · 2024-07-17T07:10:06Z

@diogomart thanks!
I just want to make sure: did you compile with both TENSOR=ON TCEC=ON ?

diogomart · 2024-07-17T13:19:21Z

With TENSOR=ON yes, but not with TCEC=ON

L30nardoSV · 2024-07-17T13:38:49Z

With TENSOR=ON yes, but not with TCEC=ON

Can you please try again with TENSOR=ON and TCEC=ON?

diogomart · 2024-07-17T17:30:51Z

TCEC fixed it 👍

atillack · 2024-07-17T17:43:49Z

Thank you @diogomart! Glad the search performance is back to normal. There is no measurable performance benefit though and it adds a wrinkle between Cuda and OpenCL code paths. So, I am not sure what to do with this PR ...

diogomart · 2024-07-17T17:47:59Z

I think there's a small improvement over plain CUDA but OpenCL is still the fastest.
(this was run in a mix of different GPUs, so one should mentally blur the plot to interpret it meaningfully)

atillack · 2024-07-17T17:59:12Z

What is the overall runtime divided by the overall number of evals? (and what is the std.error?)

atillack · 2024-07-17T18:00:18Z

(and by runtime, i really mean docking time - although with overlap, runtime is probably good enough)

atillack · 2024-07-17T18:01:17Z

Also, did you run on the same type of GPU?

diogomart · 2024-07-17T18:06:20Z

no, this was a mix of rtx5000 and rtx6000

atillack · 2024-07-17T18:26:54Z

here is the data for the reduced set of 42:

Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking	idle
Cuda	128	yes	80037847	38 / 42 good	38 / 42 good	184.85 s	0.20 s
Cuda	128	yes	88164353	36 / 42 good	38 / 42 good	194.13 s	7.74 s

In other words, i've seen 184,850,000 microseconds / 80,037,847 = 2.31 microseconds per eval vs 2.20 microseconds per eval for this PR which would be about 5% faster ... except that that's right around what i would estimate AutoStop to fluctuate by ...

I'll run with a fixed number of evals later to get one more data point and try to get on one of our newest cards as well to see if newer tensor cores might be beneficial.

L30nardoSV · 2024-07-17T18:31:22Z

Thanks @diogomart and @atillack

Glad to see that by enabling the error correction code (via TCEC) the docking quality is back to normal.

I'll run with a fixed number of evals later to get one more data point and try to get on one of our newest cards as well to see if newer tensor cores might be beneficial.

I look forward to seeing those numbers :)

diogomart · 2024-07-17T18:58:02Z

Each marker is system and aggregates data from 32000 to 8M evals.

atillack · 2024-07-17T21:10:10Z

@L30nardoSV There is a speedup with increasingly more tensor core-heavy cards and with smaller ligands \o/

Please go ahead and remove the non-TCEC code.

Here is the data using varying combinations of AutoStop and Heuristics (NUMWI=128, TARGETS=86) for an RTX A5000:

A5000	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup
Reference	no & no	105356778	37 / 42 good	37 / 42 good	110.19 s	1.046
Tensor + TCEC	no & no	105336788	36 / 42 good	38 / 42 good	97.01 s	0.921	1.14x
Reference	no & yes	110331676	37 / 42 good	36 / 42 good	237.74 s	2.155
Tensor + TCEC	no & yes	110389153	39 / 42 good	38 / 42 good	213.16 s	1.931	1.12x
Reference	yes & yes	81545345	36 / 42 good	37 / 42 good	203.22 s	2.492
Tensor + TCEC	yes & yes	76813659	38 / 42 good	38 / 42 good	184.35 s	2.400	1.04x

From this data it seems that small ligands benefit more than larger ones (the fraction of evals spent on larger ligands increases with heuristics and autostop, compared to every ligand getting 2.5M evals)

Data for an RTX A6000 Ada showing overall similar relative speedup (compiled same as above):

A6000 Ada	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup
Reference	no & no	105393255	35 / 42 good	38 / 42 good	49.70 s	0.472
Tensor + TCEC	no & no	105405382	37 / 42 good	37 / 42 good	43.53 s	0.413	1.14x
Reference	no & yes	110421843	37 / 42 good	38 / 42 good	102.46 s	0.928
Tensor + TCEC	no & yes	110388427	39 / 42 good	36 / 42 good	94.18 s	0.853	1.09x
Reference	yes & yes	83999741	39 / 42 good	37 / 42 good	91.50 s	1.089
Tensor + TCEC	yes & yes	78516735	39 / 42 good	37 / 42 good	81.78 s	1.042	1.05x

Data for an H100 (compiled same as above, except TARGETS=90):

H100	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup
Reference	no & no	105332904	37 / 42 good	36 / 42 good	93.13 s	0.884
Tensor + TCEC	no & no	105406910	37 / 42 good	37 / 42 good	68.27 s	0.648	1.36x
Reference	no & yes	110399211	38 / 42 good	38 / 42 good	183.82 s	1.665
Tensor + TCEC	no & yes	110451270	39 / 42 good	38 / 42 good	152.55 s	1.381	1.21x
Reference	yes & yes	82377612	39 / 42 good	38 / 42 good	160.04 s	1.943
Tensor + TCEC	yes & yes	81215241	37 / 42 good	37 / 42 good	133.76 s	1.647	1.18x

So with a card with very strong tensor cores there's a bit more of a benefit ... Interestingly for docking, the newer Ada generation A6000 Ada cards are still much better overall.

L30nardoSV · 2024-07-17T22:40:40Z

@atillack Many thanks for the detailed evaluation!

I will get the non-TCEC code removed in the next few days

…s flag.

Removing non-TCEC code

L30nardoSV · 2024-07-19T07:13:08Z

@atillack @diogomart Please test commit 162a850

Only TENSOR=ON is required, because now TCEC is enabled by default (non-TCEC code was removed).

atillack · 2024-07-23T02:27:15Z

@L30nardoSV I am in favor of merging this PR as an option to "future-proof" the Cuda branch a bit. There are still a couple modifications needed - those should be minor, but i since unified the host code for example and made some other small tweaks to performance.

There is no free lunch though: theoretical FP32 flops w/ vs. w/o tensor cores and memory bandwidth stay the same, it's just a question which implementation is more efficient (unsurprisingly Nvidia's WMMA does a good job). So even with this PR and the tensor cores used, the Cuda path is still currently at best as fast as the OpenCL path (which typically is about 5% faster). There is a chance though some of the changes i made also will speed up this PR ;-)

OpenCL runs of the most recent code (261c91f) run on the same nodes as above:

GPU	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup vs PR TCEC
A5000	no & no	105421266	36 / 42 good	38 / 42 good	91.58 s	0.869	1.06x
A5000	no & yes	110360358	39 / 42 good	38 / 42 good	212.89 s	1.929	1.00x
A5000	yes & yes	80855441	36 / 42 good	38 / 42 good	184.47 s	2.281	1.05x

A6000 Ada	no & no	105393969	37 / 42 good	36 / 42 good	40.29 s	0.382	1.08x
A6000 Ada	no & yes	110372962	38 / 42 good	39 / 42 good	91.14 s	0.826	1.03x
A6000 Ada	yes & yes	78186199	37 / 42 good	36 / 42 good	79.45 s	1.016	1.03x

H100	no & no	105462969	36 / 42 good	37 / 42 good	64.67 s	0.613	1.06x
H100	no & yes	110356319	39 / 42 good	37 / 42 good	145.83 s	1.321	1.05x
H100	yes & yes	88132921	37 / 42 good	37 / 42 good	136.91 s	1.553	1.06x

…ensorcores

… to Makefile.

atillack · 2024-07-23T15:19:15Z

@L30nardoSV @diogomart Code is now updated to current develop branch, here is a quick benchmark on the A6000 Ada showing a speedup (!) of TENSOR=ON over OpenCL between 1-4%.

GPU	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup vs OpenCL
A6000 Ada	no & no	105308281	36 / 42 good	36 / 42 good	38.77 s	0.368	1.04x
A6000 Ada	no & yes	110376381	38 / 42 good	37 / 42 good	89.92 s	0.815	1.01x
A6000 Ada	yes & yes	79465317	38 / 42 good	37 / 42 good	78.25 s	0.985	1.03x

atillack · 2024-07-23T15:20:22Z

@diogomart Please rerun verification :-)

…ensorcores

…d error message if device is below.

atillack · 2024-07-25T19:57:22Z

@L30nardoSV @diogomart I optimized and cleaned up the WMMA code a bit, added checks to make sure the device we're running on is able to run the tensor core sum reductions, and now also automatically set the minimum compute capability to 8.0 so make TENSOR=ON is all that's needed now.

From my end that's all the code changes and I'll approve/merge when Diogo's regression check is successful.

L30nardoSV · 2024-07-26T07:30:33Z

Thank you @atillack.
I look forward to seeing the results of @diogomart's check :)

hwcopeland · 2024-10-01T04:14:47Z

@L30nardoSV I am in favor of merging this PR as an option to "future-proof" the Cuda branch a bit. There are still a couple modifications needed - those should be minor, but i since unified the host code for example and made some other small tweaks to performance.

There is no free lunch though: theoretical FP32 flops w/ vs. w/o tensor cores and memory bandwidth stay the same, it's just a question which implementation is more efficient (unsurprisingly Nvidia's WMMA does a good job). So even with this PR and the tensor cores used, the Cuda path is still currently at best as fast as the OpenCL path (which typically is about 5% faster). There is a chance though some of the changes i made also will speed up this PR ;-)

OpenCL runs of the most recent code (261c91f) run on the same nodes as above:

GPU AutoStop & Heuristics overall evals energy rmsd docking microseconds per eval speedup vs PR TCEC
A5000 no & no 105421266 36 / 42 good 38 / 42 good 91.58 s 0.869 1.06x
A5000 no & yes 110360358 39 / 42 good 38 / 42 good 212.89 s 1.929 1.00x
A5000 yes & yes 80855441 36 / 42 good 38 / 42 good 184.47 s 2.281 1.05x
A6000 Ada no & no 105393969 37 / 42 good 36 / 42 good 40.29 s 0.382 1.08x
A6000 Ada no & yes 110372962 38 / 42 good 39 / 42 good 91.14 s 0.826 1.03x
A6000 Ada yes & yes 78186199 37 / 42 good 36 / 42 good 79.45 s 1.016 1.03x
H100 no & no 105462969 36 / 42 good 37 / 42 good 64.67 s 0.613 1.06x
H100 no & yes 110356319 39 / 42 good 37 / 42 good 145.83 s 1.321 1.05x
H100 yes & yes 88132921 37 / 42 good 37 / 42 good 136.91 s 1.553 1.06x

Thank you I've been looking everywhere for numbers like this.

L30nardoSV added 3 commits January 5, 2024 08:31

Offloading reductions to Tensor Cores.

6d0eec1

Asserting minimum number-of-threads-per-block (= 64) for the matrix-b…

9df1e35

…ased reduction method to work correctly.

Minor improvements in comments.

7798aa4

L30nardoSV requested a review from atillack January 11, 2024 14:30

Adding TENSOR directive to enable/disable usage of tensor cores for r…

10b07fa

…eductions in ADADELTA.

L30nardoSV requested a review from diogomart January 11, 2024 20:30

Enabling usage of WMMA Extension for single precision matmul using Te…

b2ab3fe

…nsor Cores and error correction (TCEC).

L30nardoSV added 4 commits July 19, 2024 07:53

Starting to remove non-TCEC code.

0ec50e2

Removing support for FP16. This is not needed when having TCEC.

7be85ed

TCEC=ON is enabled by default, and thus, not needed to be specified a…

0b76241

…s flag.

Merge pull request #1 from L30nardoSV/tensorcores_only_tcec

162a850

Removing non-TCEC code

atillack added 3 commits July 23, 2024 07:36

Temporarily undo changes in performdocking.cpp.Cuda to update commit.

51012a8

Merge branch 'develop' of github.com:ccsb-scripps/AutoDock-GPU into t…

c6908bb

…ensorcores

Added minimum NUMWI (64) and target architecture (80) for TENSOR code…

a6e2054

… to Makefile.

atillack added 3 commits July 25, 2024 12:15

Optimization and cleanup of WMMA (tensor core) code.

d3b3299

Merge branch 'develop' of github.com:ccsb-scripps/AutoDock-GPU into t…

b8e5512

…ensorcores

Automatically set minimum tensor core compute capability 8.0 and adde…

0dde87b

…d error message if device is below.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up sum reductions in ADADELTA by using Tensor Cores #252

Speeding up sum reductions in ADADELTA by using Tensor Cores #252

L30nardoSV commented Jan 11, 2024

atillack commented Jan 11, 2024

L30nardoSV commented Jan 11, 2024

atillack commented Jan 11, 2024

atillack commented Jan 11, 2024

diogomart commented Jan 12, 2024

L30nardoSV commented Feb 16, 2024

atillack commented Feb 20, 2024

atillack commented Feb 20, 2024

L30nardoSV commented Mar 4, 2024

diogomart commented Mar 4, 2024

diogomart commented Jul 16, 2024

L30nardoSV commented Jul 17, 2024

diogomart commented Jul 17, 2024

L30nardoSV commented Jul 17, 2024 •

edited

Loading

diogomart commented Jul 17, 2024

atillack commented Jul 17, 2024

diogomart commented Jul 17, 2024 •

edited

Loading

atillack commented Jul 17, 2024

atillack commented Jul 17, 2024

atillack commented Jul 17, 2024

diogomart commented Jul 17, 2024

atillack commented Jul 17, 2024

L30nardoSV commented Jul 17, 2024

diogomart commented Jul 17, 2024

atillack commented Jul 17, 2024

L30nardoSV commented Jul 17, 2024

L30nardoSV commented Jul 19, 2024

atillack commented Jul 23, 2024

atillack commented Jul 23, 2024

atillack commented Jul 23, 2024

atillack commented Jul 25, 2024

L30nardoSV commented Jul 26, 2024

hwcopeland commented Oct 1, 2024

Speeding up sum reductions in ADADELTA by using Tensor Cores #252

Are you sure you want to change the base?

Speeding up sum reductions in ADADELTA by using Tensor Cores #252

Conversation

L30nardoSV commented Jan 11, 2024

atillack commented Jan 11, 2024

L30nardoSV commented Jan 11, 2024

atillack commented Jan 11, 2024

atillack commented Jan 11, 2024

diogomart commented Jan 12, 2024

L30nardoSV commented Feb 16, 2024

atillack commented Feb 20, 2024

atillack commented Feb 20, 2024

L30nardoSV commented Mar 4, 2024

diogomart commented Mar 4, 2024

diogomart commented Jul 16, 2024

L30nardoSV commented Jul 17, 2024

diogomart commented Jul 17, 2024

L30nardoSV commented Jul 17, 2024 • edited Loading

diogomart commented Jul 17, 2024

atillack commented Jul 17, 2024

diogomart commented Jul 17, 2024 • edited Loading

atillack commented Jul 17, 2024

atillack commented Jul 17, 2024

atillack commented Jul 17, 2024

diogomart commented Jul 17, 2024

atillack commented Jul 17, 2024

L30nardoSV commented Jul 17, 2024

diogomart commented Jul 17, 2024

atillack commented Jul 17, 2024

L30nardoSV commented Jul 17, 2024

L30nardoSV commented Jul 19, 2024

atillack commented Jul 23, 2024

atillack commented Jul 23, 2024

atillack commented Jul 23, 2024

atillack commented Jul 25, 2024

L30nardoSV commented Jul 26, 2024

hwcopeland commented Oct 1, 2024

L30nardoSV commented Jul 17, 2024 •

edited

Loading

diogomart commented Jul 17, 2024 •

edited

Loading