Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems OpenCL is faster than CUDA #239

Open
daylight-00 opened this issue Aug 23, 2023 · 5 comments
Open

Seems OpenCL is faster than CUDA #239

daylight-00 opened this issue Aug 23, 2023 · 5 comments

Comments

@daylight-00
Copy link

daylight-00 commented Aug 23, 2023

I compared the time required by ligand and the time required by whole job, using Autodock-GPU built with CUDA and OpenCL respectively. And it always took CUDA about 30~40% more time than OpenCL. However, the description in the repository says CUDA is faster than OpenCL, contrary to my results. Similar results have always been obtained when trying under different conditions on the same system, and I have not tried on other systems. So I hope others to check if CUDA is really faster than OpenCL.

System:

  • AMD EPYC 7542 32-Core Processor
  • NVIDIA Geforce RTX 3090 * 1
  • CUDA Toolkit 11.8 (Conda Package)

Docking:

  • ligand batch size: 10K
  • nrun: 10
  • iteration of entire job: 10
  • random seed: 100

image

output

@daylight-00 daylight-00 changed the title OpenCL is faster than CUDA Seems OpenCL is faster than CUDA Aug 23, 2023
@atillack
Copy link
Member

@daylight0-0 Thank you and yes, OpenCL is about 5-15% faster in our own testing on the same hardware (RTX A5000). Newer versions should narrow the gap a little bit (due to requesting a smaller chunk of memory in Cuda similar to OpenCL based on the actual memory needed and not the maximums) - so if this isn't the current develop branch it may be worthwhile to test again.

I suspect the remaining difference may be caused by pre-allocated memory at compile time (OpenCL) vs dynamically allocated memory at runtime (Cuda) for variables in shared memory - as other than this Cuda and OpenCL paths are using exactly the same algorithms and even implementations as much as possible ...

Since OpenCL does exist on Nvidia and many more devices (all the way to Android) that's good news though ultimately :-)

@atillack
Copy link
Member

Found the culprit: It looks like I wrote that Cuda was faster about 3 years ago in our README.md. It probably was true at the time before I merged the integer gradient from Cuda to OpenCL as well. So I'll fix README.md by taking this sentence out.

@daylight-00
Copy link
Author

daylight-00 commented Aug 23, 2023

Thank you for your answer. I did use the develop branch though.
I'm using AutoDock-GPU in a cluster that uses various types or number of gpu(A5000, A6000, 3090...), and I wonder if there could be a problem if I use it in a different node than when I do compile.

@atillack
Copy link
Member

For Cuda this should only be an issue if you were to compile with the wrong architecture(s) - for 3090/A5000/A6000. you want to compile with TARGETS="86".

One more thing: I would only compare overall runtimes on the same machines as the kernel runtime performance timers while at the same location may still contain different tasks depending on what Cuda and OpenCL do at kernel cleanup time.

@atillack
Copy link
Member

I just realized, PR #233 should close the Cuda performance gap a bit more as it contains the code to allocate the same amount of memory as OpenCL ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants