-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU support via CuArrays.jl #86
Comments
Hi @kose-y, I’m not familiar with CuArrays, but from what I understand the type of arrays it provides is compatible with AbstractArray, and that should be sufficient for many operators in ProximalOperators to work. Could you provide a specific example of a case where that doesn’t work as expected? |
Looping with direct elementwise access (e.g. https://github.com/kul-forbes/ProximalOperators.jl/blob/master/src/functions/normL2.jl#L37) is expected to be very slow on GPUs, and Adding |
Well in that specific case I think it makes sense to just remove that loop and instead rely on broadcasting. Other cases may be just as simple. Thanks for pointing this out! |
Thank you for the comment. I have to point out that this is not one specific case: I could quickly find many of them under the directory |
That would be nice, thanks. You can add the list as items to this issue, by editing the original post, so that progress towards solving this is tracked. |
@kose-y I’ve edited your post adding a list of proxes that could be improved. I’m not sure 100% on all of them, but I guess they’re worth a try. I guess any attempt in solving this issue would need to come with evidence that performance improves with CuArray on GPU but doesn’t degrade with Array on CPU. |
That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU). |
How did you write the broadcast? The simple version on the CPU: y .= max.(f.lb, min.(f.ub, x)) takes for me with scalar lb/ub: Edit: Interestingly enough, with Float32, I get |
@mfalt I used exactly the line you mentioned, and vectors lb/ub. I tried with size 10k and 100k, and observed a slowdown similar to yours. In the 100k case
|
My takeaway so far: for serial computation, looping through the data just once wins, and should be preferred over broadcasting. For
When scalar bounds, things are a bit faster in all cases:
|
It may be helpful to use For example: ProximalOperators.jl/src/functions/normL1.jl Line 127 in 5b2845a
|
Thanks for the nice package. I think it would be better to have GPU support (along with multi-thread evaluation for CPU in #82).
Edited by @lostella on Dec 7th, 2019
Note similar optimizations could be used in the computation of
f
andgradient
, besidesprox
.The text was updated successfully, but these errors were encountered: