Added speed improvements and CuArray support to CircShiftedArray #66

RainerHeintzmann · 2023-04-19T15:00:18Z

this is achieved by implementing broadcasting to CircShiftedArray. Note that all tests run except of the @inferred macro.
Also the show() was overloaded in such a way, that the CuArray displays without an error or warning but the package does not need to depend on CUDA.jl.
Where possible the base "circshift" is used upon materialization.

RainerHeintzmann · 2023-04-29T13:48:42Z

added now also broadcasting support for the standard ShiftedArray type.

RainerHeintzmann · 2023-05-17T06:27:35Z

.. will now try to fix the base-type problem and then also move the shift back to a variable inside the class.

RainerHeintzmann · 2023-05-17T20:54:39Z

This is done now. The benchmarking yielded only minor changes which are acceptable and at least the combinatorial explosion problem is reduced. Most importantly the type-stability seems restored now.

piever · 2023-05-18T17:10:46Z

I somehow wonder, if one should create an Abstract Superclass from which both derive?

Yes, this seems like a cleaner solution if one wants to share functionality. The current implementation seems a bit odd with the value of the default element encoded in a type parameter. Having both ShiftedArray and CircShiftedArray be subtypes of a

AbstractShiftedArray{T, N} <: AbstractArray{T, N}

seems a cleaner way to reuse functionality. I would also like to ideally keep the meaning of ShiftedArray and CircShiftedArray the same as they were before to avoid breakage.

One can also think about a shift array being a PaddedArray of a SubArray (@view) and the CircShiftedArray being a CatViewor a SubArray. What is your opinion?

To be honest, I think I would prefer to keep this package very lightweight, so unless there are strong advantages, I would avoiding using combination of other wrapper types.

On a more general note, the PR seems pretty large (and so also a bit difficult to review), is there a way that it could be split into smaller / more manageable PRs? Though maybe switching from the more complex type to the inheritance approach to share functionality would reduce the diff significantly.

RainerHeintzmann · 2023-05-18T18:27:00Z

I somehow wonder, if one should create an Abstract Superclass from which both derive?

Yes, this seems like a cleaner solution if one wants to share functionality. The current implementation seems a bit odd with the value of the default element encoded in a type parameter. Having both ShiftedArray and CircShiftedArray be subtypes of a
AbstractShiftedArray{T, N} <: AbstractArray{T, N}
seems a cleaner way to reuse functionality. I would also like to ideally keep the meaning of ShiftedArray and CircShiftedArray the same as they were before to avoid breakage.
I think there is almost no breakage in the current version. Even the shifts member is maintained. I think using the default value as a type parameter is not bad, as this is something that the compiler would usually know which leads to a speedup similar to StaticArrays.jl. The only possible break is if someone accessed the default argument in the structure, but this should anyway be avoided.

One can also think about a shift array being a PaddedArray of a SubArray (@view) and the CircShiftedArray being a CatViewor a SubArray. What is your opinion?

To be honest, I think I would prefer to keep this package very lightweight, so unless there are strong advantages, I would avoiding using combination of other wrapper types.
Fair enough But I also did not think to incklude any other packages. The thought was more about them possibly using the structures created here.

On a more general note, the PR seems pretty large (and so also a bit difficult to review), is there a way that it could be split into smaller / more manageable PRs? Though maybe switching from the more complex type to the inheritance approach to share functionality would reduce the diff significantly.

I doubt it. The size of this PR is due to the details of the broadcasting mechanism, which one would probably need just as well (if not more) if an abstract array was introduced.
Don't you think that the AbstractArray intermediate class may be also introduced at a later stage?
Should we ask @roflmaostc for help with the code review?

piever · 2023-06-07T16:36:46Z

Of course it can be helpful if @roflmaostc can review the code, but I think it would greatly simplify things if we tried to break this PR into more manageable chunks, as the diff really is quite large (esp. given how small the package is).

As far as I see it, a few things are happening.

Merging ShiftedArray and CircShiftedArray into a unique type. Is this actually needed? Why it is better than just having a shared supertype to share functionality? Why does the default value need to be encoded as a Val(x) parameter and not simply as a field? Many alternatives are possible and it should be discussed: this is a big change for a stable package.
Adding some optimized methods for shifted arrays (such as isequal, reverse, similar and so on). This can definitely be done separately.
Adding broadcasting mechanisms, which I imagine will be a complex thing regardless of 1. and 2.

I would propose to first figure out 1. I think we should make the smallest possible change to the type hierarchy, as this package is a dependency of the JuliaStats stack.

RainerHeintzmann · 2023-06-14T07:56:06Z

Merging ShiftedArray and CircShiftedArray into a unique type. Is this actually needed? Why it is better than just having a shared supertype to share functionality?

I guess this could have been done either way. I guess both have a "shift" parameter.

Why does the default value need to be encoded as a Val(x) parameter and not simply as a field? Many alternatives are possible and it should be discussed: this is a big change for a stable package.

If it is part of the type, it can be determined at compile time rather than runtime. This is why I implemented it this way.

Adding some optimized methods for shifted arrays (such as isequal, reverse, similar and so on). This can definitely be done separately.

I agree. this could have been done separately.

Adding broadcasting mechanisms, which I imagine will be a complex thing regardless of 1. and 2.

Yes. This is the main and also the most important point. Without the broadcasting it cannot sensibly be used in CUDA.jl.
Most of the effort went into trying to establish close to 100% compatibillity. Yes, the type hirarchy changed, but I think this should not affect its use.
Maybe we should run the tests of JuliaStats stack with the modified package?

RainerHeintzmann · 2023-08-28T09:36:46Z

Can we make some progress on this pull request?
Maybe @maleadt can have a look at this to comment? This issue is that we have code, which uses ShiftedArrays.jl and are unable to convert it to compatibility with CUDA.jl without a major rewrite, since the current version of ShiftedArrays.jl is incompatible with CUDA.jl. For this reason I spend quite some time to make it compatible. I do understand, that this converted a beautifully written simple package with only a few lines of code into a pretty extended (relatively messy) package. Yet, luckily you wrote a quite extended test suite, which the version of this PR passes.
So if @maleadt or another expert in the field agrees, can we merge it?
If you disagree, should I rather rename the package and release it under a different name?

maleadt · 2023-08-28T11:42:31Z

This issue is that we have code, which uses ShiftedArrays.jl and are unable to convert it to compatibility with CUDA.jl without a major rewrite, since the current version of ShiftedArrays.jl is incompatible with CUDA.jl.

Can you be more specific? I'm not familiar with this package; What used to be CuArray-incompatible, and what changes did you have to make that are specific to CUDA.jl?

RainerHeintzmann · 2023-08-28T12:28:01Z

@maleadt Sorry for pulling you into this here. In short: This module wraps an array into a shifted or circularly shifted array. Specialities are the support of types such as undefined and padding values and also cancelling of shifts where possible. The FourierTools.lj pa ckaged makes use of it.
The basic implementation (see current version in the original repository) works by overloading the index access in get_index. This lead to small and beautiful code. However, if this is used together with CuArray, it caused well justified warnings and severe performance loss, especially when it comes to using such wrapped arrays as members of broadcast operations.

For this reason I implemented a bunch of necessary broadcast overloads for this module such that direct calls to get_index are avoided as much as possible. This made also the core functionality a bit faster, but most importantly it now fully works with CuArray support. The disadvantage is, that it made the code significantly longer and also significantly harder to understand. Yet, this is currently the price to pay.

@piever luckily implemented an quite extensive set of tests, which helped enormously in implementing the broadcasting schemes, which now fully pass these tests (and some more) both with and without CuArray as the input datatype.
Interestingly this package does still not depend on CUDA.jl as the implementations only concern the broadcasting.

A (minor?) detail is that the class structure was slightly changed (avoiding two independent classes) to avoid code-dublication, but this should still be compatible with the original classes for all practical purposes.

maleadt · 2023-08-28T12:37:29Z

The basic implementation (see current version in the original repository) works by overloading the index access in get_index. This lead to small and beautiful code. However, if this is used together with CuArray, it caused well justified warnings and severe performance loss, especially when it comes to using such wrapped arrays as members of broadcast operations.

I guess you're talking about scalar indexing here? Without looking too closely at the source code (where you seem to materialize the array wrapper eagerly before broadcasting); I wonder if it wouldn't have been possible to have broadcast apply the ShiftedArray wrapper to the CuDeviceArray that gets passed to the broadcast kernel, where the existing getindex overrides could just have been re-used (scalar indexing is bad because you index from the host; the GPU itself can of course freely index device arrays during execution on the GPU). Such specializations could then maybe be put in an extension package, instead of changing the existing implementation.

RainerHeintzmann · 2023-08-28T16:05:40Z

Thanks for this interesting idea. This may also be an interesting approach. Of course you would than not get the speed-improvement that the full broadcasting support in the current implementation supplies for non-CUDA usage.
Yet what I also wonder is how well it would be possible to preserve the use of the "undefined" datatype and how one would write a getindex function that would run on the CuDeviceArray. Would one need to write specialized CUDA-Kernels for this?
Part of the beauty of the current implementation is that it is not CUDA-specific but just implements broadcasting.
Yet I do agree that a small kernel-based getindex access also has benefits and would keep the code more clean.

maleadt · 2023-08-28T16:24:27Z

Yet what I also wonder is how well it would be possible to preserve the use of the "undefined" datatype and how one would write a getindex function that would run on the CuDeviceArray. Would one need to write specialized CUDA-Kernels for this?

If you pass a CircShiftedArray{<:CuDeviceArray} to the existing broadcast kernel from GPUArrays, getindex from ShiftedArrays.jl would be called, which calls getindex on the underlying array. So you wouldn't need to write any kernel code. The disadvantage of course is that you would be spawning too many threads, i.e., also for the part of the array that's been shifted away (and where I presume getindex from ShiftedArrays.jl returns missing), so materializing the wrapper early and optimizing what you broadcast over might be better for your use cases.

RainerHeintzmann · 2023-08-29T08:09:26Z

This sounds quite an interesting strategy, but I have to admit that I cannot fully follow as I am not acquainted well enough with the internals of CUDA.jl. If I look at the type of CuArray(rand(100,100)), it does not seem to be a subtype of CuDeviceArray neither are its storage component or the buffer. From the Doc of CuDeviceArray I do not understand whether the ptr is supposed to be a device pointer (e.g. like the one already existing in a CuArray) or a pointer in the main memory, where it would presumably copy the data to the device.
So your suggestion would be to write an extension package, which would have some code specialization to re-wrap a CircShiftedArray{CuArray} into a CircShiftedArray{CuDeviceArray} which would then automatically use the getindex in the kernel-code during any broadcast operation?
I am less worried about unused kernels for the empty areas of ShiftedArrays (which will anyway stall and be reused) than about general efficiency of broadcasting and multi-core support, which is a key feature of Julia.

maleadt · 2023-08-29T10:57:22Z

CuDeviceArray is the device-side representation of a CuArray, and gets automatically converted when passing it to a kernel (via cudaconvert). The only thing that should be required here, is ensuring that CircShiftedArray{CuArray} still uses the GPUArrays.jl/CUDA.jl broadcast kernels (probably overloading BroadcastStyle), and providing an Adapt.jl rule so that cudaconvert(::CircShiftedArray{CuArray}) results in a CircShiftedArray{CuDeviceArray} (e.g. https://github.com/JuliaGPU/Adapt.jl/blob/df06bcb6936baa7352b8cc7bf5f08f98f2653f25/src/wrappers.jl#L10-L11).

RainerHeintzmann · 2023-08-30T09:01:55Z

Ok. This is actually an impressive scheme, and basically works. However, as I feared, there is a problem with the "non-concrete element type": # ERROR: GPU broadcast resulted in non-concrete element type Union{Missing, Float64}.
As far as I remember, some of the complications in my code arose from this which meant, that I had to carry this type information through the broadcasting scheme using a specialized broadcast style. Is there a way to avoid this problem?

using ShiftedArrays
using CUDA
using Adapt
CUDA.allowscalar(false);

a = rand(100,100);
# csa = circshift(a, (10,11))
csa = CircShiftedArray(a, (10,11));
ca = CuArray(a);

csca = CircShiftedArray(ca, (10,11));
Adapt.adapt_structure(to, x::CircShiftedArray{T, D, CT}) where {T,D,CT<:CuArray} = CircShiftedArray(adapt(to, parent(x)), shifts(csca));

q = adapt(CuArray,csca);
q = cudaconvert(csca);
typeof(q)
qc = csca .+ 1.0;

function Base.Broadcast.BroadcastStyle(::Type{T})  where (T<: CircShiftedArray{<:Any,<:Any,<:CuArray})
    CUDA.CuArrayStyle{ndims(T)}()
end

qc = csca .+ 1.0;
q  = csa .+ 1.0;
Array(qc) == q # check for identical result
# Great!

sa = ShiftedArray(a,(22,23))
sca = ShiftedArray(ca,(22,23));

# lets do this for the ShiftedArray type
function Base.Broadcast.BroadcastStyle(::Type{T})  where (T<: ShiftedArray{<:Any,<:Any,<:Any,<:CuArray})
    CUDA.CuArrayStyle{ndims(T)}()
end

q = sa .+ 1.0;ERROR: GPU broadcast resulted in non-concrete element type Union{Missing, Float64}.
# This probably means that 
# Here is the problem:
qc = sca .+ 1.0;
# ERROR: GPU broadcast resulted in non-concrete element type Union{Missing, Float64}.
# This probably means that the function you are broadcasting contains an error or type instability.

maleadt · 2023-08-30T12:41:31Z

# ERROR: GPU broadcast resulted in non-concrete element type Union{Missing, Float64}.

Eh, that probably just needs to be relaxed. As I mentioned above, CuArrays supports union-isbits storage nowadays. Removing that error as a test, and adding the necessary Adapt rule, makes everything work:

julia> Adapt.adapt_structure(to, x::ShiftedArray{T, M, N, <:CuArray}) where {T,M,N} =
           ShiftedArray(adapt(to, parent(x)), shifts(x); default=ShiftedArrays.default(x))

julia> typeof(cudaconvert(sca))
ShiftedArray{Float64, Missing, 2, CuDeviceMatrix{Float64, 1}}

julia> sca .+ 1.0
100×100 CuArray{Union{Missing, Float64}, 2, CUDA.Mem.DeviceBuffer}:
 missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  …   missing   missing   missing   missing   missing   missing   missing   missing
...

Also:

csca = CircShiftedArray(ca, (10,11));
Adapt.adapt_structure(to, x::CircShiftedArray{T, D, CT}) where {T,D,CT<:CuArray} = CircShiftedArray(adapt(to, parent(x)), shifts(csca));

That csca usage is wrong.

RainerHeintzmann · 2023-08-30T12:54:01Z

Thanks a lot. Silly me, to forget the second adapt rule. Thanks also for spotting the wrong csca which, of course, should be x.
If I get you right, the error needed removal in the main branch. Has this been pushed?
I tried it by using "update CUDA" but still get the same error.

maleadt · 2023-08-30T13:12:38Z

JuliaGPU/GPUArrays.jl#491

RainerHeintzmann · 2023-08-30T16:32:00Z

So, here comes the surprise. I did some benchmarking using @btime, which I hope is OK to use for CuArrays.
While the (complicated) broadcast version of the PR is 2-8 fold faster for normal (non-CuArray) arrays, the cuda version seems to be 5-10 fold faster in the non-special broadcast (simple code with adapt rule) way with adapt rules.

The question is therefore how to proceed from here? @piever
Split the versions and make an extra package for the broadcast support, but keep this one simple and add adapt rules (in an extension package)?
Or dump the (hard work) for broadcast support and only use the adapt rules, therefore not gaining the speedup?
Or try to find a (complicated) way to get the optimal performance for each use-case?

RainerHeintzmann · 2023-09-04T07:26:46Z

@piever see above. Any comments how to proceed?

piever · 2023-09-08T16:10:09Z

@RainerHeintzmann sorry for the very slow reply rate in this PR! I can see that a lot of work already went into this, and it'd be definitely very useful to get ShiftedArrays of CuArrays to work well.

So, here comes the surprise. I did some benchmarking using @btime, which I hope is OK to use for CuArrays.
While the (complicated) broadcast version of the PR is 2-8 fold faster for normal (non-CuArray) arrays, the cuda version seems to be 5-10 fold faster in the non-special broadcast (simple code with adapt rule) way with adapt rules.

Maybe a trivial comment, but did you make sure to also use CUDA.@sync? See examples here.

The question is therefore how to proceed from here? @piever
Split the versions and make an extra package for the broadcast support, but keep this one simple and add adapt rules (in an extension package)?
Or dump the (hard work) for broadcast support and only use the adapt rules, therefore not gaining the speedup?
Or try to find a (complicated) way to get the optimal performance for each use-case?

Having the adapt rule in the package seems perfectly reasonable. It's just a lightweight dependency + 2 lines of code. I would agree to use the new weakdeps machinery (just how OffsetArrays.jl plans on doing it: JuliaArrays/OffsetArrays.jl#331).

Given that the adapt rule already makes things work on the GPU, it might make sense to have the custom broadcasting code (at least for the time being) as an experimental package that does type piracy (say FastShiftedArraysBroadcast.jl). In the future, once that gets polished, battle-tested and further optimized for the GPU as well, we can consider adding it here.

RainerHeintzmann · 2023-09-15T12:03:23Z

Thanks for your hints. Indeed I did not consider the CUDA.@sync command. I redid all the benchmarking both ways. Now there is still an advantage of the Adapt rule over the full-broadcast implementation. The relative speedup is between 10% and a factor of two for CUDA. However, for non-cuda is is the other way round: the full broadcast version is consistently about a factor of two faster than the orginal (or AdaptCUDA) version.
The broadcast package does not have any additional dependencies, I think, so it is " lightweight dependency".
As for battle-testing you may be right. Since it passes your quite extensive testing, it does have some battle-testing already, but not in a real use-case environment.
So how do we continue now?

piever · 2023-09-19T11:31:44Z

So how do we continue now?

I think the first thing to do is to turn the snippet from #66 (comment) (and the analogous for CircShiftedArrays) into a PR (feel free to go for it, otherwise I can just do it in the next few days, should be reasonably straightforward) and test that this fixes broadcast on the GPU (using for example https://github.com/JuliaGPU/GPUArrays.jl/tree/master/lib/JLArrays).

I imagine that if you need the extra performance of this PR, you could start developing a FastShiftedArraysBroadcast.jl package (essentially a copy-paste of this PR) that "pirates" the broadcasting methods, and start battle-testing it in your projects and polishing it. Overtime we can figure out whether the code complexity / performance trade-off is worth it.

RainerHeintzmann · 2023-09-20T06:55:44Z

OK. I did already write that code and I am happy to do a new PR, but only in about two weeks, if that's OK.

piever · 2023-09-22T14:15:08Z

OK. I did already write that code and I am happy to do a new PR, but only in about two weeks, if that's OK.

Sounds great!

RainerHeintzmann added 21 commits April 18, 2023 01:00

reimplemented CircShiftedArrays adding broadcasting support

10d8eb5

bug fixes of the rewritten circshiftedarray

9e23396

bug fixes

81d5d9f

improvements

a12a950

removed the specialied version

787feb6

added test for @view

1988c5a

improved show method

797a031

bug fix in show()

2c674c9

correction in .&

8de2db6

increased minimum version to 1.3

139f7c5

increased min version to 1.5

eda9b08

changes towards broadcasting for both types

8cdc7b8

fixed code and tests

0031596

bug fixes. hooked into reducedim_initarray

cc15c38

not quite working with the broadcasting

893f99f

before removing R in ShiftedArrayStyle

5552b84

all tests are running now

ecdd1d5

bug fixed to work with Cuda

9e1f19b

bug fixes with CuArray type

a450787

added system information

a186b6b

more system details

8c3ea6e

RainerHeintzmann added 8 commits May 1, 2023 15:54

added more support for reductions, but fixes

6fe5206

fixed test

892bf97

added isapprox implementation

b22f4c7

added reverse support

2b08383

fixed docs

86426c2

bug fix

72634d0

bug fix

e3dddef

bug fix

5a1d2d2

RainerHeintzmann added 4 commits May 17, 2023 08:51

fixed array base type problem

32149b1

removed S

00e7047

removed S from BroadcastStyle. Ready to merge

b4dd2c9

cleaned up code and fixed docs

05a78f4

minor change

ef70bc0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added speed improvements and CuArray support to CircShiftedArray #66

Added speed improvements and CuArray support to CircShiftedArray #66

RainerHeintzmann commented Apr 19, 2023

RainerHeintzmann commented Apr 29, 2023

RainerHeintzmann commented May 17, 2023

RainerHeintzmann commented May 17, 2023

piever commented May 18, 2023

RainerHeintzmann commented May 18, 2023 •

edited

Loading

piever commented Jun 7, 2023

RainerHeintzmann commented Jun 14, 2023

RainerHeintzmann commented Aug 28, 2023

maleadt commented Aug 28, 2023

RainerHeintzmann commented Aug 28, 2023

maleadt commented Aug 28, 2023

RainerHeintzmann commented Aug 28, 2023

maleadt commented Aug 28, 2023

RainerHeintzmann commented Aug 29, 2023 •

edited

Loading

maleadt commented Aug 29, 2023

RainerHeintzmann commented Aug 30, 2023

maleadt commented Aug 30, 2023

RainerHeintzmann commented Aug 30, 2023

maleadt commented Aug 30, 2023

RainerHeintzmann commented Aug 30, 2023

RainerHeintzmann commented Sep 4, 2023

piever commented Sep 8, 2023

RainerHeintzmann commented Sep 15, 2023 •

edited

Loading

piever commented Sep 19, 2023

RainerHeintzmann commented Sep 20, 2023 via email

piever commented Sep 22, 2023

Added speed improvements and CuArray support to CircShiftedArray #66

Are you sure you want to change the base?

Added speed improvements and CuArray support to CircShiftedArray #66

Conversation

RainerHeintzmann commented Apr 19, 2023

RainerHeintzmann commented Apr 29, 2023

RainerHeintzmann commented May 17, 2023

RainerHeintzmann commented May 17, 2023

piever commented May 18, 2023

RainerHeintzmann commented May 18, 2023 • edited Loading

piever commented Jun 7, 2023

RainerHeintzmann commented Jun 14, 2023

RainerHeintzmann commented Aug 28, 2023

maleadt commented Aug 28, 2023

RainerHeintzmann commented Aug 28, 2023

maleadt commented Aug 28, 2023

RainerHeintzmann commented Aug 28, 2023

maleadt commented Aug 28, 2023

RainerHeintzmann commented Aug 29, 2023 • edited Loading

maleadt commented Aug 29, 2023

RainerHeintzmann commented Aug 30, 2023

maleadt commented Aug 30, 2023

RainerHeintzmann commented Aug 30, 2023

maleadt commented Aug 30, 2023

RainerHeintzmann commented Aug 30, 2023

RainerHeintzmann commented Sep 4, 2023

piever commented Sep 8, 2023

RainerHeintzmann commented Sep 15, 2023 • edited Loading

piever commented Sep 19, 2023

RainerHeintzmann commented Sep 20, 2023 via email

piever commented Sep 22, 2023

RainerHeintzmann commented May 18, 2023 •

edited

Loading

RainerHeintzmann commented Aug 29, 2023 •

edited

Loading

RainerHeintzmann commented Sep 15, 2023 •

edited

Loading