Inverse Complex-to-Real FFT allocates GPU memory #2249

navdeeprana · 2024-01-23T18:28:08Z

Describe the bug

Inverse Complex-to-Real FFT allocates GPU memory, whereas inverse Complex-to-Complex FFT does not.

To reproduce

The Minimal Working Example (MWE) for this bug:

using AbstractFFTs, CUDA, LinearAlgebra
CUDA.allowscalar(false)

u = CuArray(rand(512,512))
uk = rfft(u)
pfor = plan_rfft(u)
pinv = plan_irfft(uk, 512)
mul!(u, pinv, uk)
println("Complex-to-Real")
CUDA.@time mul!(u, pinv, uk);

u = CuArray(rand(ComplexF64,512,512))
uk = fft(u)
pfor = plan_fft(u)
pinv = plan_ifft(uk)
mul!(u, pinv, uk)
println("Complex-to-Complex")
CUDA.@time mul!(u, pinv, uk);

Complex-to-Real
  0.000091 seconds (20 CPU allocations: 800 bytes) (1 GPU allocation: 2.008 MiB, 13.43% memmgmt time)
Complex-to-Complex
  0.000168 seconds (132 CPU allocations: 11.141 KiB)

Manifest.toml

CUDA v5.1.2
GPUCompiler v0.25.0
LLVM v6.4.2

Expected behavior

No allocations?

Version info

Details on Julia:

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 2 on 48 virtual cores
Environment:
  JULIA_DEPOT_PATH = /data.lmp/nrana/.julia
  JULIA_NUM_THREADS = 1

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 510.108.3, originally for CUDA 11.6

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 11.0.0+510.108.3

Julia packages: 
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

4 devices:
  0: NVIDIA A100-PCIE-40GB (sm_80, 37.391 GiB / 40.000 GiB available)
  1: NVIDIA A100-PCIE-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  2: NVIDIA A100-PCIE-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  3: NVIDIA A100-PCIE-40GB (sm_80, 38.363 GiB / 40.000 GiB available)

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

maleadt · 2024-01-23T21:24:10Z

Known and expected; this is a bug in CUFFT, and NVIDIA has updated the documentation to indicate that these operations are expected to mutate inputs, so we need to take a copy of them.

jipolanco · 2024-10-02T14:36:13Z

Sorry for commenting on this closed issue, but in some applications one does not need the input data after the transform has been computed.

Would it make sense to add an "advanced" interface allowing a user to explicitly specify that they're OK with CUFFT overwriting input arrays? For example by setting an optional keyword argument to plan_{r,br,ir}fft, something like allow_overwriting_input::Bool which would be false by default.

I can make a PR with the changes if that's an acceptable solution.

maleadt · 2024-10-07T13:05:27Z

I think that would be fine. Maybe it would make sense to coordinate such a change with AbstractFFTs.jl though; @stevengj does this kind of problem (where computing an FFT mutates inputs) happen with other FFT back-ends as well?

jipolanco · 2024-10-07T14:07:46Z

Thanks for your answer. I agree, this should better be coordinated at the level of AbstractFFTs.jl.

Just note that the mutating behaviour of CUFFT on complex-to-real transforms also exists in FFTW:

[...] As noted above, the c2r transform destroys its input array even for out-of-place transforms. This can be prevented, if necessary, by including FFTW_PRESERVE_INPUT in the flags, with unfortunately some sacrifice in performance. This flag is also not currently supported for multi-dimensional real DFTs (next section).

In FFTW.jl this is also the case when using the non-allocating interface (mul! / ldiv!):

using FFTW
using LinearAlgebra

û = rand(ComplexF64, 21, 30)
û_orig = copy(û)
# p = plan_brfft(û, 40; flags = FFTW.PRESERVE_INPUT)  # only works for 1D inputs
p = plan_brfft(û, 40)

v = p * û         # always preserves input?
norm(û - û_orig)  # = 0 (input preserved)

mul!(v, p, û)     # destroys input
norm(û - û_orig)  # ≠ 0 (input was modified)

maleadt · 2024-10-07T14:10:03Z

In FFTW.jl this is also the case when using the non-allocating interface (mul! / ldiv!)

In that case, I guess we shouldn't default to making a preserving copy unless the user requested that on plan creation? That would be a breaking change, though.

navdeeprana added the bug Something isn't working label Jan 23, 2024

maleadt closed this as completed Jan 23, 2024

maleadt removed the bug Something isn't working label Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inverse Complex-to-Real FFT allocates GPU memory #2249

Inverse Complex-to-Real FFT allocates GPU memory #2249

navdeeprana commented Jan 23, 2024

maleadt commented Jan 23, 2024

jipolanco commented Oct 2, 2024

maleadt commented Oct 7, 2024

jipolanco commented Oct 7, 2024 •

edited

Loading

maleadt commented Oct 7, 2024

Inverse Complex-to-Real FFT allocates GPU memory #2249

Inverse Complex-to-Real FFT allocates GPU memory #2249

Comments

navdeeprana commented Jan 23, 2024

maleadt commented Jan 23, 2024

jipolanco commented Oct 2, 2024

maleadt commented Oct 7, 2024

jipolanco commented Oct 7, 2024 • edited Loading

maleadt commented Oct 7, 2024

jipolanco commented Oct 7, 2024 •

edited

Loading