Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverse Complex-to-Real FFT allocates GPU memory #2249

Closed
navdeeprana opened this issue Jan 23, 2024 · 5 comments
Closed

Inverse Complex-to-Real FFT allocates GPU memory #2249

navdeeprana opened this issue Jan 23, 2024 · 5 comments

Comments

@navdeeprana
Copy link

Describe the bug

Inverse Complex-to-Real FFT allocates GPU memory, whereas inverse Complex-to-Complex FFT does not.

To reproduce

The Minimal Working Example (MWE) for this bug:

using AbstractFFTs, CUDA, LinearAlgebra
CUDA.allowscalar(false)

u = CuArray(rand(512,512))
uk = rfft(u)
pfor = plan_rfft(u)
pinv = plan_irfft(uk, 512)
mul!(u, pinv, uk)
println("Complex-to-Real")
CUDA.@time mul!(u, pinv, uk);

u = CuArray(rand(ComplexF64,512,512))
uk = fft(u)
pfor = plan_fft(u)
pinv = plan_ifft(uk)
mul!(u, pinv, uk)
println("Complex-to-Complex")
CUDA.@time mul!(u, pinv, uk);
Complex-to-Real
  0.000091 seconds (20 CPU allocations: 800 bytes) (1 GPU allocation: 2.008 MiB, 13.43% memmgmt time)
Complex-to-Complex
  0.000168 seconds (132 CPU allocations: 11.141 KiB)
Manifest.toml

CUDA v5.1.2
GPUCompiler v0.25.0
LLVM v6.4.2

Expected behavior

No allocations?

Version info

Details on Julia:

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 2 on 48 virtual cores
Environment:
  JULIA_DEPOT_PATH = /data.lmp/nrana/.julia
  JULIA_NUM_THREADS = 1

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 510.108.3, originally for CUDA 11.6

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 11.0.0+510.108.3

Julia packages: 
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

4 devices:
  0: NVIDIA A100-PCIE-40GB (sm_80, 37.391 GiB / 40.000 GiB available)
  1: NVIDIA A100-PCIE-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  2: NVIDIA A100-PCIE-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  3: NVIDIA A100-PCIE-40GB (sm_80, 38.363 GiB / 40.000 GiB available)

Additional context

Add any other context about the problem here.

@navdeeprana navdeeprana added the bug Something isn't working label Jan 23, 2024
@maleadt
Copy link
Member

maleadt commented Jan 23, 2024

Known and expected; this is a bug in CUFFT, and NVIDIA has updated the documentation to indicate that these operations are expected to mutate inputs, so we need to take a copy of them.

@maleadt maleadt closed this as completed Jan 23, 2024
@maleadt maleadt removed the bug Something isn't working label Jan 23, 2024
@jipolanco
Copy link
Contributor

Sorry for commenting on this closed issue, but in some applications one does not need the input data after the transform has been computed.

Would it make sense to add an "advanced" interface allowing a user to explicitly specify that they're OK with CUFFT overwriting input arrays? For example by setting an optional keyword argument to plan_{r,br,ir}fft, something like allow_overwriting_input::Bool which would be false by default.

I can make a PR with the changes if that's an acceptable solution.

@maleadt
Copy link
Member

maleadt commented Oct 7, 2024

I think that would be fine. Maybe it would make sense to coordinate such a change with AbstractFFTs.jl though; @stevengj does this kind of problem (where computing an FFT mutates inputs) happen with other FFT back-ends as well?

@jipolanco
Copy link
Contributor

jipolanco commented Oct 7, 2024

Thanks for your answer. I agree, this should better be coordinated at the level of AbstractFFTs.jl.

Just note that the mutating behaviour of CUFFT on complex-to-real transforms also exists in FFTW:

[...] As noted above, the c2r transform destroys its input array even for out-of-place transforms. This can be prevented, if necessary, by including FFTW_PRESERVE_INPUT in the flags, with unfortunately some sacrifice in performance. This flag is also not currently supported for multi-dimensional real DFTs (next section).

In FFTW.jl this is also the case when using the non-allocating interface (mul! / ldiv!):

using FFTW
using LinearAlgebra

û = rand(ComplexF64, 21, 30)
û_orig = copy(û)
# p = plan_brfft(û, 40; flags = FFTW.PRESERVE_INPUT)  # only works for 1D inputs
p = plan_brfft(û, 40)

v = p *# always preserves input?
norm(û - û_orig)  # = 0 (input preserved)

mul!(v, p, û)     # destroys input
norm(û - û_orig)  # ≠ 0 (input was modified)

@maleadt
Copy link
Member

maleadt commented Oct 7, 2024

In FFTW.jl this is also the case when using the non-allocating interface (mul! / ldiv!)

In that case, I guess we shouldn't default to making a preserving copy unless the user requested that on plan creation? That would be a breaking change, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants