Improve efficiency of NFFT direct transformation in one dimension. #142
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Small change to improve the efficiencvy of the direct NFFT trafo in one dimension:
memset
to initialize the target vector with zeros, it's better to just write the final value to the target in the outer loop. Rationale: Using memset will need to access the entire target vector another time. This can be costly if the vector is large and does not fit inside the CPU cache.f[j]
in the inner loop in each iteration, it may be better to accumulate the value in a local variable and write only the final value tof[j]
. Rationale: May reduce potentially slow memory access tof[j]
, but iff[j]
is in the CPU cache and/or the compiler is smart, this may not make any difference.cexp
to calculate e^{-i*omega}, use real-valuedsin
andcos
functions. Rationale:cexp
supports complex-valued arguments, but the actual argument is always purely imaginary.It was difficult to test this one because there's not simple benchmark I could quickly run. Also, I tested this on arm64/v8 and
cycle.h
currently doesn't work for me. So I had to set up a scratch file to run a quick check which is not part of this PR. On my platform, the number of cycles for the direct transform drops to 60-80% compared to before.Would be good if someone could test this separately and on a different architecture (e.g. amd64) as well.