Output driven parallelism #663

DiamonDinoia · 2025-04-21T19:57:38Z

Output Driven initial implementation
Binsize tuning
Parameter tuning
Optimized interp

…arallelism

* Fixed test timing issue

* Some tests in cmake were asking for too many digits give the size of the transform * Cufinufft suffered from a off-by-one error since the original implementation. Fixed for now.

…arallelism

…driven-parallelism

DiamonDinoia · 2025-05-05T19:25:43Z

Performance summary:

dim	type	method	mean_ms	nupts_per_s
1	1	1	51.785	1.93e+09
1	1	2	47.124	2.12e+09
1	1	3	89.269	1.12e+09
1	2	1	58.341	1.71e+09
2	1	1	346.492	2.89e+08
2	1	2	239.713	4.17e+08
2	1	3	103.491	9.66e+08
2	2	1	97.006	1.03e+09
2	2	2	96.725	1.03e+09
3	1	1	2780.466	3.60e+07
3	1	2	10323.879	9.69e+06
3	1	3	769.989	1.30e+08
3	2	1 (tweaked)	660.690	1.51e+08
3	2	1 (master)	1913.930	5.22e+07
3	2	2	1140.149	8.77e+07

DiamonDinoia · 2025-05-05T19:35:46Z

blackwer · 2025-05-06T01:16:16Z

Just noting that I am unavailable for reviews until about May 15. Looking forward to seeing the performance when back in office!

…

On Monday, May 5, 2025, Marco Barbone ***@***.***> wrote: *DiamonDinoia* left a comment (flatironinstitute/finufft#663) <#663 (comment)> image.png (view on web) <https://github.com/user-attachments/assets/015186df-b704-4e06-a1e7-90dd1358fe94> — Reply to this email directly, view it on GitHub <#663 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACY7UWE2ABW6WNY64ARRGD2464SRAVCNFSM6AAAAAB3R6T2U2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNJSGEZDSOBXG4> . You are receiving this because your review was requested.Message ID: ***@***.***>

…driven-parallelism

DiamonDinoia · 2025-05-28T14:37:21Z

@blackwer , @janden can you review?

DiamonDinoia · 2025-06-10T19:50:34Z

include/cufinufft/intrinsics.h

+ * multiple threads, improving cache efficiency and reducing memory latency.
+ */
+template<typename T> __device__ __forceinline__ T loadReadOnly(const T *ptr) {
+#ifdef __CUDA_ARCH__


replace with nvcc

blackwer

This is a lot to take in with just code review. Once the issue with #701 is resolved, and the corresponding raw index lookups in method 3 are handled, I think we should merge it as an experimental feature. It barely touches old codepaths, so I doubt any existing functionality should be affected. Any docs and code notes should be updated to reflect that there is a new method. I noticed cuperftest had a stale reference to method 4, and didn't mention 3 at all. I'm about to create a separate PR for independent changes to cuperftest, so don't worry about touching that.

include/cufinufft/intrinsics.h

blackwer · 2025-06-09T15:15:01Z

src/cuda/1d/interp1d_wrapper.cu

-  threadsPerBlock.y = 1;
-  blocks.x          = (M + threadsPerBlock.x - 1) / threadsPerBlock.x;
-  blocks.y          = 1;
+  threadsPerBlock.x = threadsPerBlock.x = std::min(256u, (unsigned)M);


This seems high from my older tests, where I generally found 64/128 reasonable. Is this more targeted for newer hardware?

Even on my laptop this is what gives the best performance. Not sure about older GPUs, maybe worth having a macro that depends on __CUDA_ARCH__? Do we have older GPUs to test this on?

/** * Return an architecture-specific “good enough” thread-block size. * – Each branch is resolved at compile time (if-constexpr + __CUDA_ARCH__). * – Host-only translation units get the fall-back value. * Rationale (rule-of-thumb): * SM 9x / 8x : 16 warps = 256 threads * SM 7x : 8 warps = 128 threads * SM 6x- : 4 warps = 64 threads */ constexpr int optimal_block_threads() noexcept { #if defined(__CUDA_ARCH__) if constexpr (__CUDA_ARCH__ >= 800) // Ampere (SM 80/86) Hopper (SM 90+) return 256; // 16 warps else if constexpr (__CUDA_ARCH__ >= 700) // Volta/Turing (SM 70-75) return 128; // 8 warps else return 64; // 4 warps #else // Host code path – pick a safe generic value return 0; #endif }

We have some v100s and a100s to test if you want. This seems like a reasonable enough heuristic though

I implemented it with one change. Since this value should be known in cpu code not gpu I used a runtime API.

src/cuda/CMakeLists.txt

test/cuda/cufinufft3d_test.cu

cmake/setupCCCL.cmake

blackwer · 2025-06-20T14:08:07Z

src/cuda/1d/spreadinterp1d.cuh

+        const int ix = xstart + idx + ns_2;
+        // separable window weights
+        const auto kervalue = window_vals(i, idx);
+
+        // accumulate
+        const cuda_complex<T> res{cnow.x * kervalue, cnow.y * kervalue};
+        u_local[ix] += res;


This codeblock can segfault, as per the discussion in #701. Remediation probably depends on the solution to that issue.

CMakeLists.txt

…driven-parallelism

DiamonDinoia · 2025-06-20T16:31:06Z

@ahbarnett, @blackwer
I addressed the review comments. On the algorithm documentation is correct. I can answers questions at the next meeting.
gpu_np, np is the only variable that might not be 100% cufinufft-style but batch size and nupts are already taken.

…driven-parallelism

DiamonDinoia added 30 commits March 28, 2025 16:07

using adaptive bin size

3b91b4c

using adaptive bin size but 1/6 for 1D

e3826f2

using adaptive bin size but 1/6 for 1D

5fafb4f

WIP

fba94ce

WIP

92b5909

new method in place

0716f1f

WIP

6b64e4e

binsize revamp

e1b16f5

dry run of method 3

9ddf241

first implementation

930933e

first working implementation

a24f7d5

clean implementation in 3D

db91a56

clean implementation in 3D

8f0060c

fixed 3D tests, WIP 1D/2D

c0791ff

full spreading implemented

31e0219

using dirft to compute full ffts

e769e9d

pretty prints fixed sync

ae9896f

Merge branch 'cufinufft-improve-tests' into cufinufft-output-driven-p…

947da6a

…arallelism

using replacements for span/mdspan

bf0e929

using new span commit

b11068d

using ndrange

2ae4e11

* using dirft to compute full ffts

7e9630c

* Fixed test timing issue

Fixing tests.

5d6418b

* Some tests in cmake were asking for too many digits give the size of the transform * Cufinufft suffered from a off-by-one error since the original implementation. Fixed for now.

Merge branch 'cufinufft-improve-tests' into cufinufft-output-driven-p…

0180303

…arallelism

better intrinsics

9af6646

trying using cccl for spans

3f5d485

Merge remote-tracking branch 'flatiron/master' into cufinufft-output-…

459a8bb

…driven-parallelism

fixing atomicAdd (?)

3838c6e

better fix

87c0f39

const in the wrong place

9bb3f54

DiamonDinoia added 2 commits May 5, 2025 11:22

restore win-2019 properly

c2c0a2a

restore win-2022

7f98bf1

DiamonDinoia added 5 commits May 7, 2025 16:47

Fixed multi GPU, removed code duplication

ce3e2a0

Merge remote-tracking branch 'flatiron/master' into cufinufft-output-…

53b589d

…driven-parallelism

adapting cccl

6ae58e9

intrinsics wrapper

869795b

fixed clang build

5d750da

ahbarnett added this to the 2.5 milestone May 27, 2025

ahbarnett mentioned this pull request Apr 30, 2025

cuFINUFFT ouput driven #661

Open

5 tasks

DiamonDinoia commented Jun 10, 2025

View reviewed changes

blackwer mentioned this pull request Jun 18, 2025

[cufinufft] fold_rescale/interval fails in for float32 inputs on large 1D (~1e8) uniform grids #701

Open

blackwer requested changes Jun 20, 2025

View reviewed changes

blackwer mentioned this pull request Jun 20, 2025

cuperftest: Add debug argument and fix manual formatting #703

Merged

DiamonDinoia added 6 commits June 20, 2025 11:16

Merge remote-tracking branch 'flatiron/master' into cufinufft-output-…

d409480

…driven-parallelism

Robert fixes

915ade3

fixed cuda detection message

0315504

fixed Robert's comments

3854656

fixed thread detection

14481f6

documentation comments fix

705b8fe

DiamonDinoia added 3 commits June 20, 2025 12:50

updated documentation

d037a27

rename vp_sm

79a429f

rename vp_sm

130e100

DiamonDinoia requested a review from blackwer June 20, 2025 18:34

ahbarnett self-requested a review June 24, 2025 19:45

Merge remote-tracking branch 'flatiron/master' into cufinufft-output-…

f2806c1

…driven-parallelism

Output driven parallelism #663

Are you sure you want to change the base?

Output driven parallelism #663

Uh oh!

Conversation

DiamonDinoia commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamonDinoia commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamonDinoia commented May 5, 2025

Uh oh!

blackwer commented May 6, 2025 via email

Uh oh!

DiamonDinoia commented May 28, 2025

Uh oh!

DiamonDinoia Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

blackwer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blackwer Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blackwer Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blackwer Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DiamonDinoia commented Jun 20, 2025

Uh oh!

Uh oh!

DiamonDinoia commented Apr 21, 2025 •

edited

Loading

DiamonDinoia commented May 5, 2025 •

edited

Loading

DiamonDinoia Jun 20, 2025 •

edited

Loading