Skip to content

Output driven parallelism #663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 79 commits into
base: master
Choose a base branch
from

Conversation

DiamonDinoia
Copy link
Collaborator

@DiamonDinoia DiamonDinoia commented Apr 21, 2025

  • Output Driven initial implementation
  • Binsize tuning
  • Parameter tuning
  • Optimized interp

@DiamonDinoia
Copy link
Collaborator Author

DiamonDinoia commented May 5, 2025

Performance summary:

dim type method mean_ms nupts_per_s
1 1 1 51.785 1.93e+09
1 1 2 47.124 2.12e+09
1 1 3 89.269 1.12e+09
1 2 1 58.341 1.71e+09
2 1 1 346.492 2.89e+08
2 1 2 239.713 4.17e+08
2 1 3 103.491 9.66e+08
2 2 1 97.006 1.03e+09
2 2 2 96.725 1.03e+09
3 1 1 2780.466 3.60e+07
3 1 2 10323.879 9.69e+06
3 1 3 769.989 1.30e+08
3 2 1 (tweaked) 660.690 1.51e+08
3 2 1 (master) 1913.930 5.22e+07
3 2 2 1140.149 8.77e+07

@DiamonDinoia
Copy link
Collaborator Author

image

@blackwer
Copy link
Member

blackwer commented May 6, 2025 via email

@ahbarnett ahbarnett added this to the 2.5 milestone May 27, 2025
@ahbarnett ahbarnett mentioned this pull request Apr 30, 2025
5 tasks
@DiamonDinoia
Copy link
Collaborator Author

@blackwer , @janden can you review?

* multiple threads, improving cache efficiency and reducing memory latency.
*/
template<typename T> __device__ __forceinline__ T loadReadOnly(const T *ptr) {
#ifdef __CUDA_ARCH__
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with nvcc

Copy link
Member

@blackwer blackwer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot to take in with just code review. Once the issue with #701 is resolved, and the corresponding raw index lookups in method 3 are handled, I think we should merge it as an experimental feature. It barely touches old codepaths, so I doubt any existing functionality should be affected. Any docs and code notes should be updated to reflect that there is a new method. I noticed cuperftest had a stale reference to method 4, and didn't mention 3 at all. I'm about to create a separate PR for independent changes to cuperftest, so don't worry about touching that.

threadsPerBlock.y = 1;
blocks.x = (M + threadsPerBlock.x - 1) / threadsPerBlock.x;
blocks.y = 1;
threadsPerBlock.x = threadsPerBlock.x = std::min(256u, (unsigned)M);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems high from my older tests, where I generally found 64/128 reasonable. Is this more targeted for newer hardware?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even on my laptop this is what gives the best performance. Not sure about older GPUs, maybe worth having a macro that depends on __CUDA_ARCH__? Do we have older GPUs to test this on?

Copy link
Collaborator Author

@DiamonDinoia DiamonDinoia Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
 * Return an architecture-specific “good enough” thread-block size.
 * – Each branch is resolved at compile time (if-constexpr + __CUDA_ARCH__).
 * – Host-only translation units get the fall-back value.
 * Rationale (rule-of-thumb):
 *   SM 9x / 8x : 16 warps  = 256 threads
 *   SM 7x      :  8 warps  = 128 threads
 *   SM 6x-     :  4 warps  = 64 threads
 */
constexpr int optimal_block_threads() noexcept
{
#if defined(__CUDA_ARCH__)
    if constexpr (__CUDA_ARCH__ >= 800)   // Ampere (SM 80/86)  Hopper (SM 90+)
        return 256;                            // 16 warps
    else if constexpr (__CUDA_ARCH__ >= 700)   // Volta/Turing (SM 70-75)
        return 128;                            // 8 warps
    else 
        return 64;                            // 4 warps
#else
    // Host code path – pick a safe generic value
    return 0;
#endif
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some v100s and a100s to test if you want. This seems like a reasonable enough heuristic though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented it with one change. Since this value should be known in cpu code not gpu I used a runtime API.

Comment on lines 157 to 163
const int ix = xstart + idx + ns_2;
// separable window weights
const auto kervalue = window_vals(i, idx);

// accumulate
const cuda_complex<T> res{cnow.x * kervalue, cnow.y * kervalue};
u_local[ix] += res;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This codeblock can segfault, as per the discussion in #701. Remediation probably depends on the solution to that issue.

@DiamonDinoia
Copy link
Collaborator Author

@ahbarnett, @blackwer
I addressed the review comments. On the algorithm documentation is correct. I can answers questions at the next meeting.
gpu_np, np is the only variable that might not be 100% cufinufft-style but batch size and nupts are already taken.

@DiamonDinoia DiamonDinoia requested a review from blackwer June 20, 2025 18:34
@ahbarnett ahbarnett self-requested a review June 24, 2025 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants