-
Notifications
You must be signed in to change notification settings - Fork 90
Output driven parallelism #663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Output driven parallelism #663
Conversation
* Fixed test timing issue
* Some tests in cmake were asking for too many digits give the size of the transform * Cufinufft suffered from a off-by-one error since the original implementation. Fixed for now.
…driven-parallelism
Performance summary:
|
Just noting that I am unavailable for reviews until about May 15. Looking
forward to seeing the performance when back in office!
…On Monday, May 5, 2025, Marco Barbone ***@***.***> wrote:
*DiamonDinoia* left a comment (flatironinstitute/finufft#663)
<#663 (comment)>
image.png (view on web)
<https://github.com/user-attachments/assets/015186df-b704-4e06-a1e7-90dd1358fe94>
—
Reply to this email directly, view it on GitHub
<#663 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACY7UWE2ABW6WNY64ARRGD2464SRAVCNFSM6AAAAAB3R6T2U2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNJSGEZDSOBXG4>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
include/cufinufft/intrinsics.h
Outdated
* multiple threads, improving cache efficiency and reducing memory latency. | ||
*/ | ||
template<typename T> __device__ __forceinline__ T loadReadOnly(const T *ptr) { | ||
#ifdef __CUDA_ARCH__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with nvcc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot to take in with just code review. Once the issue with #701 is resolved, and the corresponding raw index lookups in method 3 are handled, I think we should merge it as an experimental feature. It barely touches old codepaths, so I doubt any existing functionality should be affected. Any docs and code notes should be updated to reflect that there is a new method. I noticed cuperftest
had a stale reference to method 4, and didn't mention 3 at all. I'm about to create a separate PR for independent changes to cuperftest
, so don't worry about touching that.
src/cuda/1d/interp1d_wrapper.cu
Outdated
threadsPerBlock.y = 1; | ||
blocks.x = (M + threadsPerBlock.x - 1) / threadsPerBlock.x; | ||
blocks.y = 1; | ||
threadsPerBlock.x = threadsPerBlock.x = std::min(256u, (unsigned)M); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems high from my older tests, where I generally found 64/128 reasonable. Is this more targeted for newer hardware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even on my laptop this is what gives the best performance. Not sure about older GPUs, maybe worth having a macro that depends on __CUDA_ARCH__
? Do we have older GPUs to test this on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/**
* Return an architecture-specific “good enough” thread-block size.
* – Each branch is resolved at compile time (if-constexpr + __CUDA_ARCH__).
* – Host-only translation units get the fall-back value.
* Rationale (rule-of-thumb):
* SM 9x / 8x : 16 warps = 256 threads
* SM 7x : 8 warps = 128 threads
* SM 6x- : 4 warps = 64 threads
*/
constexpr int optimal_block_threads() noexcept
{
#if defined(__CUDA_ARCH__)
if constexpr (__CUDA_ARCH__ >= 800) // Ampere (SM 80/86) Hopper (SM 90+)
return 256; // 16 warps
else if constexpr (__CUDA_ARCH__ >= 700) // Volta/Turing (SM 70-75)
return 128; // 8 warps
else
return 64; // 4 warps
#else
// Host code path – pick a safe generic value
return 0;
#endif
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some v100s and a100s to test if you want. This seems like a reasonable enough heuristic though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented it with one change. Since this value should be known in cpu code not gpu I used a runtime API.
src/cuda/1d/spreadinterp1d.cuh
Outdated
const int ix = xstart + idx + ns_2; | ||
// separable window weights | ||
const auto kervalue = window_vals(i, idx); | ||
|
||
// accumulate | ||
const cuda_complex<T> res{cnow.x * kervalue, cnow.y * kervalue}; | ||
u_local[ix] += res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This codeblock can segfault, as per the discussion in #701. Remediation probably depends on the solution to that issue.
@ahbarnett, @blackwer |
…driven-parallelism
Uh oh!
There was an error while loading. Please reload this page.