-
Notifications
You must be signed in to change notification settings - Fork 222
c.parallel: enable UBLKCP in transform #4847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test 5d98349 |
cc @bernhardmgruber for an initial look-over |
/ok to test 298eb95 |
🟨 CI finished in 2h 33m: Pass: 92%/188 | Total: 3d 15h | Avg: 28m 03s | Max: 1h 25m | Hits: 87%/273863
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
+/- | CUB |
Thrust | |
+/- | CUDA Experimental |
stdpar | |
python | |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 188)
# | Runner |
---|---|
129 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
12 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-rtxa6000-latest-1 |
11 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
/ok to test 2b4b62b |
This reverts commit 2b4b62b.
_CCCL_ASSERT(reinterpret_cast<uintptr_t>(dst) % alignof(T) == 0, ""); | ||
|
||
const int bytes_to_copy = static_cast<int>(sizeof(T)) * tile_size; | ||
cuda::memcpy_async(this_thread_block(), dst, src, bytes_to_copy, pipe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need to benchmark this new fallback path. I remember I chose the CG version for a reason. I could be that it worked better for badly aligned inputs or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked at the PTX generated for int sized data and it was identical (actually, one SHL was hoisted out of the loop in our case, and was inside the loop in the CG case for some reason), but yes, agreed. The reason I'm changing this is that we do have the libcu++ versions readily available and redistributable, whereas using CG in the runtime compilation in c.parallel would require a fair number of additional hoops to jump through.
template <Algorithm AlgorithmV, int BlockThreads, int MinBif, int BulkCopyAlignment> | ||
struct transform_agent_policy_t | ||
{ | ||
static constexpr int block_threads = BlockThreads; | ||
static constexpr Algorithm algorithm = AlgorithmV; | ||
static constexpr int min_bif = MinBif; | ||
static constexpr int block_threads = BlockThreads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, please do not unify the policies for the various algorithms. I am currently working on a vectorized policy and it will look substantially different again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. This was the easiest (and shortest, code-wise) way for me to tackle this, but I can go back.
@@ -197,13 +208,13 @@ struct dispatch_t<StableAddress, | |||
|
|||
elem_counts last_counts{}; | |||
// Increase the number of output elements per thread until we reach the required bytes in flight. | |||
static_assert(policy_t::min_items_per_thread <= policy_t::max_items_per_thread, ""); // ensures the loop below | |||
_CCCL_ASSERT_HOST(policy.MinItemsPerThread() <= policy.MaxItemsPerThread(), ""); // ensures the loop below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: It's really sad if we have to demote compile-time checks to runtime checks. I think there is a macro defined when CUB is built for cuda.parallel
. I think we should use it to generate a runtime check for cuda.parallel
and stay with the compile-time check for CUB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess I could key this off of CUB_DEFINE_RUNTIME_POLICIES
, since that is pretty much only relevant then. I am less convinced about the assert, to be quite honest, but the argument about doing something similar for the if constexpr
change you mentioned below makes perfect sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use CCCL_C_EXPERIMENTAL
as a macro. There is precedence for it.
#ifdef _CUB_HAS_TRANSFORM_UBLKCP | ||
if constexpr (Algorithm::ublkcp == wrapped_policy.GetAlgorithm()) | ||
if (Algorithm::ublkcp == wrapped_policy.GetAlgorithm()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: IIUC, this will now lead CUB to instantiate and generate both algorithms for each template instantiation of the public CUB API. @gevtushenko and I jumped through hoops last summer to prevent this at all cost to not increase binary sizes. I really believe we need to keep the compile-time dispatch for the non cuda.parallel CUB version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this at the code review hour and conclude that we can use a workaround like this for now:
if (Algorithm::ublkcp == wrapped_policy.GetAlgorithm()) | |
if | |
#ifndef CCCL_C_EXPERIMENTAL | |
constexpr | |
#endif | |
(Algorithm::ublkcp == wrapped_policy.GetAlgorithm()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have something equivalent in my working tree, just keyed off of CUB_DEFINE_RUNTIME_POLICIES
.
Description
This PR makes the UBLKCP path of transform compatible with c.parallel and makes c.parallel use the new machinery for computing the policy for transform, which, in effect, means that c.parallel is now able to use the of the UBLKCP path.
Resolves #4506
Resolves #4361
Checklist