Skip to content

c.parallel: enable UBLKCP in transform #4847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

griwes
Copy link
Contributor

@griwes griwes commented May 29, 2025

Description

This PR makes the UBLKCP path of transform compatible with c.parallel and makes c.parallel use the new machinery for computing the policy for transform, which, in effect, means that c.parallel is now able to use the of the UBLKCP path.

Resolves #4506
Resolves #4361

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented May 29, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL May 29, 2025
@griwes
Copy link
Contributor Author

griwes commented May 29, 2025

/ok to test 5d98349

@griwes
Copy link
Contributor Author

griwes commented May 29, 2025

cc @bernhardmgruber for an initial look-over

@griwes
Copy link
Contributor Author

griwes commented May 29, 2025

/ok to test 298eb95

Copy link
Contributor

🟨 CI finished in 2h 33m: Pass: 92%/188 | Total: 3d 15h | Avg: 28m 03s | Max: 1h 25m | Hits: 87%/273863
  • 🟨 thrust: Pass: 87%/47 | Total: 1d 08h | Avg: 40m 58s | Max: 1h 25m | Hits: 81%/78328

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  86%/45  | Total:  1d 06h | Avg: 41m 00s | Max:  1h 25m | Hits:  82%/74507 
      🟩 arm64              Pass: 100%/2   | Total:  1h 20m | Avg: 40m 19s | Max: 42m 26s | Hits:  79%/3821  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 57m 31s | Avg: 28m 45s | Max: 29m 58s | Hits:  79%/3820  
      🔍 nvcc               Pass:  86%/45  | Total:  1d 07h | Avg: 41m 31s | Max:  1h 25m | Hits:  82%/74508 
    🔍 sm: 90 🔍
      🔍 90                 Pass:  50%/2   | Total: 36m 07s | Avg: 18m 03s | Max: 23m 38s | Hits:  79%/1911  
      🟩 90;90a;100         Pass: 100%/1   | Total: 49m 20s | Avg: 49m 20s | Max: 49m 20s | Hits:  79%/1911  
    🟨 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 57m 31s | Avg: 28m 45s | Max: 29m 58s | Hits:  79%/3820  
      🟨 nvcc12.0           Pass:  80%/5   | Total:  3h 13m | Avg: 38m 44s | Max: 58m 38s | Hits:  79%/7642  
      🟨 nvcc12.9           Pass:  87%/40  | Total:  1d 03h | Avg: 41m 51s | Max:  1h 25m | Hits:  82%/66866 
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 30m | Avg: 37m 37s | Max: 44m 46s | Hits:  79%/7640  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 25m | Avg: 42m 40s | Max: 43m 25s | Hits:  79%/3820  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 34m | Avg: 47m 24s | Max: 51m 05s | Hits:  79%/3820  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 36m | Avg: 48m 18s | Max: 49m 14s | Hits:  79%/3820  
      🟩 Clang18            Pass: 100%/2   | Total:  1h 23m | Avg: 41m 58s | Max: 42m 02s | Hits:  79%/3820  
      🟩 Clang19            Pass: 100%/7   | Total:  3h 23m | Avg: 29m 02s | Max: 44m 44s | Hits:  85%/13370 
      🟩 GCC7               Pass: 100%/2   | Total:  1h 20m | Avg: 40m 10s | Max: 45m 26s | Hits:  79%/3822  
      🟩 GCC8               Pass: 100%/1   | Total: 44m 06s | Avg: 44m 06s | Max: 44m 06s | Hits:  79%/1911  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 29m | Avg: 44m 57s | Max: 53m 15s | Hits:  79%/3822  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 38m | Avg: 49m 24s | Max: 52m 49s | Hits:  79%/3822  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 29m | Avg: 44m 43s | Max: 45m 58s | Hits:  79%/3822  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 29m | Avg: 44m 50s | Max: 47m 40s | Hits:  79%/3822  
      🟨 GCC13              Pass:  90%/10  | Total:  4h 53m | Avg: 29m 22s | Max: 49m 20s | Hits:  86%/17199 
      🟥 MSVC14.29          Pass:   0%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 07m
      🟥 MSVC14.43          Pass:   0%/3   | Total:  2h 19m | Avg: 46m 26s | Max:  1h 12m
      🟩 NVHPC25.5          Pass: 100%/2   | Total:  2h 40m | Avg:  1h 20m | Max:  1h 25m | Hits:  72%/3818  
    🟨 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 11h 54m | Avg: 37m 36s | Max: 51m 05s | Hits:  82%/36290 
      🟨 GCC                Pass:  95%/21  | Total: 13h 06m | Avg: 37m 26s | Max: 53m 15s | Hits:  82%/38220 
      🟥 MSVC               Pass:   0%/5   | Total:  4h 25m | Avg: 53m 01s | Max:  1h 12m
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 40m | Avg:  1h 20m | Max:  1h 25m | Hits:  72%/3818  
    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 47m 34s | Avg: 23m 47s | Max: 33m 52s | Hits:  89%/3822  
    🟨 ctk
      🟨 12.0               Pass:  80%/5   | Total:  3h 13m | Avg: 38m 44s | Max: 58m 38s | Hits:  79%/7642  
      🟨 12.9               Pass:  88%/42  | Total:  1d 04h | Avg: 41m 14s | Max:  1h 25m | Hits:  82%/70686 
    🟨 gpu
      🟨 h100               Pass:  50%/2   | Total: 36m 07s | Avg: 18m 03s | Max: 23m 38s | Hits:  79%/1911  
      🟨 rtx2080            Pass:  91%/35  | Total:  1d 03h | Avg: 46m 42s | Max:  1h 25m | Hits:  79%/61132 
      🟨 rtx4090            Pass:  80%/10  | Total:  4h 14m | Avg: 25m 28s | Max:  1h 12m | Hits:  92%/15285 
    🟨 jobs
      🟨 Build              Pass:  90%/40  | Total:  1d 06h | Avg: 46m 25s | Max:  1h 25m | Hits:  79%/68775 
      🟨 TestCPU            Pass:  66%/3   | Total: 18m 13s | Avg:  6m 04s | Max:  9m 55s | Hits:  99%/3821  
      🟨 TestGPU            Pass:  75%/4   | Total: 50m 20s | Avg: 12m 35s | Max: 13m 42s | Hits:  99%/5732  
    🟨 std
      🟨 17                 Pass:  85%/21  | Total: 16h 41m | Avg: 47m 40s | Max:  1h 14m | Hits:  79%/34388 
      🟨 20                 Pass:  87%/24  | Total: 14h 37m | Avg: 36m 33s | Max:  1h 25m | Hits:  83%/40118 
    
  • 🟨 cub: Pass: 89%/47 | Total: 1d 18h | Avg: 54m 16s | Max: 1h 14m | Hits: 73%/51896

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  88%/45  | Total:  1d 16h | Avg: 54m 12s | Max:  1h 14m | Hits:  73%/49398 
      🟩 arm64              Pass: 100%/2   | Total:  1h 51m | Avg: 55m 33s | Max: 57m 40s | Hits:  68%/2498  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 03m | Avg: 31m 41s | Max: 33m 10s | Hits:  74%/2151  
      🔍 nvcc               Pass:  88%/45  | Total:  1d 17h | Avg: 55m 16s | Max:  1h 14m | Hits:  73%/49745 
    🔍 sm: 90 🔍
      🔍 90                 Pass:  66%/3   | Total:  1h 28m | Avg: 29m 26s | Max: 33m 15s | Hits:  83%/2500  
      🟩 90;90a;100         Pass: 100%/1   | Total: 58m 07s | Avg: 58m 07s | Max: 58m 07s | Hits:  67%/1250  
    🟨 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total:  1h 03m | Avg: 31m 41s | Max: 33m 10s | Hits:  74%/2151  
      🟨 nvcc12.0           Pass:  80%/5   | Total:  4h 43m | Avg: 56m 45s | Max:  1h 04m | Hits:  68%/4997  
      🟨 nvcc12.9           Pass:  90%/40  | Total:  1d 12h | Avg: 55m 05s | Max:  1h 14m | Hits:  74%/44748 
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 52m | Avg: 58m 05s | Max:  1h 04m | Hits:  68%/4998  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 56m | Avg: 58m 29s | Max:  1h 02m | Hits:  68%/2495  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 53m | Avg: 56m 38s | Max: 56m 45s | Hits:  68%/2495  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 57m | Avg: 58m 46s | Max:  1h 00m | Hits:  68%/2495  
      🟩 Clang18            Pass: 100%/2   | Total:  1h 57m | Avg: 58m 56s | Max:  1h 03m | Hits:  68%/2495  
      🟩 Clang19            Pass: 100%/7   | Total:  4h 49m | Avg: 41m 23s | Max:  1h 03m | Hits:  79%/8390  
      🟩 GCC7               Pass: 100%/2   | Total:  1h 56m | Avg: 58m 00s | Max: 59m 16s | Hits:  67%/2498  
      🟩 GCC8               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits:  67%/1249  
      🟩 GCC9               Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 04m | Hits:  67%/2498  
      🟩 GCC10              Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 05m | Hits:  67%/2499  
      🟩 GCC11              Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 05m | Hits:  67%/2495  
      🟩 GCC12              Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 11m | Hits:  67%/2495  
      🟨 GCC13              Pass:  90%/11  | Total:  7h 38m | Avg: 41m 42s | Max:  1h 07m | Hits:  83%/12497 
      🟥 MSVC14.29          Pass:   0%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 08m
      🟥 MSVC14.43          Pass:   0%/2   | Total:  2h 18m | Avg:  1h 09m | Max:  1h 14m
      🟩 NVHPC25.5          Pass: 100%/2   | Total:  2h 26m | Avg:  1h 13m | Max:  1h 13m | Hits:  68%/2297  
    🟨 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 16h 27m | Avg: 51m 58s | Max:  1h 04m | Hits:  72%/23368 
      🟨 GCC                Pass:  95%/22  | Total: 19h 05m | Avg: 52m 03s | Max:  1h 11m | Hits:  75%/26231 
      🟥 MSVC               Pass:   0%/4   | Total:  4h 31m | Avg:  1h 07m | Max:  1h 14m
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 26m | Avg:  1h 13m | Max:  1h 13m | Hits:  68%/2297  
    🟨 gpu
      🟨 h100               Pass:  66%/3   | Total:  1h 28m | Avg: 29m 26s | Max: 33m 15s | Hits:  83%/2500  
      🟨 rtx2080            Pass:  88%/36  | Total:  1d 12h | Avg:  1h 00m | Max:  1h 14m | Hits:  68%/39402 
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 56m | Avg: 37m 04s | Max:  1h 02m | Hits:  91%/9994  
    🟨 jobs
      🟨 Build              Pass:  89%/39  | Total:  1d 14h | Avg: 59m 22s | Max:  1h 14m | Hits:  68%/43150 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 33m 14s | Avg: 33m 14s | Max: 33m 14s | Hits:  99%/1250  
      🟩 GraphCapture       Pass: 100%/1   | Total: 26m 55s | Avg: 26m 55s | Max: 26m 55s | Hits:  99%/1250  
      🟨 HostLaunch         Pass:  66%/3   | Total:  1h 31m | Avg: 30m 36s | Max: 31m 13s | Hits:  99%/2498  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 23m | Avg: 27m 45s | Max: 33m 03s | Hits:  99%/3748  
    🟨 ctk
      🟨 12.0               Pass:  80%/5   | Total:  4h 43m | Avg: 56m 45s | Max:  1h 04m | Hits:  68%/4997  
      🟨 12.9               Pass:  90%/42  | Total:  1d 13h | Avg: 53m 58s | Max:  1h 14m | Hits:  74%/46899 
    🟨 std
      🟨 17                 Pass:  85%/21  | Total: 21h 17m | Avg:  1h 00m | Max:  1h 13m | Hits:  68%/22191 
      🟨 20                 Pass:  92%/26  | Total: 21h 12m | Avg: 48m 57s | Max:  1h 14m | Hits:  77%/29705 
    
  • 🟨 python: Pass: 83%/12 | Total: 1h 35m | Avg: 7m 55s | Max: 19m 03s

    🚨 jobs: Test cuda.parallel 🚨
      🟩 Build cuda.cccl    Pass: 100%/2   | Total:  6m 50s | Avg:  3m 25s | Max:  3m 30s
      🟩 Build cuda.cooperative Pass: 100%/2   | Total:  6m 54s | Avg:  3m 27s | Max:  3m 29s
      🟩 Build cuda.parallel Pass: 100%/2   | Total: 16m 12s | Avg:  8m 06s | Max:  8m 06s
      🟩 Test cuda.cccl     Pass: 100%/2   | Total:  9m 01s | Avg:  4m 30s | Max:  4m 49s
      🟩 Test cuda.cooperative Pass: 100%/2   | Total: 37m 39s | Avg: 18m 49s | Max: 19m 03s
      🔥 Test cuda.parallel Pass:   0%/2   | Total: 18m 32s | Avg:  9m 16s | Max:  9m 17s
    🟨 cpu
      🟨 amd64              Pass:  83%/12  | Total:  1h 35m | Avg:  7m 55s | Max: 19m 03s
    🟨 ctk
      🟨 12.9               Pass:  83%/12  | Total:  1h 35m | Avg:  7m 55s | Max: 19m 03s
    🟨 cudacxx
      🟨 nvcc12.9           Pass:  83%/12  | Total:  1h 35m | Avg:  7m 55s | Max: 19m 03s
    🟨 cudacxx_family
      🟨 nvcc               Pass:  83%/12  | Total:  1h 35m | Avg:  7m 55s | Max: 19m 03s
    🟨 cxx
      🟨 GCC13              Pass:  83%/12  | Total:  1h 35m | Avg:  7m 55s | Max: 19m 03s
    🟨 cxx_family
      🟨 GCC                Pass:  83%/12  | Total:  1h 35m | Avg:  7m 55s | Max: 19m 03s
    🟨 gpu
      🟨 rtxa6000           Pass:  83%/12  | Total:  1h 35m | Avg:  7m 55s | Max: 19m 03s
    🟨 py_version
      🟨 3.10               Pass:  83%/6   | Total: 46m 58s | Avg:  7m 49s | Max: 18m 36s
      🟨 3.13               Pass:  83%/6   | Total: 48m 10s | Avg:  8m 01s | Max: 19m 03s
    
  • 🟨 cudax: Pass: 96%/26 | Total: 3h 12m | Avg: 7m 23s | Max: 13m 27s | Hits: 90%/14181

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  95%/22  | Total:  2h 48m | Avg:  7m 40s | Max: 13m 27s | Hits:  90%/11817 
      🟩 arm64              Pass: 100%/4   | Total: 23m 30s | Avg:  5m 52s | Max:  6m 10s | Hits:  90%/2364  
    🔍 ctk: 12.9 🔍
      🟩 12.0               Pass: 100%/3   | Total: 23m 40s | Avg:  7m 53s | Max: 13m 15s | Hits:  88%/1478  
      🔍 12.9               Pass:  95%/23  | Total:  2h 48m | Avg:  7m 20s | Max: 13m 27s | Hits:  91%/12703 
    🔍 cudacxx: nvcc12.9 🔍
      🟩 nvcc12.0           Pass: 100%/3   | Total: 23m 40s | Avg:  7m 53s | Max: 13m 15s | Hits:  88%/1478  
      🔍 nvcc12.9           Pass:  95%/23  | Total:  2h 48m | Avg:  7m 20s | Max: 13m 27s | Hits:  91%/12703 
    🔍 cxx: GCC13 🔍
      🟩 Clang14            Pass: 100%/2   | Total: 10m 35s | Avg:  5m 17s | Max:  5m 42s | Hits:  90%/1186  
      🟩 Clang15            Pass: 100%/1   | Total:  5m 40s | Avg:  5m 40s | Max:  5m 40s | Hits:  90%/591   
      🟩 Clang16            Pass: 100%/1   | Total:  6m 21s | Avg:  6m 21s | Max:  6m 21s | Hits:  90%/591   
      🟩 Clang17            Pass: 100%/1   | Total:  5m 40s | Avg:  5m 40s | Max:  5m 40s | Hits:  90%/591   
      🟩 Clang18            Pass: 100%/1   | Total:  5m 37s | Avg:  5m 37s | Max:  5m 37s | Hits:  90%/591   
      🟩 Clang19            Pass: 100%/4   | Total: 26m 42s | Avg:  6m 40s | Max:  9m 21s | Hits:  93%/2364  
      🟩 GCC10              Pass: 100%/2   | Total: 11m 59s | Avg:  5m 59s | Max:  6m 27s | Hits:  90%/1186  
      🟩 GCC11              Pass: 100%/1   | Total:  6m 11s | Avg:  6m 11s | Max:  6m 11s | Hits:  90%/591   
      🟩 GCC12              Pass: 100%/1   | Total:  7m 04s | Avg:  7m 04s | Max:  7m 04s | Hits:  90%/591   
      🔍 GCC13              Pass:  87%/8   | Total: 55m 54s | Avg:  6m 59s | Max: 11m 20s | Hits:  91%/4137  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 13m 15s | Avg: 13m 15s | Max: 13m 15s | Hits:  78%/292   
      🟩 MSVC14.43          Pass: 100%/1   | Total: 13m 27s | Avg: 13m 27s | Max: 13m 27s | Hits:  79%/292   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 23m 57s | Avg: 11m 58s | Max: 12m 08s | Hits:  88%/1178  
    🔍 cxx_family: GCC 🔍
      🟩 Clang              Pass: 100%/10  | Total:  1h 00m | Avg:  6m 03s | Max:  9m 21s | Hits:  91%/5914  
      🔍 GCC                Pass:  91%/12  | Total:  1h 21m | Avg:  6m 45s | Max: 11m 20s | Hits:  91%/6505  
      🟩 MSVC               Pass: 100%/2   | Total: 26m 42s | Avg: 13m 21s | Max: 13m 27s | Hits:  78%/584   
      🟩 NVHPC              Pass: 100%/2   | Total: 23m 57s | Avg: 11m 58s | Max: 12m 08s | Hits:  88%/1178  
    🔍 gpu: h100 🔍
      🔍 h100               Pass:  50%/2   | Total: 15m 39s | Avg:  7m 49s | Max: 10m 42s | Hits:  90%/591   
      🟩 rtx2080            Pass: 100%/24  | Total:  2h 56m | Avg:  7m 21s | Max: 13m 27s | Hits:  90%/13590 
    🔍 jobs: Test 🔍
      🟩 Build              Pass: 100%/23  | Total:  2h 40m | Avg:  6m 59s | Max: 13m 27s | Hits:  89%/12999 
      🔍 Test               Pass:  66%/3   | Total: 31m 23s | Avg: 10m 27s | Max: 11m 20s | Hits:  99%/1182  
    🔍 sm: 90 🔍
      🔍 90                 Pass:  66%/3   | Total: 20m 43s | Avg:  6m 54s | Max: 10m 42s | Hits:  90%/1182  
      🟩 90a                Pass: 100%/1   | Total:  5m 00s | Avg:  5m 00s | Max:  5m 00s | Hits:  90%/591   
    🔍 std: 20 🔍
      🟩 17                 Pass: 100%/4   | Total: 29m 23s | Avg:  7m 20s | Max: 12m 08s | Hits:  90%/2362  
      🔍 20                 Pass:  95%/22  | Total:  2h 42m | Avg:  7m 24s | Max: 13m 27s | Hits:  90%/11819 
    🟨 cudacxx_family
      🟨 nvcc               Pass:  96%/26  | Total:  3h 12m | Avg:  7m 23s | Max: 13m 27s | Hits:  90%/14181 
    
  • 🟨 cccl_c_parallel: Pass: 66%/3 | Total: 36m 47s | Avg: 12m 15s | Max: 19m 23s | Hits: 98%/328

    🚨 gpu: h100 🚨
      🔥 h100               Pass:   0%/1   | Total: 19m 23s | Avg: 19m 23s | Max: 19m 23s
      🟩 rtx2080            Pass: 100%/2   | Total: 17m 24s | Avg:  8m 42s | Max: 12m 42s | Hits:  98%/328   
    🔍 jobs: Test 🔍
      🟩 Build              Pass: 100%/1   | Total:  4m 42s | Avg:  4m 42s | Max:  4m 42s | Hits:  97%/164   
      🔍 Test               Pass:  50%/2   | Total: 32m 05s | Avg: 16m 02s | Max: 19m 23s | Hits:  98%/164   
    🟨 cpu
      🟨 amd64              Pass:  66%/3   | Total: 36m 47s | Avg: 12m 15s | Max: 19m 23s | Hits:  98%/328   
    🟨 ctk
      🟨 12.9               Pass:  66%/3   | Total: 36m 47s | Avg: 12m 15s | Max: 19m 23s | Hits:  98%/328   
    🟨 cudacxx
      🟨 nvcc12.9           Pass:  66%/3   | Total: 36m 47s | Avg: 12m 15s | Max: 19m 23s | Hits:  98%/328   
    🟨 cudacxx_family
      🟨 nvcc               Pass:  66%/3   | Total: 36m 47s | Avg: 12m 15s | Max: 19m 23s | Hits:  98%/328   
    🟨 cxx
      🟨 GCC13              Pass:  66%/3   | Total: 36m 47s | Avg: 12m 15s | Max: 19m 23s | Hits:  98%/328   
    🟨 cxx_family
      🟨 GCC                Pass:  66%/3   | Total: 36m 47s | Avg: 12m 15s | Max: 19m 23s | Hits:  98%/328   
    
  • 🟩 libcudacxx: Pass: 100%/45 | Total: 7h 12m | Avg: 9m 36s | Max: 28m 43s | Hits: 95%/129130

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  7h 02m | Avg:  9m 50s | Max: 28m 43s | Hits:  95%/122443
      🟩 arm64              Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  4m 38s | Hits:  99%/6687  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total: 42m 52s | Avg:  8m 34s | Max: 25m 34s | Hits:  99%/16354 
      🟩 12.9               Pass: 100%/40  | Total:  6h 29m | Avg:  9m 43s | Max: 28m 43s | Hits:  94%/112776
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 54m 19s | Avg: 27m 09s | Max: 28m 43s | Hits:  26%/6651  
      🟩 nvcc12.0           Pass: 100%/5   | Total: 42m 52s | Avg:  8m 34s | Max: 25m 34s | Hits:  99%/16354 
      🟩 nvcc12.9           Pass: 100%/38  | Total:  5h 35m | Avg:  8m 48s | Max: 28m 25s | Hits:  98%/106125
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 54m 19s | Avg: 27m 09s | Max: 28m 43s | Hits:  26%/6651  
      🟩 nvcc               Pass: 100%/43  | Total:  6h 17m | Avg:  8m 47s | Max: 28m 25s | Hits:  99%/122479
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 19m 22s | Avg:  4m 50s | Max:  5m 29s | Hits:  99%/13258 
      🟩 Clang15            Pass: 100%/2   | Total: 10m 37s | Avg:  5m 18s | Max:  5m 34s | Hits:  98%/6647  
      🟩 Clang16            Pass: 100%/2   | Total: 10m 29s | Avg:  5m 14s | Max:  5m 20s | Hits:  99%/6647  
      🟩 Clang17            Pass: 100%/2   | Total: 10m 57s | Avg:  5m 28s | Max:  5m 55s | Hits:  99%/6647  
      🟩 Clang18            Pass: 100%/2   | Total: 10m 35s | Avg:  5m 17s | Max:  5m 30s | Hits:  98%/6647  
      🟩 Clang19            Pass: 100%/6   | Total:  1h 19m | Avg: 13m 15s | Max: 28m 43s | Hits:  70%/16641 
      🟩 GCC7               Pass: 100%/2   | Total:  9m 08s | Avg:  4m 34s | Max:  4m 50s | Hits:  99%/6583  
      🟩 GCC8               Pass: 100%/1   | Total:  4m 46s | Avg:  4m 46s | Max:  4m 46s | Hits:  99%/3302  
      🟩 GCC9               Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  4m 43s | Hits:  99%/6595  
      🟩 GCC10              Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  4m 52s | Hits:  99%/6649  
      🟩 GCC11              Pass: 100%/2   | Total: 10m 41s | Avg:  5m 20s | Max:  5m 30s | Hits:  98%/6645  
      🟩 GCC12              Pass: 100%/2   | Total: 10m 39s | Avg:  5m 19s | Max:  5m 28s | Hits:  99%/6649  
      🟩 GCC13              Pass: 100%/10  | Total:  1h 42m | Avg: 10m 14s | Max: 22m 28s | Hits:  99%/16885 
      🟩 MSVC14.29          Pass: 100%/2   | Total: 53m 59s | Avg: 26m 59s | Max: 28m 25s | Hits:  99%/6323  
      🟩 MSVC14.43          Pass: 100%/2   | Total: 54m 14s | Avg: 27m 07s | Max: 27m 41s | Hits:  99%/6375  
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 26m 14s | Avg: 13m 07s | Max: 14m 02s | Hits:  95%/6637  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/18  | Total:  2h 21m | Avg:  7m 51s | Max: 28m 43s | Hits:  90%/56487 
      🟩 GCC                Pass: 100%/21  | Total:  2h 36m | Avg:  7m 26s | Max: 22m 28s | Hits:  99%/53308 
      🟩 MSVC               Pass: 100%/4   | Total:  1h 48m | Avg: 27m 03s | Max: 28m 25s | Hits:  99%/12698 
      🟩 NVHPC              Pass: 100%/2   | Total: 26m 14s | Avg: 13m 07s | Max: 14m 02s | Hits:  95%/6637  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 20m 20s | Avg: 10m 10s | Max: 15m 07s | Hits:  99%/3426  
      🟩 rtx2080            Pass: 100%/43  | Total:  6h 51m | Avg:  9m 34s | Max: 28m 43s | Hits:  95%/125704
    🟩 jobs
      🟩 Build              Pass: 100%/39  | Total:  5h 46m | Avg:  8m 52s | Max: 28m 43s | Hits:  95%/129090
      🟩 NVRTC              Pass: 100%/2   | Total: 41m 29s | Avg: 20m 44s | Max: 22m 28s | Hits:  90%/40    
      🟩 Test               Pass: 100%/3   | Total: 40m 29s | Avg: 13m 29s | Max: 15m 07s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  4m 08s | Avg:  4m 08s | Max:  4m 08s
    🟩 sm
      🟩 75                 Pass: 100%/2   | Total: 41m 29s | Avg: 20m 44s | Max: 22m 28s | Hits:  90%/40    
      🟩 90                 Pass: 100%/2   | Total: 20m 20s | Avg: 10m 10s | Max: 15m 07s | Hits:  99%/3426  
      🟩 90;90a;100         Pass: 100%/1   | Total:  5m 43s | Avg:  5m 43s | Max:  5m 43s | Hits:  99%/3426  
    🟩 std
      🟩 17                 Pass: 100%/22  | Total:  3h 38m | Avg:  9m 54s | Max: 28m 25s | Hits:  95%/68924 
      🟩 20                 Pass: 100%/22  | Total:  3h 30m | Avg:  9m 32s | Max: 28m 43s | Hits:  94%/60206 
    
  • 🟩 cccl: Pass: 100%/4 | Total: 19m 51s | Avg: 4m 57s | Max: 5m 25s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 19m 51s | Avg:  4m 57s | Max:  5m 25s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  4m 57s
      🟩 12.9               Pass: 100%/2   | Total: 10m 42s | Avg:  5m 21s | Max:  5m 25s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  4m 57s
      🟩 nvcc12.9           Pass: 100%/2   | Total: 10m 42s | Avg:  5m 21s | Max:  5m 25s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 19m 51s | Avg:  4m 57s | Max:  5m 25s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  4m 12s | Avg:  4m 12s | Max:  4m 12s
      🟩 Clang19            Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
      🟩 GCC12              Pass: 100%/1   | Total:  4m 57s | Avg:  4m 57s | Max:  4m 57s
      🟩 GCC13              Pass: 100%/1   | Total:  5m 25s | Avg:  5m 25s | Max:  5m 25s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  9m 29s | Avg:  4m 44s | Max:  5m 17s
      🟩 GCC                Pass: 100%/2   | Total: 10m 22s | Avg:  5m 11s | Max:  5m 25s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 19m 51s | Avg:  4m 57s | Max:  5m 25s
    🟩 jobs
      🟩 Infra              Pass: 100%/4   | Total: 19m 51s | Avg:  4m 57s | Max:  5m 25s
    
  • 🟩 stdpar: Pass: 100%/4 | Total: 21m 28s | Avg: 5m 22s | Max: 6m 00s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 11m 41s | Avg:  5m 50s | Max:  6m 00s
      🟩 arm64              Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  5m 17s
    🟩 ctk
      🟩 12.9               Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  6m 00s
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  6m 00s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  6m 00s
    🟩 cxx
      🟩 NVHPC25.5          Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  6m 00s
    🟩 cxx_family
      🟩 NVHPC              Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  6m 00s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  6m 00s
    🟩 jobs
      🟩 Build              Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  6m 00s
    🟩 std
      🟩 17                 Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  6m 00s
      🟩 20                 Pass: 100%/2   | Total: 10m 11s | Avg:  5m 05s | Max:  5m 41s
    

👃 Inspect Changes

Modifications in project?

Project
+/- CCCL Infrastructure
libcu++
+/- CUB
Thrust
+/- CUDA Experimental
stdpar
python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
+/- CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- stdpar
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 188)

# Runner
129 linux-amd64-cpu16
15 windows-amd64-cpu16
12 linux-arm64-cpu16
12 linux-amd64-gpu-rtxa6000-latest-1
11 linux-amd64-gpu-rtx2080-latest-1
6 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1

@griwes
Copy link
Contributor Author

griwes commented May 30, 2025

/ok to test 2b4b62b

This reverts commit 2b4b62b.
_CCCL_ASSERT(reinterpret_cast<uintptr_t>(dst) % alignof(T) == 0, "");

const int bytes_to_copy = static_cast<int>(sizeof(T)) * tile_size;
cuda::memcpy_async(this_thread_block(), dst, src, bytes_to_copy, pipe);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need to benchmark this new fallback path. I remember I chose the CG version for a reason. I could be that it worked better for badly aligned inputs or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at the PTX generated for int sized data and it was identical (actually, one SHL was hoisted out of the loop in our case, and was inside the loop in the CG case for some reason), but yes, agreed. The reason I'm changing this is that we do have the libcu++ versions readily available and redistributable, whereas using CG in the runtime compilation in c.parallel would require a fair number of additional hoops to jump through.

Comment on lines +74 to +79
template <Algorithm AlgorithmV, int BlockThreads, int MinBif, int BulkCopyAlignment>
struct transform_agent_policy_t
{
static constexpr int block_threads = BlockThreads;
static constexpr Algorithm algorithm = AlgorithmV;
static constexpr int min_bif = MinBif;
static constexpr int block_threads = BlockThreads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, please do not unify the policies for the various algorithms. I am currently working on a vectorized policy and it will look substantially different again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. This was the easiest (and shortest, code-wise) way for me to tackle this, but I can go back.

@@ -197,13 +208,13 @@ struct dispatch_t<StableAddress,

elem_counts last_counts{};
// Increase the number of output elements per thread until we reach the required bytes in flight.
static_assert(policy_t::min_items_per_thread <= policy_t::max_items_per_thread, ""); // ensures the loop below
_CCCL_ASSERT_HOST(policy.MinItemsPerThread() <= policy.MaxItemsPerThread(), ""); // ensures the loop below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: It's really sad if we have to demote compile-time checks to runtime checks. I think there is a macro defined when CUB is built for cuda.parallel. I think we should use it to generate a runtime check for cuda.parallel and stay with the compile-time check for CUB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess I could key this off of CUB_DEFINE_RUNTIME_POLICIES, since that is pretty much only relevant then. I am less convinced about the assert, to be quite honest, but the argument about doing something similar for the if constexpr change you mentioned below makes perfect sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use CCCL_C_EXPERIMENTAL as a macro. There is precedence for it.

#ifdef _CUB_HAS_TRANSFORM_UBLKCP
if constexpr (Algorithm::ublkcp == wrapped_policy.GetAlgorithm())
if (Algorithm::ublkcp == wrapped_policy.GetAlgorithm())
Copy link
Contributor

@bernhardmgruber bernhardmgruber Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: IIUC, this will now lead CUB to instantiate and generate both algorithms for each template instantiation of the public CUB API. @gevtushenko and I jumped through hoops last summer to prevent this at all cost to not increase binary sizes. I really believe we need to keep the compile-time dispatch for the non cuda.parallel CUB version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this at the code review hour and conclude that we can use a workaround like this for now:

Suggested change
if (Algorithm::ublkcp == wrapped_policy.GetAlgorithm())
if
#ifndef CCCL_C_EXPERIMENTAL
constexpr
#endif
(Algorithm::ublkcp == wrapped_policy.GetAlgorithm())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have something equivalent in my working tree, just keyed off of CUB_DEFINE_RUNTIME_POLICIES.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA]: Expose transform UBLKCP algorithm via cuda.parallel Refactor transform in c.parallel to use json magic for reusing CUB tuning policies
2 participants