Skip to content

[mlir][nvgpu] update commit group and wait async ops #130482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

linuxlonelyeagle
Copy link
Member

There seems to be some errors in the current definitions of nvgpu.devicce_async_create_group and nvgpu.device_async_wait.
nvgpu.device_async_wait should not have operands.If it has operands, the semantics should be to wait for a cp operation in the operands.But it actually waits for all groups, so it shouldn't specify an operand, and you can see from the pattern in nvgpu-to-nvvm that its operand is useless.Since nvgpu.device_async_wait no longer has operands, this PR also removes the result from nvgpu.devicce_async_create_group.
In addition to this, corrections were made to the documentation and examples.

@llvmbot
Copy link
Member

llvmbot commented Mar 9, 2025

@llvm/pr-subscribers-mlir-nvgpu

@llvm/pr-subscribers-mlir

Author: lonely eagle (linuxlonelyeagle)

Changes

There seems to be some errors in the current definitions of nvgpu.devicce_async_create_group and nvgpu.device_async_wait.
nvgpu.device_async_wait should not have operands.If it has operands, the semantics should be to wait for a cp operation in the operands.But it actually waits for all groups, so it shouldn't specify an operand, and you can see from the pattern in nvgpu-to-nvvm that its operand is useless.Since nvgpu.device_async_wait no longer has operands, this PR also removes the result from nvgpu.devicce_async_create_group.
In addition to this, corrections were made to the documentation and examples.


Full diff: https://github.com/llvm/llvm-project/pull/130482.diff

8 Files Affected:

  • (modified) mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td (+21-16)
  • (modified) mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp (+1-6)
  • (modified) mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp (+2-5)
  • (modified) mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir (+6-6)
  • (modified) mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir (+14-14)
  • (modified) mlir/test/Dialect/NVGPU/roundtrip.mlir (+4-4)
  • (modified) mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir (+6-6)
  • (modified) mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir (+7-12)
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
index eb0fb90d271ed..03a9485e26bc7 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
@@ -216,15 +216,13 @@ def NVGPU_DeviceAsyncCopyOp : NVGPU_Op<"device_async_copy", [
     // copy 2.
     %cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
     // group 1 contains copy 1 and copy 2.
-    %token1 = nvgpu.device_async_create_group %cp1, %cp2
+    nvgpu.device_async_create_group %cp1, %cp2
     // copy 3.
     %cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
     // group 2 contains copy 3.
-    %token2 = nvgpu.device_async_create_group %cp3
-    // after the wait copy 1 and copy 2 are complete.
-    nvgpu.device_async_wait %token1
-    // after the wait copy 3 is complete.
-    nvgpu.device_async_wait %token2
+    nvgpu.device_async_create_group %cp3
+    // after the wait copy 1, copy 2 and copy 3 are complete.
+    nvgpu.device_async_wait
     ```
 
     Example:
@@ -255,9 +253,7 @@ def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
     The `nvgpu.device_async_create_group` op creates a group of memory accesses
     containing all the pending `device_async_copy` operations associated with
     argument tokens. Each token can only be part of one group.
-
-    It returns a token that can be use to wait until the group fully completes.
-
+ 
     This is meant to be used with `nvgpu.device_async_wait` to synchronize copies
     as explained in those ops descriptions.
 
@@ -266,10 +262,10 @@ def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
     Example:
 
     ```mlir
-    %0 = nvgpu.device_async_create_group
-  ```
+    %cp = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+    nvgpu.device_async_create_group %cp
+    ```
   }];
-  let results = (outs NVGPU_DeviceAsyncToken:$asyncToken);
   let arguments = (ins Variadic<NVGPU_DeviceAsyncToken>:$inputTokens);
   let assemblyFormat = [{
     $inputTokens attr-dict
@@ -291,13 +287,22 @@ def NVGPU_DeviceAsyncWaitOp : NVGPU_Op<"device_async_wait", []> {
     Example:
 
     ```mlir
-    nvgpu.device_async_wait %0
+    // copy 1.
+    %cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
+    nvgpu.device_async_create_group %cp1
+    // copy 2.
+    %cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+    nvgpu.device_async_create_group %cp2
+    // copy 3.
+    %cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+    nvgpu.device_async_create_group %cp3
+    // after the wait copy 1 and copy 2 are complete.
+    nvgpu.device_async_wait {numGroups = 1 : i32}
     ```
   }];
-  let arguments = (ins NVGPU_DeviceAsyncToken:$asyncDependencies,
-                       OptionalAttr<I32Attr>:$numGroups);
+  let arguments = (ins OptionalAttr<I32Attr>:$numGroups);
   let assemblyFormat = [{
-    $asyncDependencies attr-dict
+    attr-dict
   }];
 }
 
diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
index f53de416f2abd..3bf1fd04d1759 100644
--- a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
+++ b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
@@ -733,12 +733,7 @@ struct NVGPUAsyncCreateGroupLowering
   LogicalResult
   matchAndRewrite(nvgpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
                   ConversionPatternRewriter &rewriter) const override {
-    rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
-    // Drop the result token.
-    Value zero = rewriter.create<LLVM::ConstantOp>(
-        op->getLoc(), IntegerType::get(op.getContext(), 32),
-        rewriter.getI32IntegerAttr(0));
-    rewriter.replaceOp(op, zero);
+    rewriter.replaceOpWithNewOp<NVVM::CpAsyncCommitGroupOp>(op);
     return success();
   }
 };
diff --git a/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp b/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
index 10bc1993ffd96..08794b2b328fa 100644
--- a/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
+++ b/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
@@ -265,11 +265,8 @@ void nvgpu::createAsyncGroups(RewriterBase &rewriter, Operation *op,
     }
 
     // Create the group and wait for it right after.
-    Value groupToken = rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(
-        op->getLoc(), nvgpu::DeviceAsyncTokenType::get(op->getContext()),
-        tokens);
-    rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), groupToken,
-                                              nullptr);
+    rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(op->getLoc(), tokens);
+    rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), nullptr);
     // Clean up old stores.
     for (Operation *writeOp : group)
       rewriter.eraseOp(writeOp);
diff --git a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
index 6b59b5e4343b4..524eb3e1fa7b1 100644
--- a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
+++ b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
@@ -241,9 +241,9 @@ func.func @async_cp(
   // CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 16, cache = ca
   %0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 : memref<128x128xf32> to memref<3x16x128xf32, 3>
   // CHECK: nvvm.cp.async.commit.group
-  %1 = nvgpu.device_async_create_group %0
+  nvgpu.device_async_create_group %0
   // CHECK: nvvm.cp.async.wait.group 1
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+  nvgpu.device_async_wait { numGroups = 1 : i32 }
 
   // CHECK: nvvm.cp.async.shared.global %{{.*}}, %{{.*}}, 16, cache = cg
   %2 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
@@ -299,9 +299,9 @@ func.func @async_cp_zfill_f32_align4(
   // CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 16, cache = cg, %[[c5]]
   %0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4, %srcElements {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
   // CHECK: nvvm.cp.async.commit.group
-  %1 = nvgpu.device_async_create_group %0
+  nvgpu.device_async_create_group %0
   // CHECK: nvvm.cp.async.wait.group 1
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+  nvgpu.device_async_wait { numGroups = 1 : i32 }
 
   return
 }
@@ -334,9 +334,9 @@ func.func @async_cp_zfill_f32_align1(
   // CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 4, cache = ca, %[[c5]]
   %0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 1, %srcElements : memref<128x128xf32> to memref<3x16x128xf32, 3>
   // CHECK: nvvm.cp.async.commit.group
-  %1 = nvgpu.device_async_create_group %0
+  nvgpu.device_async_create_group %0
   // CHECK: nvvm.cp.async.wait.group 1
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+  nvgpu.device_async_wait { numGroups = 1 : i32 }
 
   return
 }
diff --git a/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir b/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
index 7477e18728677..610afb56d3175 100644
--- a/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
+++ b/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
@@ -19,8 +19,8 @@ func.func @optimize_128x32xf16_32x128xf16(%arg0: memref<128x128xf16>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
       : memref<128x128xf16> to memref<128x32xf16, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c6:%.+]] = arith.constant 6 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -39,8 +39,8 @@ func.func @optimize_128x32xf16_32x128xf16(%arg0: memref<128x128xf16>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shmB]][[[stRow]], [[stColPerm]]]
   %2 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shmB[%stRow, %stCol], 8
       : memref<128x128xf16> to memref<32x128xf16, 3>
-  %3 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c15:%.+]] = arith.constant 15 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c15]]
@@ -76,8 +76,8 @@ func.func @optimize_64x16xf32_16x64xf32(%arg0: memref<128x128xf32>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 4
       : memref<128x128xf32> to memref<64x16xf32, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c6:%.+]] = arith.constant 6 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -132,8 +132,8 @@ func.func @optimize_64x16xf32_16x64xf32(%arg0: memref<128x128xf32>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shmB]][[[stRow]], [[stColPerm]]]
   %2 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shmB[%stRow, %stCol], 4
       : memref<128x128xf32> to memref<16x64xf32, 3>
-  %3 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c15:%.+]] = arith.constant 15 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c15]]
@@ -177,8 +177,8 @@ func.func @small_column_size_f64(%arg0: memref<32x32xf64>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 2
       : memref<32x32xf64> to memref<32x4xf64, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c6:%.+]] = arith.constant 4 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -203,8 +203,8 @@ func.func @too_small_column_size_f16(%arg0: memref<128x128xf16>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stCol]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
       : memref<128x128xf16> to memref<128x8xf16, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: nvgpu.ldmatrix [[shm]][[[fragRow]], [[fragCol]]]
   %mat = nvgpu.ldmatrix %shm[%fragRow, %fragCol] {numTiles = 1 : i32, transpose = false}
@@ -229,8 +229,8 @@ func.func @abort_if_subview(%arg0: memref<128x128xf16>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stCol]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
       : memref<128x128xf16> to memref<128x32xf16, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: nvgpu.ldmatrix [[shmView]][[[fragRow]], [[fragCol]]]
   %mat = nvgpu.ldmatrix %shmView[%fragRow, %fragCol] {numTiles = 1 : i32, transpose = false}
diff --git a/mlir/test/Dialect/NVGPU/roundtrip.mlir b/mlir/test/Dialect/NVGPU/roundtrip.mlir
index ad516b4d2c200..dbd9c368d9e47 100644
--- a/mlir/test/Dialect/NVGPU/roundtrip.mlir
+++ b/mlir/test/Dialect/NVGPU/roundtrip.mlir
@@ -63,9 +63,9 @@ func.func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){
   %c0 = arith.constant 0 : index
   // CHECK: nvgpu.device_async_copy %{{.*}}[{{.*}}, {{.*}}], %{{.*}}[{{.*}}, {{.*}}, {{.*}}], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
   %0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
-  // CHECK: %{{.*}} = nvgpu.device_async_create_group
-  %token = nvgpu.device_async_create_group %0
-  // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1 : i32}
-  nvgpu.device_async_wait %token {numGroups = 1 : i32}
+  // CHECK: nvgpu.device_async_create_group
+  nvgpu.device_async_create_group %0
+  // CHECK: nvgpu.device_async_wait {numGroups = 1 : i32}
+  nvgpu.device_async_wait {numGroups = 1 : i32}
   return
 }
diff --git a/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir b/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
index 8290001c45856..aaaeb50854dc4 100644
--- a/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
+++ b/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
@@ -17,8 +17,8 @@ builtin.module {
     // CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1
     %2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
     vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
-    // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
-    // CHECK: nvgpu.device_async_wait %[[G]]
+    // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+    // CHECK: nvgpu.device_async_wait
     return
   }
 
@@ -51,8 +51,8 @@ builtin.module {
     // CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1 :
     %2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
     vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
-    // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
-    // CHECK: nvgpu.device_async_wait %[[G]]
+    // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+    // CHECK: nvgpu.device_async_wait
     return
   }
 
@@ -83,8 +83,8 @@ builtin.module {
     // CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1 :
     %2 = vector.load %a[%c0, %c4] : memref<1024x1024xf32>, vector<1xf32>
     vector.store %2, %0[%c0, %c4, %c0] : memref<4x32x16xf32, #gpu.address_space<workgroup>>, vector<1xf32>
-    // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
-    // CHECK: nvgpu.device_async_wait %[[G]]
+    // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+    // CHECK: nvgpu.device_async_wait
     return
   }
 
diff --git a/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir b/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
index e959949babd9e..e93a6a40391bb 100644
--- a/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
+++ b/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
@@ -94,13 +94,11 @@ func.func @async_depth_2_predicated(%global: memref<?xf32>, %alloc_size: index)
   %c0f = arith.constant 0.0 : f32
   // CHECK: %[[TOKEN0:.+]] = nvgpu.device_async_copy
   // CHECK: %[[TOKEN1:.+]] = nvgpu.device_async_copy
-  // CHECK: scf.for %[[I:.+]] = {{.*}} iter_args
-  // CHECK-SAME: %[[ITER_ARG0:.+]] = %[[TOKEN0]]
-  // CHECK-SAME: %[[ITER_ARG1:.+]] = %[[TOKEN1]]
+  // CHECK: scf.for %[[I:.+]] = {{.*}}
   scf.for %i = %c0 to %c98 step %c4 {
     // Condition for the predication "select" below.
     // CHECK:   %[[CMP0:.+]] = arith.cmpi slt, %[[I]], %[[C90]]
-    // CHECK:   nvgpu.device_async_wait %[[ITER_ARG0]] {numGroups = 1
+    // CHECK:   nvgpu.device_async_wait {numGroups = 1
     // Original "select" with updated induction variable.
     // CHECK:   %[[I_PLUS_8:.+]] = arith.addi %[[I]], %[[C8]]
     // CHECK:   %[[CMP1:.+]] = arith.cmpi slt, %[[I_PLUS_8]], %[[C96]]
@@ -122,9 +120,7 @@ func.func @async_depth_2_predicated(%global: memref<?xf32>, %alloc_size: index)
     %token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
       : memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
 
-    nvgpu.device_async_wait %token
-
-    // CHECK: scf.yield %[[ITER_ARG1]], %[[ASYNC_TOKEN]]
+    nvgpu.device_async_wait
   }
   // There is no need to wait for the last copies as it it was fully predicated
   // out and doesn't load the original data.
@@ -156,12 +152,11 @@ func.func @async_depth_2_peeled(%global: memref<?xf32>) {
   // CHECK: nvgpu.device_async_copy
   // CHECK: nvgpu.device_async_copy
   // CHECK: scf.for
-  // CHECK:   nvgpu.device_async_wait %{{.*}} {numGroups = 1
+  // CHECK:   nvgpu.device_async_wait {numGroups = 1
   // CHECK:   arith.select
   // CHECK:   nvgpu.device_async_copy
-  // CHECK:   scf.yield
-  // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1
-  // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 0
+  // CHECK: nvgpu.device_async_wait {numGroups = 1
+  // CHECK: nvgpu.device_async_wait {numGroups = 0
   scf.for %i = %c0 to %c98 step %c4 {
     %c96 = arith.constant 96 : index
     %cond = arith.cmpi slt, %i, %c96 : index
@@ -169,7 +164,7 @@ func.func @async_depth_2_peeled(%global: memref<?xf32>) {
     %read_size = arith.select %cond, %c4, %c2 : index
     %token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
       : memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
-    nvgpu.device_async_wait %token
+    nvgpu.device_async_wait
   }
   return
 }

@llvmbot
Copy link
Member

llvmbot commented Mar 9, 2025

@llvm/pr-subscribers-mlir-gpu

Author: lonely eagle (linuxlonelyeagle)

Changes

There seems to be some errors in the current definitions of nvgpu.devicce_async_create_group and nvgpu.device_async_wait.
nvgpu.device_async_wait should not have operands.If it has operands, the semantics should be to wait for a cp operation in the operands.But it actually waits for all groups, so it shouldn't specify an operand, and you can see from the pattern in nvgpu-to-nvvm that its operand is useless.Since nvgpu.device_async_wait no longer has operands, this PR also removes the result from nvgpu.devicce_async_create_group.
In addition to this, corrections were made to the documentation and examples.


Full diff: https://github.com/llvm/llvm-project/pull/130482.diff

8 Files Affected:

  • (modified) mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td (+21-16)
  • (modified) mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp (+1-6)
  • (modified) mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp (+2-5)
  • (modified) mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir (+6-6)
  • (modified) mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir (+14-14)
  • (modified) mlir/test/Dialect/NVGPU/roundtrip.mlir (+4-4)
  • (modified) mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir (+6-6)
  • (modified) mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir (+7-12)
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
index eb0fb90d271ed..03a9485e26bc7 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
@@ -216,15 +216,13 @@ def NVGPU_DeviceAsyncCopyOp : NVGPU_Op<"device_async_copy", [
     // copy 2.
     %cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
     // group 1 contains copy 1 and copy 2.
-    %token1 = nvgpu.device_async_create_group %cp1, %cp2
+    nvgpu.device_async_create_group %cp1, %cp2
     // copy 3.
     %cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
     // group 2 contains copy 3.
-    %token2 = nvgpu.device_async_create_group %cp3
-    // after the wait copy 1 and copy 2 are complete.
-    nvgpu.device_async_wait %token1
-    // after the wait copy 3 is complete.
-    nvgpu.device_async_wait %token2
+    nvgpu.device_async_create_group %cp3
+    // after the wait copy 1, copy 2 and copy 3 are complete.
+    nvgpu.device_async_wait
     ```
 
     Example:
@@ -255,9 +253,7 @@ def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
     The `nvgpu.device_async_create_group` op creates a group of memory accesses
     containing all the pending `device_async_copy` operations associated with
     argument tokens. Each token can only be part of one group.
-
-    It returns a token that can be use to wait until the group fully completes.
-
+ 
     This is meant to be used with `nvgpu.device_async_wait` to synchronize copies
     as explained in those ops descriptions.
 
@@ -266,10 +262,10 @@ def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
     Example:
 
     ```mlir
-    %0 = nvgpu.device_async_create_group
-  ```
+    %cp = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+    nvgpu.device_async_create_group %cp
+    ```
   }];
-  let results = (outs NVGPU_DeviceAsyncToken:$asyncToken);
   let arguments = (ins Variadic<NVGPU_DeviceAsyncToken>:$inputTokens);
   let assemblyFormat = [{
     $inputTokens attr-dict
@@ -291,13 +287,22 @@ def NVGPU_DeviceAsyncWaitOp : NVGPU_Op<"device_async_wait", []> {
     Example:
 
     ```mlir
-    nvgpu.device_async_wait %0
+    // copy 1.
+    %cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
+    nvgpu.device_async_create_group %cp1
+    // copy 2.
+    %cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+    nvgpu.device_async_create_group %cp2
+    // copy 3.
+    %cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+    nvgpu.device_async_create_group %cp3
+    // after the wait copy 1 and copy 2 are complete.
+    nvgpu.device_async_wait {numGroups = 1 : i32}
     ```
   }];
-  let arguments = (ins NVGPU_DeviceAsyncToken:$asyncDependencies,
-                       OptionalAttr<I32Attr>:$numGroups);
+  let arguments = (ins OptionalAttr<I32Attr>:$numGroups);
   let assemblyFormat = [{
-    $asyncDependencies attr-dict
+    attr-dict
   }];
 }
 
diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
index f53de416f2abd..3bf1fd04d1759 100644
--- a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
+++ b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
@@ -733,12 +733,7 @@ struct NVGPUAsyncCreateGroupLowering
   LogicalResult
   matchAndRewrite(nvgpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
                   ConversionPatternRewriter &rewriter) const override {
-    rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
-    // Drop the result token.
-    Value zero = rewriter.create<LLVM::ConstantOp>(
-        op->getLoc(), IntegerType::get(op.getContext(), 32),
-        rewriter.getI32IntegerAttr(0));
-    rewriter.replaceOp(op, zero);
+    rewriter.replaceOpWithNewOp<NVVM::CpAsyncCommitGroupOp>(op);
     return success();
   }
 };
diff --git a/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp b/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
index 10bc1993ffd96..08794b2b328fa 100644
--- a/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
+++ b/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
@@ -265,11 +265,8 @@ void nvgpu::createAsyncGroups(RewriterBase &rewriter, Operation *op,
     }
 
     // Create the group and wait for it right after.
-    Value groupToken = rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(
-        op->getLoc(), nvgpu::DeviceAsyncTokenType::get(op->getContext()),
-        tokens);
-    rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), groupToken,
-                                              nullptr);
+    rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(op->getLoc(), tokens);
+    rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), nullptr);
     // Clean up old stores.
     for (Operation *writeOp : group)
       rewriter.eraseOp(writeOp);
diff --git a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
index 6b59b5e4343b4..524eb3e1fa7b1 100644
--- a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
+++ b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
@@ -241,9 +241,9 @@ func.func @async_cp(
   // CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 16, cache = ca
   %0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 : memref<128x128xf32> to memref<3x16x128xf32, 3>
   // CHECK: nvvm.cp.async.commit.group
-  %1 = nvgpu.device_async_create_group %0
+  nvgpu.device_async_create_group %0
   // CHECK: nvvm.cp.async.wait.group 1
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+  nvgpu.device_async_wait { numGroups = 1 : i32 }
 
   // CHECK: nvvm.cp.async.shared.global %{{.*}}, %{{.*}}, 16, cache = cg
   %2 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
@@ -299,9 +299,9 @@ func.func @async_cp_zfill_f32_align4(
   // CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 16, cache = cg, %[[c5]]
   %0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4, %srcElements {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
   // CHECK: nvvm.cp.async.commit.group
-  %1 = nvgpu.device_async_create_group %0
+  nvgpu.device_async_create_group %0
   // CHECK: nvvm.cp.async.wait.group 1
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+  nvgpu.device_async_wait { numGroups = 1 : i32 }
 
   return
 }
@@ -334,9 +334,9 @@ func.func @async_cp_zfill_f32_align1(
   // CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 4, cache = ca, %[[c5]]
   %0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 1, %srcElements : memref<128x128xf32> to memref<3x16x128xf32, 3>
   // CHECK: nvvm.cp.async.commit.group
-  %1 = nvgpu.device_async_create_group %0
+  nvgpu.device_async_create_group %0
   // CHECK: nvvm.cp.async.wait.group 1
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+  nvgpu.device_async_wait { numGroups = 1 : i32 }
 
   return
 }
diff --git a/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir b/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
index 7477e18728677..610afb56d3175 100644
--- a/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
+++ b/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
@@ -19,8 +19,8 @@ func.func @optimize_128x32xf16_32x128xf16(%arg0: memref<128x128xf16>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
       : memref<128x128xf16> to memref<128x32xf16, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c6:%.+]] = arith.constant 6 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -39,8 +39,8 @@ func.func @optimize_128x32xf16_32x128xf16(%arg0: memref<128x128xf16>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shmB]][[[stRow]], [[stColPerm]]]
   %2 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shmB[%stRow, %stCol], 8
       : memref<128x128xf16> to memref<32x128xf16, 3>
-  %3 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c15:%.+]] = arith.constant 15 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c15]]
@@ -76,8 +76,8 @@ func.func @optimize_64x16xf32_16x64xf32(%arg0: memref<128x128xf32>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 4
       : memref<128x128xf32> to memref<64x16xf32, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c6:%.+]] = arith.constant 6 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -132,8 +132,8 @@ func.func @optimize_64x16xf32_16x64xf32(%arg0: memref<128x128xf32>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shmB]][[[stRow]], [[stColPerm]]]
   %2 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shmB[%stRow, %stCol], 4
       : memref<128x128xf32> to memref<16x64xf32, 3>
-  %3 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c15:%.+]] = arith.constant 15 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c15]]
@@ -177,8 +177,8 @@ func.func @small_column_size_f64(%arg0: memref<32x32xf64>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 2
       : memref<32x32xf64> to memref<32x4xf64, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: [[c6:%.+]] = arith.constant 4 : index
   // CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -203,8 +203,8 @@ func.func @too_small_column_size_f16(%arg0: memref<128x128xf16>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stCol]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
       : memref<128x128xf16> to memref<128x8xf16, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: nvgpu.ldmatrix [[shm]][[[fragRow]], [[fragCol]]]
   %mat = nvgpu.ldmatrix %shm[%fragRow, %fragCol] {numTiles = 1 : i32, transpose = false}
@@ -229,8 +229,8 @@ func.func @abort_if_subview(%arg0: memref<128x128xf16>,
   // CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stCol]]]
   %0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
       : memref<128x128xf16> to memref<128x32xf16, 3>
-  %1 = nvgpu.device_async_create_group %0
-  nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+  nvgpu.device_async_create_group %0
+  nvgpu.device_async_wait { numGroups = 1 : i32}
 
   // CHECK: nvgpu.ldmatrix [[shmView]][[[fragRow]], [[fragCol]]]
   %mat = nvgpu.ldmatrix %shmView[%fragRow, %fragCol] {numTiles = 1 : i32, transpose = false}
diff --git a/mlir/test/Dialect/NVGPU/roundtrip.mlir b/mlir/test/Dialect/NVGPU/roundtrip.mlir
index ad516b4d2c200..dbd9c368d9e47 100644
--- a/mlir/test/Dialect/NVGPU/roundtrip.mlir
+++ b/mlir/test/Dialect/NVGPU/roundtrip.mlir
@@ -63,9 +63,9 @@ func.func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){
   %c0 = arith.constant 0 : index
   // CHECK: nvgpu.device_async_copy %{{.*}}[{{.*}}, {{.*}}], %{{.*}}[{{.*}}, {{.*}}, {{.*}}], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
   %0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
-  // CHECK: %{{.*}} = nvgpu.device_async_create_group
-  %token = nvgpu.device_async_create_group %0
-  // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1 : i32}
-  nvgpu.device_async_wait %token {numGroups = 1 : i32}
+  // CHECK: nvgpu.device_async_create_group
+  nvgpu.device_async_create_group %0
+  // CHECK: nvgpu.device_async_wait {numGroups = 1 : i32}
+  nvgpu.device_async_wait {numGroups = 1 : i32}
   return
 }
diff --git a/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir b/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
index 8290001c45856..aaaeb50854dc4 100644
--- a/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
+++ b/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
@@ -17,8 +17,8 @@ builtin.module {
     // CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1
     %2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
     vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
-    // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
-    // CHECK: nvgpu.device_async_wait %[[G]]
+    // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+    // CHECK: nvgpu.device_async_wait
     return
   }
 
@@ -51,8 +51,8 @@ builtin.module {
     // CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1 :
     %2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
     vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
-    // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
-    // CHECK: nvgpu.device_async_wait %[[G]]
+    // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+    // CHECK: nvgpu.device_async_wait
     return
   }
 
@@ -83,8 +83,8 @@ builtin.module {
     // CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1 :
     %2 = vector.load %a[%c0, %c4] : memref<1024x1024xf32>, vector<1xf32>
     vector.store %2, %0[%c0, %c4, %c0] : memref<4x32x16xf32, #gpu.address_space<workgroup>>, vector<1xf32>
-    // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
-    // CHECK: nvgpu.device_async_wait %[[G]]
+    // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+    // CHECK: nvgpu.device_async_wait
     return
   }
 
diff --git a/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir b/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
index e959949babd9e..e93a6a40391bb 100644
--- a/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
+++ b/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
@@ -94,13 +94,11 @@ func.func @async_depth_2_predicated(%global: memref<?xf32>, %alloc_size: index)
   %c0f = arith.constant 0.0 : f32
   // CHECK: %[[TOKEN0:.+]] = nvgpu.device_async_copy
   // CHECK: %[[TOKEN1:.+]] = nvgpu.device_async_copy
-  // CHECK: scf.for %[[I:.+]] = {{.*}} iter_args
-  // CHECK-SAME: %[[ITER_ARG0:.+]] = %[[TOKEN0]]
-  // CHECK-SAME: %[[ITER_ARG1:.+]] = %[[TOKEN1]]
+  // CHECK: scf.for %[[I:.+]] = {{.*}}
   scf.for %i = %c0 to %c98 step %c4 {
     // Condition for the predication "select" below.
     // CHECK:   %[[CMP0:.+]] = arith.cmpi slt, %[[I]], %[[C90]]
-    // CHECK:   nvgpu.device_async_wait %[[ITER_ARG0]] {numGroups = 1
+    // CHECK:   nvgpu.device_async_wait {numGroups = 1
     // Original "select" with updated induction variable.
     // CHECK:   %[[I_PLUS_8:.+]] = arith.addi %[[I]], %[[C8]]
     // CHECK:   %[[CMP1:.+]] = arith.cmpi slt, %[[I_PLUS_8]], %[[C96]]
@@ -122,9 +120,7 @@ func.func @async_depth_2_predicated(%global: memref<?xf32>, %alloc_size: index)
     %token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
       : memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
 
-    nvgpu.device_async_wait %token
-
-    // CHECK: scf.yield %[[ITER_ARG1]], %[[ASYNC_TOKEN]]
+    nvgpu.device_async_wait
   }
   // There is no need to wait for the last copies as it it was fully predicated
   // out and doesn't load the original data.
@@ -156,12 +152,11 @@ func.func @async_depth_2_peeled(%global: memref<?xf32>) {
   // CHECK: nvgpu.device_async_copy
   // CHECK: nvgpu.device_async_copy
   // CHECK: scf.for
-  // CHECK:   nvgpu.device_async_wait %{{.*}} {numGroups = 1
+  // CHECK:   nvgpu.device_async_wait {numGroups = 1
   // CHECK:   arith.select
   // CHECK:   nvgpu.device_async_copy
-  // CHECK:   scf.yield
-  // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1
-  // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 0
+  // CHECK: nvgpu.device_async_wait {numGroups = 1
+  // CHECK: nvgpu.device_async_wait {numGroups = 0
   scf.for %i = %c0 to %c98 step %c4 {
     %c96 = arith.constant 96 : index
     %cond = arith.cmpi slt, %i, %c96 : index
@@ -169,7 +164,7 @@ func.func @async_depth_2_peeled(%global: memref<?xf32>) {
     %read_size = arith.select %cond, %c4, %c2 : index
     %token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
       : memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
-    nvgpu.device_async_wait %token
+    nvgpu.device_async_wait
   }
   return
 }

@grypp
Copy link
Member

grypp commented Mar 11, 2025

I think this PR is breaking SSA semantic, this can cause problems right?
Currently we have this ops:

%1 = nvgpu.device_async_create_group %0
nvgpu.device_async_wait %1 { numGroups = 1 : i32 }

PR makes the ops like the following. There is no SSA anymore.

nvgpu.device_async_create_group %0
nvgpu.device_async_wait { numGroups = 1 : i32 }

@linuxlonelyeagle
Copy link
Member Author

I think this PR is breaking SSA semantic, this can cause problems right? Currently we have this ops:

%1 = nvgpu.device_async_create_group %0
nvgpu.device_async_wait %1 { numGroups = 1 : i32 }

PR makes the ops like the following. There is no SSA anymore.

nvgpu.device_async_create_group %0
nvgpu.device_async_wait { numGroups = 1 : i32 }

Yes.But I don't think it will cause any problem.You just need to think of it as writing PTX inline assembly.
Here are my reasons:
1.https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-wait-group
You see here, it actually has no operands.To be honest with you, I think there seems to have misunderstood cp.async.wait_group.
The following code comes from https://mlir.llvm.org/docs/Dialects/NVGPU/#nvgpudevice_async_copy-nvgpudeviceasynccopyop

// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2

I think the use of the two wait is wrong.The first wait should wait for all async ops.The token here has no effect.
Here two groups are created. You need to wait for cp1 and cp2 to complete. That is, you need to wait for group1 to complete and ignore group2. You should say:

// num_group indicates the maximum number of unfinished groups.
nvgpu.device_async_wait {num_groups = 1 } 

One thing that needs to be made clear is that wait is waiting for groups, not the cp Ops in a certain group.
2. Consider the following structure.Based on the above concept.Is this structure a bit strange?
The fundamental problem is that the current implementation assumes that the wait is for the cp in the group, but it is not.

 // some async copy op...
%group = nvgpu.create_group ....
 nvgpu.async_wait_group %group { num_groups = nstage - 2 }
for xxx {
  for xxx {
    // some async copy op...
    %group = nvgpu.create_group ....
    nvgpu.async_wait_group %group { num_groups = nstage - 2 }
  } 
}

If I want to describe the semantics I mentioned above.

 // some async copy op...
%group = nvgpu.create_group ....
 nvgpu.async_wait_group %group { num_groups = nstage - 2 }
for xxx {
  for xxx {
    // some async copy op...
    %group1 = nvgpu.create_group ....
     %group2 = nvgpu.push_group %group, %group1
    nvgpu.async_wait_group %group2 { num_groups = nstage - 2 }
  } 
}

Doing this implies that we wait on the entire group.So the simplest approach is to remove the operand of the wait op, which actually only needs the attribute.

@linuxlonelyeagle
Copy link
Member Author

I think this PR is breaking SSA semantic, this can cause problems right? Currently we have this ops:

%1 = nvgpu.device_async_create_group %0
nvgpu.device_async_wait %1 { numGroups = 1 : i32 }

PR makes the ops like the following. There is no SSA anymore.

nvgpu.device_async_create_group %0
nvgpu.device_async_wait { numGroups = 1 : i32 }

As a side note, the group generated here by nvgpu.device_async_create_group %0 is actually a transparent group. generating an SSA Value and then using the wait Op doesn't actually control it.

@linuxlonelyeagle
Copy link
Member Author

I think the second reason in the comment above is not very clear. But I did find a good example now, which represents the abstraction of cute.
https://github.com/NVIDIA/cutlass/blob/06e560d98a5fe8acb975db2c4c26817b6c90acb1/examples/cute/tutorial/sgemm_sm80.cu#L250

@grypp
Copy link
Member

grypp commented Mar 13, 2025

When you break SSA semantic like following, is reordering the OPs allowed?

nvgpu.device_async_create_group %0
nvgpu.device_async_wait { numGroups = 1 : i32 }

@linuxlonelyeagle
Copy link
Member Author

When you break SSA semantic like following, is reordering the OPs allowed?

nvgpu.device_async_create_group %0
nvgpu.device_async_wait { numGroups = 1 : i32 }

It cannot be reordered, nvgpu.device_async_wait should be similar to gpu.barrier.

@grypp
Copy link
Member

grypp commented Mar 14, 2025

When you break SSA semantic like following, is reordering the OPs allowed?

nvgpu.device_async_create_group %0

nvgpu.device_async_wait { numGroups = 1 : i32 }

It cannot be reordered, nvgpu.device_async_wait should be similar to gpu.barrier.

They should be reordered yes. But what stops the compiler to reorder them? When there was SSA semantic, the reordering wasn't possible.

@joker-eph
Copy link
Collaborator

But what stops the compiler to reorder them?

They aren't marked as having no side effects.

@linuxlonelyeagle
Copy link
Member Author

When you break SSA semantic like following, is reordering the OPs allowed?

nvgpu.device_async_create_group %0

nvgpu.device_async_wait { numGroups = 1 : i32 }

It cannot be reordered, nvgpu.device_async_wait should be similar to gpu.barrier.

They should be reordered yes. But what stops the compiler to reorder them? When there was SSA semantic, the reordering wasn't possible.

I think they are not Pure Ops, like constant Op, it will be Pure Op, so by running -canonicalize, they will be reordered.

@grypp
Copy link
Member

grypp commented Mar 15, 2025

They should be reordered yes.

Typo: they should not be reordered.

To prevent them from being reordered, we need to make them side-effecting. Otherwise, MLIR—or any downstream compiler—is free to reorder them, which is not what we want.

That said, making them side-effecting is a big hammer. Preserving SSA semantics is, in my opinion, a less intrusive approach.

Your PR makes sense if we didn’t have SSA. I can totally understand the reasoning if these ops are lowered directly to PTX, where there are no tokens. However, PTX isn’t SSA anyway, so I’m not sure it’s the best comparison.

If you look at vectors or structs at the NVGPU level, you’ll see the same thing: SSA semantics are preserved.

@grypp
Copy link
Member

grypp commented Mar 15, 2025

Bu the way, it looks like you are doing interesting work in nvgpu/nvvm. Here is my discord if you want to discuss things in higher bandwidth: 'guraypp#5865'

@joker-eph
Copy link
Collaborator

To prevent them from being reordered, we need to make them side-effecting.

To be clear: they are already side-effecting.

@linuxlonelyeagle
Copy link
Member Author

@grypp
Copy link
Member

grypp commented Mar 16, 2025

To prevent them from being reordered, we need to make them side-effecting.

To be clear: they are already side-effecting.

I appreciate the clarification. I m miss-read your comment yesterday, sorry my bad. In that case, my reordering concern is not an issue.

I'll need to take a look at this and lower IR to understand why this is implemented with SSA in the first place.

@linuxlonelyeagle
Copy link
Member Author

The original implementation using SSA Value should be traceable from here. This example contains the original original usage. (It should be understood that the wait is for a cp operation, not for groups that are being committed.
https://mlir.llvm.org/docs/Dialects/NVGPU/#nvgpudevice_async_copy-nvgpudeviceasynccopyop

// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2
Example:

%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 :
  memref<4x5xf32> to memref<2x7x5xf32, 3>

@grypp
Copy link
Member

grypp commented Mar 17, 2025

Here is the RFC and code review discussion:
https://discourse.llvm.org/t/modeling-gpu-async-copy-ampere-feature/4924
https://reviews.llvm.org/D119191

It seems this PR's approach (without token) was previously discussed there. However, since it's close to low-level PTX, the design decision was to avoid this approach in favor of using SSA and tokens.

In my personal opinion, there's not much practical difference with or without token(I've learned that reordering isn't an issue). Having ssa&tokens is slightly better as it makes dependencies clearer.

Overall, there doesn't seem to be a clear advantage to adopting this PR. Let me know what you think.

@linuxlonelyeagle
Copy link
Member Author

  • Introducing this thing should be hoped to enable better analysis.But I think this thing is too hack.This makes normal usage not easy to use.
    The follow is an nstage algorithm.
// prefetch data
%token = async.cp
%token_0 = commit_group %token

// cp op and commit op should be nstage - 1 times.

loop args(%token_0,...) { // nstage - 1 tokens
   wait [%token_0, ...]
   compute Ops
   // prefetch data
   %token_0 = async.cp
   %token_1 = commit_group %token
   // nstage - 1 times.
   yield % token_1, ...
}

It seems to me that there is no need to put so many parameters in the loop.

  • The semantic confusion
%token = async.cp
%token_0 = commit_group %token

%token_1 = async.cp
%token_2 = commit_group %token_1

Regarding the above situation,Aren't the following usages the same?

// wait all cp1 and cp2
async_wait {num_groups = 0}
async_wait [%token_1] {num_groups = 0}
async_wait [%token_1, %token_2] {num_groups = 0}

Like the example I left in a recent comment, it also has problems.The current changes are closer to the original semantics of PTX.People familiar with PTX will immediately know how to use these Ops.

  • Taking a step back, if we cancel the dependent groups, will it be difficult to analyze the groups that the current async_wait op is waiting for.Maybe it is not difficult to do. Enough information is kept on IR.

@linuxlonelyeagle
Copy link
Member Author

I read the comments in the REVIEW (there's just too much content), but one thing I can tell is this: it seems like the wait copy. operation was assumed in the first place, not the number of groups.

@linuxlonelyeagle
Copy link
Member Author

Can we go forward with this PR?

@grypp
Copy link
Member

grypp commented Mar 27, 2025

I think the way your PR implements it is also fine. But as we read in the previous discussion, folks chose the current approach. I don’t see much difference between the two. The current way is easier to follow in terms of dependencies, while your PR makes that a little harder.

It looks like the generated PTX will look the same in either way.

I’m wondering—does the current approach block you from implementing an algorithm? If that’s the case, can you elaborate what is this case?

@linuxlonelyeagle
Copy link
Member Author

It looks like the generated PTX will look the same in either way.
Even though the generated PTX is the same, that's because groups were replaced by creating constantOp, as you can see in the pattern.I have made change to it in this PR.

I’m wondering—does the current approach block you from implementing an algorithm? If that’s the case, can you elaborate what is this case?

It really is affecting the nStage algorithm.Maybe the example I gave wasn't detailed enough.

// prefetch data
%token = async.cp
%token_0 = commit_group %token

// cp op and commit op should be nstage - 1 times.

loop args(%token_0,...) { // nstage - 1 tokens
   wait [%token_0, ...]
   compute Ops
   // prefetch data
   %token_0 = async.cp
   %token_1 = commit_group %token
   // nstage - 1 times.
   yield % token_1, ...
}

Now we assume that n is 3.

// prefetch data 2 times.async copy op and commit op should be nstage - 1 times.

%token = async.cp
%token_0 = commit_group %token

%token_1 = async.cp
%token_2 = commit_group %token_1

// Ignore register prefetching.

loop args(%token_0, token_2) { // nstage - 1 tokens
   // numgroup = n - 2, wait %token_0 group complete copy data from global memory to share memory.
   wait [%token_0, token_2] {numgroup = 1}
   compute ops
   // prefetch data
   %token_3 = async.cp
   %token_4 = commit_group %token_3
   // The number of operands of a yield op should be equal to n - 1, representing the currently existing group.
   yield % token_2, %token_4
}

As n gets larger, the parameters to be iterated in the loop increase linearly.But it's not really necessary to introduce the result of the commit_group in the loop.This made it more difficult for me to codegen gpu code.For now, this makes sense to me personally.

On top of that, the current nvgpu.device_async_wait is semantically incorrect.

%token = async.cp
%token_0 = commit_group %token

%token_1 = async.cp
%token_2 = commit_group %token_1

// wait all cp1 and cp2,the following three wait ops do practically the same thing.
async_wait {num_groups = 0}
async_wait [%token_1] {num_groups = 0}
async_wait [%token_1, %token_2] {num_groups = 0}

I think we need to fix it, the thing that really acts as a control is the nunmGroups attribute.This makes sense for MLIR.
You can look at the current MLIR documentation for this part of the op, and the semantics in the documentation are contrary to what it really actually does.

@linuxlonelyeagle
Copy link
Member Author

Do you guys have any questions in response to my reply above? Honestly, looking forward to further communication.

@grypp
Copy link
Member

grypp commented Mar 30, 2025

As n gets larger, the parameters to be iterated in the loop increase linearly.But it's not really necessary to introduce the result of the commit_group in the loop.This made it more difficult for me to codegen gpu code.

So, you're saying that generating iter_args for the token is difficult.
I'm not sure this is sufficient argument to change the semantic of the op. Also, generating extra iter_arg doesn't make the codegen very complicated. Additionally, when it's lowered, there's no downside because it results in the same PTX, whether using the current approach or the proposed one.

I think it’s better to focus on whether the current approach actually blocks you.

@linuxlonelyeagle
Copy link
Member Author

linuxlonelyeagle commented Mar 30, 2025

As n gets larger, the parameters to be iterated in the loop increase linearly.But it's not really necessary to introduce the result of the commit_group in the loop.This made it more difficult for me to codegen gpu code.

So, you're saying that generating iter_args for the token is difficult. However, I don't find it challenging to generate iter_args. When it's lowered, there's no downside because it results in the same PTX, whether using the current approach or the proposed one.

I think it’s better to focus on whether the current approach actually blocks you.

This method wouldn't really stop me, but it certainly adds to the complexity of the problem. Realize that there are other parameters in the loop that need to be iterated over.
The two parameters that are required in the loop.smem_read Indicates where the current iteration prefetches data to share memory.smem_write,Indicates the location of the share memory to be read into the register.

Three possible parameters.
A, B, and C registers of the matrix.If the size of the k dimension of the warp tile to be computed is equal to the k size of the tensor core.Indicates that we do not need to prefetch registers.These three parameters will not exist.

Now we are discussing the parameters of wait group.The issue is now honestly complicated enough.
This algorithm is similar to https://github.com/NVIDIA/cutlass/blob/06e560d98a5fe8acb975db2c4c26817b6c90acb1/examples/cute/tutorial/sgemm_sm80.cu

A very key point is that I'm not the only one who benefits from this PR. Though I've been waiting for this PR to merge.You didn't answer my second question.

// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2

It come from https://mlir.llvm.org/docs/Dialects/NVGPU/#nvgpudevice_async_create_group-nvgpudeviceasynccreategroupop ,It's wrong, isn't it?
The following sentence has waited for all the groups.

// Actually wait  group 1 and group 2
nvgpu.device_async_wait %token1

@linuxlonelyeagle
Copy link
Member Author

I fixed the second problem in the content of the last email made a change, a bit of the details are not correct.
One more thing, even if you don't reflect the groups that are waiting on IR, you can analyze which groups are currently waiting.
And one thing I must clarify is that wait groups does not specify the groups that are waiting.This was a bit of a misnomer when these Ops were first designed, as can be seen in the codereview.The actual behavior does not match the semantics, which can also be seen in the documentation.

@joker-eph @grypp For all of the above reasons, with this PR, if you guys continue to think you don't need this PR, I'll close it.

@grypp
Copy link
Member

grypp commented Mar 30, 2025

I think your first case isn't problematic, as I mentioned above. We also discussed that having an SSA semantic is a design choice.

However, the you have found the second case that could be an issue in terms of semantic of the nvgpu abstraction. Let me take a closer look at it tomorrow.

Let's keep the PR open until we all agree.

@linuxlonelyeagle
Copy link
Member Author

I think your first case isn't problematic, as I mentioned above. We also discussed that having an SSA semantic is a design choice.

However, the you have found the second case that could be an issue in terms of semantic of the nvgpu abstraction. Let me take a closer look at it tomorrow.

Let's keep the PR open until we all agree.

If it was just the first issue, I shouldn't have brought up this PR, because there is a second issue that benefits everyone, so I brought it up.

@joker-eph
Copy link
Collaborator

We also discussed that having an SSA semantic is a design choice.

SSA semantics to me implies that these are tracked and honored in the lowering: is it the case though?

@linuxlonelyeagle
Copy link
Member Author

SSA semantics to me implies that these are tracked and honored in the lowering: is it the case though?

We also discussed that having an SSA semantic is a design choice.

SSA semantics to me implies that these are tracked and honored in the lowering: is it the case though?

The tracking was not really done during the lower, but was replaced with zero.So you're tracking content that doesn't make sense.

  LogicalResult
  matchAndRewrite(nvgpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
    rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
    // Drop the result token.
    Value zero = rewriter.create<LLVM::ConstantOp>(
        op->getLoc(), IntegerType::get(op.getContext(), 32),
        rewriter.getI32IntegerAttr(0));
    rewriter.replaceOp(op, zero);
    rewriter.replaceOpWithNewOp<NVVM::CpAsyncCommitGroupOp>(op);
    return success();
  }
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants