-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[mlir][nvgpu] update commit group and wait async ops #130482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[mlir][nvgpu] update commit group and wait async ops #130482
Conversation
@llvm/pr-subscribers-mlir-nvgpu @llvm/pr-subscribers-mlir Author: lonely eagle (linuxlonelyeagle) ChangesThere seems to be some errors in the current definitions of Full diff: https://github.com/llvm/llvm-project/pull/130482.diff 8 Files Affected:
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
index eb0fb90d271ed..03a9485e26bc7 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
@@ -216,15 +216,13 @@ def NVGPU_DeviceAsyncCopyOp : NVGPU_Op<"device_async_copy", [
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
- %token1 = nvgpu.device_async_create_group %cp1, %cp2
+ nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
- %token2 = nvgpu.device_async_create_group %cp3
- // after the wait copy 1 and copy 2 are complete.
- nvgpu.device_async_wait %token1
- // after the wait copy 3 is complete.
- nvgpu.device_async_wait %token2
+ nvgpu.device_async_create_group %cp3
+ // after the wait copy 1, copy 2 and copy 3 are complete.
+ nvgpu.device_async_wait
```
Example:
@@ -255,9 +253,7 @@ def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
The `nvgpu.device_async_create_group` op creates a group of memory accesses
containing all the pending `device_async_copy` operations associated with
argument tokens. Each token can only be part of one group.
-
- It returns a token that can be use to wait until the group fully completes.
-
+
This is meant to be used with `nvgpu.device_async_wait` to synchronize copies
as explained in those ops descriptions.
@@ -266,10 +262,10 @@ def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
Example:
```mlir
- %0 = nvgpu.device_async_create_group
- ```
+ %cp = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+ nvgpu.device_async_create_group %cp
+ ```
}];
- let results = (outs NVGPU_DeviceAsyncToken:$asyncToken);
let arguments = (ins Variadic<NVGPU_DeviceAsyncToken>:$inputTokens);
let assemblyFormat = [{
$inputTokens attr-dict
@@ -291,13 +287,22 @@ def NVGPU_DeviceAsyncWaitOp : NVGPU_Op<"device_async_wait", []> {
Example:
```mlir
- nvgpu.device_async_wait %0
+ // copy 1.
+ %cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
+ nvgpu.device_async_create_group %cp1
+ // copy 2.
+ %cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+ nvgpu.device_async_create_group %cp2
+ // copy 3.
+ %cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+ nvgpu.device_async_create_group %cp3
+ // after the wait copy 1 and copy 2 are complete.
+ nvgpu.device_async_wait {numGroups = 1 : i32}
```
}];
- let arguments = (ins NVGPU_DeviceAsyncToken:$asyncDependencies,
- OptionalAttr<I32Attr>:$numGroups);
+ let arguments = (ins OptionalAttr<I32Attr>:$numGroups);
let assemblyFormat = [{
- $asyncDependencies attr-dict
+ attr-dict
}];
}
diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
index f53de416f2abd..3bf1fd04d1759 100644
--- a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
+++ b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
@@ -733,12 +733,7 @@ struct NVGPUAsyncCreateGroupLowering
LogicalResult
matchAndRewrite(nvgpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
- rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
- // Drop the result token.
- Value zero = rewriter.create<LLVM::ConstantOp>(
- op->getLoc(), IntegerType::get(op.getContext(), 32),
- rewriter.getI32IntegerAttr(0));
- rewriter.replaceOp(op, zero);
+ rewriter.replaceOpWithNewOp<NVVM::CpAsyncCommitGroupOp>(op);
return success();
}
};
diff --git a/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp b/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
index 10bc1993ffd96..08794b2b328fa 100644
--- a/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
+++ b/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
@@ -265,11 +265,8 @@ void nvgpu::createAsyncGroups(RewriterBase &rewriter, Operation *op,
}
// Create the group and wait for it right after.
- Value groupToken = rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(
- op->getLoc(), nvgpu::DeviceAsyncTokenType::get(op->getContext()),
- tokens);
- rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), groupToken,
- nullptr);
+ rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(op->getLoc(), tokens);
+ rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), nullptr);
// Clean up old stores.
for (Operation *writeOp : group)
rewriter.eraseOp(writeOp);
diff --git a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
index 6b59b5e4343b4..524eb3e1fa7b1 100644
--- a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
+++ b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
@@ -241,9 +241,9 @@ func.func @async_cp(
// CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 16, cache = ca
%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 : memref<128x128xf32> to memref<3x16x128xf32, 3>
// CHECK: nvvm.cp.async.commit.group
- %1 = nvgpu.device_async_create_group %0
+ nvgpu.device_async_create_group %0
// CHECK: nvvm.cp.async.wait.group 1
- nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+ nvgpu.device_async_wait { numGroups = 1 : i32 }
// CHECK: nvvm.cp.async.shared.global %{{.*}}, %{{.*}}, 16, cache = cg
%2 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
@@ -299,9 +299,9 @@ func.func @async_cp_zfill_f32_align4(
// CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 16, cache = cg, %[[c5]]
%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4, %srcElements {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
// CHECK: nvvm.cp.async.commit.group
- %1 = nvgpu.device_async_create_group %0
+ nvgpu.device_async_create_group %0
// CHECK: nvvm.cp.async.wait.group 1
- nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+ nvgpu.device_async_wait { numGroups = 1 : i32 }
return
}
@@ -334,9 +334,9 @@ func.func @async_cp_zfill_f32_align1(
// CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 4, cache = ca, %[[c5]]
%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 1, %srcElements : memref<128x128xf32> to memref<3x16x128xf32, 3>
// CHECK: nvvm.cp.async.commit.group
- %1 = nvgpu.device_async_create_group %0
+ nvgpu.device_async_create_group %0
// CHECK: nvvm.cp.async.wait.group 1
- nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+ nvgpu.device_async_wait { numGroups = 1 : i32 }
return
}
diff --git a/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir b/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
index 7477e18728677..610afb56d3175 100644
--- a/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
+++ b/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
@@ -19,8 +19,8 @@ func.func @optimize_128x32xf16_32x128xf16(%arg0: memref<128x128xf16>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
: memref<128x128xf16> to memref<128x32xf16, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c6:%.+]] = arith.constant 6 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -39,8 +39,8 @@ func.func @optimize_128x32xf16_32x128xf16(%arg0: memref<128x128xf16>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shmB]][[[stRow]], [[stColPerm]]]
%2 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shmB[%stRow, %stCol], 8
: memref<128x128xf16> to memref<32x128xf16, 3>
- %3 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c15:%.+]] = arith.constant 15 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c15]]
@@ -76,8 +76,8 @@ func.func @optimize_64x16xf32_16x64xf32(%arg0: memref<128x128xf32>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 4
: memref<128x128xf32> to memref<64x16xf32, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c6:%.+]] = arith.constant 6 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -132,8 +132,8 @@ func.func @optimize_64x16xf32_16x64xf32(%arg0: memref<128x128xf32>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shmB]][[[stRow]], [[stColPerm]]]
%2 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shmB[%stRow, %stCol], 4
: memref<128x128xf32> to memref<16x64xf32, 3>
- %3 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c15:%.+]] = arith.constant 15 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c15]]
@@ -177,8 +177,8 @@ func.func @small_column_size_f64(%arg0: memref<32x32xf64>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 2
: memref<32x32xf64> to memref<32x4xf64, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c6:%.+]] = arith.constant 4 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -203,8 +203,8 @@ func.func @too_small_column_size_f16(%arg0: memref<128x128xf16>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stCol]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
: memref<128x128xf16> to memref<128x8xf16, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: nvgpu.ldmatrix [[shm]][[[fragRow]], [[fragCol]]]
%mat = nvgpu.ldmatrix %shm[%fragRow, %fragCol] {numTiles = 1 : i32, transpose = false}
@@ -229,8 +229,8 @@ func.func @abort_if_subview(%arg0: memref<128x128xf16>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stCol]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
: memref<128x128xf16> to memref<128x32xf16, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: nvgpu.ldmatrix [[shmView]][[[fragRow]], [[fragCol]]]
%mat = nvgpu.ldmatrix %shmView[%fragRow, %fragCol] {numTiles = 1 : i32, transpose = false}
diff --git a/mlir/test/Dialect/NVGPU/roundtrip.mlir b/mlir/test/Dialect/NVGPU/roundtrip.mlir
index ad516b4d2c200..dbd9c368d9e47 100644
--- a/mlir/test/Dialect/NVGPU/roundtrip.mlir
+++ b/mlir/test/Dialect/NVGPU/roundtrip.mlir
@@ -63,9 +63,9 @@ func.func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){
%c0 = arith.constant 0 : index
// CHECK: nvgpu.device_async_copy %{{.*}}[{{.*}}, {{.*}}], %{{.*}}[{{.*}}, {{.*}}, {{.*}}], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
- // CHECK: %{{.*}} = nvgpu.device_async_create_group
- %token = nvgpu.device_async_create_group %0
- // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1 : i32}
- nvgpu.device_async_wait %token {numGroups = 1 : i32}
+ // CHECK: nvgpu.device_async_create_group
+ nvgpu.device_async_create_group %0
+ // CHECK: nvgpu.device_async_wait {numGroups = 1 : i32}
+ nvgpu.device_async_wait {numGroups = 1 : i32}
return
}
diff --git a/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir b/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
index 8290001c45856..aaaeb50854dc4 100644
--- a/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
+++ b/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
@@ -17,8 +17,8 @@ builtin.module {
// CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1
%2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
- // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
- // CHECK: nvgpu.device_async_wait %[[G]]
+ // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+ // CHECK: nvgpu.device_async_wait
return
}
@@ -51,8 +51,8 @@ builtin.module {
// CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1 :
%2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
- // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
- // CHECK: nvgpu.device_async_wait %[[G]]
+ // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+ // CHECK: nvgpu.device_async_wait
return
}
@@ -83,8 +83,8 @@ builtin.module {
// CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1 :
%2 = vector.load %a[%c0, %c4] : memref<1024x1024xf32>, vector<1xf32>
vector.store %2, %0[%c0, %c4, %c0] : memref<4x32x16xf32, #gpu.address_space<workgroup>>, vector<1xf32>
- // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
- // CHECK: nvgpu.device_async_wait %[[G]]
+ // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+ // CHECK: nvgpu.device_async_wait
return
}
diff --git a/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir b/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
index e959949babd9e..e93a6a40391bb 100644
--- a/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
+++ b/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
@@ -94,13 +94,11 @@ func.func @async_depth_2_predicated(%global: memref<?xf32>, %alloc_size: index)
%c0f = arith.constant 0.0 : f32
// CHECK: %[[TOKEN0:.+]] = nvgpu.device_async_copy
// CHECK: %[[TOKEN1:.+]] = nvgpu.device_async_copy
- // CHECK: scf.for %[[I:.+]] = {{.*}} iter_args
- // CHECK-SAME: %[[ITER_ARG0:.+]] = %[[TOKEN0]]
- // CHECK-SAME: %[[ITER_ARG1:.+]] = %[[TOKEN1]]
+ // CHECK: scf.for %[[I:.+]] = {{.*}}
scf.for %i = %c0 to %c98 step %c4 {
// Condition for the predication "select" below.
// CHECK: %[[CMP0:.+]] = arith.cmpi slt, %[[I]], %[[C90]]
- // CHECK: nvgpu.device_async_wait %[[ITER_ARG0]] {numGroups = 1
+ // CHECK: nvgpu.device_async_wait {numGroups = 1
// Original "select" with updated induction variable.
// CHECK: %[[I_PLUS_8:.+]] = arith.addi %[[I]], %[[C8]]
// CHECK: %[[CMP1:.+]] = arith.cmpi slt, %[[I_PLUS_8]], %[[C96]]
@@ -122,9 +120,7 @@ func.func @async_depth_2_predicated(%global: memref<?xf32>, %alloc_size: index)
%token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
: memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
- nvgpu.device_async_wait %token
-
- // CHECK: scf.yield %[[ITER_ARG1]], %[[ASYNC_TOKEN]]
+ nvgpu.device_async_wait
}
// There is no need to wait for the last copies as it it was fully predicated
// out and doesn't load the original data.
@@ -156,12 +152,11 @@ func.func @async_depth_2_peeled(%global: memref<?xf32>) {
// CHECK: nvgpu.device_async_copy
// CHECK: nvgpu.device_async_copy
// CHECK: scf.for
- // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1
+ // CHECK: nvgpu.device_async_wait {numGroups = 1
// CHECK: arith.select
// CHECK: nvgpu.device_async_copy
- // CHECK: scf.yield
- // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1
- // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 0
+ // CHECK: nvgpu.device_async_wait {numGroups = 1
+ // CHECK: nvgpu.device_async_wait {numGroups = 0
scf.for %i = %c0 to %c98 step %c4 {
%c96 = arith.constant 96 : index
%cond = arith.cmpi slt, %i, %c96 : index
@@ -169,7 +164,7 @@ func.func @async_depth_2_peeled(%global: memref<?xf32>) {
%read_size = arith.select %cond, %c4, %c2 : index
%token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
: memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
- nvgpu.device_async_wait %token
+ nvgpu.device_async_wait
}
return
}
|
@llvm/pr-subscribers-mlir-gpu Author: lonely eagle (linuxlonelyeagle) ChangesThere seems to be some errors in the current definitions of Full diff: https://github.com/llvm/llvm-project/pull/130482.diff 8 Files Affected:
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
index eb0fb90d271ed..03a9485e26bc7 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUOps.td
@@ -216,15 +216,13 @@ def NVGPU_DeviceAsyncCopyOp : NVGPU_Op<"device_async_copy", [
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
- %token1 = nvgpu.device_async_create_group %cp1, %cp2
+ nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
- %token2 = nvgpu.device_async_create_group %cp3
- // after the wait copy 1 and copy 2 are complete.
- nvgpu.device_async_wait %token1
- // after the wait copy 3 is complete.
- nvgpu.device_async_wait %token2
+ nvgpu.device_async_create_group %cp3
+ // after the wait copy 1, copy 2 and copy 3 are complete.
+ nvgpu.device_async_wait
```
Example:
@@ -255,9 +253,7 @@ def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
The `nvgpu.device_async_create_group` op creates a group of memory accesses
containing all the pending `device_async_copy` operations associated with
argument tokens. Each token can only be part of one group.
-
- It returns a token that can be use to wait until the group fully completes.
-
+
This is meant to be used with `nvgpu.device_async_wait` to synchronize copies
as explained in those ops descriptions.
@@ -266,10 +262,10 @@ def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
Example:
```mlir
- %0 = nvgpu.device_async_create_group
- ```
+ %cp = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+ nvgpu.device_async_create_group %cp
+ ```
}];
- let results = (outs NVGPU_DeviceAsyncToken:$asyncToken);
let arguments = (ins Variadic<NVGPU_DeviceAsyncToken>:$inputTokens);
let assemblyFormat = [{
$inputTokens attr-dict
@@ -291,13 +287,22 @@ def NVGPU_DeviceAsyncWaitOp : NVGPU_Op<"device_async_wait", []> {
Example:
```mlir
- nvgpu.device_async_wait %0
+ // copy 1.
+ %cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
+ nvgpu.device_async_create_group %cp1
+ // copy 2.
+ %cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+ nvgpu.device_async_create_group %cp2
+ // copy 3.
+ %cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
+ nvgpu.device_async_create_group %cp3
+ // after the wait copy 1 and copy 2 are complete.
+ nvgpu.device_async_wait {numGroups = 1 : i32}
```
}];
- let arguments = (ins NVGPU_DeviceAsyncToken:$asyncDependencies,
- OptionalAttr<I32Attr>:$numGroups);
+ let arguments = (ins OptionalAttr<I32Attr>:$numGroups);
let assemblyFormat = [{
- $asyncDependencies attr-dict
+ attr-dict
}];
}
diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
index f53de416f2abd..3bf1fd04d1759 100644
--- a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
+++ b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
@@ -733,12 +733,7 @@ struct NVGPUAsyncCreateGroupLowering
LogicalResult
matchAndRewrite(nvgpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
- rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
- // Drop the result token.
- Value zero = rewriter.create<LLVM::ConstantOp>(
- op->getLoc(), IntegerType::get(op.getContext(), 32),
- rewriter.getI32IntegerAttr(0));
- rewriter.replaceOp(op, zero);
+ rewriter.replaceOpWithNewOp<NVVM::CpAsyncCommitGroupOp>(op);
return success();
}
};
diff --git a/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp b/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
index 10bc1993ffd96..08794b2b328fa 100644
--- a/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
+++ b/mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
@@ -265,11 +265,8 @@ void nvgpu::createAsyncGroups(RewriterBase &rewriter, Operation *op,
}
// Create the group and wait for it right after.
- Value groupToken = rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(
- op->getLoc(), nvgpu::DeviceAsyncTokenType::get(op->getContext()),
- tokens);
- rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), groupToken,
- nullptr);
+ rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(op->getLoc(), tokens);
+ rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), nullptr);
// Clean up old stores.
for (Operation *writeOp : group)
rewriter.eraseOp(writeOp);
diff --git a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
index 6b59b5e4343b4..524eb3e1fa7b1 100644
--- a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
+++ b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
@@ -241,9 +241,9 @@ func.func @async_cp(
// CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 16, cache = ca
%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 : memref<128x128xf32> to memref<3x16x128xf32, 3>
// CHECK: nvvm.cp.async.commit.group
- %1 = nvgpu.device_async_create_group %0
+ nvgpu.device_async_create_group %0
// CHECK: nvvm.cp.async.wait.group 1
- nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+ nvgpu.device_async_wait { numGroups = 1 : i32 }
// CHECK: nvvm.cp.async.shared.global %{{.*}}, %{{.*}}, 16, cache = cg
%2 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
@@ -299,9 +299,9 @@ func.func @async_cp_zfill_f32_align4(
// CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 16, cache = cg, %[[c5]]
%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4, %srcElements {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
// CHECK: nvvm.cp.async.commit.group
- %1 = nvgpu.device_async_create_group %0
+ nvgpu.device_async_create_group %0
// CHECK: nvvm.cp.async.wait.group 1
- nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+ nvgpu.device_async_wait { numGroups = 1 : i32 }
return
}
@@ -334,9 +334,9 @@ func.func @async_cp_zfill_f32_align1(
// CHECK-DAG: nvvm.cp.async.shared.global %[[ADDRESSDST]], %[[CAST2]], 4, cache = ca, %[[c5]]
%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 1, %srcElements : memref<128x128xf32> to memref<3x16x128xf32, 3>
// CHECK: nvvm.cp.async.commit.group
- %1 = nvgpu.device_async_create_group %0
+ nvgpu.device_async_create_group %0
// CHECK: nvvm.cp.async.wait.group 1
- nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
+ nvgpu.device_async_wait { numGroups = 1 : i32 }
return
}
diff --git a/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir b/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
index 7477e18728677..610afb56d3175 100644
--- a/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
+++ b/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir
@@ -19,8 +19,8 @@ func.func @optimize_128x32xf16_32x128xf16(%arg0: memref<128x128xf16>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
: memref<128x128xf16> to memref<128x32xf16, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c6:%.+]] = arith.constant 6 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -39,8 +39,8 @@ func.func @optimize_128x32xf16_32x128xf16(%arg0: memref<128x128xf16>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shmB]][[[stRow]], [[stColPerm]]]
%2 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shmB[%stRow, %stCol], 8
: memref<128x128xf16> to memref<32x128xf16, 3>
- %3 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c15:%.+]] = arith.constant 15 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c15]]
@@ -76,8 +76,8 @@ func.func @optimize_64x16xf32_16x64xf32(%arg0: memref<128x128xf32>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 4
: memref<128x128xf32> to memref<64x16xf32, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c6:%.+]] = arith.constant 6 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -132,8 +132,8 @@ func.func @optimize_64x16xf32_16x64xf32(%arg0: memref<128x128xf32>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shmB]][[[stRow]], [[stColPerm]]]
%2 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shmB[%stRow, %stCol], 4
: memref<128x128xf32> to memref<16x64xf32, 3>
- %3 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c15:%.+]] = arith.constant 15 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c15]]
@@ -177,8 +177,8 @@ func.func @small_column_size_f64(%arg0: memref<32x32xf64>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stColPerm]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 2
: memref<32x32xf64> to memref<32x4xf64, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: [[c6:%.+]] = arith.constant 4 : index
// CHECK: [[srcBits:%.+]] = arith.andi [[fragRow]], [[c6]]
@@ -203,8 +203,8 @@ func.func @too_small_column_size_f16(%arg0: memref<128x128xf16>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stCol]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
: memref<128x128xf16> to memref<128x8xf16, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: nvgpu.ldmatrix [[shm]][[[fragRow]], [[fragCol]]]
%mat = nvgpu.ldmatrix %shm[%fragRow, %fragCol] {numTiles = 1 : i32, transpose = false}
@@ -229,8 +229,8 @@ func.func @abort_if_subview(%arg0: memref<128x128xf16>,
// CHECK: nvgpu.device_async_copy [[arg0]][[[ldRow]], [[ldCol]]], [[shm]][[[stRow]], [[stCol]]]
%0 = nvgpu.device_async_copy %arg0[%ldRow, %ldCol], %shm[%stRow, %stCol], 8
: memref<128x128xf16> to memref<128x32xf16, 3>
- %1 = nvgpu.device_async_create_group %0
- nvgpu.device_async_wait %1 { numGroups = 1 : i32}
+ nvgpu.device_async_create_group %0
+ nvgpu.device_async_wait { numGroups = 1 : i32}
// CHECK: nvgpu.ldmatrix [[shmView]][[[fragRow]], [[fragCol]]]
%mat = nvgpu.ldmatrix %shmView[%fragRow, %fragCol] {numTiles = 1 : i32, transpose = false}
diff --git a/mlir/test/Dialect/NVGPU/roundtrip.mlir b/mlir/test/Dialect/NVGPU/roundtrip.mlir
index ad516b4d2c200..dbd9c368d9e47 100644
--- a/mlir/test/Dialect/NVGPU/roundtrip.mlir
+++ b/mlir/test/Dialect/NVGPU/roundtrip.mlir
@@ -63,9 +63,9 @@ func.func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){
%c0 = arith.constant 0 : index
// CHECK: nvgpu.device_async_copy %{{.*}}[{{.*}}, {{.*}}], %{{.*}}[{{.*}}, {{.*}}, {{.*}}], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
- // CHECK: %{{.*}} = nvgpu.device_async_create_group
- %token = nvgpu.device_async_create_group %0
- // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1 : i32}
- nvgpu.device_async_wait %token {numGroups = 1 : i32}
+ // CHECK: nvgpu.device_async_create_group
+ nvgpu.device_async_create_group %0
+ // CHECK: nvgpu.device_async_wait {numGroups = 1 : i32}
+ nvgpu.device_async_wait {numGroups = 1 : i32}
return
}
diff --git a/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir b/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
index 8290001c45856..aaaeb50854dc4 100644
--- a/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
+++ b/mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir
@@ -17,8 +17,8 @@ builtin.module {
// CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1
%2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
- // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
- // CHECK: nvgpu.device_async_wait %[[G]]
+ // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+ // CHECK: nvgpu.device_async_wait
return
}
@@ -51,8 +51,8 @@ builtin.module {
// CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1 :
%2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
- // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
- // CHECK: nvgpu.device_async_wait %[[G]]
+ // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+ // CHECK: nvgpu.device_async_wait
return
}
@@ -83,8 +83,8 @@ builtin.module {
// CHECK: %[[CP1:.*]] = nvgpu.device_async_copy {{.*}}, {{.*}}, 1 :
%2 = vector.load %a[%c0, %c4] : memref<1024x1024xf32>, vector<1xf32>
vector.store %2, %0[%c0, %c4, %c0] : memref<4x32x16xf32, #gpu.address_space<workgroup>>, vector<1xf32>
- // CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
- // CHECK: nvgpu.device_async_wait %[[G]]
+ // CHECK: nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
+ // CHECK: nvgpu.device_async_wait
return
}
diff --git a/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir b/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
index e959949babd9e..e93a6a40391bb 100644
--- a/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
+++ b/mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
@@ -94,13 +94,11 @@ func.func @async_depth_2_predicated(%global: memref<?xf32>, %alloc_size: index)
%c0f = arith.constant 0.0 : f32
// CHECK: %[[TOKEN0:.+]] = nvgpu.device_async_copy
// CHECK: %[[TOKEN1:.+]] = nvgpu.device_async_copy
- // CHECK: scf.for %[[I:.+]] = {{.*}} iter_args
- // CHECK-SAME: %[[ITER_ARG0:.+]] = %[[TOKEN0]]
- // CHECK-SAME: %[[ITER_ARG1:.+]] = %[[TOKEN1]]
+ // CHECK: scf.for %[[I:.+]] = {{.*}}
scf.for %i = %c0 to %c98 step %c4 {
// Condition for the predication "select" below.
// CHECK: %[[CMP0:.+]] = arith.cmpi slt, %[[I]], %[[C90]]
- // CHECK: nvgpu.device_async_wait %[[ITER_ARG0]] {numGroups = 1
+ // CHECK: nvgpu.device_async_wait {numGroups = 1
// Original "select" with updated induction variable.
// CHECK: %[[I_PLUS_8:.+]] = arith.addi %[[I]], %[[C8]]
// CHECK: %[[CMP1:.+]] = arith.cmpi slt, %[[I_PLUS_8]], %[[C96]]
@@ -122,9 +120,7 @@ func.func @async_depth_2_predicated(%global: memref<?xf32>, %alloc_size: index)
%token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
: memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
- nvgpu.device_async_wait %token
-
- // CHECK: scf.yield %[[ITER_ARG1]], %[[ASYNC_TOKEN]]
+ nvgpu.device_async_wait
}
// There is no need to wait for the last copies as it it was fully predicated
// out and doesn't load the original data.
@@ -156,12 +152,11 @@ func.func @async_depth_2_peeled(%global: memref<?xf32>) {
// CHECK: nvgpu.device_async_copy
// CHECK: nvgpu.device_async_copy
// CHECK: scf.for
- // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1
+ // CHECK: nvgpu.device_async_wait {numGroups = 1
// CHECK: arith.select
// CHECK: nvgpu.device_async_copy
- // CHECK: scf.yield
- // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1
- // CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 0
+ // CHECK: nvgpu.device_async_wait {numGroups = 1
+ // CHECK: nvgpu.device_async_wait {numGroups = 0
scf.for %i = %c0 to %c98 step %c4 {
%c96 = arith.constant 96 : index
%cond = arith.cmpi slt, %i, %c96 : index
@@ -169,7 +164,7 @@ func.func @async_depth_2_peeled(%global: memref<?xf32>) {
%read_size = arith.select %cond, %c4, %c2 : index
%token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
: memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
- nvgpu.device_async_wait %token
+ nvgpu.device_async_wait
}
return
}
|
I think this PR is breaking SSA semantic, this can cause problems right?
PR makes the ops like the following. There is no SSA anymore.
|
Yes.But I don't think it will cause any problem.You just need to think of it as writing PTX inline assembly.
I think the use of the two wait is wrong.The first wait should wait for all async ops.The token here has no effect.
One thing that needs to be made clear is that wait is waiting for groups, not the cp Ops in a certain group.
If I want to describe the semantics I mentioned above.
Doing this implies that we wait on the entire group.So the simplest approach is to remove the operand of the wait op, which actually only needs the attribute. |
As a side note, the group generated here by |
I think the second reason in the comment above is not very clear. But I did find a good example now, which represents the abstraction of |
When you break SSA semantic like following, is reordering the OPs allowed?
|
It cannot be reordered, |
They should be reordered yes. But what stops the compiler to reorder them? When there was SSA semantic, the reordering wasn't possible. |
They aren't marked as having no side effects. |
I think they are not Pure Ops, like constant Op, it will be Pure Op, so by running -canonicalize, they will be reordered. |
Typo: they should not be reordered. To prevent them from being reordered, we need to make them side-effecting. Otherwise, MLIR—or any downstream compiler—is free to reorder them, which is not what we want. That said, making them side-effecting is a big hammer. Preserving SSA semantics is, in my opinion, a less intrusive approach. Your PR makes sense if we didn’t have SSA. I can totally understand the reasoning if these ops are lowered directly to PTX, where there are no tokens. However, PTX isn’t SSA anyway, so I’m not sure it’s the best comparison. If you look at vectors or structs at the NVGPU level, you’ll see the same thing: SSA semantics are preserved. |
Bu the way, it looks like you are doing interesting work in nvgpu/nvvm. Here is my discord if you want to discuss things in higher bandwidth: 'guraypp#5865' |
To be clear: they are already side-effecting. |
Given this comment, I guess this PR looks good to me. |
I appreciate the clarification. I m miss-read your comment yesterday, sorry my bad. In that case, my reordering concern is not an issue. I'll need to take a look at this and lower IR to understand why this is implemented with SSA in the first place. |
The original implementation using SSA Value should be traceable from here. This example contains the original original usage. (It should be understood that the wait is for a cp operation, not for groups that are being committed.
|
Here is the RFC and code review discussion: It seems this PR's approach (without token) was previously discussed there. However, since it's close to low-level PTX, the design decision was to avoid this approach in favor of using SSA and tokens. In my personal opinion, there's not much practical difference with or without Overall, there doesn't seem to be a clear advantage to adopting this PR. Let me know what you think. |
It seems to me that there is no need to put so many parameters in the loop.
Regarding the above situation,Aren't the following usages the same?
Like the example I left in a recent comment, it also has problems.The current changes are closer to the original semantics of PTX.People familiar with PTX will immediately know how to use these Ops.
|
I read the comments in the REVIEW (there's just too much content), but one thing I can tell is this: it seems like the wait copy. operation was assumed in the first place, not the number of groups. |
Can we go forward with this PR? |
I think the way your PR implements it is also fine. But as we read in the previous discussion, folks chose the current approach. I don’t see much difference between the two. The current way is easier to follow in terms of dependencies, while your PR makes that a little harder. It looks like the generated PTX will look the same in either way. I’m wondering—does the current approach block you from implementing an algorithm? If that’s the case, can you elaborate what is this case? |
It really is affecting the nStage algorithm.Maybe the example I gave wasn't detailed enough.
Now we assume that n is 3.
As On top of that, the current nvgpu.device_async_wait is semantically incorrect.
I think we need to fix it, the thing that really acts as a control is the nunmGroups attribute.This makes sense for MLIR. |
Do you guys have any questions in response to my reply above? Honestly, looking forward to further communication. |
So, you're saying that generating iter_args for the token is difficult. I think it’s better to focus on whether the current approach actually blocks you. |
This method wouldn't really stop me, but it certainly adds to the complexity of the problem. Realize that there are other parameters in the loop that need to be iterated over. Three possible parameters. Now we are discussing the parameters of wait group.The issue is now honestly complicated enough. A very key point is that I'm not the only one who benefits from this PR. Though I've been waiting for this PR to merge.You didn't answer my second question.
It come from https://mlir.llvm.org/docs/Dialects/NVGPU/#nvgpudevice_async_create_group-nvgpudeviceasynccreategroupop ,It's wrong, isn't it?
|
I fixed the second problem in the content of the last email made a change, a bit of the details are not correct. @joker-eph @grypp For all of the above reasons, with this PR, if you guys continue to think you don't need this PR, I'll close it. |
I think your first case isn't problematic, as I mentioned above. We also discussed that having an SSA semantic is a design choice. However, the you have found the second case that could be an issue in terms of semantic of the nvgpu abstraction. Let me take a closer look at it tomorrow. Let's keep the PR open until we all agree. |
If it was just the first issue, I shouldn't have brought up this PR, because there is a second issue that benefits everyone, so I brought it up. |
SSA semantics to me implies that these are tracked and honored in the lowering: is it the case though? |
The tracking was not really done during the lower, but was replaced with zero.So you're tracking content that doesn't make sense.
|
There seems to be some errors in the current definitions of
nvgpu.devicce_async_create_group
andnvgpu.device_async_wait
.nvgpu.device_async_wait
should not have operands.If it has operands, the semantics should be to wait for a cp operation in the operands.But it actually waits for all groups, so it shouldn't specify an operand, and you can see from the pattern in nvgpu-to-nvvm that its operand is useless.Sincenvgpu.device_async_wait
no longer has operands, this PR also removes the result fromnvgpu.devicce_async_create_group
.In addition to this, corrections were made to the documentation and examples.