From 0dc2ce20f02f00871ed5c6d0e7860b8cd34b9656 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Wed, 26 Feb 2025 23:21:47 -0800 Subject: [PATCH 01/13] Update XeGPU.md save work --- docs/rfcs/XeGPU.md | 166 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 156 insertions(+), 10 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index 16e52b48f..c0c9596ab 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -2,7 +2,9 @@ ## Summary The XeGPU dialect provides an abstraction that closely models Xe instructions to support high-performance GEMM code generation. -The matrix instructions at this level exactly match the hardware instructions’ semantics including the matrix sizes. +The XeGPU operations are designed to support tile based programming. The same set of operations work at multiple levels, including workgroup, subgroup, and work item. +The workgroup level operation can be decomposed and unrolled to multipel XeGPU operations at subgroup level, which can be further decomposed to work item level. +Along the way, the tensor size is partitioned to smaller size, and the subgroup and work item level XeGPU operations exactly match the hardware instructions’ semantics including the matrix sizes. The lowering and optimizations built on top of the XeGPU dialect are target-specific. ## Proposal @@ -12,6 +14,9 @@ XeGPU operations are introduced when there is a special Xe instruction not model load and store. In some cases, one XeGPU op may lower to a sequence of instructions for a dedicated and performance-critical function. For example, create_tdesc is mapped to a fixed sequence of instructions to create an address description. +The operation definition is general and works for workgroup, subgroup, or work item level. When working at workgroup level, the operation must +attach `wg_map` attribute, and work item level operation must attach `sg_map` attribute. + Below is a summary. | Ops | Syntax | Example | @@ -34,10 +39,6 @@ Below is a summary. |nbarrier_wait | operation ::= xegpu.nbarrier_wait $nbarrier : type($nbarrier) | xegpu.nbarrier_wait %nbarrier : !xegpu.nbarrier | |fence | operation ::= xegpu.fence attr-dict | xegpu.fence {scope = gpu, memory_kind = global} | -The XeGPU dialect supports lowering from [XeTile dialects]{./XeTile.md}. The tile-based XeTile operation can be further decomposed to -multiple XeGPU ops. For example, XeTile.load_tile operation is lowered to XeGPU’s load_nd or load operations. Compared with the -XeTile dialect, the XeGPU dialect works with even smaller matrix sizes, since XeGPU operations map to one hardware instruction in most cases. - XeGPU supports two flavors of load/store operations: n-dimension load (nd load) and scattered load. Both need a tensor descriptor to describe the addresses/offsets to a data block. The descriptor is used for load/store/prefetch, and then updated for reuse with the next data block. Nd_load can be used to map to 1D load, 2D load, or nd load. Scattered load requires a special tensor descriptor, which @@ -488,7 +489,7 @@ The load with chunk_size pack the low-precision data to 32-bit data using wi_dat User must use legal sg_map value for the WI data distribution for certain operations on PVC and ARC. It includes load_nd/store_nd, load/store with chunk_size, and DPAS. -## Rules of sg_map setting for load and store on PVC and ARC +**Rules of sg_map setting for load and store on PVC and ARC** The WI data distribution requires the following sg_map for the 2D block load and store to work with DPAS on PVC. Not using the sg_map value defined here leads to undefined behavior. ```mlir # assert (wi_layout[0] x wi_layout[1] == subgroup_size) // PVC subgroup_size = 16 @@ -643,9 +644,7 @@ users must use for the WI data distribution of 1D block load and regular load wi #sg_map_t = xegpu.sg_map // for 8-bit data element like uint8, sint8 ``` - - -## sg_map use case - 2D load +**sg_map use case - 2D load** An example on how to load a 2D block, perform dpas, and store back to memory. @@ -678,7 +677,7 @@ An example on how to load a 2D block, perform dpas, and store back to memory. ``` -## sg_map use case - regular load: +**sg_map use case - regular load** An example on how to perform transpose using load with chunk_size in SIMT flavor. ```mlir @@ -701,6 +700,153 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor ``` +## Workgroup level XeGPU Operations + +By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler as the loop nest level for subgroup and work item level can be removed. To eanble XeGPU operate the workgroup level, `wg_map` attribute is introduced to specify how the data is distributed across subgroups. `wg_mmap` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_mapping` to parition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critial preformance knobs. + +**Attribute xegpu.wg_map** +xegpu.wg_map specifies how a ND tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of two parameters: + * sg_layout: Defines the ND arrangement of subgroups within the workgroup. + * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. + +When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor. + +**Constraints** + +Given these definitions: +```mlir +sg_data_size = sg_data[0] × sg_data[1] +workgroup_size = sg_layout[0] × sg_layout[1] +tensor_size = tensor_desc[0] × tensor_desc[1] +``` + +the following conditions must hold: + +* workgroup_size must represent the number of subgroups in a workgroup for a kernel. +* tensor_desc[0] must be evenly divisible by sg_layout[0] × sg_data[0], or vice versa. +* tensor_desc[1] must be evenly divisible by sg_layout[1] × sg_data[1], or vice versa. + +**distribution rule** + +The wg_tile is distributed to sg_data x sg_layout in a round-robin fashion. If sg_data[i] x sg_layout[i] < wg_tile[i], we have data left after all subgroups are assigned for the first round. In this case, we continue to assign the rest data starting from the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] >= wg_tile[i], we may have already used up all the data before all subgroups are assigned. In this case, we wrap around the wg_tile and continue the assignment, and the rest subgroups along that dimension share the same data. + +data_index_i = sg_id_in_wg_x % (tensor[i]/sg_data[i]) +data_index_j = sg_id_in_wg_x % (tensor[j]/sg_data[j]) +sg_data_assigned[data_index_i, data_index_j] = (tensor[i]/sg_data[i], tensor[j]/sg_data[j], sg_data[i], sg_data[j]) + + +**Resulting WI Data Fragment** + +Each work item’s fragment of the distributed tensor is represented by a 2D vector (e.g., a SPIR-V or LLVM vector) with the shape [n_distribution_units, wi_data_size]. The result 2D vector will be further lowered to a 1D “SIMT-flavored” vector, such as a SPIR-V vector or LLVM vector, as the elements in the inner dimension being packed to a single packed data unit. + +**Examples of workgroup distribution with wg_map** + +over the lowering process so that the user can tune for optimal performance. + +Below is an example. +```mlir + #wg_map_a = #xetile.wg_map + #tile_attr = #xetile.tile_attr + + %wg_tile = xetile.init_tile %A[%m, %c0] : memref<1024x1024xf16> -> !xetile.tile<128x128xf16, #tile_attr> +``` +Within the `xetile.wg_map`, `sg_layout` specifies the subgroup layout, and `sg_data` specifies the tile size owned by each subgroup. The tile created by init_tile is a workgroup-level tile. In the example above, sg_layout [2,2] means that each workgroup has 4 subgroups with 2 rows and 2 columns. When mapping sg_layout to linear subgroup id, sg_layout is always mapped to subgroup id in row-major ordering. sg_data [32,128] means that each subgroup works on a submatrix [32, 128]. The data elements assigned to each subgroup thread must be contiguous. + + + +For example, for the tile size [128, 128] and sg_data [32, 128], along the second dimension, there is no more data left to assign after the first subgroup, it wraps around and moves to the beginning of the tile and continues the assignment. Instead, for the first dimension, there is more data left after the first round of distribution, so it move to the next subtile and continue the assignement. As a result, the tile would be sliced to four subtiles with size [32,128], with the following mapping for sg_layout [2,2]: + +| subgroup tensor | 2D subgroup id | Linearized subgroup id +| :--- | :---- | :---- | +| [ 0:31, 0:127] | [0, 0] , [0, 1] | 0 , 1 | +| [ 32:63, 0:127] | [1, 0] , [1, 1] | 2 , 3 | +| [ 64:95, 0:127] | [0, 0] , [0, 1] | 0 , 1 | +| [96:127, 0:127] | [1, 0] , [1, 1] | 2 , 3 | + +With the `xetile.wg_map` attribute being included in the tile data type, the tile memory related operations (xxx_tile) can be distributed to subgroup. The vector based operations (tile_xxx) requires extra handling, since we can't attatch the the `xetile.wg_map` attribute to MLIR vector data type. + +The proposal is to attach the `xetile.wg_map` attribute to the vector based XeTile operations as illustrated below. The attribute applies only to the output value of each operation. The input values `xetile.wg_map` are determined by their respective defining operations. +| Ops | Syntax | Example | +| :--- | :---- | :--- | +|tile_mma | operation ::= xetile.tile_mma $matA, $matB, $matC attr_dict: type($matA), type($matB), type($matC)-> type($res) | %vector_c = xetile.tile_mma %vector_a, %vector_b, %vector_c {#mp_c} : vector<64x32xbf16>, vector<32x128xbf16>, vector<64x128xfloat> into vector<64x128xfloat> | +|transpose | operation ::= xetile.transpose attr_dict $vec : type($vec) -> type($res) | %vector_a = xetile.transpose %vector_b {#mp_a}: vector<64x32xfloat> into vector<32x64xfloat> | +|reduction | operation ::= xetile.reduction $kind $src attr_dict: type($value) -> type($res) | %vector_a = xetile.reduction %vector_b [1] {#mp_a}: vector<64x32xfloat> into vector<64x1xfloat> | +|broadcast | operation ::= xetile.broadcast $src attr_dict : type($value) -> type($res) | %vector_a = xetile.broadcast %vector_b [0] {#mp_a}: vector<1x32xfloat> into vector<64x32xfloat> | +|convert_layout | operation ::= xetile.conv_layout $src attr_dict: type($value) -> type($res) | %vector_a = xetile.convert_layout %vector_b {#mp_a} : vector<256x256xfloat> into vector<256x256xfloat> | + +With the `wg_map` attribute attached for the output vector, `tile_mma` does a matrix multiplication at a work group level vector. +```mlir + #wg_map_d = #xetile.wg_map + + %vector_d = xetile.tile_mma %vector_a, %vector_b, %vector_c {#wg_map_d}: + vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16> + into vector<256x256xfloat> +``` +The `wg_map` attribute of input vector operands can be derived from the wg_map_d. They must have the same sg_layout, and sg_data for m and n dimenion must be same as wg_map_d, and sg_data for k dimension must be same as operand A and B. These attributes may be retrieved from their producer ops, and the retrieved attributes must be consistent with the derived ones. Below is the derived wg_map for the three vector operands in the example above. +```mlir + #wg_map_a = #xetile.wg_map //wg_map for %vector_a + #wg_map_b = #xetile.wg_map //wg_map for %vector_b + #wg_map_c = #xetile.wg_map //wg_map for %vector_c +``` + +`reduction` with `wg_map` does the reduction over a workgroup level vector. +```mlir + #wg_map_a = #xetile.wg_map + %vector_a = xetile.reduction %vector_b [1] {#wg_map_a}: vector<256x128xfloat> into vector<256x1xfloat> +``` +`reduction_size` attribute is used to support paritial reduction. +```mlir + #wg_map_a = #xetile.wg_map + #wg_map_b = #xetile.wg_map + %vector_a = math.exp %input {#wg_map_a} : vector<256x128xf32> + %vector_b = xetile.reduction %vector_a [0] {$reduction_size = [32]} {#wg_map_b}: vector<256x128xfloat> into vector<8x128xfloat> +``` + +The `wg_map` attribute of the input vector can be derived from the wg_map_a. sg_layout must be same, sg_data for the dimension being reduced must be same as the input vector, and the other dimension must be same as the wg_map_a. The input vector's wg_map attribute may be retrieved from its producer op, and the retrieved attribute must be consistent with the derived one. Below is the derived wg_map for the input vector in the example above. +```mlir + #wg_map_b = #xetile.wg_map //wg_map for %vector_b +``` + +`broadcast` with `wg_map` attribute broadcast at workgroup level. +```mlir + #wg_map_a = #xetile.wg_map + %vector_a = xetile.broadcast %vector_b [1] {#wg_map_a}: vector<256x1xfloat> into vector<256x256xfloat> +``` +The `wg_map` attribute of the input vector can be derived from the wg_map_a. sg_layout must be same, sg_data for the dimension being broadcast must be "1", and the other dimension must be same as the wg_map_a. The input vector's wg_map attribute may be retrieved from its producer op, and the retrieved attribute must be consistent with the derived one. Below is the derived wg_map for the input vector in the example above. +```mlir + #wg_map_b = #xetile.wg_map //wg_map for %vector_b +``` + +`transpose` with `wg_map` attribute transpose a workgroup level vector. +```mlir + #wg_map_a = #xetile.wg_map + %vector_a = xetile.transpose %vector_b {#wg_map_a}: vector<512x128xfloat> into vector<128x512xfloat> +``` + +The `wg_map` attribute of the input vector can be derived from the wg_map_a. The two dimension of sg_layout and sg_data must be swapped. The input vector's wg_map attribute may be retrieved from its producer op, and the retrieved attribute must be consistent with the derived one. Below is the derived wg_map for the input vector in the example above. +```mlir + #wg_map_b = #xetile.wg_map //wg_map for %vector_b +``` +The transpose can be implemented by saving and restoring from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to to shared memory with the #wg_map_b mapping assuming row_major and 2) use wg_map_a mapping to load the data from shared memory to vector assuming column_major. To support this, we relax the restriction of tile_load and tile_store so that they can load 2D from share local memory. + +An optimization is to analyze the load op which produces %vector_b, carefully arrange its mapping so that each subgroup thread loads its corresponding subgroup tile, and then either combine transpose function to the load op or do an in-register transpose. + +`convert_layout` with `wg_map` attributes remaps the workgroup level vector to subgroup threads. The second `wg_map` attribute is optional and describes the input operand. The input vector's wg_map attribute may be retrieved from its producer op, and the retrieved attribute must be consistent with the second `wg_map` attribute if it is present. + +Example with the wg_map specified for both input and output operands. +```mlir + #wg_map_b = #xetile.wg_map // used for cooperative load/prefetch + #wg_map_a = #xetile.wg_map // used as mma's input matrix A + %vector_a = xetile.convert_layout %vector_b {#wg_map_a #wg_map_b}: vector<256x256xfloat> into vector<256x256xfloat> +``` +Example without the wg_map specified for the input operand. +```mlir + #wg_map_a = #xetile.wg_map // used as mma's input matrix A + %vector_a = xetile.convert_layout %vector_b {#wg_map_a}: vector<256x256xfloat> into vector<256x256xfloat> +``` +The convert_layout could be implemented by saving and restoring from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to to shared memory with the #wg_map_b mapping assuming row_major and 2) use wg_map_a mapping to load the data from shared memory to vector assuming same row_major. To support this, we relax the restriction of tile_load and tile_store so that they can load 2D from share local memory. + + ## Notes Currently, there is no lower-level dialect for the Intel GPU compiler toolchain to represent GPU ops with values based on LLVM data types such as NVVM From b86bc26dc604035f122df021b957b334f19b5e13 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Thu, 27 Feb 2025 18:34:38 -0800 Subject: [PATCH 02/13] Update XeGPU.md save work --- docs/rfcs/XeGPU.md | 149 +++++++++++++++++++-------------------------- 1 file changed, 64 insertions(+), 85 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index c0c9596ab..d8d12f048 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -702,12 +702,15 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor ## Workgroup level XeGPU Operations -By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler as the loop nest level for subgroup and work item level can be removed. To eanble XeGPU operate the workgroup level, `wg_map` attribute is introduced to specify how the data is distributed across subgroups. `wg_mmap` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_mapping` to parition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critial preformance knobs. +By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. **Attribute xegpu.wg_map** -xegpu.wg_map specifies how a ND tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of two parameters: - * sg_layout: Defines the ND arrangement of subgroups within the workgroup. - * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. +`wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of two parameters: + * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. The dimension can up to 3d array. + * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. + * sg_order: The dimension order used to linearize n-d subgroup ids to 1-d. The first dimension in the sg_order list is the fastest-changing dimension. + +Given a 3-d sg_layout with and dimension sizes as dim_0, dim_1, dim_2, sg_order[2, 1, 0] maps subgroup thread [x, y, z] to linear subgroup thread [z + dim_2*y + dim_2*dim_1*x ], sg_order[1, 2, 0] maps to [y + dim_2*z + dim_2*dim_1*x]. When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor. @@ -723,128 +726,104 @@ tensor_size = tensor_desc[0] × tensor_desc[1] the following conditions must hold: * workgroup_size must represent the number of subgroups in a workgroup for a kernel. -* tensor_desc[0] must be evenly divisible by sg_layout[0] × sg_data[0], or vice versa. -* tensor_desc[1] must be evenly divisible by sg_layout[1] × sg_data[1], or vice versa. +* tensor_desc[0] must be either evenly divisible by sg_layout[0] × sg_data[0], or vice versa. +* tensor_desc[1] must be either evenly divisible by sg_layout[1] × sg_data[1], or vice versa. **distribution rule** -The wg_tile is distributed to sg_data x sg_layout in a round-robin fashion. If sg_data[i] x sg_layout[i] < wg_tile[i], we have data left after all subgroups are assigned for the first round. In this case, we continue to assign the rest data starting from the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] >= wg_tile[i], we may have already used up all the data before all subgroups are assigned. In this case, we wrap around the wg_tile and continue the assignment, and the rest subgroups along that dimension share the same data. - -data_index_i = sg_id_in_wg_x % (tensor[i]/sg_data[i]) -data_index_j = sg_id_in_wg_x % (tensor[j]/sg_data[j]) -sg_data_assigned[data_index_i, data_index_j] = (tensor[i]/sg_data[i], tensor[j]/sg_data[j], sg_data[i], sg_data[j]) - +The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], we have data left after all subgroups are assigned for the first round, we continue to assign the rest data starting from the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], we may have already used up all the data before all subgroups are assigned. In this case, we wrap around the tensor data and continue the assignment, and the rest subgroups along that dimension share the same data. **Resulting WI Data Fragment** -Each work item’s fragment of the distributed tensor is represented by a 2D vector (e.g., a SPIR-V or LLVM vector) with the shape [n_distribution_units, wi_data_size]. The result 2D vector will be further lowered to a 1D “SIMT-flavored” vector, such as a SPIR-V vector or LLVM vector, as the elements in the inner dimension being packed to a single packed data unit. +The distributed tensor for each subgroup has the same dimension as work group level tensor. **Examples of workgroup distribution with wg_map** -over the lowering process so that the user can tune for optimal performance. +The workgroup creates a tensor descriptor [128, 128] and distributes to 4 subgroups with `sg_layout` [2,2], and each subgroup gets `sg_data` [32,128]. The first dimension is split and distributed to subgroups in two rounds, and the second dimension is assigned as whole to multiple subgroup threads. -Below is an example. ```mlir - #wg_map_a = #xetile.wg_map - #tile_attr = #xetile.tile_attr - - %wg_tile = xetile.init_tile %A[%m, %c0] : memref<1024x1024xf16> -> !xetile.tile<128x128xf16, #tile_attr> + #wg_map_a = #xegpu.wg_map + %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<128x128xf16, #wg_map_a> ``` -Within the `xetile.wg_map`, `sg_layout` specifies the subgroup layout, and `sg_data` specifies the tile size owned by each subgroup. The tile created by init_tile is a workgroup-level tile. In the example above, sg_layout [2,2] means that each workgroup has 4 subgroups with 2 rows and 2 columns. When mapping sg_layout to linear subgroup id, sg_layout is always mapped to subgroup id in row-major ordering. sg_data [32,128] means that each subgroup works on a submatrix [32, 128]. The data elements assigned to each subgroup thread must be contiguous. - - +The table below shows the result tensor for each subgroup thread and its linear subgroup thread id. -For example, for the tile size [128, 128] and sg_data [32, 128], along the second dimension, there is no more data left to assign after the first subgroup, it wraps around and moves to the beginning of the tile and continues the assignment. Instead, for the first dimension, there is more data left after the first round of distribution, so it move to the next subtile and continue the assignement. As a result, the tile would be sliced to four subtiles with size [32,128], with the following mapping for sg_layout [2,2]: - -| subgroup tensor | 2D subgroup id | Linearized subgroup id +| subgroup tensor | 2D subgroup id | Linearized subgroup id | :--- | :---- | :---- | -| [ 0:31, 0:127] | [0, 0] , [0, 1] | 0 , 1 | -| [ 32:63, 0:127] | [1, 0] , [1, 1] | 2 , 3 | -| [ 64:95, 0:127] | [0, 0] , [0, 1] | 0 , 1 | -| [96:127, 0:127] | [1, 0] , [1, 1] | 2 , 3 | - -With the `xetile.wg_map` attribute being included in the tile data type, the tile memory related operations (xxx_tile) can be distributed to subgroup. The vector based operations (tile_xxx) requires extra handling, since we can't attatch the the `xetile.wg_map` attribute to MLIR vector data type. +| [ 0:31, 0:127] | [0, 0], [0, 1] | 0 , 1 | +| [ 32:63, 0:127] | [1, 0], [1, 1] | 2 , 3 | +| [ 64:95, 0:127] | [0, 0], [0, 1] | 0 , 1 | +| [ 96:127, 0:127] | [1, 0], [1, 1] | 2 , 3 | -The proposal is to attach the `xetile.wg_map` attribute to the vector based XeTile operations as illustrated below. The attribute applies only to the output value of each operation. The input values `xetile.wg_map` are determined by their respective defining operations. -| Ops | Syntax | Example | -| :--- | :---- | :--- | -|tile_mma | operation ::= xetile.tile_mma $matA, $matB, $matC attr_dict: type($matA), type($matB), type($matC)-> type($res) | %vector_c = xetile.tile_mma %vector_a, %vector_b, %vector_c {#mp_c} : vector<64x32xbf16>, vector<32x128xbf16>, vector<64x128xfloat> into vector<64x128xfloat> | -|transpose | operation ::= xetile.transpose attr_dict $vec : type($vec) -> type($res) | %vector_a = xetile.transpose %vector_b {#mp_a}: vector<64x32xfloat> into vector<32x64xfloat> | -|reduction | operation ::= xetile.reduction $kind $src attr_dict: type($value) -> type($res) | %vector_a = xetile.reduction %vector_b [1] {#mp_a}: vector<64x32xfloat> into vector<64x1xfloat> | -|broadcast | operation ::= xetile.broadcast $src attr_dict : type($value) -> type($res) | %vector_a = xetile.broadcast %vector_b [0] {#mp_a}: vector<1x32xfloat> into vector<64x32xfloat> | -|convert_layout | operation ::= xetile.conv_layout $src attr_dict: type($value) -> type($res) | %vector_a = xetile.convert_layout %vector_b {#mp_a} : vector<256x256xfloat> into vector<256x256xfloat> | +Similarly to `sg_map`, the `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performance from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. -With the `wg_map` attribute attached for the output vector, `tile_mma` does a matrix multiplication at a work group level vector. +For `dpas`, the `wg_map` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimenion as output, and `sg_data` for k dimension must be same as operand A and B. `sg_order` must be same as output. ```mlir - #wg_map_d = #xetile.wg_map + #wg_map_d = #xegpu.wg_map - %vector_d = xetile.tile_mma %vector_a, %vector_b, %vector_c {#wg_map_d}: + %vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#wg_map_d}: vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16> into vector<256x256xfloat> -``` -The `wg_map` attribute of input vector operands can be derived from the wg_map_d. They must have the same sg_layout, and sg_data for m and n dimenion must be same as wg_map_d, and sg_data for k dimension must be same as operand A and B. These attributes may be retrieved from their producer ops, and the retrieved attributes must be consistent with the derived ones. Below is the derived wg_map for the three vector operands in the example above. -```mlir - #wg_map_a = #xetile.wg_map //wg_map for %vector_a - #wg_map_b = #xetile.wg_map //wg_map for %vector_b - #wg_map_c = #xetile.wg_map //wg_map for %vector_c -``` -`reduction` with `wg_map` does the reduction over a workgroup level vector. -```mlir - #wg_map_a = #xetile.wg_map - %vector_a = xetile.reduction %vector_b [1] {#wg_map_a}: vector<256x128xfloat> into vector<256x1xfloat> -``` -`reduction_size` attribute is used to support paritial reduction. -```mlir - #wg_map_a = #xetile.wg_map - #wg_map_b = #xetile.wg_map - %vector_a = math.exp %input {#wg_map_a} : vector<256x128xf32> - %vector_b = xetile.reduction %vector_a [0] {$reduction_size = [32]} {#wg_map_b}: vector<256x128xfloat> into vector<8x128xfloat> + //derived wg_map for input operands + #wg_map_a = #xegpu.wg_map //wg_map for %vector_a + #wg_map_b = #xegpu.wg_map //wg_map for %vector_b + #wg_map_c = #xegpu.wg_map //wg_map for %vector_c ``` -The `wg_map` attribute of the input vector can be derived from the wg_map_a. sg_layout must be same, sg_data for the dimension being reduced must be same as the input vector, and the other dimension must be same as the wg_map_a. The input vector's wg_map attribute may be retrieved from its producer op, and the retrieved attribute must be consistent with the derived one. Below is the derived wg_map for the input vector in the example above. +For `reduction`, `wg_map` of the input operand hads an additional dimension to represent the dimension being reduced. `sg_layout` must be same and the new dimension as `1`. The new dimension of `sg_data` must be same as the input tensor size, and the other dimension must be same as the output's `wg_map`. The new dimension of `sg_order` should not change the existing ordering specified by the output's `wg_map`. + ```mlir - #wg_map_b = #xetile.wg_map //wg_map for %vector_b + #wg_map_a = #xegpu.wg_map + %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#wg_map_a}: vector<256x128xfloat> into vector<256xfloat> + + //derived wg_map for input operand + #wg_map_b = #xegpu.wg_map ``` -`broadcast` with `wg_map` attribute broadcast at workgroup level. +The rule also applies to reduction from 3d to 2d. ```mlir - #wg_map_a = #xetile.wg_map - %vector_a = xetile.broadcast %vector_b [1] {#wg_map_a}: vector<256x1xfloat> into vector<256x256xfloat> + #wg_map_a = #xegpu.wg_map + %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#wg_map_a}: vector<8x32x128xf32> to vector<8x128xf32> + + //derived wg_map for input operand + #wg_map_b = #xegpu.wg_map ``` -The `wg_map` attribute of the input vector can be derived from the wg_map_a. sg_layout must be same, sg_data for the dimension being broadcast must be "1", and the other dimension must be same as the wg_map_a. The input vector's wg_map attribute may be retrieved from its producer op, and the retrieved attribute must be consistent with the derived one. Below is the derived wg_map for the input vector in the example above. + +For `shape_cast`, it first determines the dimensions being reduced or expanded. The input's `wg_map` needs to expand or reduce the value accordingly for related dimension in `sg_layout` and `sg_data`. `sg_order` should be consistent between input and output. ```mlir - #wg_map_b = #xetile.wg_map //wg_map for %vector_b + wg_map_a = #xegpu.wg_map + %vector_a = vector.shape_cast %vector_b {#wg_map_a} : vector<256x128xf32> to vector<8x32x128xf32> + + //derived wg_map for input operand + #wg_map_b = #xegpu.wg_map ``` -`transpose` with `wg_map` attribute transpose a workgroup level vector. +For `broadcast`, `wg_map` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut wg_map and must be removed for the input operand. The corresponding dimension in `sg_data` and `sg_order` must be removed also. + ```mlir - #wg_map_a = #xetile.wg_map - %vector_a = xetile.transpose %vector_b {#wg_map_a}: vector<512x128xfloat> into vector<128x512xfloat> + #wg_map_a = #xegpu.wg_map + %vector_a = vector.broadcast %vector_b [1] {#wg_map_a}: vector<256xfloat> into vector<256x256xfloat> + + //derived wg_map for input operand + #wg_map_b = #xegpu.wg_map ``` -The `wg_map` attribute of the input vector can be derived from the wg_map_a. The two dimension of sg_layout and sg_data must be swapped. The input vector's wg_map attribute may be retrieved from its producer op, and the retrieved attribute must be consistent with the derived one. Below is the derived wg_map for the input vector in the example above. +For `transpose`, the values in `wg_map` must be swapped for the two dimensions being transposed, including `sg_layout`, `sg_data`, and `sg_order`. ```mlir - #wg_map_b = #xetile.wg_map //wg_map for %vector_b + #wg_map_a = #xegpu.wg_map + %vector_a = vector.transpose %vector_b {#wg_map_a}: vector<512x128xfloat> into vector<128x512xfloat> ``` -The transpose can be implemented by saving and restoring from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to to shared memory with the #wg_map_b mapping assuming row_major and 2) use wg_map_a mapping to load the data from shared memory to vector assuming column_major. To support this, we relax the restriction of tile_load and tile_store so that they can load 2D from share local memory. -An optimization is to analyze the load op which produces %vector_b, carefully arrange its mapping so that each subgroup thread loads its corresponding subgroup tile, and then either combine transpose function to the load op or do an in-register transpose. +`wg_layout` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `wg_layout` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `wg_map` to the other. -`convert_layout` with `wg_map` attributes remaps the workgroup level vector to subgroup threads. The second `wg_map` attribute is optional and describes the input operand. The input vector's wg_map attribute may be retrieved from its producer op, and the retrieved attribute must be consistent with the second `wg_map` attribute if it is present. +The example below represent the `wg_map` conversion with unrealized_conversion_cast. -Example with the wg_map specified for both input and output operands. -```mlir - #wg_map_b = #xetile.wg_map // used for cooperative load/prefetch - #wg_map_a = #xetile.wg_map // used as mma's input matrix A - %vector_a = xetile.convert_layout %vector_b {#wg_map_a #wg_map_b}: vector<256x256xfloat> into vector<256x256xfloat> -``` -Example without the wg_map specified for the input operand. ```mlir - #wg_map_a = #xetile.wg_map // used as mma's input matrix A - %vector_a = xetile.convert_layout %vector_b {#wg_map_a}: vector<256x256xfloat> into vector<256x256xfloat> + #wg_map_b = #xegpu.wg_map // used for cooperative load/prefetch + #wg_map_a = #xegpu.wg_map // used as mma's input matrix A + %vector_a = unrealized_conversion_cast %vector_b {#wg_map_a #wg_map_b}: vector<256x256xfloat> into vector<256x256xfloat> ``` -The convert_layout could be implemented by saving and restoring from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to to shared memory with the #wg_map_b mapping assuming row_major and 2) use wg_map_a mapping to load the data from shared memory to vector assuming same row_major. To support this, we relax the restriction of tile_load and tile_store so that they can load 2D from share local memory. +The `wg_map` conversion can be lowered to storing and loading from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to shared local memory with the #wg_map_b and 2) use wg_map_a mapping to load the data from shared local memory. ## Notes From 6533478bf39320a3ba575a8455fd86f4eddfa1c6 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Thu, 27 Feb 2025 18:56:28 -0800 Subject: [PATCH 03/13] Update XeGPU.md --- docs/rfcs/XeGPU.md | 186 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 186 insertions(+) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index d8d12f048..d94c4d990 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -825,6 +825,192 @@ The example below represent the `wg_map` conversion with unrealized_conversion_ ``` The `wg_map` conversion can be lowered to storing and loading from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to shared local memory with the #wg_map_b and 2) use wg_map_a mapping to load the data from shared local memory. +## Appendix 1 - Code examples for work group level XeGPU using wg_map attribute + +## Appendix 1.1 Simple Gemm with prefetch +The first example shows a simple gemm. It demonstrates the different wg_map we used for prefetch and load. +```mlir +Pseudo code for simple gemm +C[4096, 4096] = matmul (A[4096, 4096], B[4096, 4096]) +``` + +```mlir +#mp_a = #wg_map +#mp_a_pfh = #wg_map +#mp_b = #wg_map +#mp_b_pfh = #wg_map +#mp_c = #wg_map + +func.func @test_gemm(%a : memref<4096x4096xf16>, +     %b: memref<4096x4096xf16>, + %c: memref<4096xf32> ) { + scf.for %i = %c0 to %c4096 step %c256 { + scf.for %j = %c0 to %c4096 step %c256 { +   %1 = create_nd_tdesc %a[%i, %c0] : memref<4096x4096xf16> -> tensor_desc<256x32xf16, #mp_a> // sg_layout=[8,4], sg_data=[32,32] +   %2 = create_nd_tdesc %b[%c0, %j] : memref<4096x4096xf16> -> tensor_desc<32x256xf16, #mp_b> // sg_layout=[8,4], sg_data=[32,64] +   %1p = create_nd_tdesc %a[%i, %c96] : memref<4096x4096xf16> -> tensor_desc<256x32xf16, #mp_a_pfh]> // sg_layout=[32,1] +   %2p = create_nd_tdesc %b[%c96, %j] : memref<4096x4096xf16> -> tensor_desc<32x256xf16, #mp_b_pfh> // sg_layout=[4,8] + + %3 = create_nd_tdesc %c[%i, %j] : memref<4096x4096xf32> -> tensor_desc<256x256xf32, #mp_c> // sg_layout=[32, 1] + + scf.for %k= %c0 to %c4096 step %c32 { +   %4 = load_nd %1 : tensor_desc<256x32xf16 #mp_a > -> vector<256x32xf16> // sg_layout=[8,4], sg_data=[32,32] + %10 = load_nd %2 : tensor_desc<32x256xf16 #mp_b> -> vector<32x256xf16> // sg_layout=[8,4], sg_data=[32,64] + +   prefetch_nd %1 : tensor_desc<256x32xf16, #mp_a_pfh>              // sg_layout=[32,1] + prefetch_nd %2 : tensor_desc<32x256xf16, #mp_a_pfh>              // sg_layout=[4,8] + %6 = dpas %4, %10 {#mp_a #mp_b #mp_c} : (vector<256x32xf16>, vector<32x256xf16>) -> vector<256x256xf32> //sg_layout=[8,4] + %1 = update_nd_offset%1, %c0, %c32 : tensor_desc<256x32xf16, #mp_a> + %2 = update_nd_offset%2, %c32, %c0 : tensor_desc<32x256xf16, #mp_b> + %1p = update_nd_offset%1p, %c0, %c32 : tensor_desc<256x32xf16, #mp_a_pft> + %2p = update_nd_offset%2p, %c32, %c0 : tensor_desc<32x256xf16, #mp_b_pft> + } +   store_nd %3, %6: (tensor_desc<256x256xf32, #mp_c>, vector<256x256xf32>)          // sg_layout=[8, 4] + } + } +``` +## Appendix 1.2 Gemm with transpose, broadcast, and reduction +The second example contains transpose, broadcast, and reduction. +```mlir +Pseduo code for the original problem. +C[4096, 4096] = matmul (A[4096, 4096], BT[4096, 4096]) + broad_cast(bcast[4096], dim=0) +Reduce[4096] = reduce_add(C[4096, 4096], dim=1) +``` + +```mlir +#mp_a = #wg_map +#mp_a_pfh = #wg_map +#mp_b = #wg_map +#mp_bt = #wg_map +#mp_bt_pfh = #wg_map +#mp_c = #wg_map + +#mp_bcast = #wg_map +#mp_reduce= #wg_map +#mp_reduce2= #wg_map + +func.func @test_gemm(%a : memref<4096x4096xf16>, +     %b: memref<4096x4096xf16>, + %bcast: memref<4096xf32> +     %res: memref<4096xf32> ) { + scf.for %i = %c0 to %c4096 step %c256 { + scf.for %j = %c0 to %c4096 step %c256 { +   %1 = create_nd_tdesc %a[%i, %c0] : memref<4096x4096xf16> -> tensor_desc<256x32xf16, #mp_a> // sg_layout=[8,4], sg_data=[32,32] +   %2 = create_nd_tdesc %bt[%j, %c0] : memref<4096x4096xf16> -> tensor_desc<256x32xf16, #mp_bt> // sg_layout=[4,8], sg_data=[64,32] +   %1p = create_nd_tdesc %a[%i, %c192] : memref<4096x4096xf16> -> tensor_desc<256x32xf16, #mp_a_pfh]> // sg_layout=[32,1] +   %2p = create_nd_tdesc %bt[%j, %c192] : memref<4096x4096xf16> -> tensor_desc<256x32xf16, #mp_bt_pfh> // sg_layout=[32,1] + + %bcast'= memref.cast %bcast: memref<4096xf32> -> memref<1x4096xf32> +   %7 = create_nd_tdesc %bcast'[%j] : memref<1x4096xf32> -> tensor_desc<1x256xf32, #mp_bast> // sg_layout=[4, 8], sg_data=[1,32] + + %res'= memref.cast %res: memref<4096xf32> -> memref<4096x1xf32> +   %3 = create_nd_tdesc %res'[%i] : memref<4096x1xf32> -> tensor_desc<256x1xf32, #mp_reduce> // sg_layout=[32, 1] + + scf.for %k= %c0 to %c4096 step %c32 { +   %4 = load_nd %1 : tensor_desc<256x32xf16 #mp_a > -> vector<256x32xf16> // sg_layout=[8,4], sg_data=[32,32] + %10 = load_nd %2 : tensor_desc<256x32xf16 #mp_bt> -> vector<256x32xf16> // sg_layout=[4,8], sg_data=[64,32] + %5 = vector.transpose %10 {#mp_bt #mp_b}: vector<256x32xf16> -> vector<32x256xf16> // sg_layout=[4,8] -> sg_layout=[8,4] + +   prefetch_nd %1 : tensor_desc<256x32xf16, #mp_a_pfh>              // sg_layout=[32,1] + prefetch_nd %2 : tensor_desc<256x32xf16, #mp_a_pfh>              // sg_layout=[32,1] + %6 = dpas %4, %5 {#mp_a #mp_b #mp_c} : (vector<256x32xf16>, vector<32x256xf16>) -> vector<256x256xf32> //sg_layout=[8,4] + %1 = update_nd_offset%1, %c0, %c32 : tensor_desc<256x32xf16, #mp_a> + %2 = update_nd_offset%2, %c0, %c32 : tensor_desc<256x32xf16, #mp_bt> + %1p = update_nd_offset%1p, %c0, %c32 : tensor_desc<256x32xf16, #mp_a_pft> + %2p = update_nd_offset%2p, %c32, %c0 : tensor_desc<256x32xf16, #mp_bt_pft> + } + + %12 = load_nd %7 : tensor_desc<256xf32, #mp_bcast> -> vector<256xf16>     // sg_layout=[32], sg_data=[64] +   %13 = broadcast {#mp_bcast #mp_c} %12 [0]: vector<256xf32> => vector<256x256xf32>     // sg_layout=[8, 4], sg_data=[32,64] + %14 = add %6, %13 : vector<256x256xf32> +    %15 = convert_layout {#mp_c #mp_reduce2} %14 : vector<256x256xf32>   // sg_layout=[8, 4] -> sg_layout=[32, 1] +    %16 = vector.reduction {#mp_reduce2 #mp_reduce} %15 [1]: vector<256x256xf32> => vector<256xf32>  // sg_layout=[32] +   store_nd %3, %7: (tensor_desc<256xf32, #mp_reduce>, vector<256xf32>)          // sg_layout=[32] + } + } +``` + +## Appendix 1.3 Gemm implementation with two cache levels +For GPU support high-performance prefetch through two level of caches. +```mlir +#mp_a = #wg_map +#mp_b = #wg_map +#mp_c = #wg_map + +#mp_a_copl2 = #wg_map +#mp_b_copl2 = #wg_map< sg_layout=[16,2], sg_data=[8,128]> + +#mp_a_copl1 = #wg_map +#mp_b_copl1 = #wg_map< sg_layout=[4, 8], sg_data=[8,32]> + +func.func @test_gemm(%a : memref<4096x4096xf16>, +     %b: memref<4096x4096xf16>, + %c: memref<4096xf32> ) { + scf.for %i = %c0 to %c4096 step %c256 { + scf.for %j = %c0 to %c4096 step %c256 { +   %a1_l2 = create_nd_tdesc %a[%i, %c0] : memref<4096x4096xf16> -> tensor_desc<512x128xf16, #mp_a_copl2> +   %b1_l2 = create_nd_tdesc %b[%c0, %j] : memref<4096x4096xf16> -> tensor_desc<128x256xf16, #mp_b_copl2> +   %a2_l2 = create_nd_tdesc %a[%i, %c256] : memref<4096x4096xf16> -> tensor_desc<512x128xf16, #mp_a_copl2> +   %b2_l2 = create_nd_tdesc %b[%c256, %j] : memref<4096x4096xf16> -> tensor_desc<128x256xf16, #mp_b_copl2> + + prefetch_nd %a1_l2 locality<2>: tensor_desc<512x128xf16, #mp_a_copl2> + prefetch_nd %b1_l2 locality<2>: tensor_desc<128x256xf16, #mp_b_copl2> + prefetch_nd %a2_l2 locality<2>: tensor_desc<512x128xf16, #mp_a_copl2> + prefetch_nd %b2_l2 locality<2>: tensor_desc<128x256xf16, #mp_b_copl2> + %a2_l2’ = update_nd_offset%a2_l2, %c0, %c32 : tensor_desc<512x128xf16, #mp_b_copl2> + %b2_l2’ = update_nd_offset%b2_l2, %c32, %c0 : tensor_desc<128x256xf16, #mp_b_copl2> + +   %a1_l1 = create_nd_tdesc %a[%i, %c0] : memref<4096x4096xf16> -> tensor_desc<512x32xf16, #mp_a_copl1> +   %b1_l1 = create_nd_tdesc %b[%c0, %j] : memref<4096x4096xf16> -> tensor_desc<32x256xf16, #mp_b_copl1> +   %a2_l1 = create_nd_tdesc %a[%i, %c32] : memref<4096x4096xf16> -> tensor_desc<512x32xf16, #mp_a_copl1> +   %b2_l1 = create_nd_tdesc %b[%c32, %j] : memref<4096x4096xf16> -> tensor_desc<32x256xf16, #mp_b_copl1> +   %a3_l1 = create_nd_tdesc %a[%i, %c64] : memref<4096x4096xf16> -> tensor_desc<512x32xf16, #mp_a_copl1> +   %b3_l1 = create_nd_tdesc %b[%c64, %j] : memref<4096x4096xf16> -> tensor_desc<32x256xf16, #mp_b_copl1> +   %a4_l1 = create_nd_tdesc %a[%i, %c96] : memref<4096x4096xf16> -> tensor_desc<512x32xf16, #mp_a_copl1> +   %b4_l1 = create_nd_tdesc %b[%c96, %j] : memref<4096x4096xf16> -> tensor_desc<32x256xf16, #mp_b_copl1> + + prefetch_nd %a1_l1 locality<3>: tensor_desc<512x32xf16, #mp_a_copl1> + prefetch_nd %b1_l1 locality<3>: tensor_desc<32x256xf16, #mp_b_copl1> + prefetch_nd %a2_l1 locality<3>: tensor_desc<512x32xf16, #mp_a_copl1> + prefetch_nd %b2_l1 locality<3>: tensor_desc<32x256xf16, #mp_b_copl1> + prefetch_nd %a3_l1 locality<3>: tensor_desc<512x32xf16, #mp_a_copl1> + prefetch_nd %b3_l1 locality<3>: tensor_desc<32x256xf16, #mp_b_copl1> + prefetch_nd %a4_l1 locality<3>: tensor_desc<512x32xf16, #mp_a_copl1> + prefetch_nd %b4_l1 locality<3>: tensor_desc<32x256xf16, #mp_b_copl1> + %a4_l1’ = update_nd_offset% a4_l1, %c0, %c128 : tensor_desc<512x32xf16, #mp_a_copl1> + %b4_l1’ = update_nd_offset% b4_l1, %c128, %c0 : tensor_desc<32x256xf16, #mp_b_copl1> + +   %a1_load = create_nd_tdesc %a[%i, %c0] : memref<4096x4096xf16> -> tensor_desc<512x32xf16, #mp_a> +   %b1_load = create_nd_tdesc %b[%c0, %j] : memref<4096x4096xf16> -> tensor_desc<32x256xf16, #mp_b> + + %c_tile = create_nd_tdesc %c[%i, %j] : memref<4096x4096xf32> -> tensor_desc<512x256xf32, #mp_c> + + scf.for %k= %c0 to %c4096 step %c32 { + %a1_r = load_nd %a1_load : tensor_desc<256x32xf16 #mp_a > -> vector<512x32xf16> + %b1_r = load_nd %b1_load : tensor_desc<32x256xf16 #mp_b> -> vector<32x256xf16> + + Scf.if (%k %4 == 0) { + gpu.barrier + prefetch_nd %a2_l2’ locality<2>: tensor_desc<512x128xf16, #mp_a_copl2> + prefetch_nd %b2_l2’ locality<2>: tensor_desc<128x256xf16, #mp_b_copl2> + %a2_l2’ = update_nd_offset%a2_l2’, %c0, %c128 : tensor_desc<512x128xf16, #mp_a_copl2> + %b2_l2’ = update_nd_offset%b2_l2’, %c128, %c0 : tensor_desc<128x256xf16, #mp_b_copl2> + } + prefetch_nd %a4_l1’ locality<3>: tensor_desc<512x32xf16, #mp_a_copl1> + prefetch_nd %b4_l1’ locality<3>: tensor_desc<32x256xf16, #mp_b_copl1> + %a4_l1’ = update_nd_offset%a4_l1’, %c0, %c32 : tensor_desc<512x32xf16, #mp_a_copl1> + %b4_l1’ = update_nd_offset%b4_l1’, %c32, %c0 : tensor_desc<32x256xf16, #mp_b_copl1> + + %a1_load = update_nd_offset%a1_load, %c0, %c32 : tensor_desc<512x32xf16, #mp_a> + %a2_load = update_nd_offset%b1_load, %c32, %c0 : tensor_desc<32x256xf16, #mp_b> + + %6 = dpas %a1_r, %b1_r {#mp_a #mp_b #mp_c} : (vector<512x32xf16>, vector<32x256xf16>) -> vector<512x256xf32> + } +  store_nd %c_tile, %6: (tensor_desc<512x256xf32, #mp_c>, vector<512x256xf32>) + } + } +} +``` ## Notes From 42f5f150502da6e0ee7681dad3594405ea7448aa Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Thu, 27 Feb 2025 19:11:57 -0800 Subject: [PATCH 04/13] Update XeGPU.md --- docs/rfcs/XeGPU.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index d94c4d990..971fac748 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -814,14 +814,14 @@ For `transpose`, the values in `wg_map` must be swapped for the two dimensions b %vector_a = vector.transpose %vector_b {#wg_map_a}: vector<512x128xfloat> into vector<128x512xfloat> ``` -`wg_layout` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `wg_layout` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `wg_map` to the other. +`wg_map` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `wg_map` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `wg_map` to the other. -The example below represent the `wg_map` conversion with unrealized_conversion_cast. +`convert_layout` is introduced to represent the `wg_map` conversion. ```mlir #wg_map_b = #xegpu.wg_map // used for cooperative load/prefetch #wg_map_a = #xegpu.wg_map // used as mma's input matrix A - %vector_a = unrealized_conversion_cast %vector_b {#wg_map_a #wg_map_b}: vector<256x256xfloat> into vector<256x256xfloat> + %vector_a = xegpu.convert_layout %vector_b {#wg_map_a #wg_map_b}: vector<256x256xfloat> into vector<256x256xfloat> ``` The `wg_map` conversion can be lowered to storing and loading from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to shared local memory with the #wg_map_b and 2) use wg_map_a mapping to load the data from shared local memory. @@ -886,7 +886,8 @@ Reduce[4096] = reduce_add(C[4096, 4096], dim=1) #mp_c = #wg_map #mp_bcast = #wg_map -#mp_reduce= #wg_map +#mp_bcast2 = #wg_map +#mp_reduce= #wg_map #mp_reduce2= #wg_map func.func @test_gemm(%a : memref<4096x4096xf16>, @@ -920,11 +921,12 @@ func.func @test_gemm(%a : memref<4096x4096xf16>, %2p = update_nd_offset%2p, %c32, %c0 : tensor_desc<256x32xf16, #mp_bt_pft> } - %12 = load_nd %7 : tensor_desc<256xf32, #mp_bcast> -> vector<256xf16>     // sg_layout=[32], sg_data=[64] -   %13 = broadcast {#mp_bcast #mp_c} %12 [0]: vector<256xf32> => vector<256x256xf32>     // sg_layout=[8, 4], sg_data=[32,64] + %12 = load_nd %7 : tensor_desc<256xf32, #mp_bcast2> -> vector<256xf16>     // sg_layout=[32], sg_data=[8] + %12' = convert_layout {#mp_bcast2 #mp_bcast} %12 : vector<256x256xf32> // sg_layout=[32] -> sg_layout=[8, 4] +   %13 = vector.broadcast {#mp_c} %12' [0]: vector<256xf32> => vector<256x256xf32>     // sg_layout=[8, 4], sg_data=[32,64] %14 = add %6, %13 : vector<256x256xf32> -    %15 = convert_layout {#mp_c #mp_reduce2} %14 : vector<256x256xf32>   // sg_layout=[8, 4] -> sg_layout=[32, 1] -    %16 = vector.reduction {#mp_reduce2 #mp_reduce} %15 [1]: vector<256x256xf32> => vector<256xf32>  // sg_layout=[32] +    %14' = convert_layout {#mp_c #mp_reduce2} %14 : vector<256x256xf32>   // sg_layout=[8, 4] -> sg_layout=[32] +    %16 = vector.reduction {#mp_reduce2 #mp_reduce} %14' [1]: vector<256x256xf32> => vector<256xf32>  // sg_layout=[32]   store_nd %3, %7: (tensor_desc<256xf32, #mp_reduce>, vector<256xf32>)          // sg_layout=[32] } } From 1c0bd4aa95c6a382545df05e1d12449e27716e38 Mon Sep 17 00:00:00 2001 From: Igor Zamyatin Date: Fri, 28 Feb 2025 12:18:49 -0600 Subject: [PATCH 05/13] Some trivial changes --- docs/rfcs/XeGPU.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index 971fac748..4f951d6c3 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -705,6 +705,7 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. **Attribute xegpu.wg_map** + `wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of two parameters: * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. The dimension can up to 3d array. * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. @@ -735,7 +736,7 @@ The tensor_desc is distributed to sg_data x sg_layout along each dimension in a **Resulting WI Data Fragment** -The distributed tensor for each subgroup has the same dimension as work group level tensor. +The distributed tensor for each subgroup has the same dimension as the work group level tensor. **Examples of workgroup distribution with wg_map** @@ -754,9 +755,9 @@ The table below shows the result tensor for each subgroup thread and its linear | [ 64:95, 0:127] | [0, 0], [0, 1] | 0 , 1 | | [ 96:127, 0:127] | [1, 0], [1, 1] | 2 , 3 | -Similarly to `sg_map`, the `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performance from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. +Similarly to `sg_map`, the `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. -For `dpas`, the `wg_map` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimenion as output, and `sg_data` for k dimension must be same as operand A and B. `sg_order` must be same as output. +For `dpas`, the `wg_map` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimension as output, and `sg_data` for k dimension must be same as operand A and B. `sg_order` must be same as output. ```mlir #wg_map_d = #xegpu.wg_map @@ -770,7 +771,7 @@ For `dpas`, the `wg_map` attribute of input operands must have the same `sg_layo #wg_map_c = #xegpu.wg_map //wg_map for %vector_c ``` -For `reduction`, `wg_map` of the input operand hads an additional dimension to represent the dimension being reduced. `sg_layout` must be same and the new dimension as `1`. The new dimension of `sg_data` must be same as the input tensor size, and the other dimension must be same as the output's `wg_map`. The new dimension of `sg_order` should not change the existing ordering specified by the output's `wg_map`. +For `reduction`, `wg_map` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `wg_map`. The new dimension of `sg_order` should not change the existing ordering specified by the output's `wg_map`. ```mlir #wg_map_a = #xegpu.wg_map From bc78f3a3e37a878f2d27d9de4d90597a0fd18f60 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Fri, 28 Feb 2025 15:25:46 -0800 Subject: [PATCH 06/13] Update XeGPU.md save work --- docs/rfcs/XeGPU.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index 4f951d6c3..dfd2ab10e 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -706,13 +706,21 @@ By allowing XeGPU operating on workgroup level data size, it provides a concise **Attribute xegpu.wg_map** -`wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of two parameters: +`wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of three parameters: * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. The dimension can up to 3d array. * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. * sg_order: The dimension order used to linearize n-d subgroup ids to 1-d. The first dimension in the sg_order list is the fastest-changing dimension. Given a 3-d sg_layout with and dimension sizes as dim_0, dim_1, dim_2, sg_order[2, 1, 0] maps subgroup thread [x, y, z] to linear subgroup thread [z + dim_2*y + dim_2*dim_1*x ], sg_order[1, 2, 0] maps to [y + dim_2*z + dim_2*dim_1*x]. +Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. +| sg_layout[4, 4] | order[1, 0] | order[0, 1] +| :---- | :---- | :---- | +| [0, 0], [0, 1], [0, 2], [0, 3] | 0 , 1, 2, 3 | 0, 4, 8, 12 | +| [1, 0], [1, 1], [1, 2], [1, 3] | 4 , 5 , 6, 7| 1, 5, 9, 13 | +| [2, 0], [2, 1], [2, 2], [2, 3] | 8, 9, 10 , 11 | 2, 6, 10, 14 | +| [3, 0], [3, 1], [3, 2], [3, 3] | 12 , 13, 14, 15 | 3, 7, 11, 15 | + When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor. **Constraints** From aee72dd2427b1d8464af911e466eb9533e808412 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Fri, 28 Feb 2025 16:00:29 -0800 Subject: [PATCH 07/13] Update XeGPU.md --- docs/rfcs/XeGPU.md | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index dfd2ab10e..9599728a3 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -707,19 +707,19 @@ By allowing XeGPU operating on workgroup level data size, it provides a concise **Attribute xegpu.wg_map** `wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of three parameters: - * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. The dimension can up to 3d array. + * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. The dimensions can be a 3d array as [dim_0, dim_1, dim_2]. * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. * sg_order: The dimension order used to linearize n-d subgroup ids to 1-d. The first dimension in the sg_order list is the fastest-changing dimension. -Given a 3-d sg_layout with and dimension sizes as dim_0, dim_1, dim_2, sg_order[2, 1, 0] maps subgroup thread [x, y, z] to linear subgroup thread [z + dim_2*y + dim_2*dim_1*x ], sg_order[1, 2, 0] maps to [y + dim_2*z + dim_2*dim_1*x]. - Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. | sg_layout[4, 4] | order[1, 0] | order[0, 1] | :---- | :---- | :---- | -| [0, 0], [0, 1], [0, 2], [0, 3] | 0 , 1, 2, 3 | 0, 4, 8, 12 | -| [1, 0], [1, 1], [1, 2], [1, 3] | 4 , 5 , 6, 7| 1, 5, 9, 13 | +| [0, 0], [0, 1], [0, 2], [0, 3] | 0, 1, 2, 3 | 0, 4, 8, 12 | +| [1, 0], [1, 1], [1, 2], [1, 3] | 4, 5 , 6, 7| 1, 5, 9, 13 | | [2, 0], [2, 1], [2, 2], [2, 3] | 8, 9, 10 , 11 | 2, 6, 10, 14 | -| [3, 0], [3, 1], [3, 2], [3, 3] | 12 , 13, 14, 15 | 3, 7, 11, 15 | +| [3, 0], [3, 1], [3, 2], [3, 3] | 12, 13, 14, 15 | 3, 7, 11, 15 | + +For a subgroup threads in 3-d sg_layout [dim_0, dim_1, dim_2], sg_order[2, 1, 0] maps a subgroup thread with 3-d index [x, y, z] to a linear subgroup thread index [z + dim_2*y + dim_2*dim_1*x ], sg_order[1, 2, 0] maps to [y + dim_2*z + dim_2*dim_1*x]. When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor. @@ -732,7 +732,7 @@ workgroup_size = sg_layout[0] × sg_layout[1] tensor_size = tensor_desc[0] × tensor_desc[1] ``` -the following conditions must hold: +The following conditions must hold: * workgroup_size must represent the number of subgroups in a workgroup for a kernel. * tensor_desc[0] must be either evenly divisible by sg_layout[0] × sg_data[0], or vice versa. @@ -763,7 +763,7 @@ The table below shows the result tensor for each subgroup thread and its linear | [ 64:95, 0:127] | [0, 0], [0, 1] | 0 , 1 | | [ 96:127, 0:127] | [1, 0], [1, 1] | 2 , 3 | -Similarly to `sg_map`, the `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. +The `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. For `dpas`, the `wg_map` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimension as output, and `sg_data` for k dimension must be same as operand A and B. `sg_order` must be same as output. ```mlir @@ -810,22 +810,25 @@ For `shape_cast`, it first determines the dimensions being reduced or expanded. For `broadcast`, `wg_map` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut wg_map and must be removed for the input operand. The corresponding dimension in `sg_data` and `sg_order` must be removed also. ```mlir - #wg_map_a = #xegpu.wg_map + #wg_map_a = #xegpu.wg_map %vector_a = vector.broadcast %vector_b [1] {#wg_map_a}: vector<256xfloat> into vector<256x256xfloat> //derived wg_map for input operand - #wg_map_b = #xegpu.wg_map + #wg_map_b = #xegpu.wg_map ``` For `transpose`, the values in `wg_map` must be swapped for the two dimensions being transposed, including `sg_layout`, `sg_data`, and `sg_order`. ```mlir - #wg_map_a = #xegpu.wg_map + #wg_map_a = #xegpu.wg_map %vector_a = vector.transpose %vector_b {#wg_map_a}: vector<512x128xfloat> into vector<128x512xfloat> + + //derived wg_map for input operand + #wg_map_b = #xegpu.wg_map ``` `wg_map` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `wg_map` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `wg_map` to the other. -`convert_layout` is introduced to represent the `wg_map` conversion. +`convert_layout` is introduced to convert two inconsistent `wg_map`. ```mlir #wg_map_b = #xegpu.wg_map // used for cooperative load/prefetch From ebf115fdc6acdf9c8b9da546deef5b72c7ef6c4d Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Wed, 12 Mar 2025 22:32:00 -0700 Subject: [PATCH 08/13] Update XeGPU.md --- docs/rfcs/XeGPU.md | 169 ++++++++++++++++++++++----------------------- 1 file changed, 84 insertions(+), 85 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index 9599728a3..15b2882a4 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -3,7 +3,7 @@ ## Summary The XeGPU dialect provides an abstraction that closely models Xe instructions to support high-performance GEMM code generation. The XeGPU operations are designed to support tile based programming. The same set of operations work at multiple levels, including workgroup, subgroup, and work item. -The workgroup level operation can be decomposed and unrolled to multipel XeGPU operations at subgroup level, which can be further decomposed to work item level. +The workgroup level operation can be decomposed and unrolled to multiple XeGPU operations at subgroup level, which can be further decomposed to work item level. Along the way, the tensor size is partitioned to smaller size, and the subgroup and work item level XeGPU operations exactly match the hardware instructions’ semantics including the matrix sizes. The lowering and optimizations built on top of the XeGPU dialect are target-specific. @@ -14,10 +14,7 @@ XeGPU operations are introduced when there is a special Xe instruction not model load and store. In some cases, one XeGPU op may lower to a sequence of instructions for a dedicated and performance-critical function. For example, create_tdesc is mapped to a fixed sequence of instructions to create an address description. -The operation definition is general and works for workgroup, subgroup, or work item level. When working at workgroup level, the operation must -attach `wg_map` attribute, and work item level operation must attach `sg_map` attribute. - -Below is a summary. +Below is a summary of operation definition. The operation definition is general and works for workgroup, subgroup, or work item level. | Ops | Syntax | Example | | :--- | :---- | :--- | @@ -700,16 +697,20 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor ``` -## Workgroup level XeGPU Operations +## extending sg_map attributes to support work group level semantic + +By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `sg_layout` and `sg_data` parameters into `sg_map` attribute to specify how the data is distributed across subgroups. With this extension, `sg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `sg_map` to partition data among subgroups without manipulating a nested IR representation. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. -By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. +we also propose to extend the sg_map to cover multiple dimension instead of 2D only and add `order` attribute. -**Attribute xegpu.wg_map** +**Extended xegpu.sg_map** -`wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of three parameters: +The extended `sg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. sg_map consists of four parameters: * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. The dimensions can be a 3d array as [dim_0, dim_1, dim_2]. * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. - * sg_order: The dimension order used to linearize n-d subgroup ids to 1-d. The first dimension in the sg_order list is the fastest-changing dimension. + * wi_layout: Defines the 2D arrangement of WIs within the subgroup. + * wi_data: Specifies the shape of the tensor fragment that each WI loads or stores as a single packed data unit (16/32-bit). + * order: The dimension order used to linearize n-d subgroup ids to 1-d. The first dimension in the order list is the fastest-changing dimension. Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. | sg_layout[4, 4] | order[1, 0] | order[0, 1] @@ -719,9 +720,9 @@ Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. | [2, 0], [2, 1], [2, 2], [2, 3] | 8, 9, 10 , 11 | 2, 6, 10, 14 | | [3, 0], [3, 1], [3, 2], [3, 3] | 12, 13, 14, 15 | 3, 7, 11, 15 | -For a subgroup threads in 3-d sg_layout [dim_0, dim_1, dim_2], sg_order[2, 1, 0] maps a subgroup thread with 3-d index [x, y, z] to a linear subgroup thread index [z + dim_2*y + dim_2*dim_1*x ], sg_order[1, 2, 0] maps to [y + dim_2*z + dim_2*dim_1*x]. +For a subgroup in 3-d sg_layout [dim_0, dim_1, dim_2], order[2, 1, 0] maps a subgroup with 3-d index [x, y, z] to a linear subgroup index [z + dim_2 * y + dim_2 * dim_1 * x ], order[1, 2, 0] maps to [y + dim_2 * z + dim_2 * dim_1 * x]. -When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor. +When a sg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The sg_map attribute must be specified when creating the tensor descriptor. **Constraints** @@ -740,21 +741,21 @@ The following conditions must hold: **distribution rule** -The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], we have data left after all subgroups are assigned for the first round, we continue to assign the rest data starting from the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], we may have already used up all the data before all subgroups are assigned. In this case, we wrap around the tensor data and continue the assignment, and the rest subgroups along that dimension share the same data. +The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], there is data left after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], the data may be used up before all subgroups are assigned. In this case, we broadcast the tensor data to multiple subgroups by repeating the data assignment to the rest subgroups along that dimension until the all subgroups get data. -**Resulting WI Data Fragment** +**Resulting subgroup Data size** The distributed tensor for each subgroup has the same dimension as the work group level tensor. -**Examples of workgroup distribution with wg_map** +**Examples of workgroup distribution with extended sg_map** -The workgroup creates a tensor descriptor [128, 128] and distributes to 4 subgroups with `sg_layout` [2,2], and each subgroup gets `sg_data` [32,128]. The first dimension is split and distributed to subgroups in two rounds, and the second dimension is assigned as whole to multiple subgroup threads. +The workgroup creates a tensor descriptor [128, 128] and distributes to 4 subgroups with `sg_layout` [2,2], and each subgroup gets `sg_data` [32,128]. The first dimension is split and distributed to subgroups in two rounds, and the second dimension is assigned as whole to multiple subgroups. ```mlir - #wg_map_a = #xegpu.wg_map - %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<128x128xf16, #wg_map_a> + #sg_map_a = #xegpu.sg_map + %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<16x128xf16, #sg_map_a> ``` -The table below shows the result tensor for each subgroup thread and its linear subgroup thread id. +The table below shows the result tensor for each subgroup and its linear subgroup id. | subgroup tensor | 2D subgroup id | Linearized subgroup id | :--- | :---- | :---- | @@ -763,95 +764,93 @@ The table below shows the result tensor for each subgroup thread and its linear | [ 64:95, 0:127] | [0, 0], [0, 1] | 0 , 1 | | [ 96:127, 0:127] | [1, 0], [1, 1] | 2 , 3 | -The `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. +The `sg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `sg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `sg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. -For `dpas`, the `wg_map` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimension as output, and `sg_data` for k dimension must be same as operand A and B. `sg_order` must be same as output. +For `dpas`, the `sg_map` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimension as output, and `sg_data` for k dimension must be same as operand A and B. `order` must be same as output. ```mlir - #wg_map_d = #xegpu.wg_map - - %vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#wg_map_d}: + #sg_map_d = #xegpu.sg_map + %vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#sg_map_d}: vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16> into vector<256x256xfloat> - - //derived wg_map for input operands - #wg_map_a = #xegpu.wg_map //wg_map for %vector_a - #wg_map_b = #xegpu.wg_map //wg_map for %vector_b - #wg_map_c = #xegpu.wg_map //wg_map for %vector_c + //derived sg_map for input operands + #sg_map_a = #xegpu.sg_map //sg_map for %vector_a + #sg_map_b = #xegpu.sg_map //sg_map for %vector_b + #sg_map_c = #xegpu.sg_map //sg_map for %vector_c ``` -For `reduction`, `wg_map` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `wg_map`. The new dimension of `sg_order` should not change the existing ordering specified by the output's `wg_map`. +For `reduction`, `sg_map` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `sg_map`. The new dimension of `order` should not change the existing ordering specified by the output's `sg_map`. ```mlir - #wg_map_a = #xegpu.wg_map - %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#wg_map_a}: vector<256x128xfloat> into vector<256xfloat> + #sg_map_a = #xegpu.sg_map + %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#sg_map_a}: vector<256x128xfloat> into vector<256xfloat> - //derived wg_map for input operand - #wg_map_b = #xegpu.wg_map + //derived sg_map for input operand + #sg_map_b = #xegpu.sg_map ``` The rule also applies to reduction from 3d to 2d. ```mlir - #wg_map_a = #xegpu.wg_map - %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#wg_map_a}: vector<8x32x128xf32> to vector<8x128xf32> + #sg_map_a = #xegpu.sg_map + %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#sg_map_a}: vector<8x32x128xf32> to vector<8x128xf32> - //derived wg_map for input operand - #wg_map_b = #xegpu.wg_map + //derived sg_map for input operand + #sg_map_b = #xegpu.sg_map ``` -For `shape_cast`, it first determines the dimensions being reduced or expanded. The input's `wg_map` needs to expand or reduce the value accordingly for related dimension in `sg_layout` and `sg_data`. `sg_order` should be consistent between input and output. +For `shape_cast`, it first determines the dimensions being reduced or expanded. The input's `sg_map` needs to expand or reduce the value accordingly for related dimension in `sg_layout` and `sg_data`. `order` should be consistent between input and output. ```mlir - wg_map_a = #xegpu.wg_map - %vector_a = vector.shape_cast %vector_b {#wg_map_a} : vector<256x128xf32> to vector<8x32x128xf32> + sg_map_a = #xegpu.sg_map + %vector_a = vector.shape_cast %vector_b {#sg_map_a} : vector<256x128xf32> to vector<8x32x128xf32> - //derived wg_map for input operand - #wg_map_b = #xegpu.wg_map + //derived sg_map for input operand + #sg_map_b = #xegpu.sg_map ``` -For `broadcast`, `wg_map` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut wg_map and must be removed for the input operand. The corresponding dimension in `sg_data` and `sg_order` must be removed also. +For `broadcast`, `sg_map` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut sg_map and must be removed for the input operand. The corresponding dimension in `sg_data` and `order` must be removed also. ```mlir - #wg_map_a = #xegpu.wg_map - %vector_a = vector.broadcast %vector_b [1] {#wg_map_a}: vector<256xfloat> into vector<256x256xfloat> + #sg_map_a = #xegpu.sg_map + %vector_a = vector.broadcast %vector_b [1] {#sg_map_a}: vector<256xfloat> into vector<256x256xfloat> - //derived wg_map for input operand - #wg_map_b = #xegpu.wg_map + //derived sg_map for input operand + #sg_map_b = #xegpu.sg_map ``` -For `transpose`, the values in `wg_map` must be swapped for the two dimensions being transposed, including `sg_layout`, `sg_data`, and `sg_order`. +For `transpose`, the values in `sg_map` must be swapped for the two dimensions being transposed, including `sg_layout`, `sg_data`, and `order`. ```mlir - #wg_map_a = #xegpu.wg_map - %vector_a = vector.transpose %vector_b {#wg_map_a}: vector<512x128xfloat> into vector<128x512xfloat> + #sg_map_a = #xegpu.sg_map + %vector_a = vector.transpose %vector_b {#sg_map_a}: vector<512x128xfloat> into vector<128x512xfloat> - //derived wg_map for input operand - #wg_map_b = #xegpu.wg_map + //derived sg_map for input operand + #sg_map_b = #xegpu.sg_map ``` -`wg_map` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `wg_map` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `wg_map` to the other. +`sg_map` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `sg_map` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `sg_map` to the other. -`convert_layout` is introduced to convert two inconsistent `wg_map`. +`convert_layout` is introduced to convert two inconsistent `sg_map`. ```mlir - #wg_map_b = #xegpu.wg_map // used for cooperative load/prefetch - #wg_map_a = #xegpu.wg_map // used as mma's input matrix A - %vector_a = xegpu.convert_layout %vector_b {#wg_map_a #wg_map_b}: vector<256x256xfloat> into vector<256x256xfloat> + #sg_map_b = #xegpu.sg_map // used for cooperative load/prefetch + #sg_map_a = #xegpu.sg_map // used as mma's input matrix A + %vector_a = xegpu.convert_layout %vector_b {#sg_map_a #sg_map_b}: vector<256x256xfloat> into vector<256x256xfloat> ``` -The `wg_map` conversion can be lowered to storing and loading from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to shared local memory with the #wg_map_b and 2) use wg_map_a mapping to load the data from shared local memory. +The `sg_map` conversion can be lowered to storing and loading from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to shared local memory with the #sg_map_b and 2) use sg_map_a mapping to load the data from shared local memory. -## Appendix 1 - Code examples for work group level XeGPU using wg_map attribute +## Appendix 1 - Code examples for work group level XeGPU using sg_map attribute ## Appendix 1.1 Simple Gemm with prefetch -The first example shows a simple gemm. It demonstrates the different wg_map we used for prefetch and load. +The first example shows a simple gemm. It demonstrates the different sg_map we used for prefetch and load. The sg_map doesn't show wi_layout and wi_data for simplicity. ```mlir Pseudo code for simple gemm C[4096, 4096] = matmul (A[4096, 4096], B[4096, 4096]) ``` ```mlir -#mp_a = #wg_map -#mp_a_pfh = #wg_map -#mp_b = #wg_map -#mp_b_pfh = #wg_map -#mp_c = #wg_map +#mp_a = #sg_map +#mp_a_pfh = #sg_map +#mp_b = #sg_map +#mp_c = #sg_map func.func @test_gemm(%a : memref<4096x4096xf16>,      %b: memref<4096x4096xf16>, @@ -882,7 +881,7 @@ func.func @test_gemm(%a : memref<4096x4096xf16>, } ``` ## Appendix 1.2 Gemm with transpose, broadcast, and reduction -The second example contains transpose, broadcast, and reduction. +The second example contains transpose, broadcast, and reduction. The sg_map doesn't show wi_layout and wi_data for simplicity. ```mlir Pseduo code for the original problem. C[4096, 4096] = matmul (A[4096, 4096], BT[4096, 4096]) + broad_cast(bcast[4096], dim=0) @@ -890,17 +889,17 @@ Reduce[4096] = reduce_add(C[4096, 4096], dim=1) ``` ```mlir -#mp_a = #wg_map -#mp_a_pfh = #wg_map -#mp_b = #wg_map -#mp_bt = #wg_map -#mp_bt_pfh = #wg_map -#mp_c = #wg_map - -#mp_bcast = #wg_map -#mp_bcast2 = #wg_map -#mp_reduce= #wg_map -#mp_reduce2= #wg_map +#mp_a = #sg_map +#mp_a_pfh = #sg_map +#mp_b = #sg_map +#mp_bt = #sg_map +#mp_bt_pfh = #sg_map +#mp_c = #sg_map + +#mp_bcast = #sg_map +#mp_bcast2 = #sg_map +#mp_reduce= #sg_map +#mp_reduce2= #sg_map func.func @test_gemm(%a : memref<4096x4096xf16>,      %b: memref<4096x4096xf16>, @@ -947,15 +946,15 @@ func.func @test_gemm(%a : memref<4096x4096xf16>, ## Appendix 1.3 Gemm implementation with two cache levels For GPU support high-performance prefetch through two level of caches. ```mlir -#mp_a = #wg_map -#mp_b = #wg_map -#mp_c = #wg_map +#mp_a = #sg_map +#mp_b = #sg_map +#mp_c = #sg_map -#mp_a_copl2 = #wg_map -#mp_b_copl2 = #wg_map< sg_layout=[16,2], sg_data=[8,128]> +#mp_a_copl2 = #sg_map +#mp_b_copl2 = #sg_map< sg_layout=[16,2], sg_data=[8,128], order=[1,0]> -#mp_a_copl1 = #wg_map -#mp_b_copl1 = #wg_map< sg_layout=[4, 8], sg_data=[8,32]> +#mp_a_copl1 = #sg_map +#mp_b_copl1 = #sg_map< sg_layout=[4, 8], sg_data=[8,32], order=[1,0]> func.func @test_gemm(%a : memref<4096x4096xf16>,      %b: memref<4096x4096xf16>, From ba353bae174f5e66def39b3af4e2646e56b5a0ef Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Fri, 21 Mar 2025 23:29:25 -0700 Subject: [PATCH 09/13] Update XeGPU.md --- docs/rfcs/XeGPU.md | 168 +++++++++++++++++++++++---------------------- 1 file changed, 85 insertions(+), 83 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index 15b2882a4..38c190ea5 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -59,9 +59,6 @@ innermost dimension (base_stride[0]) must be 1. into tensor_desc<8x16xbf16> ``` -XeGPU op is carried out by all the work items within a subgroup. The `sg_map` attribute specifies the mapping of each work item to the -data fragments and will be introduced in the next section in details. XeGPU operation without `sg_map` attribute works on the vectors as a whole. - `create_nd_tdesc` can also accept an optional `block_tdesc_attr` to extend its capablity. The `block_tdesc_attr` could encode the following optional attributes: - `memory_space`. It describes where the data block being described is located. `global` means device memory, or `slm` means shared local memory. @@ -301,7 +298,6 @@ In case that certain Xe GPU target does not support atomic operation for a certa xegpu.alloc_nbarrier %total_nbarrier_num: i8 ``` - `init_nbarrier` returns one named barrier with the specified barrier ID to the current thread. Multiple threads may bind to the same named barrier, and the input specifies the number of total participant threads. The returned nbarrier object holds a description of the specified barrier, which encodes all the barrier information. @@ -325,8 +321,6 @@ which encodes all the barrier information. Attribute `scope` describes the scope of fence. "workgroup" means that the scope is within each work group. "gpu" means the scope is across work groups within the gpu. Attribute `Memory_kind` describes the memory kind. "global" means the global memory, "shared" means the shared local memory. -`nbarrier` and `fence` operations lower to uniform instructions, so there is no need to specify the `sg_map`. - ## XeGPU Attributes to support Work Item Level semantics **Attribute xegpu.sg_map** @@ -697,20 +691,22 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor ``` -## extending sg_map attributes to support work group level semantic +## layout attributes to support work group level semantic -By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `sg_layout` and `sg_data` parameters into `sg_map` attribute to specify how the data is distributed across subgroups. With this extension, `sg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `sg_map` to partition data among subgroups without manipulating a nested IR representation. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. +By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `layout` attribute to specify how the data is distributed across subgroups. `layout` enables tensor compiler to express the cooperative operation among subgroups by specifying a `layout` to partition data among subgroups without manipulating a nested IR representation. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. -we also propose to extend the sg_map to cover multiple dimension instead of 2D only and add `order` attribute. +`layout` attribute can be viewed as upgraded version of `sg_map`. It includes all the parameters in `sg_map`, and adds a few more to support workgroup semantics. `layout` attribute supports multiple dimension, comparing `sg_map` supporting 2D only. -**Extended xegpu.sg_map** -The extended `sg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. sg_map consists of four parameters: +**xegpu.layout (upgraded from xegpu.sg_map)** + +`layout` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. `layout` consists of six parameters: * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. The dimensions can be a 3d array as [dim_0, dim_1, dim_2]. * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. - * wi_layout: Defines the 2D arrangement of WIs within the subgroup. - * wi_data: Specifies the shape of the tensor fragment that each WI loads or stores as a single packed data unit (16/32-bit). + * lane_layout: Defines the 2D arrangement of WIs within the subgroup. It has same semantics as `wi_layout`. + * lane_data: Specifies the shape of the tensor fragment that each WI loads or stores as a single packed data unit (16/32-bit). It has same semantics as `wi_data`. * order: The dimension order used to linearize n-d subgroup ids to 1-d. The first dimension in the order list is the fastest-changing dimension. + * scope: indicate the scope of the data. It can be `wg`, `sg`, or `lane`. `wg` stands for workgroup, and `sg` stands for subgroup. Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. | sg_layout[4, 4] | order[1, 0] | order[0, 1] @@ -722,7 +718,9 @@ Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. For a subgroup in 3-d sg_layout [dim_0, dim_1, dim_2], order[2, 1, 0] maps a subgroup with 3-d index [x, y, z] to a linear subgroup index [z + dim_2 * y + dim_2 * dim_1 * x ], order[1, 2, 0] maps to [y + dim_2 * z + dim_2 * dim_1 * x]. -When a sg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The sg_map attribute must be specified when creating the tensor descriptor. +When `layout` of a tensor descriptor has `scope=wg`, load/store/dpas operates at the workgroup level. User must specify `sg_layout`, `lane_layout`, `lane_data`, an `order` for tensor descriptor creation, and an additional `sg_data` is required if the tensor is consumed by daps operation for matrix A and B. +After workgroup to subgroup distribution, the `scope` is replaced as `sg`. `sg_layout` and `sg_data` will be droped out once the `scope` becomes `sg`. +After work item distribution, the `scope` becomes `lane`. `lane_layout`, `lane_data`, and `lane_order` are preserved to describe how the tensor shape in tensor descriptor is decomposed to the data fragments being loaded. **Constraints** @@ -743,17 +741,21 @@ The following conditions must hold: The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], there is data left after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], the data may be used up before all subgroups are assigned. In this case, we broadcast the tensor data to multiple subgroups by repeating the data assignment to the rest subgroups along that dimension until the all subgroups get data. -**Resulting subgroup Data size** +When user doesn't sg_data, the distribution will automatically decide sg_data by picking `i` and `j`. +sg_data[0] = lane_layout[0] x lane_data[0] x i +sg_data[1] = lane_layout[0] x lane_data[0] x j +i <= tensor_desc[0] / sg_layout[0] / lane_layout[0] / lane_data[0] +j <= tensor_desc[1] / sg_layout[1] / lane_layout[0] / lane_data[0] -The distributed tensor for each subgroup has the same dimension as the work group level tensor. +The distribution reduce the tensor shape to `sg_data` shape, and then the tensor is further distributed by `lane_layout` and `lane_data` using the distribution rule of `sg_map`. -**Examples of workgroup distribution with extended sg_map** +**Examples of workgroup distribution with xegpu.layout** The workgroup creates a tensor descriptor [128, 128] and distributes to 4 subgroups with `sg_layout` [2,2], and each subgroup gets `sg_data` [32,128]. The first dimension is split and distributed to subgroups in two rounds, and the second dimension is assigned as whole to multiple subgroups. ```mlir - #sg_map_a = #xegpu.sg_map - %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<16x128xf16, #sg_map_a> + #layout_a = #xegpu.layout + %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<16x128xf16, #layout_a> ``` The table below shows the result tensor for each subgroup and its linear subgroup id. @@ -764,93 +766,93 @@ The table below shows the result tensor for each subgroup and its linear subgrou | [ 64:95, 0:127] | [0, 0], [0, 1] | 0 , 1 | | [ 96:127, 0:127] | [1, 0], [1, 1] | 2 , 3 | -The `sg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `sg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `sg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. +The `layout` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `layout` attribute to MLIR vector data type, we attach the attribute to vector type-based operations within the workgroup distribution pass. The `layout` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. -For `dpas`, the `sg_map` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimension as output, and `sg_data` for k dimension must be same as operand A and B. `order` must be same as output. +For `dpas`, the `layout` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimension as output, and `sg_data` for k dimension must be same as operand A and B. `order` must be same as output. ```mlir - #sg_map_d = #xegpu.sg_map - %vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#sg_map_d}: + #layout_d = #xegpu.layout + %vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#layout_d}: vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16> into vector<256x256xfloat> - //derived sg_map for input operands - #sg_map_a = #xegpu.sg_map //sg_map for %vector_a - #sg_map_b = #xegpu.sg_map //sg_map for %vector_b - #sg_map_c = #xegpu.sg_map //sg_map for %vector_c + //derived layout for input operands + #layout_a = #xegpu.layout //layout for %vector_a + #layout_b = #xegpu.layout //layout for %vector_b + #layout_c = #xegpu.layout //layout for %vector_c ``` -For `reduction`, `sg_map` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `sg_map`. The new dimension of `order` should not change the existing ordering specified by the output's `sg_map`. +For `reduction`, `layout` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `layout`. The new dimension of `order` should not change the existing ordering specified by the output's `layout`. ```mlir - #sg_map_a = #xegpu.sg_map - %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#sg_map_a}: vector<256x128xfloat> into vector<256xfloat> + #layout_a = #xegpu.layout + %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#layout_a}: vector<256x128xfloat> into vector<256xfloat> - //derived sg_map for input operand - #sg_map_b = #xegpu.sg_map + //derived layout for input operand + #layout_b = #xegpu.layout ``` The rule also applies to reduction from 3d to 2d. ```mlir - #sg_map_a = #xegpu.sg_map - %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#sg_map_a}: vector<8x32x128xf32> to vector<8x128xf32> + #layout_a = #xegpu.layout + %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#layout_a}: vector<8x32x128xf32> to vector<8x128xf32> - //derived sg_map for input operand - #sg_map_b = #xegpu.sg_map + //derived layout for input operand + #layout_b = #xegpu.layout ``` -For `shape_cast`, it first determines the dimensions being reduced or expanded. The input's `sg_map` needs to expand or reduce the value accordingly for related dimension in `sg_layout` and `sg_data`. `order` should be consistent between input and output. +For `shape_cast`, it first determines the dimensions being reduced or expanded. The input's `layout` needs to expand or reduce the value accordingly for related dimension in `sg_layout` and `sg_data`. `order` should be consistent between input and output. ```mlir - sg_map_a = #xegpu.sg_map - %vector_a = vector.shape_cast %vector_b {#sg_map_a} : vector<256x128xf32> to vector<8x32x128xf32> + layout_a = #xegpu.layout + %vector_a = vector.shape_cast %vector_b {#layout_a} : vector<256x128xf32> to vector<8x32x128xf32> - //derived sg_map for input operand - #sg_map_b = #xegpu.sg_map + //derived layout for input operand + #layout_b = #xegpu.layout ``` -For `broadcast`, `sg_map` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut sg_map and must be removed for the input operand. The corresponding dimension in `sg_data` and `order` must be removed also. +For `broadcast`, `layout` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut layout and must be removed for the input operand. The corresponding dimension in `sg_data` and `order` must be removed also. ```mlir - #sg_map_a = #xegpu.sg_map - %vector_a = vector.broadcast %vector_b [1] {#sg_map_a}: vector<256xfloat> into vector<256x256xfloat> + #layout_a = #xegpu.layout + %vector_a = vector.broadcast %vector_b [1] {#layout_a}: vector<256xfloat> into vector<256x256xfloat> - //derived sg_map for input operand - #sg_map_b = #xegpu.sg_map + //derived layout for input operand + #layout_b = #xegpu.layout ``` -For `transpose`, the values in `sg_map` must be swapped for the two dimensions being transposed, including `sg_layout`, `sg_data`, and `order`. +For `transpose`, the values in `layout` must be swapped for the two dimensions being transposed, including `sg_layout`, `sg_data`, and `order`. ```mlir - #sg_map_a = #xegpu.sg_map - %vector_a = vector.transpose %vector_b {#sg_map_a}: vector<512x128xfloat> into vector<128x512xfloat> + #layout_a = #xegpu.layout + %vector_a = vector.transpose %vector_b {#layout_a}: vector<512x128xfloat> into vector<128x512xfloat> - //derived sg_map for input operand - #sg_map_b = #xegpu.sg_map + //derived layout for input operand + #layout_b = #xegpu.layout ``` -`sg_map` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `sg_map` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `sg_map` to the other. +`layout` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `layout` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `layout` to the other. -`convert_layout` is introduced to convert two inconsistent `sg_map`. +`convert_layout` is introduced to convert two inconsistent `layout`. ```mlir - #sg_map_b = #xegpu.sg_map // used for cooperative load/prefetch - #sg_map_a = #xegpu.sg_map // used as mma's input matrix A - %vector_a = xegpu.convert_layout %vector_b {#sg_map_a #sg_map_b}: vector<256x256xfloat> into vector<256x256xfloat> + #layout_b = #xegpu.layout // used for cooperative load/prefetch + #layout_a = #xegpu.layout // used as mma's input matrix A + %vector_a = xegpu.convert_layout %vector_b {#layout_a #layout_b}: vector<256x256xfloat> into vector<256x256xfloat> ``` -The `sg_map` conversion can be lowered to storing and loading from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to shared local memory with the #sg_map_b and 2) use sg_map_a mapping to load the data from shared local memory. +The `layout` conversion can be lowered to storing and loading from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to shared local memory with the #layout_b and 2) use layout_a mapping to load the data from shared local memory. -## Appendix 1 - Code examples for work group level XeGPU using sg_map attribute +## Appendix 1 - Code examples for work group level XeGPU using layout attribute ## Appendix 1.1 Simple Gemm with prefetch -The first example shows a simple gemm. It demonstrates the different sg_map we used for prefetch and load. The sg_map doesn't show wi_layout and wi_data for simplicity. +The first example shows a simple gemm. It demonstrates the different layout we used for prefetch and load. The layout doesn't show lane_layout and lane_data for simplicity. ```mlir Pseudo code for simple gemm C[4096, 4096] = matmul (A[4096, 4096], B[4096, 4096]) ``` ```mlir -#mp_a = #sg_map -#mp_a_pfh = #sg_map -#mp_b = #sg_map -#mp_c = #sg_map +#mp_a = #layout +#mp_a_pfh = #layout +#mp_b = #layout +#mp_c = #layout func.func @test_gemm(%a : memref<4096x4096xf16>,      %b: memref<4096x4096xf16>, @@ -881,7 +883,7 @@ func.func @test_gemm(%a : memref<4096x4096xf16>, } ``` ## Appendix 1.2 Gemm with transpose, broadcast, and reduction -The second example contains transpose, broadcast, and reduction. The sg_map doesn't show wi_layout and wi_data for simplicity. +The second example contains transpose, broadcast, and reduction. The layout doesn't show lane_layout and lane_data for simplicity. ```mlir Pseduo code for the original problem. C[4096, 4096] = matmul (A[4096, 4096], BT[4096, 4096]) + broad_cast(bcast[4096], dim=0) @@ -889,17 +891,17 @@ Reduce[4096] = reduce_add(C[4096, 4096], dim=1) ``` ```mlir -#mp_a = #sg_map -#mp_a_pfh = #sg_map -#mp_b = #sg_map -#mp_bt = #sg_map -#mp_bt_pfh = #sg_map -#mp_c = #sg_map - -#mp_bcast = #sg_map -#mp_bcast2 = #sg_map -#mp_reduce= #sg_map -#mp_reduce2= #sg_map +#mp_a = #layout +#mp_a_pfh = #layout +#mp_b = #layout +#mp_bt = #layout +#mp_bt_pfh = #layout +#mp_c = #layout + +#mp_bcast = #layout +#mp_bcast2 = #layout +#mp_reduce= #layout +#mp_reduce2= #layout func.func @test_gemm(%a : memref<4096x4096xf16>,      %b: memref<4096x4096xf16>, @@ -946,15 +948,15 @@ func.func @test_gemm(%a : memref<4096x4096xf16>, ## Appendix 1.3 Gemm implementation with two cache levels For GPU support high-performance prefetch through two level of caches. ```mlir -#mp_a = #sg_map -#mp_b = #sg_map -#mp_c = #sg_map +#mp_a = #layout +#mp_b = #layout +#mp_c = #layout -#mp_a_copl2 = #sg_map -#mp_b_copl2 = #sg_map< sg_layout=[16,2], sg_data=[8,128], order=[1,0]> +#mp_a_copl2 = #layout +#mp_b_copl2 = #layout< sg_layout=[16,2], sg_data=[8,128], order=[1,0]> -#mp_a_copl1 = #sg_map -#mp_b_copl1 = #sg_map< sg_layout=[4, 8], sg_data=[8,32], order=[1,0]> +#mp_a_copl1 = #layout +#mp_b_copl1 = #layout< sg_layout=[4, 8], sg_data=[8,32], order=[1,0]> func.func @test_gemm(%a : memref<4096x4096xf16>,      %b: memref<4096x4096xf16>, From 7f3d3580c48ac8b58718dadf741c0d4eda5ded64 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Fri, 28 Mar 2025 22:08:55 -0700 Subject: [PATCH 10/13] Update XeGPU.md add inst_data remove scope remove the statements about lane_data implies packed data unit change the result of WI distribution being 1D. packing happens on 1D WI level code, not related to layout. --- docs/rfcs/XeGPU.md | 101 +++++++++++++++++++++++++++------------------ 1 file changed, 60 insertions(+), 41 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index 38c190ea5..4f4cd6b69 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -323,7 +323,7 @@ Attribute `Memory_kind` describes the memory kind. "global" means the global mem ## XeGPU Attributes to support Work Item Level semantics -**Attribute xegpu.sg_map** +**Attribute xegpu.sg_map (To be deprecated)** xegpu.sg_map specifies how a 2D tensor (defined by the tensor descriptor) is partitioned among work items (WIs) within a subgroup. sg_map consists of two parameters: * wi_layout: Defines the 2D arrangement of WIs within the subgroup. @@ -695,18 +695,17 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `layout` attribute to specify how the data is distributed across subgroups. `layout` enables tensor compiler to express the cooperative operation among subgroups by specifying a `layout` to partition data among subgroups without manipulating a nested IR representation. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. -`layout` attribute can be viewed as upgraded version of `sg_map`. It includes all the parameters in `sg_map`, and adds a few more to support workgroup semantics. `layout` attribute supports multiple dimension, comparing `sg_map` supporting 2D only. - +`layout` attribute can be viewed as upgraded version of `sg_map`. It includes all the parameters in `sg_map`, and adds a few more to support workgroup semantics. `layout` attribute supports multiple dimensions, whereas `sg_map` is limited to 2D. **xegpu.layout (upgraded from xegpu.sg_map)** `layout` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. `layout` consists of six parameters: - * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. The dimensions can be a 3d array as [dim_0, dim_1, dim_2]. - * sg_data: Specifies the shape of the tensor size for each subgroup after decomposition. - * lane_layout: Defines the 2D arrangement of WIs within the subgroup. It has same semantics as `wi_layout`. - * lane_data: Specifies the shape of the tensor fragment that each WI loads or stores as a single packed data unit (16/32-bit). It has same semantics as `wi_data`. - * order: The dimension order used to linearize n-d subgroup ids to 1-d. The first dimension in the order list is the fastest-changing dimension. - * scope: indicate the scope of the data. It can be `wg`, `sg`, or `lane`. `wg` stands for workgroup, and `sg` stands for subgroup. + * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. + * sg_data: Specifies the shape of the tensor for each subgroup after decomposition. + * inst_data: Specifies the shape of the tensor for each instruction at subgroup level. It maybe identical to sg_data. + * lane_layout: Defines the n-d arrangement of WIs within the subgroup. It was known as `wi_layout`. + * lane_data: Specifies the shape of the tensor fragment that each WI owns. The lane_data must be contiguous. One instruction may owns multiple lane_data. It was known as `wi_data`. + * order: The dimension order used to linearize n-d subgroup ids and lane ids. The first dimension in the order list is the fastest-changing dimension. Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. | sg_layout[4, 4] | order[1, 0] | order[0, 1] @@ -716,48 +715,68 @@ Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. | [2, 0], [2, 1], [2, 2], [2, 3] | 8, 9, 10 , 11 | 2, 6, 10, 14 | | [3, 0], [3, 1], [3, 2], [3, 3] | 12, 13, 14, 15 | 3, 7, 11, 15 | -For a subgroup in 3-d sg_layout [dim_0, dim_1, dim_2], order[2, 1, 0] maps a subgroup with 3-d index [x, y, z] to a linear subgroup index [z + dim_2 * y + dim_2 * dim_1 * x ], order[1, 2, 0] maps to [y + dim_2 * z + dim_2 * dim_1 * x]. +For a subgroup in 3-d sg_layout [dim_0, dim_1, dim_2], order[2, 1, 0] maps a subgroup with 3-d index [x, y, z] to a linear subgroup index [z + dim_2 * y + dim_2 * dim_1 * x ], order[1, 2, 0] maps to [y + dim_2 * z + dim_2 * dim_1 * x]. The same order applies to lane ids with the same formula. + +User may specify all these parameters and expect xegpu mechanically and gradually lowers to xevm dialect. After subgroup distribution, `sg_layout` and `sg_data` will be droped. After work item distribution, `lane_layout`, `lane_data`, and `lane_order` will be droped. -When `layout` of a tensor descriptor has `scope=wg`, load/store/dpas operates at the workgroup level. User must specify `sg_layout`, `lane_layout`, `lane_data`, an `order` for tensor descriptor creation, and an additional `sg_data` is required if the tensor is consumed by daps operation for matrix A and B. -After workgroup to subgroup distribution, the `scope` is replaced as `sg`. `sg_layout` and `sg_data` will be droped out once the `scope` becomes `sg`. -After work item distribution, the `scope` becomes `lane`. `lane_layout`, `lane_data`, and `lane_order` are preserved to describe how the tensor shape in tensor descriptor is decomposed to the data fragments being loaded. +User may just specify `sg_layout`,`sg_data`, and `order` attributes, and use xegpu passes to automatically fill the rest parameters before lowering. **Constraints** Given these definitions: ```mlir +lane_data_size = lane_data[0] × lane_data[1] +subgroup_size = lane_layout[0] × lane_layout[1] sg_data_size = sg_data[0] × sg_data[1] workgroup_size = sg_layout[0] × sg_layout[1] tensor_size = tensor_desc[0] × tensor_desc[1] ``` The following conditions must hold: - +```mlir +* subgroup_size must represent the number of work items (lanes) in a subgroup for a kernel. * workgroup_size must represent the number of subgroups in a workgroup for a kernel. -* tensor_desc[0] must be either evenly divisible by sg_layout[0] × sg_data[0], or vice versa. -* tensor_desc[1] must be either evenly divisible by sg_layout[1] × sg_data[1], or vice versa. +* for dimension i, tensor_desc[i] must be either evenly divisible by sg_layout[i] × sg_data[i], or equal to sg_data[i]. +* for dimension i, sg_data[i] must be evenly divisible by inst_data[i]. +* for dimension i, inst_data[i] must be evenly divisible by lane_layout[i] x lane_data[i]. +* When lane_data contains multiple elements, they must be contiguous and come from a single dimension. +``` **distribution rule** -The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], there is data left after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], the data may be used up before all subgroups are assigned. In this case, we broadcast the tensor data to multiple subgroups by repeating the data assignment to the rest subgroups along that dimension until the all subgroups get data. +The workgroup level tensor is first distributed to subgroup and then work item level. -When user doesn't sg_data, the distribution will automatically decide sg_data by picking `i` and `j`. -sg_data[0] = lane_layout[0] x lane_data[0] x i -sg_data[1] = lane_layout[0] x lane_data[0] x j -i <= tensor_desc[0] / sg_layout[0] / lane_layout[0] / lane_data[0] -j <= tensor_desc[1] / sg_layout[1] / lane_layout[0] / lane_data[0] +***subgroup distribution rule*** +The tensor is first distributed to sg_data along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] is equal to tensor_desc[i], the tensor data is broadcasted to all subgroups along the dimension i. -The distribution reduce the tensor shape to `sg_data` shape, and then the tensor is further distributed by `lane_layout` and `lane_data` using the distribution rule of `sg_map`. +***work item distribution rule*** +As sg_data is evenly divisible by distribution_unit_size (i.e., sg_data % distribution_unit_size == 0), and each work item will recieve the distribution unit multiple times, with each unit having lane_data_size. + +Conceptually, the work item (WI) distribution process can be broken down into two steps. The first step divides the sg_data tensor according to `lane_layout` to obtain a subtensor. The second step linerize the elements as a 1D tensor. The order of elements with the linerized tensor are determined by the order attribute and the blocking effects of inst_data within sg_data. **Examples of workgroup distribution with xegpu.layout** -The workgroup creates a tensor descriptor [128, 128] and distributes to 4 subgroups with `sg_layout` [2,2], and each subgroup gets `sg_data` [32,128]. The first dimension is split and distributed to subgroups in two rounds, and the second dimension is assigned as whole to multiple subgroups. +The workgroup creates a tensor descriptor [64, 16] and distributes to 4 subgroups with `sg_layout` [4, 8], and each subgroup gets `sg_data` [16, 16]. The first dimension is split and distributed to 4 subgroups, and the second dimension is broadcast to 8 subgroups. Then it is further distributed to work item level as a vector of 16 bf16 values. + +```mlir + #layout_a = #xegpu.layout + %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<64x16xf16, #layout_a> +   %wg_vec = load_nd %wg_tdesc : tensor_desc<64x16xf16, #layout_a> -> vector<64x16xf16xf16> +// after subgroup distribution +// #layout_a_sg = #xegpu.layout +// %sg_vec_sg = load_nd %sg_tdesc : tensor_desc<16x16xf16, #layout_a_sg> -> vector<16x16xf16xf16> +// after work item distribution +// %sg_vec_wi_0 = load_nd %sg_tdesc : tensor_desc<8x16xf16> -> vector<8xf16> +// %sg_vec_wi_1 = load_nd %sg_tdesc : tensor_desc<8x16xf16> -> vector<8xf16> +// %sg_vec_wi = vector.shuffle %sg_vec_wi_0, %sg_vec_wi_0, [0..165] : vector<16xf16> +``` +The example below shows a workgroup tensor [128, 128] being distributed to 4 subgroups with `sg_layout` [2,2], with each subgroup assigned `sg_data` [32,128]. ```mlir #layout_a = #xegpu.layout %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<16x128xf16, #layout_a> ``` -The table below shows the result tensor for each subgroup and its linear subgroup id. +The table below illustrates the result tensor for each subgroup and its linear subgroup id. | subgroup tensor | 2D subgroup id | Linearized subgroup id | :--- | :---- | :---- | @@ -770,61 +789,61 @@ The `layout` attribute propagates from the matrix multiplication ops to other op For `dpas`, the `layout` attribute of input operands must have the same `sg_layout`, and `sg_data` for m and n dimension as output, and `sg_data` for k dimension must be same as operand A and B. `order` must be same as output. ```mlir - #layout_d = #xegpu.layout + #layout_d = #xegpu.layout %vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#layout_d}: vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16> into vector<256x256xfloat> //derived layout for input operands - #layout_a = #xegpu.layout //layout for %vector_a - #layout_b = #xegpu.layout //layout for %vector_b - #layout_c = #xegpu.layout //layout for %vector_c + #layout_a = #xegpu.layout //layout for %vector_a + #layout_b = #xegpu.layout //layout for %vector_b + #layout_c = #xegpu.layout //layout for %vector_c ``` For `reduction`, `layout` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `layout`. The new dimension of `order` should not change the existing ordering specified by the output's `layout`. ```mlir - #layout_a = #xegpu.layout + #layout_a = #xegpu.layout %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#layout_a}: vector<256x128xfloat> into vector<256xfloat> //derived layout for input operand - #layout_b = #xegpu.layout + #layout_b = #xegpu.layout ``` The rule also applies to reduction from 3d to 2d. ```mlir - #layout_a = #xegpu.layout + #layout_a = #xegpu.layout %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#layout_a}: vector<8x32x128xf32> to vector<8x128xf32> //derived layout for input operand - #layout_b = #xegpu.layout + #layout_b = #xegpu.layout ``` For `shape_cast`, it first determines the dimensions being reduced or expanded. The input's `layout` needs to expand or reduce the value accordingly for related dimension in `sg_layout` and `sg_data`. `order` should be consistent between input and output. ```mlir - layout_a = #xegpu.layout + layout_a = #xegpu.layout %vector_a = vector.shape_cast %vector_b {#layout_a} : vector<256x128xf32> to vector<8x32x128xf32> //derived layout for input operand - #layout_b = #xegpu.layout + #layout_b = #xegpu.layout ``` For `broadcast`, `layout` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut layout and must be removed for the input operand. The corresponding dimension in `sg_data` and `order` must be removed also. ```mlir - #layout_a = #xegpu.layout + #layout_a = #xegpu.layout %vector_a = vector.broadcast %vector_b [1] {#layout_a}: vector<256xfloat> into vector<256x256xfloat> //derived layout for input operand - #layout_b = #xegpu.layout + #layout_b = #xegpu.layout ``` For `transpose`, the values in `layout` must be swapped for the two dimensions being transposed, including `sg_layout`, `sg_data`, and `order`. ```mlir - #layout_a = #xegpu.layout + #layout_a = #xegpu.layout %vector_a = vector.transpose %vector_b {#layout_a}: vector<512x128xfloat> into vector<128x512xfloat> //derived layout for input operand - #layout_b = #xegpu.layout + #layout_b = #xegpu.layout ``` `layout` may be assinged for certain operation before the workgroup layout propagation, for example, the cooperative load pass may specify `layout` for certain load to be cooperated. In this case, the propagation may insert an operation to express the conversion of one `layout` to the other. @@ -832,8 +851,8 @@ For `transpose`, the values in `layout` must be swapped for the two dimensions b `convert_layout` is introduced to convert two inconsistent `layout`. ```mlir - #layout_b = #xegpu.layout // used for cooperative load/prefetch - #layout_a = #xegpu.layout // used as mma's input matrix A + #layout_b = #xegpu.layout // used for cooperative load/prefetch + #layout_a = #xegpu.layout // used as mma's input matrix A %vector_a = xegpu.convert_layout %vector_b {#layout_a #layout_b}: vector<256x256xfloat> into vector<256x256xfloat> ``` The `layout` conversion can be lowered to storing and loading from the shared local memory. It can be conceptually viewed as a composition of two operations: 1) store the vector to shared local memory with the #layout_b and 2) use layout_a mapping to load the data from shared local memory. From a5dab91c7d33845063c45ef48f5c8de486a65bc6 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Fri, 9 May 2025 19:09:17 -0700 Subject: [PATCH 11/13] Update XeGPU.md --- docs/rfcs/XeGPU.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index 4f4cd6b69..71282d31e 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -703,8 +703,8 @@ By allowing XeGPU operating on workgroup level data size, it provides a concise * sg_layout: Defines the n-d arrangement of subgroups within the workgroup. * sg_data: Specifies the shape of the tensor for each subgroup after decomposition. * inst_data: Specifies the shape of the tensor for each instruction at subgroup level. It maybe identical to sg_data. - * lane_layout: Defines the n-d arrangement of WIs within the subgroup. It was known as `wi_layout`. - * lane_data: Specifies the shape of the tensor fragment that each WI owns. The lane_data must be contiguous. One instruction may owns multiple lane_data. It was known as `wi_data`. + * lane_layout: Defines the n-d arrangement of WIs within the subgroup. It is renamed from sgmap's `wi_layout`. + * lane_data: Specifies the shape of the tensor fragment that each WI owns. The lane_data must be contiguous. One instruction may owns multiple lane_data. It is renamed from sgmap's `wi_data`. * order: The dimension order used to linearize n-d subgroup ids and lane ids. The first dimension in the order list is the fastest-changing dimension. Example of linerized subgourp id regarding order[1, 0] vs. order [0, 1]. @@ -736,9 +736,9 @@ The following conditions must hold: ```mlir * subgroup_size must represent the number of work items (lanes) in a subgroup for a kernel. * workgroup_size must represent the number of subgroups in a workgroup for a kernel. -* for dimension i, tensor_desc[i] must be either evenly divisible by sg_layout[i] × sg_data[i], or equal to sg_data[i]. -* for dimension i, sg_data[i] must be evenly divisible by inst_data[i]. -* for dimension i, inst_data[i] must be evenly divisible by lane_layout[i] x lane_data[i]. +* for any dimension i, tensor_desc[i] must be either evenly divisible by sg_layout[i] × sg_data[i], or equal to sg_data[i]. +* for any dimension i, sg_data[i] must be evenly divisible by inst_data[i]. +* for any dimension i, inst_data[i] must be evenly divisible by lane_layout[i] x lane_data[i]. * When lane_data contains multiple elements, they must be contiguous and come from a single dimension. ``` @@ -774,7 +774,7 @@ The workgroup creates a tensor descriptor [64, 16] and distributes to 4 subgroup The example below shows a workgroup tensor [128, 128] being distributed to 4 subgroups with `sg_layout` [2,2], with each subgroup assigned `sg_data` [32,128]. ```mlir #layout_a = #xegpu.layout - %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<16x128xf16, #layout_a> + %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<128x128xf16, #layout_a> ``` The table below illustrates the result tensor for each subgroup and its linear subgroup id. From 3a4be5390352e9f2108c2d2ffe8043edb9e8ae56 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Mon, 12 May 2025 21:19:48 -0700 Subject: [PATCH 12/13] Update XeGPU.md --- docs/rfcs/XeGPU.md | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index 71282d31e..d446ba838 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -729,14 +729,14 @@ lane_data_size = lane_data[0] × lane_data[1] subgroup_size = lane_layout[0] × lane_layout[1] sg_data_size = sg_data[0] × sg_data[1] workgroup_size = sg_layout[0] × sg_layout[1] -tensor_size = tensor_desc[0] × tensor_desc[1] +tensor_size = tensor_shape[0] × tensor_shape[1] ``` The following conditions must hold: ```mlir * subgroup_size must represent the number of work items (lanes) in a subgroup for a kernel. * workgroup_size must represent the number of subgroups in a workgroup for a kernel. -* for any dimension i, tensor_desc[i] must be either evenly divisible by sg_layout[i] × sg_data[i], or equal to sg_data[i]. +* for any dimension i, tensor_shape[i] must be either evenly divisible by sg_layout[i] × sg_data[i], or equal to sg_data[i]. * for any dimension i, sg_data[i] must be evenly divisible by inst_data[i]. * for any dimension i, inst_data[i] must be evenly divisible by lane_layout[i] x lane_data[i]. * When lane_data contains multiple elements, they must be contiguous and come from a single dimension. @@ -747,12 +747,12 @@ The following conditions must hold: The workgroup level tensor is first distributed to subgroup and then work item level. ***subgroup distribution rule*** -The tensor is first distributed to sg_data along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] is equal to tensor_desc[i], the tensor data is broadcasted to all subgroups along the dimension i. +The tensor is first distributed to sg_data along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_shape[i], after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] is equal to tensor_shape[i], the tensor data is broadcasted to all subgroups along the dimension i. ***work item distribution rule*** As sg_data is evenly divisible by distribution_unit_size (i.e., sg_data % distribution_unit_size == 0), and each work item will recieve the distribution unit multiple times, with each unit having lane_data_size. -Conceptually, the work item (WI) distribution process can be broken down into two steps. The first step divides the sg_data tensor according to `lane_layout` to obtain a subtensor. The second step linerize the elements as a 1D tensor. The order of elements with the linerized tensor are determined by the order attribute and the blocking effects of inst_data within sg_data. +Conceptually, the work item (WI) distribution process can be broken down into two steps. The first step divides the sg_data tensor according to `lane_layout` to obtain a subtensor. The second step linerize the elements as a 1D tensor. The order of elements with the linerized tensor are determined by the order: lane_data, distribution of lane_data within inst_data, and inst_data within sg_data accoring to the order attribute. **Examples of workgroup distribution with xegpu.layout** @@ -799,11 +799,12 @@ For `dpas`, the `layout` attribute of input operands must have the same `sg_layo #layout_c = #xegpu.layout //layout for %vector_c ``` -For `reduction`, `layout` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `layout`. The new dimension of `order` should not change the existing ordering specified by the output's `layout`. +For `reduction`, `xegpu.slice` is introduced to represent the `layout` of the reduced tensor. It inherits a regualr `layout` and specifies the dimension being reduced. ```mlir - #layout_a = #xegpu.layout - %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#layout_a}: vector<256x128xfloat> into vector<256xfloat> + #layout_a = #xegpu.layout + #layout_a_reduce = #xegpu.slice<{dim = 1, parent = #layout_a}> + %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#layout_a_reduce}: vector<256x128xfloat> into vector<256xfloat> //derived layout for input operand #layout_b = #xegpu.layout @@ -811,8 +812,9 @@ For `reduction`, `layout` of the input operand has an additional dimension to r The rule also applies to reduction from 3d to 2d. ```mlir - #layout_a = #xegpu.layout - %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#layout_a}: vector<8x32x128xf32> to vector<8x128xf32> + #layout_a = #xegpu.layout + #layout_a_reduce = #xegpu.slice<{dim = 1, parent = #layout_a}> + %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#layout_a_reduce}: vector<8x32x128xf32> to vector<8x128xf32> //derived layout for input operand #layout_b = #xegpu.layout @@ -820,21 +822,30 @@ The rule also applies to reduction from 3d to 2d. For `shape_cast`, it first determines the dimensions being reduced or expanded. The input's `layout` needs to expand or reduce the value accordingly for related dimension in `sg_layout` and `sg_data`. `order` should be consistent between input and output. ```mlir - layout_a = #xegpu.layout + #layout_a = #xegpu.layout %vector_a = vector.shape_cast %vector_b {#layout_a} : vector<256x128xf32> to vector<8x32x128xf32> //derived layout for input operand #layout_b = #xegpu.layout ``` +```mlir + #layout_a = #xegpu.layout + %vector_a = vector.shape_cast %vector_b {#layout_a} : vector<256x128xf32> to vector<256x1x128xf32> + + //derived layout for input operand + #layout_b = #xegpu.layout + #layout_b_reduce = #xegpu.slice<{dim = 1, parent = #layout_b}> +``` For `broadcast`, `layout` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut layout and must be removed for the input operand. The corresponding dimension in `sg_data` and `order` must be removed also. ```mlir #layout_a = #xegpu.layout - %vector_a = vector.broadcast %vector_b [1] {#layout_a}: vector<256xfloat> into vector<256x256xfloat> + #layout_a_reduce = #xegpu.slice<{dim = 1, parent = #layout_a}> + %vector_a = vector.broadcast %vector_b [1] {#layout_a_reduce}: vector<256xfloat> into vector<256x256xfloat> //derived layout for input operand - #layout_b = #xegpu.layout + #layout_b = #xegpu.layout ``` For `transpose`, the values in `layout` must be swapped for the two dimensions being transposed, including `sg_layout`, `sg_data`, and `order`. From f7d7e980e8bbc99007c562e7052f90116375fbf5 Mon Sep 17 00:00:00 2001 From: Jianhui Li Date: Tue, 8 Jul 2025 08:44:20 -0700 Subject: [PATCH 13/13] Update XeGPU.md --- docs/rfcs/XeGPU.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md index d446ba838..7faa6332b 100644 --- a/docs/rfcs/XeGPU.md +++ b/docs/rfcs/XeGPU.md @@ -799,11 +799,12 @@ For `dpas`, the `layout` attribute of input operands must have the same `sg_layo #layout_c = #xegpu.layout //layout for %vector_c ``` -For `reduction`, `xegpu.slice` is introduced to represent the `layout` of the reduced tensor. It inherits a regualr `layout` and specifies the dimension being reduced. +For `reduction`, `xegpu.slice` is introduced to represent the `layout` of the reduced tensor. It inherits a regualr `layout` and specifies the dimension being reduced. +It conceptually squeezes the threads along that dim, and all threads share the same mapping with their representative thread. ```mlir #layout_a = #xegpu.layout - #layout_a_reduce = #xegpu.slice<{dim = 1, parent = #layout_a}> + #layout_a_reduce = #xegpu.slice<#layout_a, 1> %vector_a = vector.multi_reduction %vector_b, %cst_0 [1] {#layout_a_reduce}: vector<256x128xfloat> into vector<256xfloat> //derived layout for input operand @@ -813,7 +814,7 @@ For `reduction`, `xegpu.slice` is introduced to represent the `layout` of the re The rule also applies to reduction from 3d to 2d. ```mlir #layout_a = #xegpu.layout - #layout_a_reduce = #xegpu.slice<{dim = 1, parent = #layout_a}> + #layout_a_reduce = #xegpu.slice<#layout_a, 1> %%vector_a = vector.multi_reduction , %vector_b, %cst_0 [1] {#layout_a_reduce}: vector<8x32x128xf32> to vector<8x128xf32> //derived layout for input operand @@ -834,14 +835,14 @@ For `shape_cast`, it first determines the dimensions being reduced or expanded. //derived layout for input operand #layout_b = #xegpu.layout - #layout_b_reduce = #xegpu.slice<{dim = 1, parent = #layout_b}> + #layout_b_reduce = #xegpu.slice<#layout_b, 1> ``` For `broadcast`, `layout` of the input operand has one less dimension for the broadcast dimension. `sg_layout` for that dimension must be `1` in the ouptut layout and must be removed for the input operand. The corresponding dimension in `sg_data` and `order` must be removed also. ```mlir #layout_a = #xegpu.layout - #layout_a_reduce = #xegpu.slice<{dim = 1, parent = #layout_a}> + #layout_a_reduce = #xegpu.slice<#layout_a, 1> %vector_a = vector.broadcast %vector_b [1] {#layout_a_reduce}: vector<256xfloat> into vector<256x256xfloat> //derived layout for input operand