-
Notifications
You must be signed in to change notification settings - Fork 195
Description
If I generate a kernel with the following command in the CLI tool:
taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor
The generated kernel has the parameter C twice in the launcher function. I will add the generated code:
// Generated by the Tensor Algebra Compiler (tensor-compiler.org)
__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C, taco_tensor_t * __restrict__ w){
int A2_dimension = (int)(A->dimensions[1]);
int A3_dimension = (int)(A->dimensions[2]);
float* __restrict__ A_vals = (float*)(A->vals);
int B2_dimension = (int)(B->dimensions[1]);
int B3_dimension = (int)(B->dimensions[2]);
int B4_dimension = (int)(B->dimensions[3]);
float* __restrict__ B_vals = (float*)(B->vals);
int C1_dimension = (int)(C->dimensions[0]);
int C2_dimension = (int)(C->dimensions[1]);
int C3_dimension = (int)(C->dimensions[2]);
float* __restrict__ C_vals = (float*)(C->vals);
int w1_dimension = (int)(w->dimensions[0]);
int w2_dimension = (int)(w->dimensions[1]);
float* __restrict__ w_vals = (float*)(w->vals);
int32_t i161 = blockIdx.x;
int32_t i162 = (threadIdx.x % (256));
if (threadIdx.x >= 256) {
return;
}
int32_t i = i161 * 256 + i162;
if (i >= C1_dimension)
return;
for (int32_t j = 0; j < C2_dimension; j++) {
int32_t jC = i * C2_dimension + j;
for (int32_t b = 0; b < C3_dimension; b++) {
int32_t bC = jC * C3_dimension + b;
float tl_val = 0.0;
for (int32_t l = 0; l < B3_dimension; l++) {
int32_t jA = l * A2_dimension + j;
int32_t bA = jA * A3_dimension + b;
float tk_val = 0.0;
for (int32_t k = 0; k < w1_dimension; k++) {
int32_t kB = i * B2_dimension + k;
int32_t lB = kB * B3_dimension + l;
int32_t bB = lB * B4_dimension + b;
int32_t bw = k * w2_dimension + b;
tk_val = tk_val + (A_vals[bA] * B_vals[bB]) * w_vals[bw];
}
tl_val = tl_val + tk_val;
}
C_vals[bC] = C_vals[bC] + tl_val;
}
}
}
int compute(taco_tensor_t *C, taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *w, taco_tensor_t *C) {
int C1_dimension = (int)(C->dimensions[0]);
computeDeviceKernel0<<<(C1_dimension + 255) / 256, 256>>>(A, B, C, w);
cudaDeviceSynchronize();
return 0;
}
I have built taco with gcc-11, optimize flags -fPIC -O3, with pybind11 and cuda bundled with nvhpc 23.9 from source. The commit hash (output from git log -n 1, I hope it is the correct command to use here) is:
git log -n 1
commit 2b8ece4
Also, one more question, I wanted to use the index "b" to create batched tensor contractions in this case, and I hoped that the kernel would distribute the workload using the C->dimensions[3] because the last index is definitely the biggest one, I also have provided it in the command line argument with the hope that it would be used, what am I doing there wrong? Should I provide a schedule, or what should I do for that?
I tried:
taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor -s="parallelize(b, GPUBlock, NoRaces)"
// Generated by the Tensor Algebra Compiler (tensor-compiler.org)
terminate called after throwing an instance of 'taco::TacoException'
what(): Compiler bug at /home/primrose/Installed/taco/src/codegen/codegen_cuda.cpp:374 in visit
Please report it to developers
Condition failed: blockIDVars.size() == threadIDVars.size()
No matching GPUThread parallelize