Skip to content

Double Occurrence of Parameters in Kernels Generated with CLI tool. #557

@ThrudPrimrose

Description

@ThrudPrimrose

If I generate a kernel with the following command in the CLI tool:
taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor

The generated kernel has the parameter C twice in the launcher function. I will add the generated code:


// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C, taco_tensor_t * __restrict__ w){
  int A2_dimension = (int)(A->dimensions[1]);
  int A3_dimension = (int)(A->dimensions[2]);
  float* __restrict__ A_vals = (float*)(A->vals);
  int B2_dimension = (int)(B->dimensions[1]);
  int B3_dimension = (int)(B->dimensions[2]);
  int B4_dimension = (int)(B->dimensions[3]);
  float* __restrict__ B_vals = (float*)(B->vals);
  int C1_dimension = (int)(C->dimensions[0]);
  int C2_dimension = (int)(C->dimensions[1]);
  int C3_dimension = (int)(C->dimensions[2]);
  float* __restrict__ C_vals = (float*)(C->vals);
  int w1_dimension = (int)(w->dimensions[0]);
  int w2_dimension = (int)(w->dimensions[1]);
  float* __restrict__ w_vals = (float*)(w->vals);

  int32_t i161 = blockIdx.x;
  int32_t i162 = (threadIdx.x % (256));
  if (threadIdx.x >= 256) {
    return;
  }

  int32_t i = i161 * 256 + i162;
  if (i >= C1_dimension)
    return;

  for (int32_t j = 0; j < C2_dimension; j++) {
    int32_t jC = i * C2_dimension + j;
    for (int32_t b = 0; b < C3_dimension; b++) {
      int32_t bC = jC * C3_dimension + b;
      float tl_val = 0.0;
      for (int32_t l = 0; l < B3_dimension; l++) {
        int32_t jA = l * A2_dimension + j;
        int32_t bA = jA * A3_dimension + b;
        float tk_val = 0.0;
        for (int32_t k = 0; k < w1_dimension; k++) {
          int32_t kB = i * B2_dimension + k;
          int32_t lB = kB * B3_dimension + l;
          int32_t bB = lB * B4_dimension + b;
          int32_t bw = k * w2_dimension + b;
          tk_val = tk_val + (A_vals[bA] * B_vals[bB]) * w_vals[bw];
        }
        tl_val = tl_val + tk_val;
      }
      C_vals[bC] = C_vals[bC] + tl_val;
    }
  }
}

int compute(taco_tensor_t *C, taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *w, taco_tensor_t *C) {
  int C1_dimension = (int)(C->dimensions[0]);

  computeDeviceKernel0<<<(C1_dimension + 255) / 256, 256>>>(A, B, C, w);
  cudaDeviceSynchronize();
  return 0;
}

I have built taco with gcc-11, optimize flags -fPIC -O3, with pybind11 and cuda bundled with nvhpc 23.9 from source. The commit hash (output from git log -n 1, I hope it is the correct command to use here) is:
git log -n 1
commit 2b8ece4

Also, one more question, I wanted to use the index "b" to create batched tensor contractions in this case, and I hoped that the kernel would distribute the workload using the C->dimensions[3] because the last index is definitely the biggest one, I also have provided it in the command line argument with the hope that it would be used, what am I doing there wrong? Should I provide a schedule, or what should I do for that?

I tried:

taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor -s="parallelize(b, GPUBlock, NoRaces)"
// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

terminate called after throwing an instance of 'taco::TacoException'
  what():  Compiler bug at /home/primrose/Installed/taco/src/codegen/codegen_cuda.cpp:374 in visit
Please report it to developers
 Condition failed: blockIDVars.size() == threadIDVars.size()
 No matching GPUThread parallelize 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions