Skip to content

[RISCV] custom scmp(x,0) and scmp(0,x) lowering for RVV #151753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

camel-cdr
Copy link

I noticed that the current codegen for scmp(x,0) and scmp(0,x), also known as sign(x) and -sign(x), isn't optimal for RVV.
It produces a four instruction sequence of four instructions

    vmsgt.vi + vmslt.vi + vmerge.vim + vmerge.vim

for SEW<=32 and three instructions for SEW=64.

    scmp(0,x): vmsgt.vi + vsra.vx + vor.vi
    scmp(x,0): vmsgt.vi + vsrl.vx + vmerge.vim

This patch introduces a new lowering for all values of SEW which expresses the above in SelectionDAG Nodes.
This maps to two arithmetic instructions and a vector register move:

    scmp(0,x): vmv.v.i/v + vmsgt.vi + masked vsra.vi/vx
    scmp(x,0): vmv.v.i/v + vmsgt.vi + masked vsrl.vi/vx

These clobber v0, need to have a different destination than the input and need to use an additional GPR for SEW=64.
For the SEW<=32 and scmp(x,0) case a slightly different lowering was chooses:

    scmp(x,0): vmin.vx + vsra.i + vor.vv

This doesn't clobber v0, but uses a single GPR.
I deemed using a single GPR slightly better than clobbering v0 (SEW<=32), but using two GPRs as worse than using one GPR and clobbering v0. But I haven't done any empirical tests.


I'm not sure why the fixed-vectors tests are so messed up, so I marked this as a draft for now.

This type of lowering is also advantageous for SVE and AVX512 and could be generically implemented in TargetLowering::expandCMP, but you need to know of integer shift instructions of the element size are available.

Here are the alive2 transforms:

Copy link

github-actions bot commented Aug 1, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Aug 1, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Olaf Bernstein (camel-cdr)

Changes

I noticed that the current codegen for scmp(x,0) and scmp(0,x), also known as sign(x) and -sign(x), isn't optimal for RVV.
It produces a four instruction sequence of four instructions

    vmsgt.vi + vmslt.vi + vmerge.vim + vmerge.vim

for SEW<=32 and three instructions for SEW=64.

    scmp(0,x): vmsgt.vi + vsra.vx + vor.vi
    scmp(x,0): vmsgt.vi + vsrl.vx + vmerge.vim

This patch introduces a new lowering for all values of SEW which expresses the above in SelectionDAG Nodes.
This maps to two arithmetic instructions and a vector register move:

    scmp(0,x): vmv.v.i/v + vmsgt.vi + masked vsra.vi/vx
    scmp(x,0): vmv.v.i/v + vmsgt.vi + masked vsrl.vi/vx

These clobber v0, need to have a different destination than the input and need to use an additional GPR for SEW=64.
For the SEW<=32 and scmp(x,0) case a slightly different lowering was chooses:

    scmp(x,0): vmin.vx + vsra.i + vor.vv

This doesn't clobber v0, but uses a single GPR.
I deemed using a single GPR slightly better than clobbering v0 (SEW<=32), but using two GPRs as worse than using one GPR and clobbering v0. But I haven't done any empirical tests.


I'm not sure why the fixed-vectors tests are so messed up, so I marked this as a draft for now.

This type of lowering is also advantageous for SVE and AVX512 and could be generically implemented in TargetLowering::expandCMP, but you need to know of integer shift instructions of the element size are available.

Here are the alive2 transforms:


Patch is 29.99 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151753.diff

3 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+31)
  • (added) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-scmp.ll (+596)
  • (added) llvm/test/CodeGen/RISCV/rvv/scmp.ll (+200)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index adbfbeb4669e7..f36f134fff452 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -880,6 +880,7 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
       setOperationAction({ISD::SMIN, ISD::SMAX, ISD::UMIN, ISD::UMAX}, VT,
                          Legal);
 
+      setOperationAction(ISD::SCMP, VT, Custom);
       setOperationAction({ISD::ABDS, ISD::ABDU}, VT, Custom);
 
       // Custom-lower extensions and truncations from/to mask types.
@@ -1361,6 +1362,7 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
         setOperationAction(
             {ISD::SMIN, ISD::SMAX, ISD::UMIN, ISD::UMAX, ISD::ABS}, VT, Custom);
 
+        setOperationAction(ISD::SCMP, VT, Custom);
         setOperationAction({ISD::ABDS, ISD::ABDU}, VT, Custom);
 
         // vXi64 MULHS/MULHU requires the V extension instead of Zve64*.
@@ -8223,6 +8225,35 @@ SDValue RISCVTargetLowering::LowerOperation(SDValue Op,
   case ISD::SADDSAT:
   case ISD::SSUBSAT:
     return lowerToScalableOp(Op, DAG);
+  case ISD::SCMP: {
+    SDLoc DL(Op);
+    EVT VT = Op->getValueType(0);
+    SDValue LHS = DAG.getFreeze(Op->getOperand(0));
+    SDValue RHS = DAG.getFreeze(Op->getOperand(1));
+    unsigned SEW = VT.getScalarSizeInBits();
+
+    SDValue Shift = DAG.getConstant(SEW-1, DL, VT);
+    SDValue Zero = DAG.getConstant(0, DL, VT);
+    SDValue One = DAG.getConstant(1, DL, VT);
+    SDValue MinusOne = DAG.getAllOnesConstant(DL, VT);
+
+    if (ISD::isConstantSplatVectorAllZeros(RHS.getNode())) {
+      SDValue Sra = DAG.getNode(ISD::SRA, DL, VT, LHS, Shift);
+      if (SEW <= 32) {
+        // scmp(lhs, 0) -> vor.vv(vsra.vi(lhs,SEW-1), vmin.vx(lhs,1))
+        SDValue Min = DAG.getNode(ISD::SMIN, DL, VT, LHS, One);
+        return DAG.getNode(ISD::OR, DL, VT, Sra, Min);
+      }
+      // scmp(lhs, 0) -> vmerge.vi(vmsgt.vi(rhs,0), vsra.vx(lhs,SEW-1), 1)
+      return DAG.getSelectCC(DL, LHS, Zero, Sra, One, ISD::SETGT);
+    } else if (ISD::isConstantSplatVectorAllZeros(LHS.getNode())) {
+      // scmp(0, rhs) -> vmerge.vi(vmsgt.vi(rhs,0), vsrl.vi/vx(rhs,SEW-1), -1)
+      SDValue Srl = DAG.getNode(ISD::SRL, DL, VT, RHS, Shift);
+      return DAG.getSelectCC(DL, RHS, Zero, Srl, MinusOne, ISD::SETGT);
+    }
+
+    return SDValue();
+  }
   case ISD::ABDS:
   case ISD::ABDU: {
     SDLoc dl(Op);
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-scmp.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-scmp.ll
new file mode 100644
index 0000000000000..444d3a08216c9
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-scmp.ll
@@ -0,0 +1,596 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
+; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
+
+define <16 x i8> @scmp_i8i8(<16 x i8> %a, <16 x i8> %b) {
+; CHECK-LABEL: scmp_i8i8:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 16, e8, m1, ta, ma
+; CHECK-NEXT:    vmslt.vv v0, v9, v8
+; CHECK-NEXT:    vmv.v.i v10, 0
+; CHECK-NEXT:    vmerge.vim v10, v10, 1, v0
+; CHECK-NEXT:    vmslt.vv v0, v8, v9
+; CHECK-NEXT:    vmerge.vim v8, v10, -1, v0
+; CHECK-NEXT:    ret
+entry:
+  %c = call <16 x i8> @llvm.scmp(<16 x i8> %a, <16 x i8> %b)
+  ret <16 x i8> %c
+}
+
+define <16 x i8> @scmp_z8i8(<16 x i8> %a) {
+; RV32-LABEL: scmp_z8i8:
+; RV32:       # %bb.0: # %entry
+; RV32-NEXT:    addi sp, sp, -16
+; RV32-NEXT:    .cfi_def_cfa_offset 16
+; RV32-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
+; RV32-NEXT:    .cfi_offset s0, -4
+; RV32-NEXT:    vsetivli zero, 16, e8, m1, ta, ma
+; RV32-NEXT:    vslidedown.vi v10, v8, 9
+; RV32-NEXT:    vsrl.vi v9, v8, 7
+; RV32-NEXT:    vslidedown.vi v11, v8, 8
+; RV32-NEXT:    vslidedown.vi v12, v8, 10
+; RV32-NEXT:    vmv.x.s a0, v10
+; RV32-NEXT:    vslidedown.vi v10, v8, 11
+; RV32-NEXT:    vmv.x.s a1, v11
+; RV32-NEXT:    vslidedown.vi v11, v8, 12
+; RV32-NEXT:    vmv.x.s a2, v12
+; RV32-NEXT:    vslidedown.vi v12, v8, 13
+; RV32-NEXT:    vmv.x.s a3, v10
+; RV32-NEXT:    vslidedown.vi v10, v8, 14
+; RV32-NEXT:    vmv.x.s a4, v11
+; RV32-NEXT:    vslidedown.vi v11, v8, 15
+; RV32-NEXT:    vmv.x.s a5, v12
+; RV32-NEXT:    vslidedown.vi v12, v8, 1
+; RV32-NEXT:    vmv.x.s t4, v8
+; RV32-NEXT:    vmv.x.s a6, v10
+; RV32-NEXT:    vslidedown.vi v10, v8, 2
+; RV32-NEXT:    vmv.x.s a7, v11
+; RV32-NEXT:    vslidedown.vi v11, v8, 3
+; RV32-NEXT:    vmv.x.s t0, v12
+; RV32-NEXT:    vslidedown.vi v12, v8, 4
+; RV32-NEXT:    vmv.x.s t1, v10
+; RV32-NEXT:    vslidedown.vi v10, v8, 5
+; RV32-NEXT:    vmv.x.s t2, v11
+; RV32-NEXT:    vslidedown.vi v11, v8, 6
+; RV32-NEXT:    vslidedown.vi v8, v8, 7
+; RV32-NEXT:    li t6, 255
+; RV32-NEXT:    vmv.x.s t3, v12
+; RV32-NEXT:    vslidedown.vi v12, v9, 9
+; RV32-NEXT:    vmv.x.s t5, v10
+; RV32-NEXT:    vslidedown.vi v10, v9, 8
+; RV32-NEXT:    sgtz s0, t4
+; RV32-NEXT:    vmv.x.s t4, v11
+; RV32-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; RV32-NEXT:    vmv.s.x v0, t6
+; RV32-NEXT:    vsetvli zero, zero, e8, m1, ta, mu
+; RV32-NEXT:    vmv.x.s t6, v9
+; RV32-NEXT:    addi s0, s0, -1
+; RV32-NEXT:    or t6, s0, t6
+; RV32-NEXT:    vmv.x.s s0, v12
+; RV32-NEXT:    vslidedown.vi v11, v9, 10
+; RV32-NEXT:    sgtz a0, a0
+; RV32-NEXT:    addi a0, a0, -1
+; RV32-NEXT:    or a0, a0, s0
+; RV32-NEXT:    vmv.x.s s0, v10
+; RV32-NEXT:    vslidedown.vi v10, v9, 11
+; RV32-NEXT:    sgtz a1, a1
+; RV32-NEXT:    addi a1, a1, -1
+; RV32-NEXT:    or a1, a1, s0
+; RV32-NEXT:    vmv.x.s s0, v11
+; RV32-NEXT:    vslidedown.vi v11, v9, 12
+; RV32-NEXT:    sgtz a2, a2
+; RV32-NEXT:    addi a2, a2, -1
+; RV32-NEXT:    or a2, a2, s0
+; RV32-NEXT:    vmv.x.s s0, v10
+; RV32-NEXT:    vslidedown.vi v10, v9, 13
+; RV32-NEXT:    sgtz a3, a3
+; RV32-NEXT:    addi a3, a3, -1
+; RV32-NEXT:    or a3, a3, s0
+; RV32-NEXT:    vmv.x.s s0, v11
+; RV32-NEXT:    vslidedown.vi v11, v9, 14
+; RV32-NEXT:    sgtz a4, a4
+; RV32-NEXT:    addi a4, a4, -1
+; RV32-NEXT:    or a4, a4, s0
+; RV32-NEXT:    vmv.x.s s0, v10
+; RV32-NEXT:    vslidedown.vi v10, v9, 15
+; RV32-NEXT:    sgtz a5, a5
+; RV32-NEXT:    addi a5, a5, -1
+; RV32-NEXT:    or a5, a5, s0
+; RV32-NEXT:    vmv.x.s s0, v11
+; RV32-NEXT:    vslidedown.vi v11, v9, 1
+; RV32-NEXT:    sgtz a6, a6
+; RV32-NEXT:    addi a6, a6, -1
+; RV32-NEXT:    or a6, a6, s0
+; RV32-NEXT:    vmv.x.s s0, v10
+; RV32-NEXT:    vslidedown.vi v10, v9, 2
+; RV32-NEXT:    sgtz a7, a7
+; RV32-NEXT:    addi a7, a7, -1
+; RV32-NEXT:    or a7, a7, s0
+; RV32-NEXT:    vmv.x.s s0, v11
+; RV32-NEXT:    vslidedown.vi v11, v9, 3
+; RV32-NEXT:    sgtz t0, t0
+; RV32-NEXT:    addi t0, t0, -1
+; RV32-NEXT:    or t0, t0, s0
+; RV32-NEXT:    vmv.x.s s0, v8
+; RV32-NEXT:    vmv.v.x v8, t6
+; RV32-NEXT:    vmv.x.s t6, v10
+; RV32-NEXT:    vslidedown.vi v10, v9, 4
+; RV32-NEXT:    sgtz t1, t1
+; RV32-NEXT:    addi t1, t1, -1
+; RV32-NEXT:    or t1, t1, t6
+; RV32-NEXT:    vmv.x.s t6, v11
+; RV32-NEXT:    vslidedown.vi v11, v9, 5
+; RV32-NEXT:    sgtz t2, t2
+; RV32-NEXT:    addi t2, t2, -1
+; RV32-NEXT:    or t2, t2, t6
+; RV32-NEXT:    vmv.x.s t6, v10
+; RV32-NEXT:    vslidedown.vi v10, v9, 6
+; RV32-NEXT:    vslidedown.vi v9, v9, 7
+; RV32-NEXT:    sgtz t3, t3
+; RV32-NEXT:    sgtz t5, t5
+; RV32-NEXT:    addi t3, t3, -1
+; RV32-NEXT:    or t3, t3, t6
+; RV32-NEXT:    vmv.x.s t6, v11
+; RV32-NEXT:    sgtz t4, t4
+; RV32-NEXT:    addi t5, t5, -1
+; RV32-NEXT:    or t5, t5, t6
+; RV32-NEXT:    vmv.x.s t6, v10
+; RV32-NEXT:    sgtz s0, s0
+; RV32-NEXT:    addi t4, t4, -1
+; RV32-NEXT:    or t4, t4, t6
+; RV32-NEXT:    vmv.x.s t6, v9
+; RV32-NEXT:    addi s0, s0, -1
+; RV32-NEXT:    or t6, s0, t6
+; RV32-NEXT:    vmv.v.x v9, a1
+; RV32-NEXT:    vslide1down.vx v8, v8, t0
+; RV32-NEXT:    vslide1down.vx v9, v9, a0
+; RV32-NEXT:    vslide1down.vx v8, v8, t1
+; RV32-NEXT:    vslide1down.vx v9, v9, a2
+; RV32-NEXT:    vslide1down.vx v8, v8, t2
+; RV32-NEXT:    vslide1down.vx v9, v9, a3
+; RV32-NEXT:    vslide1down.vx v8, v8, t3
+; RV32-NEXT:    vslide1down.vx v9, v9, a4
+; RV32-NEXT:    vslide1down.vx v8, v8, t5
+; RV32-NEXT:    vslide1down.vx v9, v9, a5
+; RV32-NEXT:    vslide1down.vx v10, v8, t4
+; RV32-NEXT:    vslide1down.vx v8, v9, a6
+; RV32-NEXT:    vslide1down.vx v8, v8, a7
+; RV32-NEXT:    vslide1down.vx v9, v10, t6
+; RV32-NEXT:    vslidedown.vi v8, v9, 8, v0.t
+; RV32-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
+; RV32-NEXT:    .cfi_restore s0
+; RV32-NEXT:    addi sp, sp, 16
+; RV32-NEXT:    .cfi_def_cfa_offset 0
+; RV32-NEXT:    ret
+;
+; RV64-LABEL: scmp_z8i8:
+; RV64:       # %bb.0: # %entry
+; RV64-NEXT:    addi sp, sp, -16
+; RV64-NEXT:    .cfi_def_cfa_offset 16
+; RV64-NEXT:    sd s0, 8(sp) # 8-byte Folded Spill
+; RV64-NEXT:    .cfi_offset s0, -8
+; RV64-NEXT:    vsetivli zero, 16, e8, m1, ta, ma
+; RV64-NEXT:    vslidedown.vi v10, v8, 9
+; RV64-NEXT:    vsrl.vi v9, v8, 7
+; RV64-NEXT:    vslidedown.vi v11, v8, 8
+; RV64-NEXT:    vslidedown.vi v12, v8, 10
+; RV64-NEXT:    vmv.x.s a0, v10
+; RV64-NEXT:    vslidedown.vi v10, v8, 11
+; RV64-NEXT:    vmv.x.s a1, v11
+; RV64-NEXT:    vslidedown.vi v11, v8, 12
+; RV64-NEXT:    vmv.x.s a2, v12
+; RV64-NEXT:    vslidedown.vi v12, v8, 13
+; RV64-NEXT:    vmv.x.s a3, v10
+; RV64-NEXT:    vslidedown.vi v10, v8, 14
+; RV64-NEXT:    vmv.x.s a4, v11
+; RV64-NEXT:    vslidedown.vi v11, v8, 15
+; RV64-NEXT:    vmv.x.s a5, v12
+; RV64-NEXT:    vslidedown.vi v12, v8, 1
+; RV64-NEXT:    vmv.x.s t4, v8
+; RV64-NEXT:    vmv.x.s a6, v10
+; RV64-NEXT:    vslidedown.vi v10, v8, 2
+; RV64-NEXT:    vmv.x.s a7, v11
+; RV64-NEXT:    vslidedown.vi v11, v8, 3
+; RV64-NEXT:    vmv.x.s t0, v12
+; RV64-NEXT:    vslidedown.vi v12, v8, 4
+; RV64-NEXT:    vmv.x.s t1, v10
+; RV64-NEXT:    vslidedown.vi v10, v8, 5
+; RV64-NEXT:    vmv.x.s t2, v11
+; RV64-NEXT:    vslidedown.vi v11, v8, 6
+; RV64-NEXT:    vslidedown.vi v8, v8, 7
+; RV64-NEXT:    li t6, 255
+; RV64-NEXT:    vmv.x.s t3, v12
+; RV64-NEXT:    vslidedown.vi v12, v9, 9
+; RV64-NEXT:    vmv.x.s t5, v10
+; RV64-NEXT:    vslidedown.vi v10, v9, 8
+; RV64-NEXT:    sgtz s0, t4
+; RV64-NEXT:    vmv.x.s t4, v11
+; RV64-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; RV64-NEXT:    vmv.s.x v0, t6
+; RV64-NEXT:    vsetvli zero, zero, e8, m1, ta, mu
+; RV64-NEXT:    vmv.x.s t6, v9
+; RV64-NEXT:    addi s0, s0, -1
+; RV64-NEXT:    or t6, s0, t6
+; RV64-NEXT:    vmv.x.s s0, v12
+; RV64-NEXT:    vslidedown.vi v11, v9, 10
+; RV64-NEXT:    sgtz a0, a0
+; RV64-NEXT:    addi a0, a0, -1
+; RV64-NEXT:    or a0, a0, s0
+; RV64-NEXT:    vmv.x.s s0, v10
+; RV64-NEXT:    vslidedown.vi v10, v9, 11
+; RV64-NEXT:    sgtz a1, a1
+; RV64-NEXT:    addi a1, a1, -1
+; RV64-NEXT:    or a1, a1, s0
+; RV64-NEXT:    vmv.x.s s0, v11
+; RV64-NEXT:    vslidedown.vi v11, v9, 12
+; RV64-NEXT:    sgtz a2, a2
+; RV64-NEXT:    addi a2, a2, -1
+; RV64-NEXT:    or a2, a2, s0
+; RV64-NEXT:    vmv.x.s s0, v10
+; RV64-NEXT:    vslidedown.vi v10, v9, 13
+; RV64-NEXT:    sgtz a3, a3
+; RV64-NEXT:    addi a3, a3, -1
+; RV64-NEXT:    or a3, a3, s0
+; RV64-NEXT:    vmv.x.s s0, v11
+; RV64-NEXT:    vslidedown.vi v11, v9, 14
+; RV64-NEXT:    sgtz a4, a4
+; RV64-NEXT:    addi a4, a4, -1
+; RV64-NEXT:    or a4, a4, s0
+; RV64-NEXT:    vmv.x.s s0, v10
+; RV64-NEXT:    vslidedown.vi v10, v9, 15
+; RV64-NEXT:    sgtz a5, a5
+; RV64-NEXT:    addi a5, a5, -1
+; RV64-NEXT:    or a5, a5, s0
+; RV64-NEXT:    vmv.x.s s0, v11
+; RV64-NEXT:    vslidedown.vi v11, v9, 1
+; RV64-NEXT:    sgtz a6, a6
+; RV64-NEXT:    addi a6, a6, -1
+; RV64-NEXT:    or a6, a6, s0
+; RV64-NEXT:    vmv.x.s s0, v10
+; RV64-NEXT:    vslidedown.vi v10, v9, 2
+; RV64-NEXT:    sgtz a7, a7
+; RV64-NEXT:    addi a7, a7, -1
+; RV64-NEXT:    or a7, a7, s0
+; RV64-NEXT:    vmv.x.s s0, v11
+; RV64-NEXT:    vslidedown.vi v11, v9, 3
+; RV64-NEXT:    sgtz t0, t0
+; RV64-NEXT:    addi t0, t0, -1
+; RV64-NEXT:    or t0, t0, s0
+; RV64-NEXT:    vmv.x.s s0, v8
+; RV64-NEXT:    vmv.v.x v8, t6
+; RV64-NEXT:    vmv.x.s t6, v10
+; RV64-NEXT:    vslidedown.vi v10, v9, 4
+; RV64-NEXT:    sgtz t1, t1
+; RV64-NEXT:    addi t1, t1, -1
+; RV64-NEXT:    or t1, t1, t6
+; RV64-NEXT:    vmv.x.s t6, v11
+; RV64-NEXT:    vslidedown.vi v11, v9, 5
+; RV64-NEXT:    sgtz t2, t2
+; RV64-NEXT:    addi t2, t2, -1
+; RV64-NEXT:    or t2, t2, t6
+; RV64-NEXT:    vmv.x.s t6, v10
+; RV64-NEXT:    vslidedown.vi v10, v9, 6
+; RV64-NEXT:    vslidedown.vi v9, v9, 7
+; RV64-NEXT:    sgtz t3, t3
+; RV64-NEXT:    sgtz t5, t5
+; RV64-NEXT:    addi t3, t3, -1
+; RV64-NEXT:    or t3, t3, t6
+; RV64-NEXT:    vmv.x.s t6, v11
+; RV64-NEXT:    sgtz t4, t4
+; RV64-NEXT:    addi t5, t5, -1
+; RV64-NEXT:    or t5, t5, t6
+; RV64-NEXT:    vmv.x.s t6, v10
+; RV64-NEXT:    sgtz s0, s0
+; RV64-NEXT:    addi t4, t4, -1
+; RV64-NEXT:    or t4, t4, t6
+; RV64-NEXT:    vmv.x.s t6, v9
+; RV64-NEXT:    addi s0, s0, -1
+; RV64-NEXT:    or t6, s0, t6
+; RV64-NEXT:    vmv.v.x v9, a1
+; RV64-NEXT:    vslide1down.vx v8, v8, t0
+; RV64-NEXT:    vslide1down.vx v9, v9, a0
+; RV64-NEXT:    vslide1down.vx v8, v8, t1
+; RV64-NEXT:    vslide1down.vx v9, v9, a2
+; RV64-NEXT:    vslide1down.vx v8, v8, t2
+; RV64-NEXT:    vslide1down.vx v9, v9, a3
+; RV64-NEXT:    vslide1down.vx v8, v8, t3
+; RV64-NEXT:    vslide1down.vx v9, v9, a4
+; RV64-NEXT:    vslide1down.vx v8, v8, t5
+; RV64-NEXT:    vslide1down.vx v9, v9, a5
+; RV64-NEXT:    vslide1down.vx v10, v8, t4
+; RV64-NEXT:    vslide1down.vx v8, v9, a6
+; RV64-NEXT:    vslide1down.vx v8, v8, a7
+; RV64-NEXT:    vslide1down.vx v9, v10, t6
+; RV64-NEXT:    vslidedown.vi v8, v9, 8, v0.t
+; RV64-NEXT:    ld s0, 8(sp) # 8-byte Folded Reload
+; RV64-NEXT:    .cfi_restore s0
+; RV64-NEXT:    addi sp, sp, 16
+; RV64-NEXT:    .cfi_def_cfa_offset 0
+; RV64-NEXT:    ret
+entry:
+  %c = call <16 x i8> @llvm.scmp(<16 x i8> zeroinitializer, <16 x i8> %a)
+  ret <16 x i8> %c
+}
+
+define <16 x i8> @scmp_i8z8(<16 x i8> %a) {
+; CHECK-LABEL: scmp_i8z8:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a0, 1
+; CHECK-NEXT:    vsetivli zero, 16, e8, m1, ta, ma
+; CHECK-NEXT:    vmin.vx v9, v8, a0
+; CHECK-NEXT:    vsra.vi v8, v8, 7
+; CHECK-NEXT:    vor.vv v8, v8, v9
+; CHECK-NEXT:    ret
+entry:
+  %c = call <16 x i8> @llvm.scmp(<16 x i8> %a, <16 x i8> zeroinitializer)
+  ret <16 x i8> %c
+}
+
+
+define <8 x i16> @scmp_i16i16(<8 x i16> %a, <8 x i16> %b) {
+; CHECK-LABEL: scmp_i16i16:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
+; CHECK-NEXT:    vmslt.vv v0, v9, v8
+; CHECK-NEXT:    vmv.v.i v10, 0
+; CHECK-NEXT:    vmerge.vim v10, v10, 1, v0
+; CHECK-NEXT:    vmslt.vv v0, v8, v9
+; CHECK-NEXT:    vmerge.vim v8, v10, -1, v0
+; CHECK-NEXT:    ret
+entry:
+  %c = call <8 x i16> @llvm.scmp(<8 x i16> %a, <8 x i16> %b)
+  ret <8 x i16> %c
+}
+
+define <8 x i16> @scmp_z16i16(<8 x i16> %a) {
+; CHECK-LABEL: scmp_z16i16:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, mu
+; CHECK-NEXT:    vslidedown.vi v10, v8, 5
+; CHECK-NEXT:    vsrl.vi v9, v8, 15
+; CHECK-NEXT:    vslidedown.vi v11, v8, 4
+; CHECK-NEXT:    vslidedown.vi v12, v8, 6
+; CHECK-NEXT:    vmv.x.s a0, v10
+; CHECK-NEXT:    vslidedown.vi v10, v8, 7
+; CHECK-NEXT:    vmv.x.s a1, v11
+; CHECK-NEXT:    vslidedown.vi v11, v8, 1
+; CHECK-NEXT:    vmv.x.s a2, v8
+; CHECK-NEXT:    vmv.x.s a3, v12
+; CHECK-NEXT:    vslidedown.vi v12, v8, 2
+; CHECK-NEXT:    vslidedown.vi v8, v8, 3
+; CHECK-NEXT:    vmv.x.s a4, v10
+; CHECK-NEXT:    vslidedown.vi v10, v9, 5
+; CHECK-NEXT:    vmv.x.s a5, v11
+; CHECK-NEXT:    vslidedown.vi v11, v9, 4
+; CHECK-NEXT:    vmv.x.s a6, v12
+; CHECK-NEXT:    vslidedown.vi v12, v9, 6
+; CHECK-NEXT:    sgtz a2, a2
+; CHECK-NEXT:    vmv.x.s a7, v9
+; CHECK-NEXT:    addi a2, a2, -1
+; CHECK-NEXT:    or a2, a2, a7
+; CHECK-NEXT:    vmv.x.s a7, v10
+; CHECK-NEXT:    vslidedown.vi v10, v9, 7
+; CHECK-NEXT:    sgtz a0, a0
+; CHECK-NEXT:    addi a0, a0, -1
+; CHECK-NEXT:    or a0, a0, a7
+; CHECK-NEXT:    vmv.x.s a7, v11
+; CHECK-NEXT:    vslidedown.vi v11, v9, 1
+; CHECK-NEXT:    sgtz a1, a1
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    or a1, a1, a7
+; CHECK-NEXT:    vmv.x.s a7, v12
+; CHECK-NEXT:    vslidedown.vi v12, v9, 2
+; CHECK-NEXT:    sgtz a3, a3
+; CHECK-NEXT:    sgtz a4, a4
+; CHECK-NEXT:    addi a3, a3, -1
+; CHECK-NEXT:    or a3, a3, a7
+; CHECK-NEXT:    vmv.x.s a7, v10
+; CHECK-NEXT:    sgtz a5, a5
+; CHECK-NEXT:    addi a4, a4, -1
+; CHECK-NEXT:    or a4, a4, a7
+; CHECK-NEXT:    vmv.x.s a7, v11
+; CHECK-NEXT:    addi a5, a5, -1
+; CHECK-NEXT:    or a5, a5, a7
+; CHECK-NEXT:    vmv.x.s a7, v8
+; CHECK-NEXT:    vslidedown.vi v8, v9, 3
+; CHECK-NEXT:    sgtz a6, a6
+; CHECK-NEXT:    vmv.v.x v9, a2
+; CHECK-NEXT:    vmv.x.s a2, v12
+; CHECK-NEXT:    sgtz a7, a7
+; CHECK-NEXT:    addi a6, a6, -1
+; CHECK-NEXT:    or a2, a6, a2
+; CHECK-NEXT:    vmv.x.s a6, v8
+; CHECK-NEXT:    addi a7, a7, -1
+; CHECK-NEXT:    or a6, a7, a6
+; CHECK-NEXT:    vmv.v.x v8, a1
+; CHECK-NEXT:    vslide1down.vx v9, v9, a5
+; CHECK-NEXT:    vslide1down.vx v8, v8, a0
+; CHECK-NEXT:    vslide1down.vx v9, v9, a2
+; CHECK-NEXT:    vmv.v.i v0, 15
+; CHECK-NEXT:    vslide1down.vx v8, v8, a3
+; CHECK-NEXT:    vslide1down.vx v8, v8, a4
+; CHECK-NEXT:    vslide1down.vx v9, v9, a6
+; CHECK-NEXT:    vslidedown.vi v8, v9, 4, v0.t
+; CHECK-NEXT:    ret
+entry:
+  %c = call <8 x i16> @llvm.scmp(<8 x i16> zeroinitializer, <8 x i16> %a)
+  ret <8 x i16> %c
+}
+
+define <8 x i16> @scmp_i16z16(<8 x i16> %a) {
+; CHECK-LABEL: scmp_i16z16:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a0, 1
+; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
+; CHECK-NEXT:    vmin.vx v9, v8, a0
+; CHECK-NEXT:    vsra.vi v8, v8, 15
+; CHECK-NEXT:    vor.vv v8, v8, v9
+; CHECK-NEXT:    ret
+entry:
+  %c = call <8 x i16> @llvm.scmp(<8 x i16> %a, <8 x i16> zeroinitializer)
+  ret <8 x i16> %c
+}
+
+
+define <4 x i32> @scmp_i32i32(<4 x i32> %a, <4 x i32> %b) {
+; CHECK-LABEL: scmp_i32i32:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vmslt.vv v0, v9, v8
+; CHECK-NEXT:    vmv.v.i v10, 0
+; CHECK-NEXT:    vmerge.vim v10, v10, 1, v0
+; CHECK-NEXT:    vmslt.vv v0, v8, v9
+; CHECK-NEXT:    vmerge.vim v8, v10, -1, v0
+; CHECK-NEXT:    ret
+entry:
+  %c = call <4 x i32> @llvm.scmp(<4 x i32> %a, <4 x i32> %b)
+  ret <4 x i32> %c
+}
+
+define <4 x i32> @scmp_z32i32(<4 x i32> %a) {
+; CHECK-LABEL: scmp_z32i32:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v9, v8, 1
+; CHECK-NEXT:    vsrl.vi v10, v8, 31
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    vslidedown.vi v11, v8, 2
+; CHECK-NEXT:    vslidedown.vi v8, v8, 3
+; CHECK-NEXT:    vmv.x.s a1, v9
+; CHECK-NEXT:    vslidedown.vi v9, v10, 1
+; CHECK-NEXT:    sgtz a0, a0
+; CHECK-NEXT:    vmv.x.s a2, v10
+; CHECK-NEXT:    sgtz a1, a1
+; CHECK-NEXT:    addi a0, a0, -1
+; CHECK-NEXT:    or a0, a0, a2
+; CHECK-NEXT:    vmv.x.s a2, v9
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    or a1, a1, a2
+; CHECK-NEXT:    vmv.x.s a2, v11
+; CHECK-NEXT:    vslidedown.vi v9, v10, 2
+; CHECK-NEXT:    sgtz a2, a2
+; CHECK-NEXT:    vmv.v.x v11, a0
+; CHECK-NEXT:    vmv.x.s a0, v9
+; CHECK-NEXT:    addi a2, a2, -1
+; CHECK-NEXT:    or a0, a2, a0
+; CHECK-NEXT:    vmv.x.s a2, v8
+; CHECK-NEXT:    vslidedown.vi v8, v10, 3
+; CHECK-NEXT:    sgtz a2, a2
+; CHECK-NEXT:    vslide1down.vx v9, v11, a1
+; CHECK-NEXT:    vmv.x.s a1, v8
+; CHECK-NEXT:    addi a2, a2, -1
+; CHECK-NEXT:    vslide1down.vx v8, v9, a0
+; CHECK-NEXT:    or a1, a2, a1
+; CHECK-NEXT:    vslide1down.vx v8, v8, a1
+; CHECK-NEXT:    ret
+entry:
+  %c = call <4 x i32> @llvm.scmp(<4 x i32> zeroinitializer, <4 x i32> %a)
+  ret <4 x i32> %c
+}
+
+define <4 x i32> @scmp_i32z32(<4 x i32> %a) {
+; CHECK-LABEL: scmp_i32z32:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a0, 1
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vmin.vx v9, v8, a0
+; CHECK-NEXT:    vsra.vi v8, v8, 31
+; CHECK-NEXT:    vor.vv v8, v8, v9
+; CHECK-NEXT:    ret
+entry:
+  %c = call <4 x i32> @llvm.scmp(<4 x i32> %a, <4 x i32> zeroinitializer)
+  ret <4 x i32> %c
+}
+
+
+define <2 x i64> @scmp_i64i64(<2 x i64> %a, <2 x i64> %b) {
+; CHECK-LABEL: scmp_...
[truncated]

@camel-cdr camel-cdr marked this pull request as draft August 1, 2025 19:21
@camel-cdr camel-cdr changed the title [RISCV] custom scmp(x,0) and scmp(0,x) lowering for RVV [draft] [RISCV] custom scmp(x,0) and scmp(0,x) lowering for RVV Aug 1, 2025
} else if (ISD::isConstantSplatVectorAllZeros(LHS.getNode())) {
// scmp(0, rhs) -> vmerge.vi(vmsgt.vi(rhs,0), vsrl.vi/vx(rhs,SEW-1), -1)
SDValue Srl = DAG.getNode(ISD::SRL, DL, VT, RHS, Shift);
return DAG.getSelectCC(DL, RHS, Zero, Srl, MinusOne, ISD::SETGT);
Copy link
Collaborator

@topperc topperc Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use getSelectCC for vectors. It creates ISD::SELECT_CC which is always scalarized for fixed vectors. Make a separate setcc and select.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried:

SDValue Setcc = DAG.getSetCC(DL, VT, LHS, Zero, ISD::SETGT);
return DAG.getSelect(DL, VT, Setcc, Sra, One);

but that didn't seem to work:

LLVM ERROR: Cannot select: t19: nxv2i64 = vselect t18, t16, t22
  t18: nxv2i64 = setcc t2, t24, setgt:ch
    t2: nxv2i64,ch = CopyFromReg t0, Register:nxv2i64 %0
    t24: nxv2i64 = RISCVISD::VMV_V_X_VL undef:nxv2i64, Constant:i64<0>, Register:i64 $x0
  t16: nxv2i64 = sra t2, t23
    t2: nxv2i64,ch = CopyFromReg t0, Register:nxv2i64 %0
    t23: nxv2i64 = RISCVISD::VMV_V_X_VL undef:nxv2i64, Constant:i64<63>, Register:i64 $x0
  t22: nxv2i64 = RISCVISD::VMV_V_X_VL undef:nxv2i64, Constant:i64<1>, Register:i64 $x0
In function: scmp_i64z64

I also tried using ISD::VSELECT instead of getSelect, but that didn't work either not sure what else I can do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use an i1 vector type for the setcc result. You can call getSetccResultType.

@topperc
Copy link
Collaborator

topperc commented Aug 1, 2025

Please put the tests in a separate commit within the PR so we can see the original codegen and the change.

}
// scmp(lhs, 0) -> vmerge.vi(vmsgt.vi(rhs,0), vsra.vx(lhs,SEW-1), 1)
return DAG.getSelectCC(DL, LHS, Zero, Sra, One, ISD::SETGT);
} else if (ISD::isConstantSplatVectorAllZeros(LHS.getNode())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No else after return.

@camel-cdr camel-cdr force-pushed the rvv_scmp branch 2 times, most recently from ca76290 to cad7e00 Compare August 1, 2025 21:02
@camel-cdr
Copy link
Author

camel-cdr commented Aug 1, 2025

I initially had a mistake in the conversion, it should now match the alive2 code. Edit: now

The current codegen for scmp(x,0) and scmp(0,x), also known as sign(x)
and -sign(x), isn't optimal for RVV.
It produces a four instruction sequence of for instructions
    vmsgt.vi + vmslt.vi + vmerge.vim + vmerge.vim
for SEW<=32 and three instructions for SEW=64.
    scmp(0,x): vmsgt.vi + vsra.vx + vor.vi
    scmp(x,0): vmsgt.vi + vsrl.vx + vmerge.vim

This patch introduces a new lowering for all values of SEW which
expresses the above in SelectionDAG Nodes.
This maps to two arithmetic instructions and a vector register move:
    scmp(0,x): vmv.v.i/v + vmsgt.vi + masked vsra.vi/vx
    scmp(x,0): vmv.v.i/v + vmsgt.vi + masked vsrl.vi/vx
These clobber v0, need to have a different destination than the input
and need to use an additional GPR for SEW=64.
For the SEW<=32 and scmp(x,0) case a slightly different
lowering was chooses:
    scmp(x,0): vmin.vx + vsra.i + vor.vv
This doesn't clobber v0, but uses a single GPR.
We deemed using a single GPR slightly better than clobbering v0
(SEW<=32), but using two GPRs as worse than using one GPR and
clobbering v0.
@camel-cdr camel-cdr marked this pull request as ready for review August 1, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants