Skip to content

Add support for Classify, CompressStore, ExpandLoad, MaskLoad, MaskStore, and MoveMask #116708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 18, 2025

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jun 16, 2025

This makes additional progress towards #87097. What remains is the Test, Gather, and Scatter APIs

This provides support for:

  • Classify - allows identifying the floating-point kind (nan, subnormal, infinity, finite, zero, negative, etc)
  • CompressStore - the store counterpart to Compress, allowing sequential writing of selected elements
  • ExpandLoad - the load counterpart to Expand, allowing sequential reading of elements to selected positions
  • MaskLoad - allows reading only the selected element positions (suppresses faults)
  • MaskStore - allows writing only the selected element positions (suppresses faults)
  • MoveMask - allows extract the most significant bits from an AVX512 kmask register

This also covers the various AVX512 intrinsic variants for 128/256-bit paths where that intrinsic guarantees mask usage. For example, Sse41.BlendVariable uses xmm0 for the mask while Avx512F.VL.BlendVariable uses k1. This allows devs to force kmask usage for 128/256-bit code paths.

@tannergooding
Copy link
Member Author

As is typical, the JIT side changes are small at around 260 lines.

The bulk of the change is the managed API surface since it requires defining the managed signatures + comments and duplicating it across 2 files for each API.

The remaining 1100 lines is the test updates and new test template (most of that being the test template).

@tannergooding tannergooding marked this pull request as ready for review June 16, 2025 18:44
@tannergooding tannergooding requested review from Copilot and EgorBo June 16, 2025 18:44
@tannergooding tannergooding added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture and removed area-System.Runtime.Intrinsics labels Jun 16, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for several new AVX512 hardware intrinsic APIs including Classify, CompressStore, ExpandLoad, MaskLoad, MaskStore, and MoveMask to facilitate more fine‐grained vector operations and improved intrinsics support on 128/256-bit paths.

  • Adds new test methods in the HardwareIntrinsics test suite to validate the new API behavior.
  • Updates the System.Private.CoreLib intrinsic implementations and corresponding JIT lowering/codegen logic to incorporate the new intrinsics with proper EVEX embedded mask handling.
  • Modifies the HW intrinsic lists and lower/emit methods to ensure the new operations are recognized and processed correctly.

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/tests/JIT/HardwareIntrinsics/X86/Shared/SseVerify.cs New MoveMask overloads added for generic, float, and double arrays.
src/tests/JIT/HardwareIntrinsics/X86/Shared/LoadTernOpTest.template New test cases introduced to exercise the new intrinsic APIs.
src/tests/JIT/HardwareIntrinsics/X86/Shared/Avx512Verify.cs New intrinsic APIs added for Classify, MaskLoad, and MaskStore operations.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Vbmi2.cs Added overloads for CompressStore and ExpandLoad and updated corresponding documentation comments.
src/coreclr/jit/lowerxarch.cpp, hwintrinsicxarch.cpp, and related files Extended JIT lowering and emitter logic to support the new AVX512 intrinsic cases and EVEX embedded masking.
src/coreclr/jit/hwintrinsic*.{cpp,h} Adjusted intrinsic lookup and lists to include the new intrinsic IDs.

Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@@ -711,8 +711,8 @@ INST3(vcmppd, "cmppd", IUM_WR, BAD_CODE, BAD_
INST3(vcmpps, "cmpps", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0xC2), INS_TT_FULL, Input_32Bit | KMask_Base4 | REX_W0 | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare packed singles
INST3(vcmpsd, "cmpsd", IUM_WR, BAD_CODE, BAD_CODE, SSEDBL(0xC2), INS_TT_TUPLE1_SCALAR, Input_64Bit | KMask_Base1 | REX_W1 | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare scalar doubles
INST3(vcmpss, "cmpss", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0xC2), INS_TT_TUPLE1_SCALAR, Input_32Bit | KMask_Base1 | REX_W0 | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare scalar singles
INST3(vcompresspd, "compresspd", IUM_WR, SSE38(0x8A), BAD_CODE, BAD_CODE, INS_TT_TUPLE1_SCALAR, Input_64Bit | KMask_Base2 | REX_W1 | Encoding_EVEX) // Store sparse packed doubles into dense memory
INST3(vcompressps, "compressps", IUM_WR, SSE38(0x8A), BAD_CODE, BAD_CODE, INS_TT_TUPLE1_SCALAR, Input_32Bit | KMask_Base4 | REX_W0 | Encoding_EVEX) // Store sparse packed singles into dense memory
INST3(vcompresspd, "compresspd", IUM_WR, SSE38(0x8A), BAD_CODE, BAD_CODE, INS_TT_FULL_MEM, Input_64Bit | REX_W1 | Encoding_EVEX) // Store sparse packed doubles into dense memory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formal definition of these from the hardware manuals are INS_TT_TUPLE1_SCALAR. However, for the purposes of how the JIT uses this information it should be INS_TT_FULL_MEM.

We use this for both containment purposes and for disassembly output. While compress/expand can touch as few as 0 bytes of memory, they can also touch as much as a full vector of memory. We can't statically know how much they'll touch and so we want to presume they could touch the whole amount.

Similarly we don't want to specify the KMask_Base* amount since we can't automatically support embedded masking. Developers wanting the masking support need to use the explicit CompressStore/ExpandLoad APIs rather than ConditionalSelect + Compress/Expand

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT changes LGTM

@tannergooding
Copy link
Member Author

/ba-g unrelated dead letter failure for ios/tvos

@tannergooding tannergooding merged commit 76490ed into dotnet:main Jun 18, 2025
153 of 158 checks passed
@tannergooding tannergooding deleted the fix-87097 branch June 18, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants