-
Notifications
You must be signed in to change notification settings - Fork 5k
Add support for Classify, CompressStore, ExpandLoad, MaskLoad, MaskStore, and MoveMask #116708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As is typical, the JIT side changes are small at around 260 lines. The bulk of the change is the managed API surface since it requires defining the managed signatures + comments and duplicating it across 2 files for each API. The remaining 1100 lines is the test updates and new test template (most of that being the test template). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for several new AVX512 hardware intrinsic APIs including Classify, CompressStore, ExpandLoad, MaskLoad, MaskStore, and MoveMask to facilitate more fine‐grained vector operations and improved intrinsics support on 128/256-bit paths.
- Adds new test methods in the HardwareIntrinsics test suite to validate the new API behavior.
- Updates the System.Private.CoreLib intrinsic implementations and corresponding JIT lowering/codegen logic to incorporate the new intrinsics with proper EVEX embedded mask handling.
- Modifies the HW intrinsic lists and lower/emit methods to ensure the new operations are recognized and processed correctly.
Reviewed Changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/tests/JIT/HardwareIntrinsics/X86/Shared/SseVerify.cs | New MoveMask overloads added for generic, float, and double arrays. |
src/tests/JIT/HardwareIntrinsics/X86/Shared/LoadTernOpTest.template | New test cases introduced to exercise the new intrinsic APIs. |
src/tests/JIT/HardwareIntrinsics/X86/Shared/Avx512Verify.cs | New intrinsic APIs added for Classify, MaskLoad, and MaskStore operations. |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Vbmi2.cs | Added overloads for CompressStore and ExpandLoad and updated corresponding documentation comments. |
src/coreclr/jit/lowerxarch.cpp, hwintrinsicxarch.cpp, and related files | Extended JIT lowering and emitter logic to support the new AVX512 intrinsic cases and EVEX embedded masking. |
src/coreclr/jit/hwintrinsic*.{cpp,h} | Adjusted intrinsic lookup and lists to include the new intrinsic IDs. |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Vbmi2.cs
Show resolved
Hide resolved
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
@@ -711,8 +711,8 @@ INST3(vcmppd, "cmppd", IUM_WR, BAD_CODE, BAD_ | |||
INST3(vcmpps, "cmpps", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0xC2), INS_TT_FULL, Input_32Bit | KMask_Base4 | REX_W0 | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare packed singles | |||
INST3(vcmpsd, "cmpsd", IUM_WR, BAD_CODE, BAD_CODE, SSEDBL(0xC2), INS_TT_TUPLE1_SCALAR, Input_64Bit | KMask_Base1 | REX_W1 | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare scalar doubles | |||
INST3(vcmpss, "cmpss", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0xC2), INS_TT_TUPLE1_SCALAR, Input_32Bit | KMask_Base1 | REX_W0 | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare scalar singles | |||
INST3(vcompresspd, "compresspd", IUM_WR, SSE38(0x8A), BAD_CODE, BAD_CODE, INS_TT_TUPLE1_SCALAR, Input_64Bit | KMask_Base2 | REX_W1 | Encoding_EVEX) // Store sparse packed doubles into dense memory | |||
INST3(vcompressps, "compressps", IUM_WR, SSE38(0x8A), BAD_CODE, BAD_CODE, INS_TT_TUPLE1_SCALAR, Input_32Bit | KMask_Base4 | REX_W0 | Encoding_EVEX) // Store sparse packed singles into dense memory | |||
INST3(vcompresspd, "compresspd", IUM_WR, SSE38(0x8A), BAD_CODE, BAD_CODE, INS_TT_FULL_MEM, Input_64Bit | REX_W1 | Encoding_EVEX) // Store sparse packed doubles into dense memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formal definition of these from the hardware manuals are INS_TT_TUPLE1_SCALAR
. However, for the purposes of how the JIT uses this information it should be INS_TT_FULL_MEM
.
We use this for both containment purposes and for disassembly output. While compress
/expand
can touch as few as 0 bytes of memory, they can also touch as much as a full vector of memory. We can't statically know how much they'll touch and so we want to presume they could touch the whole amount.
Similarly we don't want to specify the KMask_Base*
amount since we can't automatically support embedded masking. Developers wanting the masking support need to use the explicit CompressStore/ExpandLoad
APIs rather than ConditionalSelect + Compress/Expand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIT changes LGTM
/ba-g unrelated dead letter failure for ios/tvos |
This makes additional progress towards #87097. What remains is the
Test
,Gather
, andScatter
APIsThis provides support for:
Classify
- allows identifying the floating-point kind (nan, subnormal, infinity, finite, zero, negative, etc)CompressStore
- the store counterpart toCompress
, allowing sequential writing of selected elementsExpandLoad
- the load counterpart toExpand
, allowing sequential reading of elements to selected positionsMaskLoad
- allows reading only the selected element positions (suppresses faults)MaskStore
- allows writing only the selected element positions (suppresses faults)MoveMask
- allows extract the most significant bits from an AVX512 kmask registerThis also covers the various AVX512 intrinsic variants for 128/256-bit paths where that intrinsic guarantees mask usage. For example,
Sse41.BlendVariable
usesxmm0
for the mask whileAvx512F.VL.BlendVariable
usesk1
. This allows devs to force kmask usage for 128/256-bit code paths.