Skip to content

Conversation

madhav-madhusoodanan
Copy link
Contributor

Context

This is a redo of PR #1814, since a lot of details have changed with PRs #1863, #1862, #1861, #1852.

r? @folkertdev
cc: @Amanieu

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch 2 times, most recently from 41db5a8 to 5fc0f3b Compare August 5, 2025 10:16
Comment on lines +199 to +407
match str::parse::<u32>(etype_processed.as_str()) {
Ok(value) => data.bit_len = Some(value),
Err(_) => {
data.bit_len = match data.kind() {
TypeKind::Char(_) => Some(8),
TypeKind::BFloat => Some(16),
TypeKind::Int(_) => Some(32),
TypeKind::Float => Some(32),
_ => None,
};
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are only some type kinds covered here? Maybe this could be a method on TypeKind?

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch from 9e28106 to 111cd5d Compare August 5, 2025 16:22
Copy link
Contributor

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should rebase on top of the upstream master branch instead of merging it in. That keeps the git history clean.

Comment on lines 188 to 204
x86_64-unknown-linux-gnu*)
CPPFLAGS="${TEST_CPPFLAGS}" RUSTFLAGS="${HOST_RUSTFLAGS}" RUST_LOG=warn \
cargo run "${INTRINSIC_TEST}" "${PROFILE}" \
--bin intrinsic-test -- intrinsics_data/x86-intel.xml \
--runner "${TEST_RUNNER}" \
--cppcompiler "${TEST_CXX_COMPILER}" \
--target "${TARGET}"
;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll have to see how to do this exactly, but we want to split these out of the main CI job to speed it up

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch from f3f87f2 to 2ec747c Compare August 9, 2025 12:20
@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch from 2ec747c to a8313d0 Compare September 5, 2025 08:16
@madhav-madhusoodanan
Copy link
Contributor Author

Seems like the CI run at this point failed due to this error. I'll retry shortly:

#6 39.45   Could not connect to archive.ubuntu.com:80 (185.125.190.82), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.83), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.83), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.81), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.81), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.82), connection timed out
#6 39.45   Unable to connect to archive.ubuntu.com:80:
#6 39.45 Fetched 126 kB in 39s (3208 B/s)
#6 39.45 Reading package lists...
#6 39.46 W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/questing/InRelease  Unable to connect to archive.ubuntu.com:80:
#6 39.46 W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/questing-updates/InRelease  Unable to connect to archive.ubuntu.com:80:
#6 39.46 W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/questing-backports/InRelease  Unable to connect to archive.ubuntu.com:80:
#6 39.46 W: Some index files failed to download. They have been ignored, or old ones used instead.

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch 13 times, most recently from 126a3ce to e79129a Compare September 10, 2025 05:58
@folkertdev
Copy link
Contributor

Can you rebase this (and remove the merge commits) sometime? That would make it a lot easier to see what has actually changed.

Also, why do you push to then have CI fail? Running this locally is a lot faster than having CI do it, because you can skip the earlier steps and run just the intrinsic tests.

Copy link
Contributor

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, wait, it probably is up to date? it's just that xml file is enormous. that's crazy. Should we like, try to trim that down?

@madhav-madhusoodanan
Copy link
Contributor Author

@folkertdev I've been syncing using the below command whenever I'm updating from master:

git rebase master

Am I doing it correctly?

@madhav-madhusoodanan
Copy link
Contributor Author

About the XML file specification, may I ask what "trim down" means?

@folkertdev
Copy link
Contributor

depends a bit on your git setup, but it seems to work allright. I was just confused by the enormous number of lines changed, but that's all due to that XML file.

@folkertdev
Copy link
Contributor

What i mean is, could we process that XML file into a smaller XML file that only stores what we need? That would reduce the size of the repo (not sure how big that file is, maybe it compresses well?) and the speed of the intrinsic tests.

There are risks too, e.g. what we generate could go out of date with the official XML file. So maybe it's fine this way.

@madhav-madhusoodanan
Copy link
Contributor Author

madhav-madhusoodanan commented Sep 10, 2025

Ahh, makes sense.

I think it might be best to keep the source of truth as unchanged as possible, since there is no direct way to obtain the sources sometimes.

For example, the XML file originally existed in the stdarch-verify crate, and I've had to check it by manually downloading the x86 reference (which comes in a folder containing HTML, CSS and JS files) at the link called Download: Offline Intel® Intrinsics Guide in the Intel intrinsics reference site

The entire XML data can be found stored as a hardcoded string that is assigned to a variable in a JS file.

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch 3 times, most recently from c9ea3bd to 18aed56 Compare October 15, 2025 06:45
@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch from 18aed56 to 5162772 Compare October 15, 2025 07:03
@madhav-madhusoodanan
Copy link
Contributor Author

madhav-madhusoodanan commented Oct 15, 2025

@folkertdev I've added the ability to sample intrinsics. Currently it's implemented for x86 and about 5% of intrinsics are randomly chosen for testing.

This is intended as a temporary solution really, we can remove it once we restructure the C++/Rust testfiles to run all intrinsics in one go.

@folkertdev
Copy link
Contributor

Neat!

So, this does increase the longest CI job to 20 minutes, up from 12 minutes. I think we should try to make this its own CI job at least. I'm also not sure that we want to randomly sample the tests on each run. If we do hit a failure, that will be hard to reproduce. The status quo is no testing at all, so I think testing even a fixed subset is an improvement.

@madhav-madhusoodanan there is already a version of the x86-intel.xml file in crates/stdarch-verify, you can just move that file to intrinsics-data (and change the path in the crates/stdarch-verify/tests/x86-intel.rs)

did this happen? It still looks like the file is added again here.

@madhav-madhusoodanan
Copy link
Contributor Author

Ohh no, let me do that too.

@madhav-madhusoodanan
Copy link
Contributor Author

About the hit failure, hmm yes you have a point.
Let's do a naive partitioning in that case.

to char 2. including variable names in template strings instead of
passing them as arguments to macros
@madhav-madhusoodanan
Copy link
Contributor Author

I'll need to rename the variables in arm/config.rs and x86/config.rs. I'll do that.

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch from 524bc52 to 6df5eca Compare October 16, 2025 07:01
@madhav-madhusoodanan
Copy link
Contributor Author

madhav-madhusoodanan commented Oct 16, 2025

Okay, so I've split the intrinsic-test run onto separate CI processes.
The x86 run of the same is the longest among them and takes 10 mins.

Now the longest process among all the CI processes, is the Test CI process for x86 which takes 12 mins.

Copy link
Contributor

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, but the CI changes are quite large, so could you split those out into their own PR (and then just add the x86 stuff here)?

ci/run.sh Outdated
Comment on lines 94 to 95
PATH="$PATH":"$(pwd)"/c_programs
export PATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? it really should not be I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so it turns out that for aarch64, qemu's executable resolution algorithm looks into the current directory too.
However, sde64 looks only at PATH.

This was intended to fix that discrepancy, but now that I think about it, it might be cleaner to use ./intrinsic-test-programs in compare_outputs.rs.

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch from 6df5eca to 1abb103 Compare October 17, 2025 16:57
Comment on lines +41 to +47
uint16_t temp = 0;
memcpy(&temp, &value, sizeof(float16_t));
std::stringstream ss;
ss << "0x" << std::setfill('0') << std::setw(4) << std::hex << temp;
os << ss.str();
return os;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@folkertdev do we need stringstream here? Seems to me that we can make this definition shorter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My c++ is not good enough to know for sure. If there is something shorter that works I'm all for it

@madhav-madhusoodanan
Copy link
Contributor Author

Okay, so I've reverted that last one commit (that split the CI job) and the PATH update.
I'll make that PR as soon as we merge this one.

@folkertdev
Copy link
Contributor

Maybe this was not clear but my intention was to merge the ci changes separately first, them rebase this pr on top

@madhav-madhusoodanan
Copy link
Contributor Author

madhav-madhusoodanan commented Oct 18, 2025

Ohh ohh, my bad.
Sure let me do that.

Created a PR for that: #1941

@madhav-madhusoodanan
Copy link
Contributor Author

Turns out std::hex is sticky for the stream that it is applied on.
So stringstream's usage actually makes sense.

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the intrinsic-test-x86-addition branch from 212e16d to 1abb103 Compare October 18, 2025 07:13
// compile this cpp file into a .o file
info!("compiling main.cpp");
trace!("compiling main.cpp");
let output = cpp_compiler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble running this locally without the intrinsic sampling on, I get an apparent segfault at this clang invocation. Were you seeing that too?

Copy link
Contributor Author

@madhav-madhusoodanan madhav-madhusoodanan Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me the stdout/stderr? Perhaps I could help

indentation: Indentation,
loads: u32,
) -> std::io::Result<()> {
for arg in self.iter().filter(|&arg| !arg.has_constraint()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note that arguments of the same type seem to have identical _vals arrays. Not required for this but as a future optimisation we could just emit one array for each unique argument type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's something I want to work on (when this PR is merged)


pub fn map_constraints(imm_type: &String, imm_width: u32) -> Option<Constraint> {
if imm_width > 0 {
let max: i64 = 2i64.pow(imm_width);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some x86 intrinsics have a lot of possible immediate values, for example _mm256_fpclass_pd_mask has an immediate of width 8 giving 256*20 loop iterations just for that one intrinsic.

We never really saw so many constraint values on aarch64 - perhaps testing every possible constraint value isn't a reasonable strategy here

RUST_LOG=warn RUST_BACKTRACE=1 \
cargo run "${INTRINSIC_TEST}" "${PROFILE}" \
--bin intrinsic-test -- intrinsics_data/x86-intel.xml \
--runner "${TEST_RUNNER}" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest trying just this tool with qemu-x86_64 -cpu=max rather than intel SDE which I expect would run a lot quicker, potentially at the cost of less coverage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qemu does not support (most of?) avx-512, which is where most of the complexity is. So we'll have to split that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which I'd rather not do in this PR

Copy link
Contributor

@adamgemmell adamgemmell Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, none of my comments should be blocking - this PR is useful without them

Maybe the situation is better today - these are the qemu cpu flags containing avx from 10.1 which is included in 25.10:

avx avx-ifma
avx-ne-convert avx-vnni avx-vnni-int16 avx-vnni-int8 avx10 avx10-128
avx10-256 avx10-512 avx2 avx512-4fmaps avx512-4vnniw avx512-bf16
avx512-fp16 avx512-vp2intersect avx512-vpopcntdq avx512bitalg avx512bw
avx512cd avx512dq avx512er avx512f avx512ifma avx512pf avx512vbmi
avx512vbmi2 avx512vl avx512vnni

And here's what lscpu reports with -cpu=max

Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch

QEMU is pretty quick to build to build a single backend from source if we wanted something newer than what's bundled with ubuntu too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qemu-user supports avx512 with only the kvm backend, but as we run it in docker in CI, we don't have kvm. Idk about qemu-system though, it should be the same

@madhav-madhusoodanan
Copy link
Contributor Author

madhav-madhusoodanan commented Oct 20, 2025

Just so that the review process is easier, would it make sense to break down the ARM and common changes into their own PRs?

A couple of them were dependent on the changes that arose to accommodate x86 in an attempt to maintain a clean separation of concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants