-
Notifications
You must be signed in to change notification settings - Fork 299
intrinsic-test
: Adding x86 behavioural testing.
#1894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
intrinsic-test
: Adding x86 behavioural testing.
#1894
Conversation
41db5a8
to
5fc0f3b
Compare
match str::parse::<u32>(etype_processed.as_str()) { | ||
Ok(value) => data.bit_len = Some(value), | ||
Err(_) => { | ||
data.bit_len = match data.kind() { | ||
TypeKind::Char(_) => Some(8), | ||
TypeKind::BFloat => Some(16), | ||
TypeKind::Int(_) => Some(32), | ||
TypeKind::Float => Some(32), | ||
_ => None, | ||
}; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are only some type kinds covered here? Maybe this could be a method on TypeKind
?
9e28106
to
111cd5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should rebase on top of the upstream master branch instead of merging it in. That keeps the git history clean.
x86_64-unknown-linux-gnu*) | ||
CPPFLAGS="${TEST_CPPFLAGS}" RUSTFLAGS="${HOST_RUSTFLAGS}" RUST_LOG=warn \ | ||
cargo run "${INTRINSIC_TEST}" "${PROFILE}" \ | ||
--bin intrinsic-test -- intrinsics_data/x86-intel.xml \ | ||
--runner "${TEST_RUNNER}" \ | ||
--cppcompiler "${TEST_CXX_COMPILER}" \ | ||
--target "${TARGET}" | ||
;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll have to see how to do this exactly, but we want to split these out of the main CI job to speed it up
f3f87f2
to
2ec747c
Compare
2ec747c
to
a8313d0
Compare
Seems like the CI run at this point failed due to this error. I'll retry shortly:
|
126a3ce
to
e79129a
Compare
Can you rebase this (and remove the merge commits) sometime? That would make it a lot easier to see what has actually changed. Also, why do you push to then have CI fail? Running this locally is a lot faster than having CI do it, because you can skip the earlier steps and run just the intrinsic tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, wait, it probably is up to date? it's just that xml file is enormous. that's crazy. Should we like, try to trim that down?
@folkertdev I've been syncing using the below command whenever I'm updating from master: git rebase master Am I doing it correctly? |
About the XML file specification, may I ask what "trim down" means? |
depends a bit on your git setup, but it seems to work allright. I was just confused by the enormous number of lines changed, but that's all due to that XML file. |
What i mean is, could we process that XML file into a smaller XML file that only stores what we need? That would reduce the size of the repo (not sure how big that file is, maybe it compresses well?) and the speed of the intrinsic tests. There are risks too, e.g. what we generate could go out of date with the official XML file. So maybe it's fine this way. |
Ahh, makes sense. I think it might be best to keep the source of truth as unchanged as possible, since there is no direct way to obtain the sources sometimes. For example, the XML file originally existed in the The entire XML data can be found stored as a hardcoded string that is assigned to a variable in a JS file. |
c9ea3bd
to
18aed56
Compare
18aed56
to
5162772
Compare
@folkertdev I've added the ability to sample intrinsics. Currently it's implemented for x86 and about 5% of intrinsics are randomly chosen for testing. This is intended as a temporary solution really, we can remove it once we restructure the C++/Rust testfiles to run all intrinsics in one go. |
Neat! So, this does increase the longest CI job to 20 minutes, up from 12 minutes. I think we should try to make this its own CI job at least. I'm also not sure that we want to randomly sample the tests on each run. If we do hit a failure, that will be hard to reproduce. The status quo is no testing at all, so I think testing even a fixed subset is an improvement.
did this happen? It still looks like the file is added again here. |
Ohh no, let me do that too. |
About the hit failure, hmm yes you have a point. |
to char 2. including variable names in template strings instead of passing them as arguments to macros
I'll need to rename the variables in |
524bc52
to
6df5eca
Compare
Okay, so I've split the Now the longest process among all the CI processes, is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, but the CI changes are quite large, so could you split those out into their own PR (and then just add the x86 stuff here)?
ci/run.sh
Outdated
PATH="$PATH":"$(pwd)"/c_programs | ||
export PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? it really should not be I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so it turns out that for aarch64, qemu
's executable resolution algorithm looks into the current directory too.
However, sde64
looks only at PATH.
This was intended to fix that discrepancy, but now that I think about it, it might be cleaner to use ./intrinsic-test-programs
in compare_outputs.rs
.
6df5eca
to
1abb103
Compare
uint16_t temp = 0; | ||
memcpy(&temp, &value, sizeof(float16_t)); | ||
std::stringstream ss; | ||
ss << "0x" << std::setfill('0') << std::setw(4) << std::hex << temp; | ||
os << ss.str(); | ||
return os; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@folkertdev do we need stringstream here? Seems to me that we can make this definition shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My c++ is not good enough to know for sure. If there is something shorter that works I'm all for it
Okay, so I've reverted that last one commit (that split the CI job) and the PATH update. |
Maybe this was not clear but my intention was to merge the ci changes separately first, them rebase this pr on top |
Ohh ohh, my bad. Created a PR for that: #1941 |
Turns out |
212e16d
to
1abb103
Compare
// compile this cpp file into a .o file | ||
info!("compiling main.cpp"); | ||
trace!("compiling main.cpp"); | ||
let output = cpp_compiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble running this locally without the intrinsic sampling on, I get an apparent segfault at this clang invocation. Were you seeing that too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show me the stdout/stderr? Perhaps I could help
indentation: Indentation, | ||
loads: u32, | ||
) -> std::io::Result<()> { | ||
for arg in self.iter().filter(|&arg| !arg.has_constraint()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note that arguments of the same type seem to have identical _vals
arrays. Not required for this but as a future optimisation we could just emit one array for each unique argument type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's something I want to work on (when this PR is merged)
|
||
pub fn map_constraints(imm_type: &String, imm_width: u32) -> Option<Constraint> { | ||
if imm_width > 0 { | ||
let max: i64 = 2i64.pow(imm_width); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some x86 intrinsics have a lot of possible immediate values, for example _mm256_fpclass_pd_mask
has an immediate of width 8 giving 256*20 loop iterations just for that one intrinsic.
We never really saw so many constraint values on aarch64 - perhaps testing every possible constraint value isn't a reasonable strategy here
RUST_LOG=warn RUST_BACKTRACE=1 \ | ||
cargo run "${INTRINSIC_TEST}" "${PROFILE}" \ | ||
--bin intrinsic-test -- intrinsics_data/x86-intel.xml \ | ||
--runner "${TEST_RUNNER}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest trying just this tool with qemu-x86_64 -cpu=max
rather than intel SDE which I expect would run a lot quicker, potentially at the cost of less coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qemu
does not support (most of?) avx-512, which is where most of the complexity is. So we'll have to split that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which I'd rather not do in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, none of my comments should be blocking - this PR is useful without them
Maybe the situation is better today - these are the qemu cpu flags containing avx from 10.1 which is included in 25.10:
avx avx-ifma
avx-ne-convert avx-vnni avx-vnni-int16 avx-vnni-int8 avx10 avx10-128
avx10-256 avx10-512 avx2 avx512-4fmaps avx512-4vnniw avx512-bf16
avx512-fp16 avx512-vp2intersect avx512-vpopcntdq avx512bitalg avx512bw
avx512cd avx512dq avx512er avx512f avx512ifma avx512pf avx512vbmi
avx512vbmi2 avx512vl avx512vnni
And here's what lscpu
reports with -cpu=max
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch
QEMU is pretty quick to build to build a single backend from source if we wanted something newer than what's bundled with ubuntu too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qemu-user supports avx512 with only the kvm backend, but as we run it in docker in CI, we don't have kvm. Idk about qemu-system though, it should be the same
Just so that the review process is easier, would it make sense to break down the ARM and common changes into their own PRs? A couple of them were dependent on the changes that arose to accommodate x86 in an attempt to maintain a clean separation of concerns. |
Context
This is a redo of PR #1814, since a lot of details have changed with PRs #1863, #1862, #1861, #1852.
r? @folkertdev
cc: @Amanieu