-
Notifications
You must be signed in to change notification settings - Fork 584
add new module to core called Time::HiRes (a real benchmark framework) #23389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bulk88
wants to merge
20
commits into
Perl:blead
Choose a base branch
from
bulk88:timehires_cleanup
base: blead
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+994
−241
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
More efficient. This is a static, there are no binary compat concerns. The dTHX is from initial commit of hrstatns() in commit: 75d5269 - Steve Peters - 10/13/2006 10:11:04 AM Upgrade to Time-HiRes-1.92.
-XPUSHs() requires saving the SV* retval of sv_2mortal(newSVsv()) around a possible Perl_stack_grow(), split the EXTEND from the PUSH, so SV* is held only in volatile registers (liveness). -over EXTEND to 13 elements instead of 1 element. Why not? pp_stat()/pp_lstat() have to do the Perl_stack_grow() call if we don't do it. -remove Zero() macro and use a function call free struct initializer. Just b/c GCC and its offshoot Clang will inline a fixed length memset() doesn't make it part of ISO C. MSVC compiler never inlines memset() calls on WinPerl (b/c P5P never added the magic sauce to ask for that feature). More portably, P5P has never verified the machine code output of all known commercial Unix CCs on all CPU archs regarding inlining memset(). -when filling out the fake OP, do some instruction level parallelism like filling in fakeop with 0s, while digging through my_perl->Iop->op_flags, my_perl->Icurstackinfo->si_cxsubix, my_perl->Icurstackinfo->si_cxstack, and etc, as part of GIMME_V macro, which used to be a libperl.so exported function call a very long time ago IIRC. Another example, translate ix?OP_LSTAT:OP_STAT while translating gm==G_LIST?OPf_WANT_LIST:gm==G_SCALAR?OPf_WANT_SCALAR:OPf_WANT_VOID. Dig through PLT/GOT/PE sym table as part of PL_ppaddr[op_type] while writing to C stk mem as part of fakeop.op_type = op_type -change fakeop.op_ppaddr(aTHX); to ppaddr(aTHX); b/c some CCs have a low IQ and can't prove statement "PL_op = &fakeop;" won't modify field fakeop.op_ppaddr in our C auto storage OP struct var. -don't execute the Perl_sv_2uv_flags() getter method pointlessly inside UV atime = SvUV(ST( 8)); if static function hrstatns() is a NOOP and is inlined and totally optimized away since in some build configs, hrstatns() only does atime_nsec = 0; mtime_nsec = 0; ctime_nsec = 0; Windows is an example. -change SvUV(ST( 8)); to SvUV(SPBASE[ 8]); don't deref my_perl->Istack_base over and over
-strings "Time::HiRes::clock", "Time::HiRes::clock_nanosleep", etc will be inside HiRes.dll.so no matter what, b/c BOOT: and newXS_flags() requires them no matter what -type NV_DIE is an un-invasive LOC-wise quick fix to get rid of the tons of EU::PXS injected dXSTARG; statements which execute Perl_sv_newmortal() right before executing croak("%s(): unimplemented in this platform","Time::HiRes::clock"); The retval types could be changed to void, or SV* instead to eliminate the Perl_sv_newmortal() before croak() calls. But for some hysterical raisens, Time::HiRes.xs is confusing Perl_warn() with Perl_croak() in dozens of places. Fixing that is out of scope for this patch.
-It is a boot time constant. It will not change without a motherboard or CPU swap and then rebooting. The actual 64 bit integer returned, reflects if the NT Kernel wants to use Intel's APIC Timer or Intel's 8253/8254 PIT Timer, or Intel's RDTSC instruction. NT Kernel will only use RDTSC backend if both the CPU and Northbridge swear upon a holy book, that they will fire an interrupt at every Intel/AMD SpeedSwitch/TurboBoost transition. The dynamic CPU speed correction factor logic lives inside the machine code of QueryPerformanceCounter(). Not inside QueryPerformanceFrequency() which has been part of MS;s frozen Public API since 1993. -the test: if (!QueryPerformanceFrequency(&l_tick_frequency)) croak("WT???"); can probably be removed one day, only Win2K or NT4 or Win95/98, running on any 32-bit CISC or 32-but RISC CPU arch, are capable of retval FALSE. The test is added out of paranoia. IDK what in real life on real HW can cause retval FALSE. -calc and save var unsigned __int64 qpc_res_ns; and unsigned __int64 qpc_res_ns_realtime; exactly once instead of re-calcing in the runloop, why not? HiRes.dll's .data section is only 0x650 bytes long and granularity is 0x1000/4096 bytes. -the BOOT: initialization code of the 3 true C static global vars, is written, to assume 2 ithreads, or 2 my_perl ptrs, or 2 different embbeding consumers of perl5XX.dll inside 1 OS process, can simultaneously call Dynaloader::bootstrap() or Time::HiRes::bootstrap() on 2 different CPU cores. This is unrealistic paranoia IMO, but CPU op lock xchg reg, [addr]; and mov [addr], reg; are both 7 bytes long. Maybe Windows >= 8.0 on ARM32/ARM64, want their memory fence/barrier formalities writing to an aligned 64 bit integer. So why not? -#define S_InterlockedExchange64(_d,_s) has S_ prefix, so no assumptions are made on MSVC and Mingw GCC, if InterlockedExchange64() is a macro or a symbol. Any age, any version, any build number, any FOSS project code owner, or any FOSS binary packager, of those 2 C compiler families.
Macros SvIV()/SvNV()/SvUV() contain getter function calls. Don't execute the getters, if we will croak() no matter what. The end user doesn't need to see an "Uninitialized variable in" STDERR warning right before croak("unimplemented"); executes. Same goes for SvGETMAGIC() methods firing right before croak("unimplemented"); I picked "int die_t" vs "int_die_t" so IDE syntax highlight keeps working on token "int".
…ime() To summarize, MS's FILETIME type is an 8 bytes long, 64 bit integer, that might aligned to 4 bytes, not 8. SW E-Attorneys, will vigorously argue, MS's FILETIME type, is an 8 byte long C struct, wrapping a union that wraps a U8 array[8]; string that is 8 bytes long. Claiming type FILETIME is a 64 bit int is libel and slander. Since P5P does not publish a C compiler or C linker. That alignment detail for Windows on RISC machine code is irrelavent. This commit was written to preventing redundant re-reads of a C auto U64 from C stack memory to a CPU register around any possible function call, if they exist, and to narrow down the peak width of each caller function's callstack frame on the C stack.
…zers -all C branches/CPP branches in these 2 XSUBs return and set "int status"
-remove align padding bytes from struct my_cxt_t{}. unsigned long run_count; is always 4 bytes, the other 3 members are always 8 bytes -cleanup ABI/machine code gen of Win32-only static fn _gettimeofday() It never leaves this TU as a fn ptr. MSVC 2022 -O1/-O2 optimizer can only create unitialzed reg/C stk "holes" for args that are unused in all callers and unused in callee. It can't shift left or collapse any both sides, unused registers/C arguments, in 1 TU, even if no fn ptr if taken in a static function. The new macro remains POSIX-like. -In _GetSystemTimePreciseAsFileTime(), immediatly copy contents of our " &C_auto_u64 " var, to a new C auto var, so the 64-bit value "outputs" or psuedo-retvals of the MS Win API funcs, can be manipulated for the rest of the function's body, completly in CPU registers, with 0% chance of re-reading or pointlessly writing back to the C stack memory address. -Do the same for _gettimeofday_x() when _gettimeofday_x() calls the MS public Win API funcs. -Inside _GetSystemTimePreciseAsFileTime(), hoist/combine/factor out the 2 different callsites of QueryPerformanceCounter() to the root block. All branches will execute QueryPerformanceCounter() anyways. MSVC 2022 refused to hoist the QueryPerformanceCounter() call, around the statement if(MY_CXT.run_count++==0 ||MY_CXT.base_systime_as_filetime.ft_i64>MY_CXT.reset_time){ -add PERL_STATIC_FORCE_INLINE for static funcs like _clock_gettime() that have exactly 1 caller/callsite, usually this is XSUB function with a CV* argument. -add PERL_STATIC_FORCE_INLINE to _gettimeofday(), even though it has 8 different callers/callsites. The reason is because _gettimeofday() has a huge amount of U64 math at its bottom. All the callers then do a huge amount of mostly FP NV/double math, before saving the final NV value to a SV* with NOK_on. To allow the CC to optimize/combine/simplify these 2 large groups of U64 math and NV math, they must be in the same function. So add PERL_STATIC_FORCE_INLINE to _gettimeofday(). sortunsigned long run_count
…erefs -each reference to a global var like qpc_res_ns or tick_frequency is 7 bytes in machine code, or a couple more bytes than 7. Since BOOT:{} runs only once, and the chance 2 parallel BOOT:{} XSUBs in 2 different my_perls is almost zero, and even if there are 2 parallel OS threads executing, 1 OS thread isn't going help shave time off the 2nd OS thread. So to reduce the number of 7 byte opcodes that are reading from the global vars, maximize C auto vars as much as possible. QueryPerformanceFrequency() internally on Win7 is around 1-3 ptr derefs into NT's "VDSO" aka KUSER_SHARED_DATA. On Win2k, QPF() is a ring 0 call. -slide indent level to the left b/c the Win32 code block is nested too deep and almost ever statement would exceed 80 chars -cache PL_modglobal to a register, PL_modglobal is a big U32 offset 0x698 into my_perl struct " 48 8B 9F 98 06 00 00 mov rbx, [rdi+698h] "
… COW -we dont need to map values 0/1 to OP_STAT/OP_LSTAT at runtime, it can be done once at CC time / BOOT:{} time -IDK why $_[0] is being duped, the pp_stat*() functions aren't supposed to modify incoming @_ args, but if we are going to dupe $_[0], atleast try to use COW semantics if available
croak("%s(): unimplemented in this platform", "Time::HiRes::ualarm"); This can be estimated at 6 + 7 + 7 = 20 bytes of machine code on Intel. My guess on a RISC CPU is 3 * 2 * 4 = 24 bytes. On any CPU arch, the asm code will look like: mov rel_U32; mov rel_U32; call rel_U32; So create a dedicated static croak func, so these unimplemented stubs are smaller, and will look like: mov reg, reg; call rel_U32; RISC: 4 + (4 || 8) Intel: 3 + (5 || 6)
-gettimeofday() EXTEND is only need if > 1 retval b/c pp_entersub promises @_ 1 slot, lift C stack memory var values to registers, this way if gettimeofday() is a static P5P written polyfill, and if the CC decides to inline it, the struct timeval Tp; C stack var will optimize away -setitimer() min 2 incoming args + PPCODE: is proof we have atleast 2 retval slots -getitimer() 1 in arg + PPCODE: is proof we have atleast 1 retval slot -utime() don't execute SvNV() over and over, don't exec sv_2io() 2x, add SvPV_const() for anti-de-COW future-proofing
-I measured S_croak_xs_unimplemented() at 0x88 bytes of MSVC 2022 -O1 x64 machine code. The optimization probably isn't worth it if break even is 0x88/(7*3) = 6.47 unimpl stubs. Just use exported function cv_name(), we don't need to perfectly match croak_xs_usage()'s text/logic.
…prmt) -TMHR has a fancy Perl maintained Win32 high precision GTOD() polyfill impl inside it. But it can't be used for actual benchmarking by CPAN authors b/c it's do a very slow Perl_get_context() call every time to get access to MY_CXT struct. So add a pTHX_ version of myNVtime(). Add tests that prove TMHR's C level public API for CPAN authors actually exists and works. Nothing inside the P5P repo, ever tries to use TMHR's C level Time::HiRes::myNVtime / Time::HiRes::myU2time function pointers. -The 3 XSUBs for calling the TMHR C func ptrs, really should be in a new .xs file inside ext/XS-APItest/ called "benchmark.xs" or "noplgetcxt.xs" that has #define NO_PERL_GET_CONTEXT at the top, UNLIKE all the other XS-APItest .xs files, which try to prove the very slot ithreads-unaware CPAN XS legacy src code compat mode actually works. -POK and SvPVX() store the 2nd fn ptr, in the same SV*, POK flag can be used by CPAN XS authors to separate old TMHR releases w/o the new fn ptr from new TMHR releases that have it. NOK and SvNVX() and using union _xnvu { NV xnv_nv; HV * xgv_stash; <<<<<<<< line_t xnv_lines; bool xnv_bm_tail; }; is an alternative design, but I went with POK and SvPVX, because even with SvREADONLY(), I have paranoia, some C code on some OS on some CPU arch somewhere, will do a random read -> round_and_or_fire_IEEE_OS_signals -> write to SvNVX() operation on the SvNVX() slot, for no good reason, b/c of academic purity/standards body compliance/ABI requirements of that CPU/OS arch, and the function ptr is now giberish, or was converted from a denormal NaN to a normal NaN or SIG_DIV0-ed. -future expansion provision exists, if SvPOK_on && SvCUR() > sizeof(void*), SvPVX() is now a pointer to a C struct/C array, with the 1st 4/8 bytes being a header, and not a fn ptr. -TODO return by copy version of Time::U2time fn ptr, more efficient on certain ABIs (__vectorcall/SysV) that allow 128 bit structs/arrays to be returned in 2 registers back to the caller, and not secret pointers as a secret 1st arg
-reason, make these XSUBs as fast as possible so these XSUBs are more accurate for benchmarking, or contribute less overhead to the final numeric time deltas vs the time of whatever PP code was being measured The sv_newmortal()+sv_set_i_u_n_v_mg() permutation is unacceptable. Stepping into sv_upgrade() is unacceptable to do SVt_NULL->SVt_IV. -TMR_TARG***(rsv, RETVAL, 1); macros could be further optimized here vs pp.h's impl of TARG***(RETVAL,1), but that is left for the future.
…n loss -add NV retval variants nv_gettimeofday() and nv_clock_gettime(clock_id, &status), the splitting of the solo U64, into 2 IVs/UVs (64b IVs/UVs on my system), then recombing those 2 integers with integer or FP double logic, was very messy and verbose machine code and no, MSVC didn't "algebra" const fold away the splitting and recombing logic, so just create polyfills that always return NVs from the start -do "- ((U64)EPOCH_BIAS" with U64 logic, for maximum chance of no rounding/no precision loss, then do division with FP logic for maximum fractional number precision -"NV nv = nv_clock_gettime(clock_id, &status);" is inlined away, var bool status; has no C stack or register representation in mach code with MSVC 2022 -O1. Returning a pass by copy struct {NV nv; bool success;}; was considered, but never tried, b/c of Win64 AMD64 ABI's "rule" of all retval types > 8 bytes become secret ptrs and a secret 1st arg. Maybe MSVC would inline and fold away the struct, maybe it would not. I didn't try it. Current impl is working as intended. -nv_clock_gettime() still needs to reject junk values in clock_id remember -add tick_frequency_nv, so U64 -> NV is done 1x at startup, not in the run loop -S_croak_xs_unimplemented(const CV *const cv) silence CC warning, cv_name() doesn't want a const CV* head struct
-EU::Constant already has all these AUTOLOAD macro const C strings in the binary, and they aren't going away any time soon. So use those C strings to make SVPV HEK* COWs, and stick them in @EXPORT_OK, instead of @EXPORT_OK holding SVPV Newx() non-COW strings. Besides, most or all all of these C strings will become HV* stash HEs, CV*s, or GV*s, and all of those hold PL_strtab HEK*s, so lets same private bytes phy/virtual memory of a Perl proc at runtime b/c @EXPORT_OK's SV*s are all COWs. And speed up Time::HiRes initial load time since yylex/ck_op*() doesn't have to parse, alloc OPs, alloc pad consts, then run BEGIN, then DTOR all the OPs and pad consts.
bc456d4
to
02c65aa
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Read the commits.
Goals of this branch:
myNVtime()
actually usable for 3rd party CPAN authors who want to benchmark something. Its not usable because of thedTHX;
/Perl_get_context()
call. Removing 1Perl_get_context()
DOUBLED the speed of C functionmyNVtime()
in this very broken/flawed benchmark. Its flawed b/c XSUBXS::APItest::XSUB::Time::HiRes::myNVtime
and XSUBXS::APItest::XSUB::Time::HiRes::myNVtime_cxt
are Grade F XS code that did not write#define PERL_NO_GET_CONTEXT
and have 1 or 2 dozenPerl_get_context()
calls inside them.The fact that their TU is Grade F XS code that did not write
#define PERL_NO_GET_CONTEXT
, is not a bug, and is as designed.-Next goal. Discretely add the
rdtsc
CPU instruction and Win32'sQueryPerformanceCounter()
andGetSystemTimePreciseAsFileTime()
func calls, to blead Perl/Perl core, without any visible public API changes, and without any visible C code changes toHiRes.xs
, and without documenting that they are secretly available now in stock P5P WinPerl core now. No need to useWin32::API
or go hunting for quick, dirty, and poorly maintainedWin32::*
CPAN mods, just to get access toQueryPerformanceCounter()
.If Linux's glibc's C grammer token
gettimeofday()
becomes actual inline assembly or a builtin or an instrinsic (a ptr deref into the vdso struct https://elixir.bootlin.com/linux/v4.7/source/arch/x86/entry/vdso/vclock_gettime.c#L104 ) inside libperl/hires.so, then that platform will be faster too.GetSystemTimePreciseAsFileTime (rdtsc + utc time) is reachable with
QueryPerformanceCounter (rdtsc with speedswitch/turboboost/hypervisor correction, no comment when 0 nanosecs happened) is reachable with
Exposing this CPU features meanings stripping as much perl XS "glue" from the xsub as possible without segving or doing black box breaking like manually delinking SV heads and SV bodies, or manually growing/reallocing the mortal stack.
There were large areas of very poor quality XS code, and sometimes even bugs where a
sv_newmortal()
was executed and the retval was never used. and HiRes's XS code did an accidentalsv_newmortal()
and a few lines later didsv_2mortal(newsviv())
.dXSTARG
contains asv_newmortal()
call.Only perf optimizations not done in these commits were
-manually delinking SV heads/bodies (making
sv_2mortal(newsviv())
. faster)-not using call checker to rewrite the caller's optree.
-POPMARK macro/inlinr static's pointer aliasing violation, only me and tonyc know about it
-not writing any asm code,
-not adding any new os specific function calls that were not previously linked into T::HR
-no reaching into the windows vdso page/C strct or linux vdso page/C struct
I DID handle and remove most of the XS_RETURN() macros since XS-RETURN's C code violates x86 pointer aliasing (memory barrer/felce
AFTER inventing
nv_gettimeofday();