Rebase with AOCL5.1 #32

kvaragan · 2025-05-27T06:21:50Z

Rebase with AOCL 5.1

Details: - Renamed herk macrokernels and supporting files and functions to gemmt, which is possible since at the macrokernel level they are identical. Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal functions rather than cpp macros that instantiate multiple functions. Thanks to Devin Matthews for his efforts on this issue (flame#531). - Check that the maximum stack buffer size is sufficiently large relative to the register blocksizes for each datatype, and do so when the context is initialized rather than when an operation is called. Note that with this change, users who pass in their own contexts into the expert interfaces currently will *not* have any checks performed. Thanks to Devin Matthews for suggesting this change.

Details: - Expanded the BLAS compatibility layer to include support for ?axpby_() and ?gemm_batch_(). The former is a straightforward BLAS-like interface into the axpbyv operation while the latter implements a batched gemm via loops over bli_?gemm(). Also expanded the CBLAS compatibility layer to include support for cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari for submitting these new APIs via flame#566. - Fixed a long-standing bug in common.mk that for some reason never manifested until now. Previously, CBLAS source files were compiled *without* the location of cblas.h being specified via a -I flag. I'm not sure why this worked, but it may be due to the fact that the cblas.h file resided in the same directory as all of the CBLAS source, and perhaps compilers implicitly add a -I flag for the directory that corresponds to the location of the source file being compiled. This bug only showed up because some CBLAS-like source code was moved into an 'extra' subdirectory of that frame/compat/cblas/src directory. After moving the code, compilation for those files failed (because the cblas.h header file, presumably, could not be found in the same location). This bug was fixed within common.mk by explicitly adding the cblas.h directory to the list of -I flags passed to the compiler. - Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory, and updated test/Makefile to build those drivers. - Fixed typo in error message string in cblas_sgemm.c.

Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon=<name> or -a <name> syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files.

Details: - Inserted a new 'Example Code' section into the README.md immediately after the 'Getting Started' section. Thanks to Devin Matthews for recommending this addition. - Moved the 'Performance' section of the README down slightly so that it appears after the 'Documentation' section.

Details: - Annotated the code blocks that represent shell commands and output as 'bash' in README.md and BuildSystem.md.

Details: - Reverted the annotation of some markdown code blocks with 'bash' after realizing that the in-browser syntax highlighting was not worthwhile.

Details: - Added a new 'zen3' subconfiguration targeting support for the AMD Zen3 microarchitecture (flame#561). Thanks to AMD for this contribution. - Restructured clang and AOCC support for zen, zen2, and zen3 make_defs.mk files. The clang and AOCC version detection now happens in configure, not in the subconfigurations' makefile fragments. That is, we've added logic to configure that detects the version of clang/AOCC, outputs an appropriate variable to config.mk (ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the makefile fragment (as is currently done for the GCC_OT_* variables). - Added configure support for a GCC_OT_10_1_0 variable (and associated substitution anchor) to communicate whether the gcc version is older than 10.1.0, and use this variable to check for recent enough versions of gcc to use -march=znver3 in the zen3 subconfig. - Inlined the contents of config/zen/amd_config.mk into the zen and zen2 make_defs.mk so that the files are self-contained, harmonizing the format of all three Zen-based subconfigurations' make_defs.mk files. - Added indenting (with spaces) of GNU make conditionals for easier reading in zen, zen2, and zen3 make_defs.mk files. - Adjusted the range of models checked by bli_cpuid_is_zen() (which was previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is completely disjoint from the models checked by bli_cpuid_is_zen2() (0x30 ~ 0xff). This is normally necessary because Zen and Zen2 microarchitectures share the same family (23, or 0x17), and so the model code is the only way to differentiate the two. But in our case, fixing the model range for zen *wasn't* actually necessary since we checked for zen2 first, and therefore the wide zen range acted like the 'else' of an 'if-else' statement. That said, the change helps improve clarity for the reader by encoding useful knowledge, which was obtained from https://en.wikichip.org/wiki/amd/cpuid . - Added zen2.def and zen3.def files to the collection in travis/cpuid. Note that support for zen, zen2, and zen3 is now present, and while all the three microarchitectures have identical instruction sets from the perspective of BLIS microkernels, they each correspond to different subconfigurations and therefore merit separate testing. Thanks to Devin Matthews for his guidance in hacking these files as slight modifications of zen.def. - Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh. Now, zen, zen2, and zen3 are tested through the SDE via Travis CI builds. - Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils repository on GitHub rather than on Intel's website. This change was made in an attempt to circumvent recent troubles with Travis CI not being able to download the SDE directly from Intel's website via curl. Thanks to Devin Matthews for suggesting the idea. - Updated travis/do_sde.sh to grab the latest version (8.69.1) of the Intel SDE from the flame/ci-utils repository. - Updated .travis.yml to use gcc 9. The file was previously using gcc 8, which did not support -march=znver2. - Created amd64_legacy umbrella family in config_registry for targeting older (bulldozer, piledriver, steamroller, and excavator) microarchitectures and moved those same subconfigs out of the amd64 umbrella family. However, x86_64 retains amd64_legacy as a constituent member. - Fixed a bug in configure related to the building of the so-called config list. When processing the contents of config_registry, configure creates a series of structures and lists that allow for various mappings related to configuration families, subconfigs, and kernel sets. Two of those lists are built via substitution of umbrella families with their subconfig members, and one of those lists was improperly performing the substitution in a way that would erroneously match on partial umbrella family names. That code was changed to match the code that was already doing the substitution properly, via substitute_words(). Also added comments noting the importance of using substitute_words() in both instances. - Comment updates.

Details: - Replaced the hard-coded calls to double-precision real syr, syr2, syrk, and syrk in the corresponding standalone test drivers in the 'test' directory with conditional branches that will call the appropriate BLAS interface depending on which datatype is enabled. Thanks to Madan mohan Manokar for this improvement. - CREDITS file update.

Details: - Add a blurb about the new addons feature to the "Documentation for BLIS developers" section of the README.md, which also links to the Addons.md document.

Details: - Add additional mentions of addons to README.md, including in the "What's New" section. - Removed mention of sandboxes from the long list of advantages provided by BLIS. - Very minor description update to opening line of Addons.md.

Details: - Added a recursive sed script to the 'build' directory.

Details: - Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and .ker_params. These fields store pointers to functions and data that will allow the user to more flexibly create custom operations while recycling BLIS's existing partitioning infrastructure. - Updated typed API to packm variant and structure-aware kernels to replace the diagonal offset with panel offsets, and changed strides of both C and P to inc/ldim semantics. Updated object API to the packm variant to include rntm_t*. - Removed the packm variant function pointer from the packm cntl_t node definition since it has been replaced by the .pack_fn pointer in the obj_t. - Updated bli_packm_int() to read the new packm variant function pointer from the obj_t and call it instead of from the cntl_t node. - Moved some of the logic of bli_l3_packm.c to a new file, bli_packm_alloc.c. - Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers instead of typed pointers, allowing a single function to be used regardless of datatype. This obviated having a separate implementation in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a new function, bli_packm_scalar(). - Employed a new standard whereby right-hand matrix operands ("B") are always packed as column-stored row panels -- that is, identically to that of left-hand matrix operands ("A"). This means that while we pack matrix A normally, we actually pack B in a transposed state. This allowed us to simplify a lot of code throughout the framework, and also affected some of the logic in bli_l3_packa() and _packb(). - Simplified bli_packm_init.c in light of the new B^T convention described above. bli_packm_init()--which is now called from within bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns a bool that indicates whether packing should be performed (or skipped). - Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(), which, among other things, defaults the new .pack_fn field of the obj_t to bli_packm_blk_var1() if the field is NULL. - Defined a new function, bli_obj_reset_origin(), which permanently refocuses the view of an object so that it "forgets" any offsets from its original pointer. This function also sets the object's root field to itself. Calls to bli_obj_reset_origin() for each matrix operand appear in the _front() functions, after the obj_t's are aliased. This resetting of the underlying matrices' origins is needed in preparation for more advanced features from within custom packm kernels. - Redefined bli_pba_rntm_set_pba() from a regular function to a static inline function. - Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use libblis_test_pobj_create() to create local packed objects. Previously, these packed objects were created by calling lower-level functions.

Details: - Added previously-deleted cpp macro block to bli_cntx_init_zen.c targeting the Naples microarchitecture that enabled different cache blocksizes when the number of threads exceeds 16. This commit represents PR flame#573.

Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations.

For 8<= GCC < 10 compatibility.

Add `%=` tag to branch labels, which expands to a unique identifier for each inline assembly block. This prevents duplicate symbol errors on Apple Silicon (flame#594). Fixes flame#594. [ci skip] since we can't test Apple Silicon anyways...

Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates.

Remove alignment of temporary AB buffer in edge case handling macros unless alignment is specifically requested (e.g. Core2, SDB/IVB). Fixes flame#595.

@egaudry

@egaudry and I both saw this issue on Linux with Clang 10. ``` Compiling obj/thunderx2/kernels/armv8a/3/sup/bli_gemmsup_rv_armv8a_asm_d4x8m.o ('thunderx2' CFLAGS for kernels) kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c:171:49: fatal error: invalid symbol redefinition " \n\t" ^ <inline asm>:90:5: note: instantiated into assembly here .SLOOPKITER: ^ 1 error generated. ``` Signed-off-by: Jeff Hammond <[email protected]>

Details: - In config/zen3/bli_family_zen3.h, renamed: BLIS_SMALL_MATRIX_A_THRES_M_GEMMT -> _M_SYRK BLIS_SMALL_MATRIX_A_THRES_N_GEMMT -> _N_SYRK Thanks to Jeff Diamond for helping spot the stale _SYRK naming.

armclang is treated as regular clang. Fixes flame#606. [ci skip]

No need to query MR during kernel runtime.

For clang (& armclang?) compilation. Hopefully solves flame#609 .

Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]

Fixes flame#611.

Details: - Added a new option to 'configure' that allows the user to specify a list of symbols to omit from the library. The format of the option is --omit-symbols=LIST where LIST is a comma-separated list of symbol names (excluding any trailing underscore). This list is parsed into a list of #define directives that causes the relevant parts of BLIS to be ignored (or not). As such, the nature of this option is to only support omitting symbols which have been pre-identified as potential troublemakers when linking BLIS with other libraries such as LAPACK or ScaLAPACK. (This list may grow in the future as additional symbols are identified.) Note: we leave lsame_() and xerbla_() prototypes enabled even when their respective symbols are omitted from the library. - Re-implemented the --enable-scalapack-compat configure option to utilize the underlying --omit-symbols=LIST infrastructure. - Implemented an --enable-lapack-compat option, which omits all of the known problematic symbols currently supported for omission. - This commit addresses Issue flame#816. Thanks to Timo Betcke for bringing it to our attention and to Devin Matthews for his advice and for his initial implementation of --enable-scalapack-compat (PR flame#813). - CREDITS file update.

@myeh01

Details: - Replace all assembly kernels in the `sifive_x280` kernel set with intrinsic versions. - Fixes bug encountered in flame#805. - Update the RISC-V toolchain used in CI testing. - Special thanks to Michael Yeh (@myeh01) and SiFive.

Details: - Search for Intel ifx and NVIDIA/PGI Fortran compilers. - Correctly determine the Fortran compiler vendor for Intel ifx and NVIDIA/PGI compilers. - Determine the compiler version and correct Fortran complex return type for NVIDIA/PGI.

Add documentation for the plugin system and for modifying the control tree to make custom operations. Details: - `docs/PluginHowTo.md` describes in a "tutorial style" how to implement a custom BLAS-like operation by creating a plugin and then modifying the `gemm` control tree to achieve the desired effect. - Briefly, plugins allow users to add new kernels and associated block sizes/preferences to BLIS without modifying the BLIS source code. User-provided kernels are compiled using the BLIS build system for configured architectures and selected at runtime based on the actual hardware. - To implement custom operations, users can combine their own kernels (and/or existing BLIS kernels) with a customized control tree, which represents the specific algorithmic steps. Users can customize the kernels to be used for packing and for computation, extra information passed to kernels (e.g. additional parameters or data), block sizes, etc. An API is provided for modifying the default `gemm` control tree (also used for other level-3 operations, except `trsm`).

…lame#841) Details: - Currently, all enums used to represent built-in kernel IDs, blocksizes, preferences, and operation IDs have a special member equal to `BLIS_VA_END`, which in turn is `(siz_t)-1`. In principle, this would force the underlying type used to represent the enum values to be as wide as `siz_t`, particularly when passed to the variadic function `bli_cntx_set_ukrs` and friends. User-registered kernels IDs and such are of type `siz_t` explicitly. However, gcc (12 and older), clang, and icx pass literal enum constants (e.g. `BLIS_MR`) that are small enough as `int` when 32-bit mode is used (`-m32`). This causes a misalignment of the parameters on the stack and ultimately a segfault. The problem also exists in 64-bit mode with clang and icx and on aarch64 with clang, as parameters far enough down the list to go on the stack do not get the upper 4 bytes initialized. - This commit introduces a new type `kerid_t` which is always `uint32_t`. This type is used for all kernel, blocksize, preference, and operation IDs (including user-registered ones). It is also used for `BLIS_VA_END`. - Now all enum values are always passed as 32-bit ints on all architectures. - Fixes flame#839.

Details: - Rename `RELEASING` to `RELEASING.md`. - Add additional structure and Markdown notation to `RELEASING.md`. - Add a section on the overall release and branching strategy. - Clarify and tweak instructions for making release candidates and releases. - Add instructions for making point releaases and back-porting bug fixes. - Rename `build/start-new-rc.sh` to `build/do-release.sh`. - Tweak `do-release.sh` to do only common tasks for rcs, major releases, and point releases. - Add `-b` option to `do-release.sh` which does a "bare" release without a new branch or tag (for "dev releases" on master). - Update the version file on `master` to `3.0-dev` to reflect the new guidelines.

Details: - Update release notes for flame#841, should have been done in the PR. - [ci skip]

Details: - Removed/relaxed the deprecation warning for `OMP_NUM_THREADS`. - Clarified how `OMP_NUM_THREADS` is used and added a simple example on how to do different regions of thread-counts.

Details: - Implemented an option (`-i LIST`) to `gen-make-frag.sh` that allows the caller to optionally ignore additional directories when walking the source directory. (Note that previously the standard -- and only -- way to ignore directories was to add them to the `ignore_list` file, which is a required argument to the script.) - I implemented this feature for something but then ended up not needing it, but figured it might be helpful in the future. - Multiple `-i` options are allowed.

Details: - Added a `sifive_rvv` configuration which is `VLEN`-agnostic but takes advantage of optimized microkernels for SiFive (and other) RISC-V architectures. - This configuration does not currently participate in automatic configuration selection during BLIS configure. - `VLEN` is detected at runtime to properly make use of available vectorization.

Details: - Previously, the tests using Intel SDE ran the BLIS testsuite manually. Now, the full `make check` suite is run using SDE as a wrapper for execution.

Details: - Fixes to the documentation: 1. Some integer-based types were missed. 2. Some function parameters were missed. 3. Many interfaces were missing `const`. - Improved formatting and consistency, removed trailing whitespace. - Added several missed global constants.

…ally found successfully. (flame#842) Details: - If the examples are built out-of-tree then `BLIS_INSTALL_PATH` needs to be set to find the header, library, and build system files. Also, if the examples are attempted to be built before configuring blis then `common.mk` will be missing. - Current behavior silently ignores the failed import of `common.mk` which causes various difficult-to-diagnose problems. - The Android/Bionic detection in common.mk has also been changed to not rely on an external file. This allows examples to be compiled in isolation. Details: - When building examples out-of-tree (or potentially other external code using `common.mk`), `DIST_PATH` will not be set and so `common.mk` will not be able to locate `build/detect/android/bionic.h`, causing a compiler error in some cases. - This has been fixed by including the contents of `bionic.h` in the shell statement executing the compiler check. - Fixes flame#840.

Details: - GCC 15 drops support for Xeon Phi architectures such as KNL. - This PR blacklists the `knl` configuration for GCC 15+.

) Details: - Alias `?gemmt_` as `?gemmtr_` to fix lapack 3.12.1 compatibility. (Fixes flame#848) - Add the `?gemmtr_ `and `cblas_?gemmtr` aliases to symbol list. - Also alias `cblas_?gemmt` as `cblas_?gemmtr` for lapack 3.12.1 compatibility.

Details: - See flame#850 for details on the problem. - This is a temporary fix which should work for sdcz data types. - Altra architectures may still not fully work for MP/MD as the stack buffer size is hard-coded.

Details: - clang 14.0.0 apparently makes some invalid assumptions about whether or not the AB microtile is initialized in the `gemm` reference microkernel. This leads to the "scale by alpha" part doing something strange (all sorts of random and even NaN values pop up). I do not know why this only manifested for `ztrsm` on `skx` (in `zgemm_skx_ref` via `zgemmtrsm_skx_ref`). See flame#852. - Aliasing the AB microtile (in the proper datatype) as a pointer to a raw character array, and then initializing the character array with `= { 0 }` convinces the compiler to do the right thing. - The problem did not occur in 14.0.6 or 15.0.7. It may only be a narrow band of versions which are problematic. - This commit adds the char array workaround and fixes flame#852.

Details: - This PR adds CircleCI testing in addition to TravisCI and Appveyor. - All of the same tests as on Travis are run, except that different hardware typically ends up being used (usually Zen on Travis, Xeon Platinum on Circle). This has actually exposed a couple of bugs (see flame#850 and flame#852). - The `travis` directory has been renamed to `ci` as it is now shared. - Running SDE on CircleCI is a bit problematic because glibc changed how CPUID detection is done. This requires running some architectures with different hardware definition files and forcing a config via `BLIS_ARCH_TYPE`.

Details: - The BLAS/CBLAS function `?gemmtr` is currently implemented as a symbol alias of the already-existing `?gemmt`. This does not work on macOS/Darwin. - Instead, use a minimal wrapper function which calls the appropriate existing BLAS/CBLAS function. - Also clean up the CBLAS prototypes a bit.

Details: - Add status badge for CircleCI. - [ci skip]

@fgvanzee

Details: - Developed by @fgvanzee and @devinamatthews. - Level-0 scalar macros have moved from a named-based system (e.g. `bli_dcopys( ... )`) to a macro argument-based system (`bli_tcopys( d,d, ... )`). - All macros are explicitly mixed-type. - All input and output operands can have a distinct type (precision and/or domain). Unnecessary computations and spurious NaN/Inf propagation are avoided in mixed-domain cases. - All macros which do math (i.e. not copy/set/etc.) take an additional computational precision. - Tile-level macros, 1m, broadcast-B, and other extensions are also included. - All macros should correctly handle aliasing of input and output operands (this needs to be rigorously checked). - The macros work generically over the defined types -- new types only need limited support (primarily conversion to other types and basic math). - For code outside of core BLIS (optimized kernels, sandboxes, etc.), a selection of legacy macros have been added which translate to the new level-0 macros. Behavior is unchanged. - A standalone, templated C++ testsuite for the level-0 macros has been added. It is currently included as part of the CircleCI tests. - Const-correctness of level-0 macros is also checked.

Details: - When adjusting the buffer to point to the first imaginary element, the function `bli_obj_buffer_at_off` was used which includes and currently set offsets, but then `bli_obj_set_buffer` was used which is the offset *before* applying offsets. - Now a matching `bli_obj_buffer` call is used to avoid any offsets.

…me#859) * Fix check for SVE instructions which caused problems on Windows. Details: - The context intialization for `armsve` was using the HWCAP functionality of Linux to check if SVE instructions are actually available, since these are used to determine the register blocksizes. Naturally, this causes problems on Windows. - Instead, use functions from `bli_cpuid.c` to check for SVE. On Windows, no check is actually done and SVE is never detected. - In the case that the user specifically requests the `armsve` config on Windows, only enable this check for the whole `arm64` family and just assume SVE is available otherwise. * Blacklist armsve on Windows.

Details: - Add tests for the `generic` config, including forcing broadcast-A,B which uses a different reference kernel. This uncovered a number of bugs, especially in `trsm`/`gemmtrsm` reference kernels, as well as diagonal packing. - Move threaded builds into main build and run `make check` once for each enabled backend. - Fix unused variable warnings in level-0 macros. - Fix `bli_tbastbbs_mxn` and add `bli_tcompressbbs_mxn`. The latter was missing from the reference `gemmtrsm` microkernel and is needed since the B11 block is accumulated to but, for complex datatypes, the effective imaginary stride is non-unit if B is broadcast packed. - Run all BLAS tests single-threaded.

Details: - This avoids possible misinterpretation of computation results printed on stdout (thanks Mason McBride for reporting it in flame#864). - Also force space for positive numbers to help with alignment.

Details: - In some cases, macOS was improperly detected as Windows due to a builtin preprocessor definition `#define TARGET_OS_WINDOWS 0`. - Update the detection to specifically look for `#define _WIN32` which more robustly detects Windows.

[ci skip]

BLIS-specific setting of threading takes precedence over OpenMP thread count ICV values, and if the BLIS-specific threading APIs are used, there was no way for the program to revert to OpenMP settings. This patch implements a function bli_thread_reset() to do this. This is similar to that implemented in upstream BLIS in commit 6dcf766 More specifically, it reverts the internal threading data to that which existed when the program was launched, subject where appropriate to any changes in the OpenMP ICVs. In other words: - It will undo changes to threading set by previous calls to bli_thread_set_num_threads or bli_thread_set_ways. - If the environment variable BLIS_NUM_THREADS was used, this will NOT be cleared, as the initial state of the program is restored. - Changes to OpenMP ICVs from previous calls to omp_set_num_threads() will still be in effect, but can be overridden by further calls to omp_set_num_threads(). Note: the internal BLIS data structure updated by the threading APIs, including bli_thread_reset(), is thread-local to each user (e.g. application) thread. Example usage: omp_set_num_threads(4); bli_thread_set_num_threads(7); dgemm(...); // 7 threads will be used bli_thread_reset(); dgemm(...); // 4 threads will be used

devinamatthews and others added 30 commits November 10, 2021 12:34

Marked some markdown shell code blocks as 'bash'.

cbc88fe

Details: - Annotated the code blocks that represent shell commands and output as 'bash' in README.md and BuildSystem.md.

Reverted cbc88fe.

74c0c62

Details: - Reverted the annotation of some markdown code blocks with 'bash' after realizing that the in-browser syntax highlighting was not worthwhile.

Merge branch 'dev'

b727645

Brief mention/link to Addons.md in README.md.

a4bc03b

Details: - Add a blurb about the new addons feature to the "Documentation for BLIS developers" section of the README.md, which also links to the Addons.md document.

Minor updates to README.md, docs/Addons.md.

12c66a4

Details: - Add additional mentions of addons to README.md, including in the "What's New" section. - Removed mention of sandboxes from the long list of advantages provided by BLIS. - Very minor description update to opening line of Addons.md.

Added recu-sed.sh script to 'build' directory.

e229e04

Details: - Added a recursive sed script to the 'build' directory.

Re-add BLIS_ENABLE_ZEN_BLOCK_SIZES macro for 'zen'.

961d9d5

Details: - Added previously-deleted cpp macro block to bli_cntx_init_zen.c targeting the Naples microarchitecture that enabled different cache blocksizes when the number of threads exceeds 16. This commit represents PR flame#573.

Evict <arm_sve.h> Requirement for SVE GEMM

08174a2

For 8<= GCC < 10 compatibility.

CREDITS file update.

864bfab

Relax alignment constraints

268ce1f

Remove alignment of temporary AB buffer in edge case handling macros unless alignment is specifically requested (e.g. Core2, SDB/IVB). Fixes flame#595.

Fix row-/column-major pref. in 16x8 haswell sgemm ukr (unused)

81f93be

Updated zen3 macro constant names.

0be9282

Details: - In config/zen3/bli_family_zen3.h, renamed: BLIS_SMALL_MATRIX_A_THRES_M_GEMMT -> _M_SYRK BLIS_SMALL_MATRIX_A_THRES_N_GEMMT -> _N_SYRK Thanks to Jeff Diamond for helping spot the stale _SYRK naming.

Add armclang detection to configure.

35195bb

armclang is treated as regular clang. Fixes flame#606. [ci skip]

Armv8a, ArmSVE: Simplify Gen-C

b5df181

Fix SVE Compil.

9cc897f

ArmSVE Use Predicate in M-Direction

72089bb

No need to query MR during kernel runtime.

ArmSVE Adopts Label Wrapper

2f3872e

For clang (& armclang?) compilation. Hopefully solves flame#609 .

Update CC_VENDOR logic

2674291

Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]

Use -flat_namespace option to link on macOS

5a4d3f5

Fixes flame#611.

fgvanzee and others added 29 commits October 16, 2024 16:45

ReleaseNotes.md update.

4bc4a1c

Details: - Update release notes for flame#841, should have been done in the PR. - [ci skip]

Clarified OMP_NUM_THREADS (flame#835)

534d52b

Details: - Removed/relaxed the deprecation warning for `OMP_NUM_THREADS`. - Clarified how `OMP_NUM_THREADS` is used and added a simple example on how to do different regions of thread-counts.

CREDITS file update.

967d29d

Run full "make check" for SDE tests. (flame#818)

d161545

Details: - Previously, the tests using Intel SDE ran the BLIS testsuite manually. Now, the full `make check` suite is run using SDE as a wrapper for execution.

Blacklist KNL with GCC 15+ (flame#844)

7e8a589

Details: - GCC 15 drops support for Xeon Phi architectures such as KNL. - This PR blacklists the `knl` configuration for GCC 15+.

Increase the max size for stack buffers. (flame#851)

5ad37a8

Details: - See flame#850 for details on the problem. - This is a temporary fix which should work for sdcz data types. - Altra architectures may still not fully work for MP/MD as the stack buffer size is hard-coded.

Update README.md

3c71737

Details: - Add status badge for CircleCI. - [ci skip]

Fix for plugins without explicit optimized kernels.

53d21cb

Examples: replace all 4.1f printm format by 4.3f (flame#865)

5d9e110

Details: - This avoids possible misinterpretation of computation results printed on stdout (thanks Mason McBride for reporting it in flame#864). - Also force space for positive numbers to help with alignment.

Update CREDITS

5097c59

[ci skip]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebase with AOCL5.1 #32

Rebase with AOCL5.1 #32

kvaragan commented May 27, 2025

Uh oh!

Uh oh!

Rebase with AOCL5.1 #32

Are you sure you want to change the base?

Rebase with AOCL5.1 #32

Conversation

kvaragan commented May 27, 2025

Uh oh!

Uh oh!