Skip to content

Rebase with AOCL5.1 #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 231 commits into
base: master
Choose a base branch
from
Open

Rebase with AOCL5.1 #32

wants to merge 231 commits into from

Conversation

kvaragan
Copy link
Collaborator

Rebase with AOCL 5.1

devinamatthews and others added 30 commits November 10, 2021 12:34
Details:
- Renamed herk macrokernels and supporting files and functions to gemmt, 
  which is possible since at the macrokernel level they are identical. 
  Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert
  level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal
  functions rather than cpp macros that instantiate multiple functions.
  Thanks to Devin Matthews for his efforts on this issue (flame#531).
- Check that the maximum stack buffer size is sufficiently large
  relative to the register blocksizes for each datatype, and do so when
  the context is initialized rather than when an operation is called.
  Note that with this change, users who pass in their own contexts into
  the expert interfaces currently will *not* have any checks performed.
  Thanks to Devin Matthews for suggesting this change.
Details:
- Expanded the BLAS compatibility layer to include support for 
  ?axpby_() and ?gemm_batch_(). The former is a straightforward
  BLAS-like interface into the axpbyv operation while the latter
  implements a batched gemm via loops over bli_?gemm(). Also
  expanded the CBLAS compatibility layer to include support for
  cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to 
  the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari
  for submitting these new APIs via flame#566.
- Fixed a long-standing bug in common.mk that for some reason never
  manifested until now. Previously, CBLAS source files were compiled
  *without* the location of cblas.h being specified via a -I flag.
  I'm not sure why this worked, but it may be due to the fact that
  the cblas.h file resided in the same directory as all of the CBLAS
  source, and perhaps compilers implicitly add a -I flag for the
  directory that corresponds to the location of the source file being
  compiled. This bug only showed up because some CBLAS-like source code
  was moved into an 'extra' subdirectory of that frame/compat/cblas/src
  directory. After moving the code, compilation for those files failed
  (because the cblas.h header file, presumably, could not be found in
  the same location). This bug was fixed within common.mk by explicitly
  adding the cblas.h directory to the list of -I flags passed to the
  compiler.
- Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory,
  and updated test/Makefile to build those drivers.
- Fixed typo in error message string in cblas_sgemm.c.
Details:
- Implemented a new feature called addons, which are similar to
  sandboxes except that there is no requirement to define gemm or any
  other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
  for requesting an addon be included within a BLIS build. configure now
  outputs the list of enabled addons into config.mk. It also outputs the
  corresponding #include directives for the addons' headers to a new
  companion to the bli_config.h header file named bli_addon.h. Because
  addons may wish to make use of existing BLIS types within their own
  definitions, the addons' headers must be included sometime after that
  of bli_config.h (which currently is #included before bli_type_defs.h).
  This is why the #include directives needed to go into a new top-level
  header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
  build with them, and what assumptions their authors should keep in
  mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
  as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
  functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.
Details:
- Inserted a new 'Example Code' section into the README.md immediately
  after the 'Getting Started' section. Thanks to Devin Matthews for
  recommending this addition.
- Moved the 'Performance' section of the README down slightly so that it
  appears after the 'Documentation' section.
Details:
- Annotated the code blocks that represent shell commands and output as
  'bash' in README.md and BuildSystem.md.
Details:
- Reverted the annotation of some markdown code blocks with 'bash'
  after realizing that the in-browser syntax highlighting was not
  worthwhile.
Details:
- Added a new 'zen3' subconfiguration targeting support for the AMD Zen3
  microarchitecture (flame#561). Thanks to AMD for this contribution.
- Restructured clang and AOCC support for zen, zen2, and zen3
  make_defs.mk files. The clang and AOCC version detection now happens
  in configure, not in the subconfigurations' makefile fragments. That
  is, we've added logic to configure that detects the version of
  clang/AOCC, outputs an appropriate variable to config.mk
  (ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the
  makefile fragment (as is currently done for the GCC_OT_* variables).
- Added configure support for a GCC_OT_10_1_0 variable (and associated
  substitution anchor) to communicate whether the gcc version is older
  than 10.1.0, and use this variable to check for recent enough versions
  of gcc to use -march=znver3 in the zen3 subconfig.
- Inlined the contents of config/zen/amd_config.mk into the zen and zen2
  make_defs.mk so that the files are self-contained, harmonizing the
  format of all three Zen-based subconfigurations' make_defs.mk files.
- Added indenting (with spaces) of GNU make conditionals for easier
  reading in zen, zen2, and zen3 make_defs.mk files.
- Adjusted the range of models checked by bli_cpuid_is_zen() (which was
  previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is
  completely disjoint from the models checked by bli_cpuid_is_zen2()
  (0x30 ~ 0xff). This is normally necessary because Zen and Zen2
  microarchitectures share the same family (23, or 0x17), and so the
  model code is the only way to differentiate the two. But in our case,
  fixing the model range for zen *wasn't* actually necessary since we
  checked for zen2 first, and therefore the wide zen range acted like
  the 'else' of an 'if-else' statement. That said, the change helps
  improve clarity for the reader by encoding useful knowledge, which
  was obtained from https://en.wikichip.org/wiki/amd/cpuid .
- Added zen2.def and zen3.def files to the collection in travis/cpuid.
  Note that support for zen, zen2, and zen3 is now present, and while
  all the three microarchitectures have identical instruction sets from
  the perspective of BLIS microkernels, they each correspond to
  different subconfigurations and therefore merit separate testing.
  Thanks to Devin Matthews for his guidance in hacking these files as
  slight modifications of zen.def.
- Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh.
  Now, zen, zen2, and zen3 are tested through the SDE via Travis CI
  builds.
- Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils
  repository on GitHub rather than on Intel's website. This change was
  made in an attempt to circumvent recent troubles with Travis CI not
  being able to download the SDE directly from Intel's website via curl.
  Thanks to Devin Matthews for suggesting the idea.
- Updated travis/do_sde.sh to grab the latest version (8.69.1) of the
  Intel SDE from the flame/ci-utils repository.
- Updated .travis.yml to use gcc 9. The file was previously using gcc 8,
  which did not support -march=znver2.
- Created amd64_legacy umbrella family in config_registry for targeting
  older (bulldozer, piledriver, steamroller, and excavator)
  microarchitectures and moved those same subconfigs out of the amd64
  umbrella family. However, x86_64 retains amd64_legacy as a constituent
  member.
- Fixed a bug in configure related to the building of the so-called
  config list. When processing the contents of config_registry,
  configure creates a series of structures and lists that allow for
  various mappings related to configuration families, subconfigs, and
  kernel sets. Two of those lists are built via substitution of
  umbrella families with their subconfig members, and one of those
  lists was improperly performing the substitution in a way that would
  erroneously match on partial umbrella family names. That code was
  changed to match the code that was already doing the substitution
  properly, via substitute_words(). Also added comments noting the
  importance of using substitute_words() in both instances.
- Comment updates.
Details:
- Replaced the hard-coded calls to double-precision real syr, syr2, 
  syrk, and syrk in the corresponding standalone test drivers in the 
  'test' directory with conditional branches that will call the 
  appropriate BLAS interface depending on which datatype is enabled. 
  Thanks to Madan mohan Manokar for this improvement.
- CREDITS file update.
Details:
- Add a blurb about the new addons feature to the "Documentation for
  BLIS developers" section of the README.md, which also links to the
  Addons.md document.
Details:
- Add additional mentions of addons to README.md, including in the
  "What's New" section.
- Removed mention of sandboxes from the long list of advantages
  provided by BLIS.
- Very minor description update to opening line of Addons.md.
Details:
- Added a recursive sed script to the 'build' directory.
Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
  .ker_params. These fields store pointers to functions and data that
  will allow the user to more flexibly create custom operations while  
  recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to 
  replace the diagonal offset with panel offsets, and changed strides 
  of both C and P to inc/ldim semantics. Updated object API to the packm
  variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
  definition since it has been replaced by the .pack_fn pointer in the 
  obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
  from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
  bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
  instead of typed pointers, allowing a single function to be used
  regardless of datatype. This obviated having a separate implementation
  in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a 
  new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
  always packed as column-stored row panels -- that is, identically to 
  that of left-hand matrix operands ("A"). This means that while we pack
  matrix A normally, we actually pack B in a transposed state. This
  allowed us to simplify a lot of code throughout the framework, and
  also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
  described above. bli_packm_init()--which is now called from within
  bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
  a bool that indicates whether packing should be performed (or
  skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
  which, among other things, defaults the new .pack_fn field of the 
  obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
  refocuses the view of an object so that it "forgets" any offsets from 
  its original pointer. This function also sets the object's root field 
  to itself. Calls to bli_obj_reset_origin() for each matrix operand
  appear in the _front() functions, after the obj_t's are aliased. This
  resetting of the underlying matrices' origins is needed in preparation
  for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static 
  inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
  libblis_test_pobj_create() to create local packed objects. Previously,
  these packed objects were created by calling lower-level functions.
Details:
- Added previously-deleted cpp macro block to bli_cntx_init_zen.c 
  targeting the Naples microarchitecture that enabled different cache 
  blocksizes when the number of threads exceeds 16. This commit 
  represents PR flame#573.
Details:
- Moved edge-case handling into the gemm microkernel. This required
  changing the microkernel API to take m and n dimension parameters.
  This required updating all existing gemm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. We also updated all existing kernels in the 'kernels' 
  directory to take m and n dimensions, and implemented edge-case 
  handling within those microkernels via a collection of new C 
  preprocessor macros defined within bli_edge_case_macro_defs.h. Also
  removed the assembly code that formerly would handle general stride 
  IO on the microtile, since this can now be handled by the same code
  that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
  bli_trsm_cntl_create(), where this function pointer is used in lieu of 
  the default macrokernel when it is non-NULL, and ignored when it is
  NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
  function using byte pointers rather that one function for each
  floating-point datatype. Also, obtain the microkernel function pointer
  from the .ukr field of the params struct embedded within the obj_t
  for matrix C (assuming params is non-NULL and contains a non-NULL
  value in the .ukr field). Communicate both the gemm microkernel
  pointer to use as well as the params struct to the microkernel via
  the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params 
  struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
  We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
  calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
  bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
  associated code to test those operations.
For 8<= GCC < 10 compatibility.
Add `%=` tag to branch labels, which expands to a unique identifier for each inline assembly block. This prevents duplicate symbol errors on Apple Silicon (flame#594). Fixes flame#594. [ci skip] since we can't test Apple Silicon anyways...
Details:
- Updated the gemmd addon and the gemmlike sandbox code to use the new
  microkernel calling sequence, which now includes m and n dimensions so
  that the microkernel has all the information necessary to handle edge
  cases. Thanks to Jeff Diamond for catching this, which ideally would
  have been included in commit 54fa28b.
- Retired var2 of both gemmd and gemmlike to 'attic' directories and
  removed their corresponding prototypes. In both cases, var2 was a
  variant of the block-panel algorithm where edge-case handling was
  abstracted away to a microkernel wrapper. (Since this is now the
  official behavior of BLIS microkernels, I saw no need to have it
  included as a separate code path.)
- Comment updates.
Remove alignment of temporary AB buffer in edge case handling macros unless alignment is specifically requested (e.g. Core2, SDB/IVB). Fixes flame#595.
@egaudry and I both saw this issue on Linux with Clang 10.

```
Compiling obj/thunderx2/kernels/armv8a/3/sup/bli_gemmsup_rv_armv8a_asm_d4x8m.o ('thunderx2' CFLAGS for kernels)
kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c:171:49: fatal error: invalid symbol redefinition
        "                                            \n\t"
                                                       ^
<inline asm>:90:5: note: instantiated into assembly here
           .SLOOPKITER:
           ^
1 error generated.
```

Signed-off-by: Jeff Hammond <[email protected]>
Details:
- In config/zen3/bli_family_zen3.h, renamed:
    BLIS_SMALL_MATRIX_A_THRES_M_GEMMT -> _M_SYRK
    BLIS_SMALL_MATRIX_A_THRES_N_GEMMT -> _N_SYRK
  Thanks to Jeff Diamond for helping spot the stale _SYRK naming.
armclang is treated as regular clang. Fixes flame#606. [ci skip]
No need to query MR during kernel runtime.
For clang (& armclang?) compilation.

Hopefully solves flame#609 .
Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]
fgvanzee and others added 29 commits October 16, 2024 16:45
Details:
- Added a new option to 'configure' that allows the user to specify a
  list of symbols to omit from the library. The format of the option is
  --omit-symbols=LIST where LIST is a comma-separated list of symbol
  names (excluding any trailing underscore). This list is parsed into
  a list of #define directives that causes the relevant parts of BLIS
  to be ignored (or not). As such, the nature of this option is to only
  support omitting symbols which have been pre-identified as potential
  troublemakers when linking BLIS with other libraries such as LAPACK
  or ScaLAPACK. (This list may grow in the future as additional symbols
  are identified.) Note: we leave lsame_() and xerbla_() prototypes 
  enabled even when their respective symbols are omitted from the 
  library.
- Re-implemented the --enable-scalapack-compat configure option to
  utilize the underlying --omit-symbols=LIST infrastructure.
- Implemented an --enable-lapack-compat option, which omits all of the
  known problematic symbols currently supported for omission.
- This commit addresses Issue flame#816. Thanks to Timo Betcke for bringing
  it to our attention and to Devin Matthews for his advice and for
  his initial implementation of --enable-scalapack-compat (PR flame#813).
- CREDITS file update.
Details:
- Replace all assembly kernels in the `sifive_x280` kernel set with intrinsic versions.
- Fixes bug encountered in flame#805.
- Update the RISC-V toolchain used in CI testing.
- Special thanks to Michael Yeh (@myeh01) and SiFive.
Details:
- Search for Intel ifx and NVIDIA/PGI Fortran compilers.
- Correctly determine the Fortran compiler vendor for Intel ifx and NVIDIA/PGI compilers.
- Determine the compiler version and correct Fortran complex return type for NVIDIA/PGI.
Add documentation for the plugin system and for modifying the control tree to make custom operations.

Details:
- `docs/PluginHowTo.md` describes in a "tutorial style" how to implement a custom BLAS-like operation by creating a plugin and then modifying the `gemm` control tree to achieve the desired effect.
- Briefly, plugins allow users to add new kernels and associated block sizes/preferences to BLIS without modifying the BLIS source code. User-provided kernels are compiled using the BLIS build system for configured architectures and selected at runtime based on the actual hardware.
- To implement custom operations, users can combine their own kernels (and/or existing BLIS kernels) with a customized control tree, which represents the specific algorithmic steps. Users can customize the kernels to be used for packing and for computation, extra information passed to kernels (e.g. additional parameters or data), block sizes, etc. An API is provided for modifying the default `gemm` control tree (also used for other level-3 operations, except `trsm`).
…lame#841)

Details:
- Currently, all enums used to represent built-in kernel IDs, blocksizes, preferences, and operation IDs have a special member equal to `BLIS_VA_END`, which in turn is `(siz_t)-1`. In principle, this would force the underlying type used to represent the enum values to be as wide as `siz_t`, particularly when passed to the variadic function `bli_cntx_set_ukrs` and friends. User-registered kernels IDs and such are of type `siz_t` explicitly. However, gcc (12 and older), clang, and icx pass literal enum constants (e.g. `BLIS_MR`) that are small enough as `int` when 32-bit mode is used (`-m32`). This causes a misalignment of the parameters on the stack and ultimately a segfault. The problem also exists in 64-bit mode with clang and icx and on aarch64 with clang, as parameters far enough down the list to go on the stack do not get the upper 4 bytes initialized.
- This commit introduces a new type `kerid_t` which is always `uint32_t`. This type is used for all kernel, blocksize, preference, and operation IDs (including user-registered ones). It is also used for `BLIS_VA_END`.
- Now all enum values are always passed as 32-bit ints on all architectures.
- Fixes flame#839.
Details:
- Rename `RELEASING` to `RELEASING.md`.
- Add additional structure and Markdown notation to `RELEASING.md`.
- Add a section on the overall release and branching strategy.
- Clarify and tweak instructions for making release candidates and releases.
- Add instructions for making point releaases and back-porting bug fixes.
- Rename `build/start-new-rc.sh` to `build/do-release.sh`.
- Tweak `do-release.sh` to do only common tasks for rcs, major releases, and point releases.
- Add `-b` option to `do-release.sh` which does a "bare" release without a new branch or tag (for "dev releases" on master).
- Update the version file on `master` to `3.0-dev` to reflect the new guidelines.
Details:
- Update release notes for flame#841, should have been done in the PR.
- [ci skip]
Details:
- Removed/relaxed the deprecation warning for `OMP_NUM_THREADS`.
- Clarified how `OMP_NUM_THREADS` is used and added a simple example on how to do different regions of thread-counts.
Details:
- Implemented an option (`-i LIST`) to `gen-make-frag.sh` that allows the caller to optionally ignore additional directories when walking the source directory. (Note that previously the standard -- and only -- way to ignore directories was to add them to the `ignore_list` file, which is a required argument to the script.)
- I implemented this feature for something but then ended up not needing it, but figured it might be helpful in the future.
- Multiple `-i` options are allowed.
Details:
- Added a `sifive_rvv` configuration which is `VLEN`-agnostic but takes advantage of optimized microkernels for SiFive (and other) RISC-V architectures.
- This configuration does not currently participate in automatic configuration selection during BLIS configure.
- `VLEN` is detected at runtime to properly make use of available vectorization.
Details:
- Previously, the tests using Intel SDE ran the BLIS testsuite manually. Now, the full `make check` suite is run using SDE as a wrapper for execution.
Details:
- Fixes to the documentation:
    1. Some integer-based types were missed.
    2. Some function parameters were missed.
    3. Many interfaces were missing `const`.
- Improved formatting and consistency, removed trailing whitespace.
- Added several missed global constants.
…ally found successfully. (flame#842)

Details:
- If the examples are built out-of-tree then `BLIS_INSTALL_PATH` needs to be set to find the header, library, and build system files. Also, if the examples are attempted to be built before configuring blis then `common.mk` will be missing.
- Current behavior silently ignores the failed import of `common.mk` which causes various difficult-to-diagnose problems.
- The Android/Bionic detection in common.mk has also been changed to not rely on an external file. This allows examples to be compiled in isolation.

Details:
- When building examples out-of-tree (or potentially other external code using `common.mk`), `DIST_PATH` will not be set and so `common.mk` will not be able to locate `build/detect/android/bionic.h`, causing a compiler error in some cases.
- This has been fixed by including the contents of `bionic.h` in the shell statement executing the compiler check.
- Fixes flame#840.
Details:
- GCC 15 drops support for Xeon Phi architectures such as KNL.
- This PR blacklists the `knl` configuration for GCC 15+.
)

Details:
- Alias `?gemmt_` as `?gemmtr_` to fix lapack 3.12.1 compatibility. (Fixes flame#848)
- Add the `?gemmtr_ `and `cblas_?gemmtr` aliases to symbol list.
- Also alias `cblas_?gemmt` as `cblas_?gemmtr` for lapack 3.12.1 compatibility.
Details:
- See flame#850 for details on the problem.
- This is a temporary fix which should work for sdcz data types.
- Altra architectures may still not fully work for MP/MD as the stack buffer size is hard-coded.
Details:
- clang 14.0.0 apparently makes some invalid assumptions about whether
  or not the AB microtile is initialized in the `gemm` reference
  microkernel. This leads to the "scale by alpha" part doing something
  strange (all sorts of random and even NaN values pop up). I do not
  know why this only manifested for `ztrsm` on `skx` (in
  `zgemm_skx_ref` via `zgemmtrsm_skx_ref`). See flame#852.
- Aliasing the AB microtile (in the proper datatype) as a pointer to
  a raw character array, and then initializing the character array
  with `= { 0 }` convinces the compiler to do the right thing.
- The problem did not occur in 14.0.6 or 15.0.7. It may only be a narrow
  band of versions which are problematic.
- This commit adds the char array workaround and fixes flame#852.
Details:
- This PR adds CircleCI testing in addition to TravisCI and Appveyor.
- All of the same tests as on Travis are run, except that different hardware typically ends up being used (usually Zen on Travis, Xeon Platinum on Circle). This has actually exposed a couple of bugs (see flame#850 and flame#852).
- The `travis` directory has been renamed to `ci` as it is now shared.
- Running SDE on CircleCI is a bit problematic because glibc changed how CPUID detection is done. This requires running some architectures with different hardware definition files and forcing a config via `BLIS_ARCH_TYPE`.
Details:
- The BLAS/CBLAS function `?gemmtr` is currently implemented as a symbol alias of the already-existing `?gemmt`. This does not work on macOS/Darwin.
- Instead, use a minimal wrapper function which calls the appropriate existing BLAS/CBLAS function.
- Also clean up the CBLAS prototypes a bit.
Details:
- Add status badge for CircleCI.
- [ci skip]
Details:
- Developed by @fgvanzee and @devinamatthews.
- Level-0 scalar macros have moved from a named-based system (e.g. `bli_dcopys( ... )`) to a macro argument-based system (`bli_tcopys( d,d, ... )`).
- All macros are explicitly mixed-type.
- All input and output operands can have a distinct type (precision and/or domain). Unnecessary computations and spurious NaN/Inf propagation are avoided in mixed-domain cases.
- All macros which do math (i.e. not copy/set/etc.) take an additional computational precision.
- Tile-level macros, 1m, broadcast-B, and other extensions are also included.
- All macros should correctly handle aliasing of input and output operands (this needs to be rigorously checked).
- The macros work generically over the defined types -- new types only need limited support (primarily conversion to other types and basic math).
- For code outside of core BLIS (optimized kernels, sandboxes, etc.), a selection of legacy macros have been added which translate to the new level-0 macros. Behavior is unchanged.
- A standalone, templated C++ testsuite for the level-0 macros has been added. It is currently included as part of the CircleCI tests.
- Const-correctness of level-0 macros is also checked.
Details:
- When adjusting the buffer to point to the first imaginary element, the function `bli_obj_buffer_at_off` was used which includes and currently set offsets, but then `bli_obj_set_buffer` was used which is the offset *before* applying offsets.
- Now a matching `bli_obj_buffer` call is used to avoid any offsets.
…me#859)

* Fix check for SVE instructions which caused problems on Windows.

Details:
- The context intialization for `armsve` was using the HWCAP functionality of Linux to check if SVE instructions are actually available, since these are used to determine the register blocksizes. Naturally, this causes problems on Windows.
- Instead, use functions from `bli_cpuid.c` to check for SVE. On Windows, no check is actually done and SVE is never detected.
- In the case that the user specifically requests the `armsve` config on Windows, only enable this check for the whole `arm64` family and just assume SVE is available otherwise.

* Blacklist armsve on Windows.
Details:
- Add tests for the `generic` config, including forcing broadcast-A,B which uses a different reference kernel. This uncovered a number of bugs, especially in `trsm`/`gemmtrsm` reference kernels, as well as diagonal packing.
- Move threaded builds into main build and run `make check` once for each enabled backend.
- Fix unused variable warnings in level-0 macros.
- Fix `bli_tbastbbs_mxn` and add `bli_tcompressbbs_mxn`. The latter was missing from the reference `gemmtrsm` microkernel and is needed since the B11 block is accumulated to but, for complex datatypes, the effective imaginary stride is non-unit if B is broadcast packed.
- Run all BLAS tests single-threaded.
Details:
- This avoids possible misinterpretation of computation results printed on stdout (thanks Mason McBride for reporting it in flame#864).
- Also force space for positive numbers to help with alignment.
Details:
- In some cases, macOS was improperly detected as Windows due to a builtin preprocessor definition `#define TARGET_OS_WINDOWS 0`.
- Update the detection to specifically look for `#define _WIN32` which more robustly detects Windows.
[ci skip]
sireeshasanga pushed a commit that referenced this pull request Jun 26, 2025
BLIS-specific setting of threading takes precedence over OpenMP
thread count ICV values, and if the BLIS-specific threading APIs
are used, there was no way for the program to revert to OpenMP
settings. This patch implements a function bli_thread_reset() to
do this. This is similar to that implemented in upstream BLIS in
commit 6dcf766

More specifically, it reverts the internal threading data to that
which existed when the program was launched, subject where appropriate
to any changes in the OpenMP ICVs. In other words:
- It will undo changes to threading set by previous calls to
  bli_thread_set_num_threads or bli_thread_set_ways.
- If the environment variable BLIS_NUM_THREADS was used, this will
  NOT be cleared, as the initial state of the program is restored.
- Changes to OpenMP ICVs from previous calls to omp_set_num_threads()
  will still be in effect, but can be overridden by further calls to
  omp_set_num_threads().

Note: the internal BLIS data structure updated by the threading APIs,
including bli_thread_reset(), is thread-local to each user
(e.g. application) thread.

Example usage:
omp_set_num_threads(4);
bli_thread_set_num_threads(7);
dgemm(...); // 7 threads will be used
bli_thread_reset();
dgemm(...); // 4 threads will be used
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.