Skip to content

Xcvmem #287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 576 commits into
base: main
Choose a base branch
from
Open

Xcvmem #287

wants to merge 576 commits into from

Conversation

realqhc
Copy link
Owner

@realqhc realqhc commented Jan 2, 2024

No description provided.

@ChunyuLiao ChunyuLiao force-pushed the xcvmem branch 2 times, most recently from 795a445 to 26f3a72 Compare January 5, 2024 06:51
jerryzj and others added 28 commits May 30, 2024 19:18
In -convert-vector-to-arm-sme the permutation_map is explicitly checked
for transpose when converting xfer ops, but for 2-D vector types the
only non-identity permutation map is transpose so this can be
simplified.
…llvm#93795)

The current way of lowering `llvm.clear_cache` is a bit unusual. As
suggested by Matt Arsenault we are better off using an ISD node.

This change introduces a new `ISD::CLEAR_CACHE`, registers a new libcall
by default named `__clear_cache` and the default legalisation is a
libcall.

This is preparatory work for a custom lowering of `ISD::CLEAR_CACHE`
needed by RISC-V on some platforms.
…t` (llvm#90754)

The "reduction" clause is not allowed on the "target" construct.
…m#93772)

If present, the optional second argument of the ieee_exceptions
intrinsic module procedure ieee_support_flag may be either a scalar or
an array. Change the signature of the routine that implements this
function so that it is processed as a transformational function, not an
elemental function, which accounts for this argument variant.
The `target reduction` combination is no longer accepted.
Disable the test to avoid build failures, until a better fix is ready.
…container-size-empty (llvm#93724)

Verify that size/length methods are called with no arguments.

Closes llvm#88203
Better design to put semantics on the ops, and in this case the ntt/intt
op can lower in multiple ways depending on the polynomial ring modulus
(it can need an nth root of unity for cyclic polymul -> ntt, or a 2nth
root for negacyclic polymul -> ntt)

---------

Co-authored-by: Jeremy Kun <[email protected]>
…93816)

Reduce code bloat by checking test requirements in a common test fixture
…#93606)

Back to back `linalg.transpose` can be rewritten to a single transpose
…ring (llvm#93592)

Handle lowering of non optional inquired argument in custom lowering.
Also fix an issue in the lowering of associated optional argument where
a box was emboxed again which led to weird result.
Macos will automatically load dependent modules when creating a target,
resulting in more modules than the test expects.
kazutakahirata and others added 29 commits June 1, 2024 10:36
…en not poison

Since llvm#93182 we can now call computeKnownBits inside getValidMaximumShiftAmount to determine the bounds of the shift amount ensuring that it wasn't poison, meaning if we did freeze the ahift amount, isGuaranteedNotToBeUndefOrPoison would then fail as we can't call computeKnownBits through FREEZE for potentially poison values.

I'm still reducing a decent test case but wanted to get the buildbot fix ASAP.
…llvm#88712)

This commit adds an API (`tileAndFuseConsumerOfSlice`) to fuse consumer to a producer within
scf.for/scf.forall loop.

To support this two new methods are added to the `TilingInterface`
- `getIterationDomainTileFromOperandTile`
- `getTiledImplementationFromOperandTile`.

Consumer operations that implement this method can be used to be fused with tiled producer operands in a manner similar to (but essentially the inverse of) the fusion of an untiled producer with a tiled consumer.

Note that this only does one `tiled producer` -> `consumer` fusion. This could be called repeatedly for fusing multiple consumers. The current implementation also is conservative in when this kicks in (like single use of the value returned by the inter-tile loops that surround the tiled producer, etc.) These can be relaxed over time.

Signed-off-by: Abhishek Varma <[email protected]>

---------

Signed-off-by: Abhishek Varma <[email protected]>
Signed-off-by: Abhishek Varma <[email protected]>
Co-authored-by: cxy <[email protected]>
…vm#92783)

in `atomic::wait`, when we call the platform wait ulock_wait , we are
using UL_COMPARE_AND_WAIT. But we should use UL_COMPARE_AND_WAIT64
instead as the address we are waiting for is a 64 bit integer.

fixes llvm#85107

It is rather hard to test directly because in `atomic::wait`, before
calling into the platform wait, our c++ code has some poll logic which
checks the value not changing. Thus in this patch, the test is using the
internal function.
Also drop errant header include from `Linalg` dialect into
`Dialect/SCF/Transforms/TileUsingInterface.cpp`
We can convert this to a select based on the `(icmp eq X, C)`, then
constant fold the addition the true arm begin `(add C, (sext/zext 1))`
and the false arm being `(add X, 0)` e.g

    - `(select (icmp eq X, C), (add C, (sext/zext 1)), (add X, 0))`.

This is essentially a specialization of the only case that sees to
actually show up from llvm#89020

Closes llvm#93840
)

Port selection dag isel to new pass manager.
Only `AMDGPU` and `X86` support new pass version. `-verify-machineinstrs` in new pass manager belongs to verify instrumentation, it is enabled by default.
…llvm#94146)

This reverts commit de37c06 to
de37c06

It still breaks EXPENSIVE_CHECKS build. Sorry.
We were missing the PoisonOnly argument (so Depth + 1 was being used instead and the default Depth = 0 argument then being silently used)

Fixes llvm#94145 and serves as the test case for 9e22c7a
…eship coverage (llvm#94120)

Three unrelated, small improvements:

* `test_macros.h` was incorrectly saying `__has_include("<version>")`
instead of `__has_include(<version>)`.
+ This caused `<ciso646>` to always be included (noticed because MSVC's
STL emitted a deprecation warning).
  + I searched all of LLVM and found no other occurrences.
* `thread.condition.condvarany/wait_for_pred.pass.cpp` forgot to test
anything.
  + I followed what `wait_for.pass.cpp` is testing.
* Uncomment spaceship test coverage.
I noticed that these tests had empty `main` functions. Dropping them and
renaming the tests to `MEOW.compile.pass.cpp` will slightly improve test
throughput.
There is no reason to give any of the functions C linkage. This makes
all of the libc++ functions have C++ linkage, removing the need for
`_LIBCPP_HIDE_FROM_ABI_C`.
The type of *Iter here is "const IndexedMemProfRecord &" as defined in
RecordLookupTrait.  Assigning *Iter to a variable of type
"const IndexedMemProfRecord &" avoids a copy, reducing the cycle and
instruction counts by 1.8% and 0.2%, respectively, with
"llvm-profdata show" modified to deserialize all MemProfRecords.

Note that RecordLookupTrait has an internal copy of
IndexedMemProfRecord, so we don't have to worry about a dangling
reference to a temporary.
This avoids the pitfall where we set the uwtable to none:
```
func.setUWTableKind(llvm::UWTableKind::None)
```
`Attribute::getAsString()` would see an unknown attribute and fail an
assertion. In this patch, we assert that we do not see a None uwtable
kind.

This also skips the check of `UWTableKind::Async`. It is dominated by
the check of `UWTableKind::Default`, which has the same enum value
(nfc).
The `LoopBlock` stored in `LoopWorkList` consist of basic block and its
loop data information. When iterate `LoopWorkList`, if estimated weight
of a loop is not stored in `EstimatedLoopWeight`, `getLoopExitBlocks()`
is called to get all exit blocks of the loop. The estimated weight of a
loop is calculated by iterating over edges leading from basic block to
all exit blocks of the loop. If at least one edge has unknown estimated
weight, the estimated weight of loop is unknown and will not be stored
in `EstimatedLoopWeight`. `LoopWorkList` can contain different blocks in
a same loop, so there is wasted work that calls `getLoopExitBlocks()`
for same loop multiple times.

Since computing the exit blocks of loop is expensive and the loop
structure is not mutated in Branch Probability Analysis, we can cache
the result and improve compile time.

With this change, the overall compile time for a file containing a very
large loop is dropped by around 82%.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.