Open
Description
Suggestion Description
Hi,
As mentioned briefly at #104 , I believe that adding support for cache flushing control flags to the stream memory read/write operations could be great, similarly to how it is possible with events.
Thanks to stream memory operations I have been able to implement an all-reduce single machine implementation that has ~18us latency, which is really good already (better than rccl from what I see). I believe that being able to avoid cache flushes might help shave more on that.
Best,
Epliz
Operating System
No response
GPU
No response
ROCm Component
No response