-
Notifications
You must be signed in to change notification settings - Fork 203
Description
Currently, Rust Cuda
does not support the standard atomic intrinsics, used by core
& std
. Instead, Rust Cuda
provides its own set of atomics in cuda_std
.
From reading the codebase, the primary issue seems to be this:
https://github.com/Rust-GPU/Rust-CUDA/blob/33664c024336de23b7e912a56b9888f6c5708b2f/crates/rustc_codegen_nvvm/src/builder.rs#L1135
- CUDA only supports 32 bit, 64 bit and 128 bit atomics(no 8 & 16 bit) AND
- CUDA atomics have some address-space restrictions(they only work in the
global
andshared
address space).
Both of those problems can be solved(or at least heavily mitigated), and at a very tiny performance cost.
Address spaces
Ignoring the generic
address space(which is just a superset of all the address spaces), CUDA has 4 address spaces: global
,shared
,constant
, local
.
The cmpxchg
instruction(and other atomics, like atomicrmw
) work fine with global
& shared
: the docs say the pointer must be in one of those address spaces.
https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#cmpxchg-instruction
This means that the instruction will not work with data in the constant
or local
address space. This seems to be the reason why this instruction is not used to implement atomics in Rust-CUDA.
Constant address space: UB in Rust anyway.
We can ignore the constant
address space: Using any atomics on constant data is UB in Rust anyway, so we are free to assume no sound code uses atomics with the constant address space.
With this, only local
the address space remains.
Local address space
Local address space is a bit tricky. I assume it is illegal for anything but the current thread to access the local address space. If this is the case, we can use non-atomic instructions here, and soundly emulate atomics.
Putting this all together:
Now, we have figured out what to do for each address space. We can use this info to make the right choice based on the address space a pointer belongs to.
Thankfully, NVVM IR provides us with helpfull intrinsics to do so:
i1 @llvm.nvvm.isspacep.const(i8*)
i1 @llvm.nvvm.isspacep.global(i8*)
i1 @llvm.nvvm.isspacep.local(i8*)
i1 @llvm.nvvm.isspacep.shared(i8*)
I propose we implement standard Rust atomics as follows:
- We check if the pointer is in the
global
orshared
address space.
1.a. If it is in that address space, we use the atomic instructions(since they support this address space). - Otherwise, we check if the pointer is in the
local
address space. If it is, we emulate atomics with other instructions(since we are guaranteed to be the only thread accessing that memory). - If the pointer is in neither of those, we trap. This would usually mean we just stop execution on UB(e.g. somebody using atomics on constant data). This is also future-proof. If ever another address space is introduced, we will conservatively trap on it.
In pseudocode:
if( llvm.nvvm.isspacep.global(ptr) || llvm.nvvm.isspacep.shared(ptr)){
cmpxchg(ptr, val, expected)
} else if (llvm.nvvm.isspacep.local(ptr) ){
emulate_cmpxchg(ptr, val, expected)
} else {
abort(); // Rust UB - somebody used atomics on constant data.
}
The address space checks are pretty cheap(from what I understand), and will get optimized away if the compiler can know for sure which address space a pointer belongs to.
Smaller atomics
For smaller atomics, we are 100% free to just emulate them using larger ones:
Atomic operations may be implemented at the instruction layer with larger-size atomics. For example some platforms use 4-byte atomic instructions to implement AtomicI8. Note that this emulation should not have an impact on correctness of code, it’s just something to be aware of.
This is not trival to do, and is a bit slower(requires 2 "large" atomic ops per one emulated op), but it can be done in a sound and reasonable way.
If we support cmpxchng
for all the required sizes, we can then easily emulate any missing atomics with a compare-exchange loop.
This would allow us to support all standard atomics fully.
Other caveats
The weak
atomics are unsupported in CUDA - this is fine, and changes nothing. Weak
atomics can always be replaced with non-weak
variants soundly.
failure ordering
is ignored. Will have to check CUDA docs a bit better, but I assume that means the same ordering is used for both success and failure. Here, we just thus need to pick the stronger one, and use that ordering(this is sound, because replacing a weaker ordering with a stronger one is always sound).
Finishing toughs
This issue outlines a rough plan for full support for standard atomics in Rust CUDA.
With this, all existing uses of atomics in Rust would just work in CUDA. Standard types, like LazyCell
, Mutex
, Arc
, or even channels could be used in CUDA, out of the box.