Skip to content

Adding support for standard atomic intrinsics #253

@FractalFir

Description

@FractalFir

Currently, Rust Cuda does not support the standard atomic intrinsics, used by core & std. Instead, Rust Cuda provides its own set of atomics in cuda_std.

From reading the codebase, the primary issue seems to be this:
https://github.com/Rust-GPU/Rust-CUDA/blob/33664c024336de23b7e912a56b9888f6c5708b2f/crates/rustc_codegen_nvvm/src/builder.rs#L1135

  1. CUDA only supports 32 bit, 64 bit and 128 bit atomics(no 8 & 16 bit) AND
  2. CUDA atomics have some address-space restrictions(they only work in the global and shared address space).

Both of those problems can be solved(or at least heavily mitigated), and at a very tiny performance cost.

Address spaces

Ignoring the generic address space(which is just a superset of all the address spaces), CUDA has 4 address spaces: global,shared,constant, local.

The cmpxchg instruction(and other atomics, like atomicrmw) work fine with global & shared : the docs say the pointer must be in one of those address spaces.

https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#cmpxchg-instruction

This means that the instruction will not work with data in the constant or local address space. This seems to be the reason why this instruction is not used to implement atomics in Rust-CUDA.

Constant address space: UB in Rust anyway.

We can ignore the constant address space: Using any atomics on constant data is UB in Rust anyway, so we are free to assume no sound code uses atomics with the constant address space.

With this, only local the address space remains.

Local address space

Local address space is a bit tricky. I assume it is illegal for anything but the current thread to access the local address space. If this is the case, we can use non-atomic instructions here, and soundly emulate atomics.

Putting this all together:

Now, we have figured out what to do for each address space. We can use this info to make the right choice based on the address space a pointer belongs to.

Thankfully, NVVM IR provides us with helpfull intrinsics to do so:

    i1 @llvm.nvvm.isspacep.const(i8*)
    i1 @llvm.nvvm.isspacep.global(i8*)
    i1 @llvm.nvvm.isspacep.local(i8*)
    i1 @llvm.nvvm.isspacep.shared(i8*)

I propose we implement standard Rust atomics as follows:

  1. We check if the pointer is in the global or shared address space.
    1.a. If it is in that address space, we use the atomic instructions(since they support this address space).
  2. Otherwise, we check if the pointer is in the local address space. If it is, we emulate atomics with other instructions(since we are guaranteed to be the only thread accessing that memory).
  3. If the pointer is in neither of those, we trap. This would usually mean we just stop execution on UB(e.g. somebody using atomics on constant data). This is also future-proof. If ever another address space is introduced, we will conservatively trap on it.

In pseudocode:

if( llvm.nvvm.isspacep.global(ptr) || llvm.nvvm.isspacep.shared(ptr)){
   cmpxchg(ptr, val, expected)
} else if (llvm.nvvm.isspacep.local(ptr) ){
    emulate_cmpxchg(ptr, val, expected)
} else {
  abort(); // Rust UB - somebody used atomics on constant data. 
}

The address space checks are pretty cheap(from what I understand), and will get optimized away if the compiler can know for sure which address space a pointer belongs to.

Smaller atomics

For smaller atomics, we are 100% free to just emulate them using larger ones:

Atomic operations may be implemented at the instruction layer with larger-size atomics. For example some platforms use 4-byte atomic instructions to implement AtomicI8. Note that this emulation should not have an impact on correctness of code, it’s just something to be aware of.
This is not trival to do, and is a bit slower(requires 2 "large" atomic ops per one emulated op), but it can be done in a sound and reasonable way.

If we support cmpxchng for all the required sizes, we can then easily emulate any missing atomics with a compare-exchange loop.
This would allow us to support all standard atomics fully.

Other caveats

The weak atomics are unsupported in CUDA - this is fine, and changes nothing. Weak atomics can always be replaced with non-weak variants soundly.

failure ordering is ignored. Will have to check CUDA docs a bit better, but I assume that means the same ordering is used for both success and failure. Here, we just thus need to pick the stronger one, and use that ordering(this is sound, because replacing a weaker ordering with a stronger one is always sound).

Finishing toughs

This issue outlines a rough plan for full support for standard atomics in Rust CUDA.

With this, all existing uses of atomics in Rust would just work in CUDA. Standard types, like LazyCell, Mutex, Arc, or even channels could be used in CUDA, out of the box.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions