Adding support for standard atomic intrinsics

Currently, `Rust Cuda` does not support the standard atomic intrinsics, used by `core` & `std`. Instead, `Rust Cuda` provides its own set of atomics in `cuda_std`. 

From reading the codebase, the primary issue seems to be this:
https://github.com/Rust-GPU/Rust-CUDA/blob/33664c024336de23b7e912a56b9888f6c5708b2f/crates/rustc_codegen_nvvm/src/builder.rs#L1135
1. CUDA only supports 32 bit, 64 bit and 128 bit atomics(no 8 & 16 bit) AND
2. CUDA atomics have some address-space restrictions(they only work in the `global` and `shared` address space).  

Both of those problems can be solved(or at least heavily mitigated), and at a very tiny performance cost. 

# Address spaces

Ignoring the `generic` address space(which is just a superset of all the address spaces), CUDA has 4 address spaces: `global`,`shared`,`constant`, `local`. 

The `cmpxchg` instruction(and other atomics, like `atomicrmw`) work fine with `global` & `shared` : the docs say the pointer must be in one of those address spaces.

https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#cmpxchg-instruction

This means that the instruction will not work with data in the `constant` or `local` address space. This seems to be the reason why this instruction is not used to implement atomics in Rust-CUDA.

## Constant address space: UB in Rust anyway. 

We can ignore the `constant` address space: Using *any* [atomics on constant data is UB in Rust anyway,](https://doc.rust-lang.org/std/sync/atomic/#atomic-accesses-to-read-only-memory) so we are free to assume no sound code uses atomics with the constant address space. 

With this, only `local` the address space remains. 

## Local address space

Local address space is a bit tricky. I *assume* it is illegal for anything but the current thread to access the local address space. If this is the case, we can use non-atomic instructions here, and soundly emulate atomics. 

## Putting this all together:

Now, we have figured out what to do for each address space. We can use this info to make the right choice *based on the address space a pointer belongs to*. 

Thankfully, NVVM IR provides us with helpfull intrinsics to do so: 
```
    i1 @llvm.nvvm.isspacep.const(i8*)
    i1 @llvm.nvvm.isspacep.global(i8*)
    i1 @llvm.nvvm.isspacep.local(i8*)
    i1 @llvm.nvvm.isspacep.shared(i8*)
```
I propose we implement standard Rust atomics as follows:
1. We check if the pointer is in the `global` or `shared` address space. 
1.a. If it is in that address space, we use the atomic instructions(since they support this address space). 
2. Otherwise, we check if the pointer is in the `local` address space. If it is, we emulate atomics with other instructions(since we are guaranteed to be the only thread accessing that memory).  
3. If the pointer is in neither of those, we trap. This would usually mean we just stop execution on UB(e.g. somebody using atomics on constant data). This is also future-proof. If ever another address space is introduced, we will conservatively trap on it. 

In pseudocode:
```c
if( llvm.nvvm.isspacep.global(ptr) || llvm.nvvm.isspacep.shared(ptr)){
   cmpxchg(ptr, val, expected)
} else if (llvm.nvvm.isspacep.local(ptr) ){
    emulate_cmpxchg(ptr, val, expected)
} else {
  abort(); // Rust UB - somebody used atomics on constant data. 
}
``` 

The address space checks are pretty cheap(from what I understand), and will get optimized away if the compiler can know for sure which address space a pointer belongs to. 

# Smaller atomics

For smaller atomics, we are [100% free to just emulate them using larger ones](https://doc.rust-lang.org/std/sync/atomic/#portability):
> Atomic operations may be implemented at the instruction layer with larger-size atomics. For example some platforms use 4-byte atomic instructions to implement AtomicI8. Note that this emulation should not have an impact on correctness of code, it’s just something to be aware of.
This is not trival to do, and is a bit slower(requires 2 "large" atomic ops per one emulated op), but it can be done in a sound and reasonable way. 

If we support `cmpxchng` for all the required sizes, we can then easily emulate any missing atomics with a compare-exchange loop. 
This would allow us to support all standard atomics fully. 

# Other caveats 

The `weak` atomics are unsupported in CUDA - this is fine, and changes nothing. `Weak` atomics can always be replaced with non-`weak` variants soundly. 

`failure ordering` is ignored. Will have to check CUDA docs a bit better, but I *assume* that means the same ordering is used for both success and failure. Here, we just thus need to pick the stronger one, and use that ordering(this is sound, because replacing a weaker ordering with a stronger one is always sound).

# Finishing toughs

This issue outlines a rough plan for full support for standard atomics in Rust CUDA.

With this, all existing uses of atomics in Rust would *just work* in CUDA. Standard types, like `LazyCell`, `Mutex`, `Arc`, or even channels could be used in CUDA, out of the box. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding support for standard atomic intrinsics #253

Address spaces

Constant address space: UB in Rust anyway.

Local address space

Putting this all together:

Smaller atomics

Other caveats

Finishing toughs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding support for standard atomic intrinsics #253

Description

Address spaces

Constant address space: UB in Rust anyway.

Local address space

Putting this all together:

Smaller atomics

Other caveats

Finishing toughs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions