Thoughts on NUMA issues with DASH containers

I want to lay out some observations I made during the last couple of months.

**Background**

MPI implementations register the memory backing windows with the network device on systems that use high-performance networks in order to leverage RDMA capabilities (observed both on IB and Cray Aries with Open MPI). As part of this process, all pages in the registered memory range are pre-faulted (in the case of `MPI_Win_allocate`), which is done by the thread allocating the window (usually the master thread). Afterwards the pages are non-movable, non-swappable, and subsequently pinned to the NUMA domain on which the thread was running. It is my understanding that pages that are already allocated (e.g., for memory passed to `MPI_Win_create`) are not moved and remain on the NUMA domain on which they were allocated.

I am strongly assuming that other communication libraries such as GASPI do the same.

**The Problem**

This can become a problem for two reasons:

1) Open MPI allocates shared memory on each node running more than one process to speed-up node-local communication (similar to our shared memory optimization). We found that the root-process on the node registers the complete shared memory range, which leads to NUMA effects on all processes not running on the same socket. There is an issue open to fix this so this problem will go away eventually but it is something to keep in mind for the time being (other MPI implementations might be affected as well). 

2) What is more interesting is a scenario in which we have one process per node that uses threads to exploit multi-core performance. Here, the master thread will allocate and register the window memory for the whole window (and thus the DASH container), essentially performing the *first touch* and thus nullifying any attempt to be NUMA-aware in the OpenMP part. This is a severe problem as most of our memory comes from MPI windows.

**Possible Mitigation**

We need to give DASH users a chance to properly initialize the memory of the container before that memory is registered with the network device. That means that we have to a) allocate memory ourselves, b) expose the memory to the user, and c) pass that memory into `MPI_Win_create` eventually (instead of using `MPI_Win_allocate` in the first place). 

DASH containers today support allocation on construction and delayed allocation (both of which directly allocate the window). We could add a third option to allow the construction of the container using local memory, after which only local access is allowed (throwing an exception on any attempt to access global memory). Thus the user would be able to use `.local` and `.lbegin()` on the container (or block/view/whathaveyou). Eventually, the local memory is committed into global memory (by creating the window) and global memory access is possible from that point on. Here is an example:

```C
dash:Array<double> arr;
arr.allocate_local(LARGE_N); // uses malloc/MPI_Alloc_mem
#pragma omp parallel for
for (int i = 0; i < LARGE_N; ++i) {
  arr.local[i] = 0.0;
}
arr.commit();
```

Whether local allocation is exposed through a function or controlled by a template parameter controlling the behavior of `allocate` can be open for discussion, I just wanted to provide a rough sketch (the check whether the memory is committed is not for free and might be an argument for a template parameter to avoid the overhead on ''regular'' containers).

Any thoughts on this are welcome. I think this is something to discuss at the next F2F but I wanted to write it up here and be able to track this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thoughts on NUMA issues with DASH containers #615

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Thoughts on NUMA issues with DASH containers #615

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions