Strange performance differences between access to DASH local memory and locally malloc'd memory

When investigating the performance of the DASH Cowichan implementation compared to TBB and Cilk I noticed that, strangely, access to DASH local memory (via `.lbegin()`) appears to be significantly slower than access to regular locally allocated memory via `malloc`. Here is an example that demonstrates the behavior:

~~~
#include <unistd.h>
#include <iostream>
#include <cstddef>
#include <sstream>

#include <libdash.h>

using namespace std;

#include <sys/time.h>
#include <time.h>

#define MYTIMEVAL( tv_ )                        \
  ((tv_.tv_sec)+(tv_.tv_usec)*1.0e-6)

#define TIMESTAMP( time_ )                                              \
  {                                                                     \
    static struct timeval tv;                                           \
    gettimeofday( &tv, NULL );                                          \
    time_=MYTIMEVAL(tv);                                                \
  }

//
// do some work and measure how long it takes
//
double do_work(int *beg, int nelem, int repeat)
{
  const int LCG_A = 1664525, LCG_C = 1013904223;
  
  int seed = 31337;    
  double start, end;

  TIMESTAMP(start);
  for( int j=0; j<repeat; j++ ) {
    for( int i=0; i<nelem; ++i ) {
      seed = LCG_A * seed + LCG_C;
      beg[i] = ((unsigned)seed) %100;
    }
  }
  TIMESTAMP(end);

  return end-start;
}

int main(int argc, char* argv[])
{
  dash::init(&argc, &argv);

  dash::Array<int> arr(100000000);

  int nelem = arr.local.size();
  
  int *mem = (int*) malloc(sizeof(int)*nelem);
  
  double dur1 = do_work(arr.lbegin(), nelem, 1);
  double dur2 = do_work(mem,          nelem, 1);
  
  cerr << "Unit " << dash::myid()
       << " DASH mem: " << dur1 << " secs"
       << " Local mem: " << dur2 << " secs" << endl;
  
  dash::finalize();

  return EXIT_SUCCESS;
}
~~~

On my machine, when run with two units, I get the following significant performance differences:

~~~
Unit 1 DASH mem: 0.346513 secs Local mem: 0.234078 secs
Unit 0 DASH mem: 0.35398 secs Local mem: 0.232012 secs
~~~

The difference appears to vanish if the `repeat` factor is increased, but this is of no help in the context of the Cowichan problems. 

I'm at a loss at the moment about what could be the root cause of this difference. Alignment appears to play no role.  @devreal : Any idea what could be behind the difference and what could make access to window memory slower? Maybe different MPI Window creation options? I'll try to investigate with hardware counters in the coming days. NUMA and memory paging are two possible culprits. With NUMA I don't see how it should have an influence in this context, with paging I could imagine that window allocation causes the pages to be pinned and thus changes access characteristics to the memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strange performance differences between access to DASH local memory and locally malloc'd memory #546

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strange performance differences between access to DASH local memory and locally malloc'd memory #546

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions