Description
When investigating the performance of the DASH Cowichan implementation compared to TBB and Cilk I noticed that, strangely, access to DASH local memory (via .lbegin()
) appears to be significantly slower than access to regular locally allocated memory via malloc
. Here is an example that demonstrates the behavior:
#include <unistd.h>
#include <iostream>
#include <cstddef>
#include <sstream>
#include <libdash.h>
using namespace std;
#include <sys/time.h>
#include <time.h>
#define MYTIMEVAL( tv_ ) \
((tv_.tv_sec)+(tv_.tv_usec)*1.0e-6)
#define TIMESTAMP( time_ ) \
{ \
static struct timeval tv; \
gettimeofday( &tv, NULL ); \
time_=MYTIMEVAL(tv); \
}
//
// do some work and measure how long it takes
//
double do_work(int *beg, int nelem, int repeat)
{
const int LCG_A = 1664525, LCG_C = 1013904223;
int seed = 31337;
double start, end;
TIMESTAMP(start);
for( int j=0; j<repeat; j++ ) {
for( int i=0; i<nelem; ++i ) {
seed = LCG_A * seed + LCG_C;
beg[i] = ((unsigned)seed) %100;
}
}
TIMESTAMP(end);
return end-start;
}
int main(int argc, char* argv[])
{
dash::init(&argc, &argv);
dash::Array<int> arr(100000000);
int nelem = arr.local.size();
int *mem = (int*) malloc(sizeof(int)*nelem);
double dur1 = do_work(arr.lbegin(), nelem, 1);
double dur2 = do_work(mem, nelem, 1);
cerr << "Unit " << dash::myid()
<< " DASH mem: " << dur1 << " secs"
<< " Local mem: " << dur2 << " secs" << endl;
dash::finalize();
return EXIT_SUCCESS;
}
On my machine, when run with two units, I get the following significant performance differences:
Unit 1 DASH mem: 0.346513 secs Local mem: 0.234078 secs
Unit 0 DASH mem: 0.35398 secs Local mem: 0.232012 secs
The difference appears to vanish if the repeat
factor is increased, but this is of no help in the context of the Cowichan problems.
I'm at a loss at the moment about what could be the root cause of this difference. Alignment appears to play no role. @devreal : Any idea what could be behind the difference and what could make access to window memory slower? Maybe different MPI Window creation options? I'll try to investigate with hardware counters in the coming days. NUMA and memory paging are two possible culprits. With NUMA I don't see how it should have an influence in this context, with paging I could imagine that window allocation causes the pages to be pinned and thus changes access characteristics to the memory.