I’m seeing some strange behavior on my TX2 board when testing basic OpenMP code. I wrote the following simple program:
#include <stdio.h>
#include <stdlib.h>
int main(int num_args, char * args[]) {
int num_threads = atoi(args[1]);
int sum = 0;
#pragma omp parallel num_threads(num_threads)
{
for (unsigned iter = 0; iter < 16; iter++) {
#pragma omp for reduction(+:sum)
for (unsigned index = 0; index < 32 * 1024 * 1024; index++) {
sum += index;
}
}
}
printf("sum: %d\n", sum);
return 0;
}
I compile it with:
gcc -Wall -fopenmp -Ofast test.c -o test
Running on the ARM cores, I see the expected linear progression in performance:
nvidia@jetson:~/src/dust/src$ sudo nvpmodel -m 3
nvidia@jetson:~/src/dust/src$ sudo nvpmodel -q
NV Power Mode: MAXP_CORE_ARM
3
nvidia@jetson:~/src/dust/src$ time ./test 1
sum: -268435456
real 0m0.576s
user 0m0.568s
sys 0m0.004s
nvidia@jetson:~/src/dust/src$ time ./test 2
sum: -268435456
real 0m0.298s
user 0m0.580s
sys 0m0.008s
nvidia@jetson:~/src/dust/src$ time ./test 3
sum: -268435456
real 0m0.208s
user 0m0.604s
sys 0m0.004s
nvidia@jetson:~/src/dust/src$ time ./test 4
sum: -268435456
real 0m0.167s
user 0m0.632s
sys 0m0.004s
Running on the Denver cores, I actually see a slow down when scaling from one thread to two.
nvidia@jetson:~/src/dust/src$ sudo nvpmodel -m 4
nvidia@jetson:~/src/dust/src$ sudo nvpmodel -q
NV Power Mode: MAXP_CORE_DENVER
4
nvidia@jetson:~/src/dust/src$ time ./test 1
sum: -268435456
real 0m0.477s
user 0m0.468s
sys 0m0.000s
nvidia@jetson:~/src/dust/src$ time ./test 2
sum: -268435456
real 0m0.601s
user 0m0.676s
sys 0m0.008s
Any idea what’s going on here? Is there a workaround for getting full utilization of both Denver cores?