[OpenMP] Performance bug in Denver cores?

I’m seeing some strange behavior on my TX2 board when testing basic OpenMP code. I wrote the following simple program:

#include <stdio.h>
#include <stdlib.h>

int main(int num_args, char * args[]) {
    int num_threads = atoi(args[1]);
    int sum = 0;
    #pragma omp parallel num_threads(num_threads)
    {
        for (unsigned iter = 0; iter < 16; iter++) {
            #pragma omp for reduction(+:sum)
            for (unsigned index = 0; index < 32 * 1024 * 1024; index++) {
                sum += index;
            }
        }
    }
    printf("sum: %d\n", sum);
    return 0;
}

I compile it with:

gcc -Wall -fopenmp -Ofast test.c -o test

Running on the ARM cores, I see the expected linear progression in performance:

nvidia@jetson:~/src/dust/src$ sudo nvpmodel -m 3
nvidia@jetson:~/src/dust/src$ sudo nvpmodel -q
NV Power Mode: MAXP_CORE_ARM
3
nvidia@jetson:~/src/dust/src$ time ./test 1
sum: -268435456

real	0m0.576s
user	0m0.568s
sys	0m0.004s
nvidia@jetson:~/src/dust/src$ time ./test 2
sum: -268435456

real	0m0.298s
user	0m0.580s
sys	0m0.008s
nvidia@jetson:~/src/dust/src$ time ./test 3
sum: -268435456

real	0m0.208s
user	0m0.604s
sys	0m0.004s
nvidia@jetson:~/src/dust/src$ time ./test 4
sum: -268435456

real	0m0.167s
user	0m0.632s
sys	0m0.004s

Running on the Denver cores, I actually see a slow down when scaling from one thread to two.

nvidia@jetson:~/src/dust/src$ sudo nvpmodel -m 4
nvidia@jetson:~/src/dust/src$ sudo nvpmodel -q
NV Power Mode: MAXP_CORE_DENVER
4
nvidia@jetson:~/src/dust/src$ time ./test 1
sum: -268435456

real	0m0.477s
user	0m0.468s
sys	0m0.000s
nvidia@jetson:~/src/dust/src$ time ./test 2
sum: -268435456

real	0m0.601s
user	0m0.676s
sys	0m0.008s

Any idea what’s going on here? Is there a workaround for getting full utilization of both Denver cores?

Hi,

Have you maximized the CPU/GPU performance before profiling?

sudo ./jetson_clocks.sh

We will also try to reproduce this issue in our environment.

Thanks.

Running the jetson_clocks.sh script before each test does improve performance across all of the tests (by 10-20%), but the trends are unchanged: (approximately) linear performance scaling on the ARM cores, performance slowdown on the Denver cores.

Hi,

1)
Here is our CPU placement for Jetson TX2:
>> [A57, Denver, Denver, A57, A57, A57]

When setting nvpmodel to MAXP_CORE_DENVER, only CPU0 and CPU1 will be enabled.
And CPU0 is in the low-frequency mode and CPU1 is in the performance mode.

As a result, you can see all the computational tasks assigned to the CPU1 while CPU0 only tasks response for the routine job.

2)
When creating two threads, both tasks are assigned to the CPU1, which will lower your performance due to multi-thread overhead(No extra hardware).

3)
There is no available nvpmodel can enable two Denver and disable other A57 together.
But you can achieve this manually.

$ sudo -i
$ nvpmodel -m 4
$ echo 1 > /sys/devices/system/cpu/cpu2/online

This will enable CPU0-CPU2 and CPU1&CPU2 is in the performance mode.
We can get good acceleration result with this setting:

nvidia@tegra-ubuntu:~$ time ./test 1
sum: 201326592

real    0m9.990s
user    0m9.960s
sys    0m0.004s
nvidia@tegra-ubuntu:~$ time ./test 2
sum: 201326592

real    0m5.071s
user    0m10.064s
sys    0m0.000s

5)
By the way, you can check CPU status(on/off, clock rate) with tegrastats.

sudo ./tegrastats

RAM 956/7846MB (lfb 1522x4MB) CPU [6%@345,100%@2020,100%@2020,off,off,off] …

Thanks.

Ah, I overlooked MAXP_CORE_DENVER actually only enables one Denver core. Your suggestion works in that it gets the two Denver cores online.

I found that in some cases, OpenMP will still schedule on the throttled ARM core. To resolve this problem, I found the Linux taskset utility to be quite helpful, e.g.:

taskset 6 time ./test 2

Where 6 is a bit-mask corresponding the CPU set {1,2} (the two Denver cores).

Incidentlly, NVPModel Clock Configuration for Jetson TX2 has details for the mvpmodels.

Hello.

I would like to revive this thread since in my case taskset isn’t working. This is what top reports using taskset mask 0x06:

Tasks: 302 total, 2 running, 300 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu1 :100,0 us, 0,0 sy, 0,0 ni, 0,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu2 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu3 : 6,7 us, 33,3 sy, 0,0 ni, 60,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu4 : 0,0 us, 0,0 sy, 0,0 ni, 93,3 id, 0,0 wa, 6,7 hi, 0,0 si, 0,0 st
%Cpu5 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem : 8048272 total, 5233800 free, 1455024 used, 1359448 buff/cache
KiB Swap: 4024128 total, 4024128 free, 0 used. 6232916 avail Mem

I can make either of the Denver cores (Denver core 1 and 2) run individually. But can’t make them both work simultaneously.

This isn’t the case for the rest of the ARM cores of the TX2. It only happens with the Denver cores. This is what top reports using taskset mask 0x30 (ARM core 4 and 5):

top - 13:01:20 up 11 days, 22:45, 2 users, load average: 0,12, 0,41, 0,93
Tasks: 305 total, 2 running, 303 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0,0 us, 0,4 sy, 0,0 ni, 98,5 id, 0,0 wa, 0,0 hi, 1,1 si, 0,0 st
%Cpu1 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu2 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu3 : 1,9 us, 1,5 sy, 0,0 ni, 96,7 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu4 : 99,3 us, 0,0 sy, 0,0 ni, 0,0 id, 0,0 wa, 0,7 hi, 0,0 si, 0,0 st
%Cpu5 : 99,3 us, 0,0 sy, 0,0 ni, 0,0 id, 0,0 wa, 0,7 hi, 0,0 si, 0,0 st
KiB Mem : 8048272 total, 5237364 free, 1451372 used, 1359536 buff/cache
KiB Swap: 4024128 total, 4024128 free, 0 used. 6236556 avail Mem

The nature of the application isn’t bound to any specific core, it’s just a simple OpenMP program with the env variable OMP_NUM_THREADS set to 2.

This is how I’m executing my application:

taskset 0x30 ./bin/matix_multiplication_omp_opt_float_16

It is obvious to say that both Denver cores were active while performing these experiments. Might this be a bug?

Best regards, and thank you for all the hard work!

Alvaro.