Why does Jetson TX1 outperform TX2 on cufft?

I benchmarked an out-of-place complex 2D cufft and found that the TX1 outperforms the TX2.
These are the numbers I got for matrix sizes NxN where N = 2^n.

n TX2 GFLOPS TX1 GFLOPS
7 21.2 14.33
8 57.3 47.5
9 53.1 67.27
10 63.5 131.6
11 70.77 212
12 76.1 124

Can anyone explain why the TX1 outperforms the TX2 for values n > 8. To me this is puzzling.

I am running 1000 iterations of an out-of-place complex cufft. I made sure the gpu clock speeds were maximized which should give the TX2 30% more performance.

Hi abdo-abaco,

Could you share your code for me to profile? You could send it through a private message.

Hi abdo-abaco,

Could you try to use cuda event for profiling? I have tried to use this timer to calculate the elapsed time in gpu.

The result shows that TX2 is faster than TX1 when doing cufftExecC2C.

The clock() seems not return the correct value for profiling. Please using gettimeofday or other functions that can reveal the wall clock.

Hi WayneWWW,

Thank you I was in fact getting correct numbers for TX2 but with the TX1 I was getting incorrect values.

Using cuda event for profiling I get the following results in GFLOPS:

n TX2 TX1
7 21.3 14.4
8 57.8 35.5
9 54.4 32.8
10 64 37.08
11 71 34.06
12 75 27.45

Does 70% increased performance sound right? Other than higher clock speeds and an improved memory hierarchy with the pascal architecture. Is there anything else that accounts for the improved performance?

Thank you,
Abdo

The TX2 has twice as much RAM. Other parts of the system may take advantage of that, e.g., file access can use extra RAM to cache file reads after the first read.