CUDA 7.5 on Maxwell 980Ti drops performance by 10x versus CUDA 7.0, and 6.5

Hi,

I have been developing “Monte Carlo simulation for photon tracing behavior” in CUDA since the last few months (github: GitHub - fninaparavecino/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator). Since I update the CUDA RC, to 7.5 I experienced a drop of performance of 10x compared to previous CUDA RC.

I have been exploring this issue since the last few weeks. I have read these article: https://devtalk.nvidia.com/default/topic/871702/cuda-7-5-give-a-30-performance-loss-vs-cuda-6-5/ and I have tried their work-around that it didn’t work. Actually I know that my kernels are not spilling any registers. Here is the behavior of my main kernel (mcx_main_loop) along the difference CUDA SDK.

CUDA RC 6.5: 15290.52 Photons/ms, Regs: 70, Cmem[0]: 424, Cmem[2]: 48, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

CUDA RC 7.0: 15408.32 Photons/ms, Regs: 87, Cmem[0]: 424, Cmem[2]: 68, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

CUDA RC 6.5: 15290.52 Photons/ms, Regs: 77, Cmem[0]: 424, Cmem[2]: 72, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

So I have tried this compilation flag -Xptxas -dlcm=cg to avoid L1 access, as you can see here:

nvcc -c -lineinfo -Xptxas -v,-dlcm=cg --maxrregcount 77 -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -gencode=arch=compute_52,code="sm_52,compute_52" -DMCX_TARGET_NAME=‘“Maxwell MCX”’ -o mcx_core.o mcx_core.cucc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_utils.o mcx_utils.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_shapes.o mcx_shapes.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o tictoc.o tictoc.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcextreme.o mcextreme.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o cjson/cJSON.o cjson/cJSON.c
cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o -o …/bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp

But my results are still drastically decreased: MCX simulation speed: 1463.27 photon/ms
Does anyone could suggest any work-around? Why CUDA 7.5 is generating this issue? and why only with Maxwell?

The “RC” designation in the stated tool chain versions is confusing. “RC” stands for release candidate, and such early-access versions are always superseded by the final version of each tool chain, which is what you should be using if you use any of CUDA 6.5, 7.0, and 7.5.

A 10x performance difference appears way too large to be due to performance regressions caused by code generation differences. Double check that you are not comparing a debug build with a release build. When you run the application under control of cuda-memcheck, are any issues reported?

The generation of machine code (SASS) from PTX is architecture dependent, since the GPU architectures lack binary compatibility. It is therefore possible that a code generation issue affects only one particular architecture.

What happens when you change the ptxas optimization level, e.g. -Xptxas -O1 ? As a quick check whether there may be any issues in the front portion of the compiler pipeline, you could try compiling for another architecture, say sm_35, then JIT compile the resulting PTX to Maxwell. If you run the CUDA profiler with the app compiled with CUDA 7.0 vs CUDA 7.5, which performance metrics show significant differences?

I briefly looked at the GitHub repository but it is not immediately obvious where the code spends the bulk of its time. Side-remark: There are some computational inefficiencies in the code, but it is not clear whether it is worth addressing those. For example at one place the code computes sinf(acosf(x)), which can be computed faster and more accurately as sqrtf(fmaf(-x,x,1.0f)). int(floorf(x)) is equivalent to __float2int_rd(x), where the latter is one instruction by the former is two. There are also various instances of sinf() and sincosf() that could use sinpif() and sincospif() instead; admittedly the usefulness of this is limited if you routinely compile with -use_fast_math. Various functions in the code look like they could benefit from the use of the modifiers restrict and const restrict (as appropriate) for pointer arguments, see the CUDA Best Practices Guide.

Thank njuffa for your reply. Yes, I was meaning the final version of each tool chain.

I double checked it many times, it is the same source code, same compilation mechanism, the only difference is the version of CUDA.

I followed your suggestions, and my results are:

  1. Using -Xptxas -O1, my performance dropped even more. MCX simulation speed: 1258.34 photon/ms
  2. I compiled with sm_35 and JIT compile the PTX to Maxwell. MCX simulation speed: 1415.63 photon/ms
  3. Usually I will simulate 10M for each run, but in order to get the metrics from nvvp I decided to run 1M for each run. First, with CUDA 7.0 and then with CUDA 7.5. In general global memory throughput and L2 throughput show better performance in CUDA 7.0, but the thing that is very interesting is warp efficiency, the performance of warp efficiency in CUDA 7.5 is very poor compared to CUDA 7.0, below the respective metrics that show differences (CUDA 7.0 vs CUDA 7.5):
    • Registers/Thread: 87 vs 77
    • Global Load Throughput(GB/sec): 123.82 vs 17.71
    • Global Store Throughput(MB/sec): 44.62 vs 6.26
    • Warp Execution Efficiency(%): 26.87 vs 3.23
    • Warp Non-Predicated Execution Efficiency(%): 25.84 vs 3.09
    • L2 Throughput (Reads)(GB/sec): 161.7 vs 23.03
    • L2 Throughput (Writes)(GB/sec): 37.93 vs 5.33
    • L2 Throughput (Atomic requests)(GB/sec): 37.88 vs 5.32
  4. Thanks for all your suggestions to improve compute efficiency. Right now, the bottleneck is dependency for hitgrid function and arithmetic operations. I would like to address all of your side-mark suggestions, and I will. But, I will like to still understand why the drop of performance so drastically among the different CUDA tool chain, and mainly how can I avoid it?.

The reason I asked to double-check debug vs release build because that has come up repeatedly as the source of performance differences as large as 10x. Performance regressions based on code generation are typically no larger than 25% to 30% for bad cases.

If this turns out to be a compiler issue, it would appear to be local to ptxas which translates PTX into SASS (machine code). The fact that you lose additional performance with -Xptxas -O1 tells me that your normal build does not have all optimizations turned off.

Nothing in the source code suggests any particular critical sequences that could be affected by code generation issues to the tune of a factor of 10x. The profiler performance metrics are all consistent with the 10x appl-level performance reduction. I am puzzled. Side-by-side comparison of the object code with cuobjdump --dump-sass between the executables from the CUDA 7.0 and CUDA 7.5 builds might be instructive. I don’t have the time to build the project, though.

Changing the focus briefly to hardware: Is this a machine with multiple GPUs? If so, double-check that you are running the application on the correct one. Does nvidia-smi show the GPU running at full speed (look at power state, and core frequencies) while the app is running?

If, after sufficient due diligence, you believe the problem is with an NVIDIA software component, rather than on your side, you might want to consider filing a bug report (form is linked from the CUDA registered developer website).

FYI, I previously reported this problem in the below thread

https://devtalk.nvidia.com/default/topic/917213/maxwell-suddernly-becomes-10x-slower/

most of my tests, including using the latest CUDA 7.5.18, were documented in this issue tracker

https://github.com/fangq/mcx/issues/18

at the beginning, I suspected that my 980Ti was defective, but later on, all evidence pointed to CUDA toolkit versions. We did all tests carefully and are certain that we used the intended GPU hardware.

Interestingly, I have been compiling this code with “-arch=sm_20” option and running the binary with “good/expected” speed on Maxwell/Kepler/Fermi. This speed drop on Maxwell happened randomly at first, but since Jan, it became permanent when compiled with CUDA 7.5.

Another note, I found my OpenCL version of this code also got the same speed hit - the simulation speed on the 980Ti is 3x slower than Fermi (GTX 590), which was not the case last year.

Can you give a set of instructions for building the code and running a benchmark? It wasn’t entirely obvious from the github repo. What is the difference if any between the FangQ project repo and the Fanny Nina Paravecino project repo? Also, the performance comparison in item #3 above surely looks to me as if debug mode was enabled (i.e. compiled with -G) for the “slow” case. The performance has cratered across the board, in all the metrics. I’m not sure why the warp efficiency metric should be singled out as “interesting”.

Fanny’s repo is 1 commit behind my master branch (but with many addition commits for debugging): see

but in terms reproducing this issue, either repo should do.

here are the procedures to use my master as an example, assuming a Linux box, you need to run

git clone https://github.com/fangq/mcx.git
cd mcx/src
make clean
make                         # compiles mcx binary with your current cuda
cd ../example/quicktest/
./listgpu.sh                 # this lists all available nvidia GPUs
./run_qtest.sh               # run benchmark; assume you use the first gpu (-G 1)

In my case, my first GPU is 980Ti, I got slow speed (1300 p/s) when compiling with cuda 7.5 and -arch=sm_xx (xx can be anything above 20). my second GPU is one core of 590. Using -G 2 I got 3000 p/s.

if you have cuda 7.0 installed, relink /usr/local/cuda to cuda-7.0, and then run

cd mcx/src
make clean
nvcc -c -lineinfo  -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math \
   -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=compute_20 \
   -code=sm_20 -code=sm_30 -code=sm_35 -code=sm_50 -code=sm_52 \
   -DMCX_TARGET_NAME='"Maxwell MCX"' -o mcx_core.o  mcx_core.cu
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o mcx_utils.o  mcx_utils.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o mcx_shapes.o  mcx_shapes.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o tictoc.o  tictoc.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o mcextreme.o  mcextreme.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o cjson/cJSON.o  cjson/cJSON.c
cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o \
    -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp -fopenmp

cd ../example/quicktest/
./listgpu.sh                 # this lists all available nvidia GPUs
./run_qtest.sh               # edit the script and change -G x to use the desired GPU

this will give the good speed (16200 p/s) as well as the correct results (absorption fraction=~17.7%, printed near the end).

just want to make sure we did not confuse you with the -G flag. The “-G N” or “–gpu N” is a flag we used in mcx to select the desired GPU by its ID. It is not a flag used with nvcc for compilation. see

https://github.com/fangq/mcx/blob/master/src/mcx_utils.c#L1389

I think the culprit could be the use of -lineinfo. As I recall, to provide accurate matching of source code line numbers to machine code instructions the compiler needs to turn off most optimizations. With full optimization, even instructions from the same expression (let alone the same source line) will be strewn all over the code, the code from some source code lines will disappear entirely (e.g. absorbed by CSE), etc.

I would suggest that as a quick experiment you remove -lineinfo from your nvcc invocation.

I followed this sequence:

cd mcx/src
make clean
make                         
cd ../example/quicktest/
./listgpu.sh                 
./run_qtest.sh

and got this output:

[bob@fed20 src]$ make clean
rm -f mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o ../bin/mcx ../bin/mcx_atomic ../bin/mcx_det
[bob@fed20 src]$ make
nvcc -c -lineinfo  -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=sm_20 -DMCX_TARGET_NAME='"Fermi MCX"' -o mcx_core.o  mcx_core.cu
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o mcx_utils.o  mcx_utils.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o mcx_shapes.o  mcx_shapes.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o tictoc.o  tictoc.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o mcextreme.o  mcextreme.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99  -m64 -fopenmp -c -o cjson/cJSON.o  cjson/cJSON.c
cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp
[bob@fed20 src]$ cd ../example/quicktest
[bob@fed20 quicktest]$ ls
grid2x.inp  qtest.inp            README.txt     run_qtest.sh
grid3x.inp  qtest.json           run_grid2x.sh  run_qtest_silent.sh
listgpu.sh  qtest_widefield.inp  run_grid3x.sh
[bob@fed20 quicktest]$ ./listgpu.sh
=============================   GPU Infomation  ================================
Device 1 of 1:          GeForce GTX 960
Compute Capability:     5.2
Global Memory:          2146762752 B
Constant Memory:        65536 B
Shared Memory:          49152 B
Registers:              65536
Clock Speed:            1.24 GHz
Number of MPs:          8
Number of Cores:        1024
SMX count:              8
[bob@fed20 quicktest]$ ./run_qtest.sh
216+0 records in
216+0 records out
216000 bytes (216 kB) copied, 0.00110169 s, 196 MB/s

MCX ERROR(0):assert error in unit mcx_utils.c:237

real    0m0.003s
user    0m0.001s
sys     0m0.002s

Unfortunately that assert is in a non-inlined function call, so it’s nearly useless as a debugging aid. I have no idea what function called that assert or what the actual issue was or test that failed without firing up a debugger.

agree, the message was not helpful at all. I meant to rewrite as the mcx_cu_assess() in mcx_core.cu, but did not get chance to update.

this is now fixed in my master. please run “git pull” to get the updated code

nonetheless, an error thrown by mcx_assess means something wrong in your input file and does not indicate a CUDA error (in that case, mcx_cu_assess will be called).

please let me know which line in the input file (qtest.inp) triggered the error.

(by the way, I tried my sequence on three different Ubuntu boxes, I did not see the error you mentioned)

tried that, no impact to the simulation speed.

In conjunction with the earlier thread, all I can say at this point is: This is getting curioser and curioser! I am hoping that txbob will be able to shed some light on the issue. There’s got to be a rational explanation for these observations …

Is it possible the 980 Ti is stalling so much that it’s running at relatively low clocks?

For example, this code makes me nervous (depending on how big the 3 nested loops are):

If you’re on Windows then I would recommend using Nsight as it’s very useful. Its instruction-level kernel profiler might help you locate the problem.

first of all, the 3-level nested loop you pointed out is no longer used by default. The “USE_CACHEBOX” blocks were hacks to avoid using atomic operations in the early NVIDIA hardware, but now, the cost of atomic operations are negligible in this code. So I’ve switched to using atomic operations by default.

nonetheless, I admit many places were not written in the most efficient way. Currently, we prioritize our code optimization using the nvvp PC sampling profiling. The inefficient implementations mentioned earlier in this thread were, fortunately, not the hotspot.

The PC sampling report screenshot (before it behaves strangely on the 980Ti) is attached below. The biggest single-line hotspot is a device function called hitgrid(). This was recently accelerated by using a custom nextafterf() function (https://github.com/fangq/mcx/commit/29ea4261ff906b713b0b35a380300118747e6c52#diff-0083a506345d0d19caffd23f50b59bcdL125).

External Media

My remaining issues are high “execution dependency” and “instruction fetch”. I am not exactly sure how to reduce those.

Have you considered annotating your kernel with __launch_bounds(…) instead of --maxrregcount?

I would add to that: Have you considered using neither -maxrregcount nor __launch_bounds()? I find that their utility has diminished with modern CUDA compilers and GPU architectures, which doesn’t mean they cannot play a positive role here and there.

I am not an expert at interpreting the profiler metrics, but I think “execution dependency” points to a lack of ILP (instruction-level parallelism) and “instruction fetch” may point to issues either with branches or with exceeding the instruction cache for core processing loops. Such issues may be addressable through source code changes (e.g. rearranging computation, limit unrolling and/or inlining).

You may want to look into the issues I noticed while perusing the code, starting with maximizing the use of “const”, “restrict”, and “const restrict” attributes for pointer arguments to functions (this really has to be done pervasively in order to ease the constraints on compiler code generation imposed by C++ language semantics).

I’m running this on Fedora 20. With your update, the error is reported on line 396 of mcx_utils.c:

MCX_ASSERT(fscanf(in,"%f %f %f", &(cfg->tstart),&(cfg->tend),&(cfg->tstep) )==3,__FILE__,__LINE__);

which appears to be reading this line from qtest.inp:

0.e+00 5.e-09 5.e-9  # time-gates(s): start, end, step

To work around that, I made the following changes to mcx_utils.c:

printf("fscanf: %d\n", fscanf(in,"%f %f %f", &(cfg->tstart),&(cfg->tend),&(cfg->tstep) ));
     printf("p1: %f, p2: %f, p3: %f\n", cfg->tstart, cfg->tend, cfg->tstep);
     cfg->tstart = 0.e+00;
     cfg->tend = 5.e-09;
     cfg->tstep = 5.e-9;
//  below is the original line 396, above code is added immediately prior to it
//     MCX_ASSERT(fscanf(in,"%f %f %f", &(cfg->tstart),&(cfg->tend),&(cfg->tstep) )==3,__FILE__,__LINE__);

With that, the extra output from the added printf statements above looks like this:

fscanf: 1
p1: 0.000000, p2: 0.000000, p3: 0.000000

I didn’t bother trying to debug that any further. There seems to be something messed up in the formatted input.

Anyway, the above changes also put the “correct” values in for those parameters. With that, I get results like this compiled with CUDA 7.5 on a GTX960 (note that as already indicated in entry 10 above, your Makefile defaults to -arch=sm_20):

$ ./run_qtest.sh
fscanf: 1
p1: 0.000000, p2: 0.000000, p3: 0.000000
autopilot mode: setting thread number to 16384, block size to 64 and time gates to 1
###############################################################################
#                      Monte Carlo eXtreme (MCX) -- CUDA                      #
#          Copyright (c) 2009-2015 Qianqian Fang <q.fang at neu.edu>          #
#                                                                             #
#                    Computational Imaging Laboratory (CIL)                   #
#            Department of Bioengineering, Northeastern University            #
###############################################################################
$MCX $Rev::     $ Last Commit $Date::                     $ by $Author:: fangq$
###############################################################################
- variant name: [Fermi] compiled for GPU Capability [100] with CUDA [7050]
- compiled with: RNG [Logistic-Lattice] with Seed Length [5]
- this version CAN save photons at the detectors


GPU=1 threadph=610 oddphotons=5760 np=10000000 nthread=16384 maxgate=1 repetition=1
initializing streams ...        init complete : 0 ms
requesting 2560 bytes of shared memory
lauching MCX simulation for time window [0.00e+00ns 5.00e+00ns] ...
simulation run# 1 ...   kernel complete:        19294 ms
retrieving fields ...   detected 30045 photons, total: 30045    transfer complete:      19313 ms
data normalization complete : 19313 ms
normalizing raw data ...        normalization factor alpha=20.000000
saving data to file ... 216000 1        saving data complete : 19324 ms

simulated 10000000 photons (10000000) with 16384 threads (repeat x1)
MCX simulation speed: 518.48 photon/ms
total simulated energy: 10000000.00     absorbed: 17.69411%
(loss due to initial specular reflection is excluded in the total)

real    0m20.536s
user    0m13.647s
sys     0m5.997s
$

And with CUDA 6.5 I see this:

$ ./run_qtest.sh
fscanf: 1
p1: 0.000000, p2: 0.000000, p3: 0.000000
autopilot mode: setting thread number to 16384, block size to 64 and time gates to 1
###############################################################################
#                      Monte Carlo eXtreme (MCX) -- CUDA                      #
#          Copyright (c) 2009-2015 Qianqian Fang <q.fang at neu.edu>          #
#                                                                             #
#                    Computational Imaging Laboratory (CIL)                   #
#            Department of Bioengineering, Northeastern University            #
###############################################################################
$MCX $Rev::     $ Last Commit $Date::                     $ by $Author:: fangq$
###############################################################################
- variant name: [Fermi] compiled for GPU Capability [100] with CUDA [6050]
- compiled with: RNG [Logistic-Lattice] with Seed Length [5]
- this version CAN save photons at the detectors


GPU=1 threadph=610 oddphotons=5760 np=10000000 nthread=16384 maxgate=1 repetition=1
initializing streams ...        init complete : 0 ms
requesting 2560 bytes of shared memory
lauching MCX simulation for time window [0.00e+00ns 5.00e+00ns] ...
simulation run# 1 ...   kernel complete:        16116 ms
retrieving fields ...   detected 30051 photons, total: 30051    transfer complete:      16135 ms
data normalization complete : 16136 ms
normalizing raw data ...        normalization factor alpha=20.000000
saving data to file ... 216000 1        saving data complete : 16147 ms

simulated 10000000 photons (10000000) with 16384 threads (repeat x1)
MCX simulation speed: 620.77 photon/ms
total simulated energy: 10000000.00     absorbed: 17.69432%
(loss due to initial specular reflection is excluded in the total)

real    0m18.357s
user    0m12.568s
sys     0m4.860s
$

(I happen to be using GPU driver 361.28)

In any event, there seems to be about a 20% difference in performance, not 10x. The reported absorption seems to be approximately the same at ~17.7% in both cases. So at the moment I’m unable to reproduce the 10x claim.

thanks txbob for testing this. Just want to make sure, when you recompile for cuda 6.5, did you run “make” or you used my modified nvcc command as shown in the second block of my post

https://devtalk.nvidia.com/default/topic/925630/cuda-programming-and-performance/cuda-7-5-on-maxwell-980ti-drops-performance-by-10x-versus-cuda-7-0-and-6-5/post/4841719/#4841719

to get the full speed binary, you need to use the following command to re-compile the .cu file:

nvcc -c -lineinfo  -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math \
   -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=compute_20 \
   -code=sm_20 -code=sm_30 -code=sm_35 -code=sm_50 -code=sm_52 \
   -DMCX_TARGET_NAME='"Maxwell MCX"' -o mcx_core.o  mcx_core.cu

you can simply copy&paste the remaining commands in my previous post, and you should get a different speed.

The NVIDIA GT 730 on my desktop can produce 900 photon/ms using cuda 7/7.5, so I expect your GTX 960 to be much more capable (should be above 3000 photon/ms).

can you double check?

BTW, I notice there is a slight difference between the number of detected photons in the two runs performed by txbob. Is there a ready explanation for this? I assume that by default, each run should produce exactly the same results as it uses the exact same sequence of random numbers to drive the MC computation? Which means that the code paths taken would be identical as well.

Of course small numeric differences in the results between CUDA versions could also be due to small changes in math functions, or slightly different FMUL/FADD->FMA contraction choices by the compiler. But I figured I should ask about the PRNG control since reproducability already seems to be an issue with this application.