CUDA 7.5 on Maxwell 980Ti drops performance by 10x versus CUDA 7.0, and 6.5
Hi, I have been developing "Monte Carlo simulation for photon tracing behavior" in CUDA since the last few months (github: https://github.com/fninaparavecino/mcx). Since I update the CUDA RC, to 7.5 I experienced a drop of performance of 10x compared to previous CUDA RC. I have been exploring this issue since the last few weeks. I have read these article: https://devtalk.nvidia.com/default/topic/871702/cuda-7-5-give-a-30-performance-loss-vs-cuda-6-5/ and I have tried their work-around that it didn't work. Actually I know that my kernels are not spilling any registers. Here is the behavior of my main kernel (mcx_main_loop) along the difference CUDA SDK. CUDA RC 6.5: 15290.52 Photons/ms, Regs: 70, Cmem[0]: 424, Cmem[2]: 48, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads CUDA RC 7.0: 15408.32 Photons/ms, Regs: 87, Cmem[0]: 424, Cmem[2]: 68, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads CUDA RC 6.5: 15290.52 Photons/ms, Regs: 77, Cmem[0]: 424, Cmem[2]: 72, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads So I have tried this compilation flag -Xptxas -dlcm=cg to avoid L1 access, as you can see here: nvcc -c -lineinfo -Xptxas -v,-dlcm=cg --maxrregcount 77 -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -gencode=arch=compute_52,code=\"sm_52,compute_52\" -DMCX_TARGET_NAME='"Maxwell MCX"' -o mcx_core.o mcx_core.cucc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_utils.o mcx_utils.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_shapes.o mcx_shapes.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o tictoc.o tictoc.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcextreme.o mcextreme.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o cjson/cJSON.o cjson/cJSON.c cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp But my results are still drastically decreased: MCX simulation speed: 1463.27 photon/ms Does anyone could suggest any work-around? Why CUDA 7.5 is generating this issue? and why only with Maxwell?
Hi,

I have been developing "Monte Carlo simulation for photon tracing behavior" in CUDA since the last few months (github: https://github.com/fninaparavecino/mcx). Since I update the CUDA RC, to 7.5 I experienced a drop of performance of 10x compared to previous CUDA RC.

I have been exploring this issue since the last few weeks. I have read these article: https://devtalk.nvidia.com/default/topic/871702/cuda-7-5-give-a-30-performance-loss-vs-cuda-6-5/ and I have tried their work-around that it didn't work. Actually I know that my kernels are not spilling any registers. Here is the behavior of my main kernel (mcx_main_loop) along the difference CUDA SDK.

CUDA RC 6.5: 15290.52 Photons/ms, Regs: 70, Cmem[0]: 424, Cmem[2]: 48, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

CUDA RC 7.0: 15408.32 Photons/ms, Regs: 87, Cmem[0]: 424, Cmem[2]: 68, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

CUDA RC 6.5: 15290.52 Photons/ms, Regs: 77, Cmem[0]: 424, Cmem[2]: 72, lmem: 24. 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

So I have tried this compilation flag -Xptxas -dlcm=cg to avoid L1 access, as you can see here:

nvcc -c -lineinfo -Xptxas -v,-dlcm=cg --maxrregcount 77 -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -gencode=arch=compute_52,code=\"sm_52,compute_52\" -DMCX_TARGET_NAME='"Maxwell MCX"' -o mcx_core.o mcx_core.cucc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_utils.o mcx_utils.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_shapes.o mcx_shapes.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o tictoc.o tictoc.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcextreme.o mcextreme.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o cjson/cJSON.o cjson/cJSON.c
cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp

But my results are still drastically decreased: MCX simulation speed: 1463.27 photon/ms
Does anyone could suggest any work-around? Why CUDA 7.5 is generating this issue? and why only with Maxwell?

#1
Posted 03/21/2016 08:26 PM   
The "RC" designation in the stated tool chain versions is confusing. "RC" stands for release candidate, and such early-access versions are always superseded by the final version of each tool chain, which is what you should be using if you use any of CUDA 6.5, 7.0, and 7.5. A 10x performance difference [i]appears[/i] way too large to be due to performance regressions caused by code generation differences. Double check that you are not comparing a debug build with a release build. When you run the application under control of cuda-memcheck, are any issues reported? The generation of machine code (SASS) from PTX is architecture dependent, since the GPU architectures lack binary compatibility. It is therefore possible that a code generation issue affects only one particular architecture. What happens when you change the ptxas optimization level, e.g. -Xptxas -O1 ? As a quick check whether there may be any issues in the front portion of the compiler pipeline, you could try compiling for another architecture, say sm_35, then JIT compile the resulting PTX to Maxwell. If you run the CUDA profiler with the app compiled with CUDA 7.0 vs CUDA 7.5, which performance metrics show significant differences? I briefly looked at the GitHub repository but it is not immediately obvious where the code spends the bulk of its time. Side-remark: There are some computational inefficiencies in the code, but it is not clear whether it is worth addressing those. For example at one place the code computes sinf(acosf(x)), which can be computed faster and more accurately as sqrtf(fmaf(-x,x,1.0f)). int(floorf(x)) is equivalent to __float2int_rd(x), where the latter is one instruction by the former is two. There are also various instances of sinf() and sincosf() that could use sinpif() and sincospif() instead; admittedly the usefulness of this is limited if you routinely compile with -use_fast_math. Various functions in the code look like they could benefit from the use of the modifiers __restrict__ and const __restrict__ (as appropriate) for pointer arguments, see the CUDA Best Practices Guide.
The "RC" designation in the stated tool chain versions is confusing. "RC" stands for release candidate, and such early-access versions are always superseded by the final version of each tool chain, which is what you should be using if you use any of CUDA 6.5, 7.0, and 7.5.

A 10x performance difference appears way too large to be due to performance regressions caused by code generation differences. Double check that you are not comparing a debug build with a release build. When you run the application under control of cuda-memcheck, are any issues reported?

The generation of machine code (SASS) from PTX is architecture dependent, since the GPU architectures lack binary compatibility. It is therefore possible that a code generation issue affects only one particular architecture.

What happens when you change the ptxas optimization level, e.g. -Xptxas -O1 ? As a quick check whether there may be any issues in the front portion of the compiler pipeline, you could try compiling for another architecture, say sm_35, then JIT compile the resulting PTX to Maxwell. If you run the CUDA profiler with the app compiled with CUDA 7.0 vs CUDA 7.5, which performance metrics show significant differences?

I briefly looked at the GitHub repository but it is not immediately obvious where the code spends the bulk of its time. Side-remark: There are some computational inefficiencies in the code, but it is not clear whether it is worth addressing those. For example at one place the code computes sinf(acosf(x)), which can be computed faster and more accurately as sqrtf(fmaf(-x,x,1.0f)). int(floorf(x)) is equivalent to __float2int_rd(x), where the latter is one instruction by the former is two. There are also various instances of sinf() and sincosf() that could use sinpif() and sincospif() instead; admittedly the usefulness of this is limited if you routinely compile with -use_fast_math. Various functions in the code look like they could benefit from the use of the modifiers __restrict__ and const __restrict__ (as appropriate) for pointer arguments, see the CUDA Best Practices Guide.

#2
Posted 03/21/2016 08:58 PM   
Thank njuffa for your reply. Yes, I was meaning the final version of each tool chain. I double checked it many times, it is the same source code, same compilation mechanism, the only difference is the version of CUDA. I followed your suggestions, and my results are: [olist] [.]Using -Xptxas -O1, my performance dropped even more. MCX simulation speed: 1258.34 photon/ms[/.] [.]I compiled with sm_35 and JIT compile the PTX to Maxwell. MCX simulation speed: 1415.63 photon/ms[/.] [.]Usually I will simulate 10M for each run, but in order to get the metrics from nvvp I decided to run 1M for each run. First, with CUDA 7.0 and then with CUDA 7.5. In general global memory throughput and L2 throughput show better performance in CUDA 7.0, but the thing that is very interesting is warp efficiency, the performance of warp efficiency in CUDA 7.5 is very poor compared to CUDA 7.0, below the respective metrics that show differences (CUDA 7.0 vs CUDA 7.5): [list] [.]Registers/Thread: 87 vs 77[/.] [.]Global Load Throughput(GB/sec): 123.82 vs 17.71[/.] [.]Global Store Throughput(MB/sec): 44.62 vs 6.26[/.] [.]Warp Execution Efficiency(%): 26.87 vs 3.23 [/.] [.]Warp Non-Predicated Execution Efficiency(%): 25.84 vs 3.09 [/.] [.]L2 Throughput (Reads)(GB/sec): 161.7 vs 23.03 [/.] [.]L2 Throughput (Writes)(GB/sec): 37.93 vs 5.33 [/.] [.]L2 Throughput (Atomic requests)(GB/sec): 37.88 vs 5.32 [/.] [/list] [/.] [.]Thanks for all your suggestions to improve compute efficiency. Right now, the bottleneck is dependency for hitgrid function and arithmetic operations. I would like to address all of your side-mark suggestions, and I will. But, I will like to still understand why the drop of performance so drastically among the different CUDA tool chain, and mainly how can I avoid it?.[/.] [/olist]
Thank njuffa for your reply. Yes, I was meaning the final version of each tool chain.

I double checked it many times, it is the same source code, same compilation mechanism, the only difference is the version of CUDA.

I followed your suggestions, and my results are:

  1. Using -Xptxas -O1, my performance dropped even more. MCX simulation speed: 1258.34 photon/ms
  2. I compiled with sm_35 and JIT compile the PTX to Maxwell. MCX simulation speed: 1415.63 photon/ms
  3. Usually I will simulate 10M for each run, but in order to get the metrics from nvvp I decided to run 1M for each run. First, with CUDA 7.0 and then with CUDA 7.5. In general global memory throughput and L2 throughput show better performance in CUDA 7.0, but the thing that is very interesting is warp efficiency, the performance of warp efficiency in CUDA 7.5 is very poor compared to CUDA 7.0, below the respective metrics that show differences (CUDA 7.0 vs CUDA 7.5):
    • Registers/Thread: 87 vs 77
    • Global Load Throughput(GB/sec): 123.82 vs 17.71
    • Global Store Throughput(MB/sec): 44.62 vs 6.26
    • Warp Execution Efficiency(%): 26.87 vs 3.23
    • Warp Non-Predicated Execution Efficiency(%): 25.84 vs 3.09
    • L2 Throughput (Reads)(GB/sec): 161.7 vs 23.03
    • L2 Throughput (Writes)(GB/sec): 37.93 vs 5.33
    • L2 Throughput (Atomic requests)(GB/sec): 37.88 vs 5.32

  4. Thanks for all your suggestions to improve compute efficiency. Right now, the bottleneck is dependency for hitgrid function and arithmetic operations. I would like to address all of your side-mark suggestions, and I will. But, I will like to still understand why the drop of performance so drastically among the different CUDA tool chain, and mainly how can I avoid it?.

#3
Posted 03/22/2016 10:51 PM   
The reason I asked to double-check debug vs release build because that has come up repeatedly as the source of performance differences as large as 10x. Performance regressions based on code generation are typically no larger than 25% to 30% for bad cases. If this turns out to be a compiler issue, it would appear to be local to ptxas which translates PTX into SASS (machine code). The fact that you lose additional performance with -Xptxas -O1 tells me that your normal build does not have all optimizations turned off. Nothing in the source code suggests any particular critical sequences that could be affected by code generation issues to the tune of a factor of 10x. The profiler performance metrics are all consistent with the 10x appl-level performance reduction. I am puzzled. Side-by-side comparison of the object code with cuobjdump --dump-sass between the executables from the CUDA 7.0 and CUDA 7.5 builds might be instructive. I don't have the time to build the project, though. Changing the focus briefly to hardware: Is this a machine with multiple GPUs? If so, double-check that you are running the application on the correct one. Does nvidia-smi show the GPU running at full speed (look at power state, and core frequencies) while the app is running? If, after sufficient due diligence, you believe the problem is with an NVIDIA software component, rather than on your side, you might want to consider filing a bug report (form is linked from the CUDA registered developer website).
The reason I asked to double-check debug vs release build because that has come up repeatedly as the source of performance differences as large as 10x. Performance regressions based on code generation are typically no larger than 25% to 30% for bad cases.

If this turns out to be a compiler issue, it would appear to be local to ptxas which translates PTX into SASS (machine code). The fact that you lose additional performance with -Xptxas -O1 tells me that your normal build does not have all optimizations turned off.

Nothing in the source code suggests any particular critical sequences that could be affected by code generation issues to the tune of a factor of 10x. The profiler performance metrics are all consistent with the 10x appl-level performance reduction. I am puzzled. Side-by-side comparison of the object code with cuobjdump --dump-sass between the executables from the CUDA 7.0 and CUDA 7.5 builds might be instructive. I don't have the time to build the project, though.

Changing the focus briefly to hardware: Is this a machine with multiple GPUs? If so, double-check that you are running the application on the correct one. Does nvidia-smi show the GPU running at full speed (look at power state, and core frequencies) while the app is running?

If, after sufficient due diligence, you believe the problem is with an NVIDIA software component, rather than on your side, you might want to consider filing a bug report (form is linked from the CUDA registered developer website).

#4
Posted 03/23/2016 12:18 AM   
FYI, I previously reported this problem in the below thread https://devtalk.nvidia.com/default/topic/917213/maxwell-suddernly-becomes-10x-slower/ most of my tests, including using the latest CUDA 7.5.18, were documented in this issue tracker https://github.com/fangq/mcx/issues/18 at the beginning, I suspected that my 980Ti was defective, but later on, all evidence pointed to CUDA toolkit versions. We did all tests carefully and are certain that we used the intended GPU hardware. Interestingly, I have been compiling this code with "-arch=sm_20" option and running the binary with "good/expected" speed on Maxwell/Kepler/Fermi. This speed drop on Maxwell happened randomly at first, but since Jan, it became permanent when compiled with CUDA 7.5. Another note, I found my OpenCL version of this code also got the same speed hit - the simulation speed on the 980Ti is 3x slower than Fermi (GTX 590), which was not the case last year.
FYI, I previously reported this problem in the below thread

https://devtalk.nvidia.com/default/topic/917213/maxwell-suddernly-becomes-10x-slower/

most of my tests, including using the latest CUDA 7.5.18, were documented in this issue tracker

https://github.com/fangq/mcx/issues/18

at the beginning, I suspected that my 980Ti was defective, but later on, all evidence pointed to CUDA toolkit versions. We did all tests carefully and are certain that we used the intended GPU hardware.

Interestingly, I have been compiling this code with "-arch=sm_20" option and running the binary with "good/expected" speed on Maxwell/Kepler/Fermi. This speed drop on Maxwell happened randomly at first, but since Jan, it became permanent when compiled with CUDA 7.5.

Another note, I found my OpenCL version of this code also got the same speed hit - the simulation speed on the 980Ti is 3x slower than Fermi (GTX 590), which was not the case last year.

#5
Posted 03/23/2016 03:21 AM   
Can you give a set of instructions for building the code and running a benchmark? It wasn't entirely obvious from the github repo. What is the difference if any between the FangQ project repo and the Fanny Nina Paravecino project repo? Also, the performance comparison in item #3 above surely looks to me as if debug mode was enabled (i.e. compiled with -G) for the "slow" case. The performance has cratered across the board, in all the metrics. I'm not sure why the warp efficiency metric should be singled out as "interesting".
Can you give a set of instructions for building the code and running a benchmark? It wasn't entirely obvious from the github repo. What is the difference if any between the FangQ project repo and the Fanny Nina Paravecino project repo? Also, the performance comparison in item #3 above surely looks to me as if debug mode was enabled (i.e. compiled with -G) for the "slow" case. The performance has cratered across the board, in all the metrics. I'm not sure why the warp efficiency metric should be singled out as "interesting".

#6
Posted 03/23/2016 05:10 AM   
Fanny's repo is 1 commit behind my master branch (but with many addition commits for debugging): see https://github.com/fangq/mcx/network but in terms reproducing this issue, either repo should do. here are the procedures to use my master as an example, assuming a Linux box, you need to run [code] git clone https://github.com/fangq/mcx.git cd mcx/src make clean make # compiles mcx binary with your current cuda cd ../example/quicktest/ ./listgpu.sh # this lists all available nvidia GPUs ./run_qtest.sh # run benchmark; assume you use the first gpu (-G 1) [/code] In my case, my first GPU is 980Ti, I got slow speed (1300 p/s) when compiling with cuda 7.5 and -arch=sm_xx (xx can be anything above 20). my second GPU is one core of 590. Using -G 2 I got 3000 p/s. if you have cuda 7.0 installed, relink /usr/local/cuda to cuda-7.0, and then run [code] cd mcx/src make clean nvcc -c -lineinfo -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math \ -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=compute_20 \ -code=sm_20 -code=sm_30 -code=sm_35 -code=sm_50 -code=sm_52 \ -DMCX_TARGET_NAME='"Maxwell MCX"' -o mcx_core.o mcx_core.cu cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_utils.o mcx_utils.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_shapes.o mcx_shapes.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o tictoc.o tictoc.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcextreme.o mcextreme.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o cjson/cJSON.o cjson/cJSON.c cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o \ -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp -fopenmp cd ../example/quicktest/ ./listgpu.sh # this lists all available nvidia GPUs ./run_qtest.sh # edit the script and change -G x to use the desired GPU [/code] this will give the good speed (16200 p/s) as well as the correct results (absorption fraction=~17.7%, printed near the end).
Fanny's repo is 1 commit behind my master branch (but with many addition commits for debugging): see


https://github.com/fangq/mcx/network


but in terms reproducing this issue, either repo should do.

here are the procedures to use my master as an example, assuming a Linux box, you need to run

git clone https://github.com/fangq/mcx.git

cd mcx/src
make clean
make # compiles mcx binary with your current cuda
cd ../example/quicktest/
./listgpu.sh # this lists all available nvidia GPUs
./run_qtest.sh # run benchmark; assume you use the first gpu (-G 1)


In my case, my first GPU is 980Ti, I got slow speed (1300 p/s) when compiling with cuda 7.5 and -arch=sm_xx (xx can be anything above 20). my second GPU is one core of 590. Using -G 2 I got 3000 p/s.

if you have cuda 7.0 installed, relink /usr/local/cuda to cuda-7.0, and then run

cd mcx/src
make clean
nvcc -c -lineinfo -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math \
-DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=compute_20 \
-code=sm_20 -code=sm_30 -code=sm_35 -code=sm_50 -code=sm_52 \
-DMCX_TARGET_NAME='"Maxwell MCX"' -o mcx_core.o mcx_core.cu
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_utils.o mcx_utils.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_shapes.o mcx_shapes.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o tictoc.o tictoc.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcextreme.o mcextreme.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o cjson/cJSON.o cjson/cJSON.c
cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o \
-o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp -fopenmp

cd ../example/quicktest/
./listgpu.sh # this lists all available nvidia GPUs
./run_qtest.sh # edit the script and change -G x to use the desired GPU


this will give the good speed (16200 p/s) as well as the correct results (absorption fraction=~17.7%, printed near the end).

#7
Posted 03/23/2016 05:01 PM   
[quote=""]Also, the performance comparison in item #3 above surely looks to me as if debug mode was enabled (i.e. compiled with -G) for the "slow" case. [/quote] just want to make sure we did not confuse you with the -G flag. The "-G N" or "--gpu N" is a flag we used in mcx to select the desired GPU by its ID. It is not a flag used with nvcc for compilation. see [url]https://github.com/fangq/mcx/blob/master/src/mcx_utils.c#L1389[/url]
said:Also, the performance comparison in item #3 above surely looks to me as if debug mode was enabled (i.e. compiled with -G) for the "slow" case.


just want to make sure we did not confuse you with the -G flag. The "-G N" or "--gpu N" is a flag we used in mcx to select the desired GPU by its ID. It is not a flag used with nvcc for compilation. see

https://github.com/fangq/mcx/blob/master/src/mcx_utils.c#L1389

#8
Posted 03/23/2016 05:08 PM   
I think the culprit could be the use of -lineinfo. As I recall, to provide accurate matching of source code line numbers to machine code instructions the compiler needs to turn off most optimizations. With full optimization, even instructions from the same expression (let alone the same source line) will be strewn all over the code, the code from some source code lines will disappear entirely (e.g. absorbed by CSE), etc. I would suggest that as a quick experiment you remove -lineinfo from your nvcc invocation.
I think the culprit could be the use of -lineinfo. As I recall, to provide accurate matching of source code line numbers to machine code instructions the compiler needs to turn off most optimizations. With full optimization, even instructions from the same expression (let alone the same source line) will be strewn all over the code, the code from some source code lines will disappear entirely (e.g. absorbed by CSE), etc.

I would suggest that as a quick experiment you remove -lineinfo from your nvcc invocation.

#9
Posted 03/23/2016 05:38 PM   
I followed this sequence: [code]cd mcx/src make clean make cd ../example/quicktest/ ./listgpu.sh ./run_qtest.sh [/code] and got this output: [code][bob@fed20 src]$ make clean rm -f mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o ../bin/mcx ../bin/mcx_atomic ../bin/mcx_det [bob@fed20 src]$ make nvcc -c -lineinfo -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=sm_20 -DMCX_TARGET_NAME='"Fermi MCX"' -o mcx_core.o mcx_core.cu cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_utils.o mcx_utils.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_shapes.o mcx_shapes.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o tictoc.o tictoc.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcextreme.o mcextreme.c cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o cjson/cJSON.o cjson/cJSON.c cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp [bob@fed20 src]$ cd ../example/quicktest [bob@fed20 quicktest]$ ls grid2x.inp qtest.inp README.txt run_qtest.sh grid3x.inp qtest.json run_grid2x.sh run_qtest_silent.sh listgpu.sh qtest_widefield.inp run_grid3x.sh [bob@fed20 quicktest]$ ./listgpu.sh ============================= GPU Infomation ================================ Device 1 of 1: GeForce GTX 960 Compute Capability: 5.2 Global Memory: 2146762752 B Constant Memory: 65536 B Shared Memory: 49152 B Registers: 65536 Clock Speed: 1.24 GHz Number of MPs: 8 Number of Cores: 1024 SMX count: 8 [bob@fed20 quicktest]$ ./run_qtest.sh 216+0 records in 216+0 records out 216000 bytes (216 kB) copied, 0.00110169 s, 196 MB/s MCX ERROR(0):assert error in unit mcx_utils.c:237 real 0m0.003s user 0m0.001s sys 0m0.002s[/code] Unfortunately that assert is in a non-inlined function call, so it's nearly useless as a debugging aid. I have no idea what function called that assert or what the actual issue was or test that failed without firing up a debugger.
I followed this sequence:

cd mcx/src
make clean
make
cd ../example/quicktest/
./listgpu.sh
./run_qtest.sh




and got this output:

[bob@fed20 src]$ make clean
rm -f mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o ../bin/mcx ../bin/mcx_atomic ../bin/mcx_det
[bob@fed20 src]$ make
nvcc -c -lineinfo -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=sm_20 -DMCX_TARGET_NAME='"Fermi MCX"' -o mcx_core.o mcx_core.cu
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_utils.o mcx_utils.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcx_shapes.o mcx_shapes.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o tictoc.o tictoc.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o mcextreme.o mcextreme.c
cc -I/usr/local/cuda/include -g -Wall -O3 -std=c99 -m64 -fopenmp -c -o cjson/cJSON.o cjson/cJSON.c
cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++ -fopenmp
[bob@fed20 src]$ cd ../example/quicktest
[bob@fed20 quicktest]$ ls
grid2x.inp qtest.inp README.txt run_qtest.sh
grid3x.inp qtest.json run_grid2x.sh run_qtest_silent.sh
listgpu.sh qtest_widefield.inp run_grid3x.sh
[bob@fed20 quicktest]$ ./listgpu.sh
============================= GPU Infomation ================================
Device 1 of 1: GeForce GTX 960
Compute Capability: 5.2
Global Memory: 2146762752 B
Constant Memory: 65536 B
Shared Memory: 49152 B
Registers: 65536
Clock Speed: 1.24 GHz
Number of MPs: 8
Number of Cores: 1024
SMX count: 8
[bob@fed20 quicktest]$ ./run_qtest.sh
216+0 records in
216+0 records out
216000 bytes (216 kB) copied, 0.00110169 s, 196 MB/s

MCX ERROR(0):assert error in unit mcx_utils.c:237

real 0m0.003s
user 0m0.001s
sys 0m0.002s


Unfortunately that assert is in a non-inlined function call, so it's nearly useless as a debugging aid. I have no idea what function called that assert or what the actual issue was or test that failed without firing up a debugger.

#10
Posted 03/23/2016 05:52 PM   
[quote=""]Unfortunately that assert is in a non-inlined function call, so it's nearly useless as a debugging aid. I have no idea what function called that assert or what the actual issue was or test that failed without firing up a debugger. [/quote] agree, the message was not helpful at all. I meant to rewrite as the mcx_cu_assess() in mcx_core.cu, but did not get chance to update. this is now fixed in my master. please run "git pull" to get the updated code https://github.com/fangq/mcx/commit/0fbf48431d37d9e72058370f9f0b8f4acfb70b46 nonetheless, an error thrown by mcx_assess means something wrong in your input file and does not indicate a CUDA error (in that case, mcx_cu_assess will be called). please let me know which line in the input file (qtest.inp) triggered the error. (by the way, I tried my sequence on three different Ubuntu boxes, I did not see the error you mentioned)
said:Unfortunately that assert is in a non-inlined function call, so it's nearly useless as a debugging aid. I have no idea what function called that assert or what the actual issue was or test that failed without firing up a debugger.


agree, the message was not helpful at all. I meant to rewrite as the mcx_cu_assess() in mcx_core.cu, but did not get chance to update.

this is now fixed in my master. please run "git pull" to get the updated code

https://github.com/fangq/mcx/commit/0fbf48431d37d9e72058370f9f0b8f4acfb70b46

nonetheless, an error thrown by mcx_assess means something wrong in your input file and does not indicate a CUDA error (in that case, mcx_cu_assess will be called).

please let me know which line in the input file (qtest.inp) triggered the error.

(by the way, I tried my sequence on three different Ubuntu boxes, I did not see the error you mentioned)

#11
Posted 03/23/2016 09:04 PM   
[quote=""]I would suggest that as a quick experiment you remove -lineinfo from your nvcc invocation.[/quote] tried that, no impact to the simulation speed.
said:I would suggest that as a quick experiment you remove -lineinfo from your nvcc invocation.


tried that, no impact to the simulation speed.

#12
Posted 03/23/2016 09:05 PM   
In conjunction with the earlier thread, all I can say at this point is: This is getting curioser and curioser! I am hoping that txbob will be able to shed some light on the issue. There's got to be a rational explanation for these observations ...
In conjunction with the earlier thread, all I can say at this point is: This is getting curioser and curioser! I am hoping that txbob will be able to shed some light on the issue. There's got to be a rational explanation for these observations ...

#13
Posted 03/23/2016 09:12 PM   
Is it possible the 980 Ti is stalling so much that it's running at relatively low clocks? For example, this code makes me nervous (depending on how big the 3 nested loops are): [img]http://i.imgur.com/qSJ9xCF.png[/img] If you're on Windows then I would recommend using Nsight as it's very useful. Its instruction-level kernel profiler might help you locate the problem.
Is it possible the 980 Ti is stalling so much that it's running at relatively low clocks?

For example, this code makes me nervous (depending on how big the 3 nested loops are):

Image

If you're on Windows then I would recommend using Nsight as it's very useful. Its instruction-level kernel profiler might help you locate the problem.

#14
Posted 03/23/2016 09:45 PM   
[quote=""]Is it possible the 980 Ti is stalling so much that it's running at relatively low clocks? For example, this code makes me nervous (depending on how big the 3 nested loops are): If you're on Windows then I would recommend using Nsight as it's very useful. Its instruction-level kernel profiler might help you locate the problem. [/quote] first of all, the 3-level nested loop you pointed out is no longer used by default. The "USE_CACHEBOX" blocks were hacks to avoid using atomic operations in the early NVIDIA hardware, but now, the cost of atomic operations are negligible in this code. So I've switched to using atomic operations by default. nonetheless, I admit many places were not written in the most efficient way. Currently, we prioritize our code optimization using the nvvp PC sampling profiling. The inefficient implementations mentioned earlier in this thread were, fortunately, not the hotspot. The PC sampling report screenshot (before it behaves strangely on the 980Ti) is attached below. The biggest single-line hotspot is a device function called hitgrid(). This was recently accelerated by using a custom nextafterf() function ([url]https://github.com/fangq/mcx/commit/29ea4261ff906b713b0b35a380300118747e6c52#diff-0083a506345d0d19caffd23f50b59bcdL125[/url]). [img]http://www.nmr.mgh.harvard.edu/~fangq/temp/mcx_profiling.png[/img] My remaining issues are high "execution dependency" and "instruction fetch". I am not exactly sure how to reduce those.
said:Is it possible the 980 Ti is stalling so much that it's running at relatively low clocks?

For example, this code makes me nervous (depending on how big the 3 nested loops are):

If you're on Windows then I would recommend using Nsight as it's very useful. Its instruction-level kernel profiler might help you locate the problem.


first of all, the 3-level nested loop you pointed out is no longer used by default. The "USE_CACHEBOX" blocks were hacks to avoid using atomic operations in the early NVIDIA hardware, but now, the cost of atomic operations are negligible in this code. So I've switched to using atomic operations by default.

nonetheless, I admit many places were not written in the most efficient way. Currently, we prioritize our code optimization using the nvvp PC sampling profiling. The inefficient implementations mentioned earlier in this thread were, fortunately, not the hotspot.

The PC sampling report screenshot (before it behaves strangely on the 980Ti) is attached below. The biggest single-line hotspot is a device function called hitgrid(). This was recently accelerated by using a custom nextafterf() function (https://github.com/fangq/mcx/commit/29ea4261ff906b713b0b35a380300118747e6c52#diff-0083a506345d0d19caffd23f50b59bcdL125).

Image

My remaining issues are high "execution dependency" and "instruction fetch". I am not exactly sure how to reduce those.

#15
Posted 03/23/2016 11:21 PM   
Scroll To Top

Add Reply