Hello to all. After recompiling the program with CUDA 7.0 it was noticed a strong deceleration (approximately two times) compared with CUDA 6.5.
This program is widely known - this is a test BT of NASA NPB ver3.3, written in Fortran language. A high-level programming language Fortran-DVMH is used to parallelization of the test.
The text of this program has been optimized and expanded with directives of FDVMH language.
Our compiler (Fortran DVMH) creates the following output code for this test:
- bt.DVMH.f - the base serial code of the program expanded RTS-DVMH calls
- bt.DVMH_cuda.cu - cuda-handlers and cuda-kernels for each parallel loop
- bt.DVMH_cuda_info.c - special cuda information for RTS-DVMH
consider the compilation of bt.DVMH_cuda.cu. The command is the following:
- /opt/cuda/cuda-6.5/bin/nvcc -arch=sm_35 -O3 -Xptxas -v -I/home/DVM/dvm_current/dvm_sys/include -c bt.DVMH_cuda.cu
- /opt/cuda/cuda-7.0/bin/nvcc -arch=sm_35 -O3 -Xptxas -v -I/home/DVM/dvm_current/dvm_sys/include -c bt.DVMH_cuda.cu
Our DVMH compiler also processes CUDA PtxAs information and convert it to readable form. Below is a output of these variants of compilation:
CUDA 6.5 PTXAs:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z23loop_bt_834_cuda_kernelPdiiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiS_iiiiS_iiiPiddddddddddddddiddddddddddddddddddddd' for 'sm_35'
ptxas info : Function properties for _Z23loop_bt_834_cuda_kernelPdiiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiS_iiiiS_iiiPiddddddddddddddiddddddddddddddddddddd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 120 registers, 832 bytes cmem[0]
ptxas info : Compiling entry function '_Z24loop_bt_3177_cuda_kernelPdiiiiS_iidS_dS_dS_dS_dS_Piiddd' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_3177_cuda_kernelPdiiiiS_iidS_dS_dS_dS_dS_Piiddd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 77 registers, 480 bytes cmem[0]
ptxas info : Compiling entry function '_Z23loop_bt_294_cuda_kernelPiiPdiiiiS0_iiS_ddd' for 'sm_35'
ptxas info : Function properties for _Z23loop_bt_294_cuda_kernelPiiPdiiiiS0_iiS_ddd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 223 registers, 408 bytes cmem[0]
ptxas info : Compiling entry function '_Z24loop_bt_1677_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_1677_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 252 registers, 524 bytes cmem[0], 32 bytes cmem[2]
ptxas info : Compiling entry function '_Z23loop_bt_811_cuda_kernelPdiiiiS_iiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiPi' for 'sm_35'
ptxas info : Function properties for _Z23loop_bt_811_cuda_kernelPdiiiiS_iiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiPi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 544 bytes cmem[0], 32 bytes cmem[2]
ptxas info : Compiling entry function '_Z24loop_bt_2300_cuda_kernelPdiiiiiS_iiiiS_iiiiPiiddddddddddddddd' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_2300_cuda_kernelPdiiiiiS_iiiiS_iiiiPiiddddddddddddddd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 248 registers, 536 bytes cmem[0], 32 bytes cmem[2]
ptxas info : Compiling entry function '_Z24loop_bt_1053_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_1053_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 247 registers, 524 bytes cmem[0], 36 bytes cmem[2]
ptxas info : Compiling entry function '_Z24loop_bt_3238_cuda_kernelPdiiiidS_dS_dS_dS_dS_Pii' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_3238_cuda_kernelPdiiiidS_dS_dS_dS_dS_Pii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 436 bytes cmem[0]
ptxas info : Compiling entry function '_Z23loop_bt_282_cuda_kernelPdiiiiPi' for 'sm_35'
ptxas info : Function properties for _Z23loop_bt_282_cuda_kernelPdiiiiPi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 16 registers, 352 bytes cmem[0]
OR CUDA 6.5 DVMH PTX Info:
Information of CUDA Ptx assembler for compiled module 'bt':
Compiled all kernels for sm_35 architecture
Used 0 bytes of global memory
Loop on line 282:
Used 16 registers
Used 352 bytes of constant memory in bank 0
Loop on line 294:
Used 223 registers
Used 408 bytes of constant memory in bank 0
Loop on line 811:
Used 25 registers
Used 544 bytes of constant memory in bank 0, 32 bytes of constant memory in bank 2
Loop on line 834:
Used 120 registers
Used 832 bytes of constant memory in bank 0
Loop on line 1053:
Used 247 registers
Used 524 bytes of constant memory in bank 0, 36 bytes of constant memory in bank 2
Loop on line 1677:
Used 252 registers
Used 524 bytes of constant memory in bank 0, 32 bytes of constant memory in bank 2
Loop on line 2300:
Used 248 registers
Used 536 bytes of constant memory in bank 0, 32 bytes of constant memory in bank 2
Loop on line 3177:
Used 77 registers
Used 480 bytes of constant memory in bank 0
Loop on line 3238:
Used 25 registers
Used 436 bytes of constant memory in bank 0
AND CUDA 7.0 PTXas:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z23loop_bt_834_cuda_kernelPdiiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiS_iiiiS_iiiPiddddddddddddddiddddddddddddddddddddd' for 'sm_35'
ptxas info : Function properties for _Z23loop_bt_834_cuda_kernelPdiiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiS_iiiiS_iiiPiddddddddddddddiddddddddddddddddddddd
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 86 registers, 832 bytes cmem[0]
ptxas info : Compiling entry function '_Z24loop_bt_3177_cuda_kernelPdiiiiS_iidS_dS_dS_dS_dS_Piiddd' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_3177_cuda_kernelPdiiiiS_iidS_dS_dS_dS_dS_Piiddd
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 42 registers, 480 bytes cmem[0]
ptxas info : Compiling entry function '_Z23loop_bt_294_cuda_kernelPiiPdiiiiS0_iiS_ddd' for 'sm_35'
ptxas info : Function properties for _Z23loop_bt_294_cuda_kernelPiiPdiiiiS0_iiS_ddd
280 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 49 registers, 408 bytes cmem[0]
ptxas info : Compiling entry function '_Z24loop_bt_1677_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_1677_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi
800 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 198 registers, 524 bytes cmem[0], 20 bytes cmem[2]
ptxas info : Compiling entry function '_Z23loop_bt_811_cuda_kernelPdiiiiS_iiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiPi' for 'sm_35'
ptxas info : Function properties for _Z23loop_bt_811_cuda_kernelPdiiiiS_iiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiPi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 23 registers, 544 bytes cmem[0], 20 bytes cmem[2]
ptxas info : Compiling entry function '_Z24loop_bt_2300_cuda_kernelPdiiiiiS_iiiiS_iiiiPiiddddddddddddddd' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_2300_cuda_kernelPdiiiiiS_iiiiS_iiiiPiiddddddddddddddd
840 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 214 registers, 536 bytes cmem[0], 20 bytes cmem[2]
ptxas info : Compiling entry function '_Z24loop_bt_1053_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_1053_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi
800 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 198 registers, 524 bytes cmem[0], 24 bytes cmem[2]
ptxas info : Compiling entry function '_Z24loop_bt_3238_cuda_kernelPdiiiidS_dS_dS_dS_dS_Pii' for 'sm_35'
ptxas info : Function properties for _Z24loop_bt_3238_cuda_kernelPdiiiidS_dS_dS_dS_dS_Pii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 24 registers, 436 bytes cmem[0]
ptxas info : Compiling entry function '_Z23loop_bt_282_cuda_kernelPdiiiiPi' for 'sm_35'
ptxas info : Function properties for _Z23loop_bt_282_cuda_kernelPdiiiiPi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 16 registers, 352 bytes cmem[0]
OR CUDA 7.0 DVMH PTX Info:
Information of CUDA Ptx assembler for compiled module 'bt':
Compiled all kernels for sm_35 architecture
Used 0 bytes of global memory
Loop on line 282:
Used 16 registers
Used 352 bytes of constant memory in bank 0
Loop on line 294:
Used 49 registers
Used 280 bytes stack frames
Used 408 bytes of constant memory in bank 0
Loop on line 811:
Used 23 registers
Used 544 bytes of constant memory in bank 0, 20 bytes of constant memory in bank 2
Loop on line 834:
Used 86 registers
Used 40 bytes stack frames
Used 832 bytes of constant memory in bank 0
Loop on line 1053:
Used 198 registers
Used 800 bytes stack frames
Used 524 bytes of constant memory in bank 0, 24 bytes of constant memory in bank 2
Loop on line 1677:
Used 198 registers
Used 800 bytes stack frames
Used 524 bytes of constant memory in bank 0, 20 bytes of constant memory in bank 2
Loop on line 2300:
Used 214 registers
Used 840 bytes stack frames
Used 536 bytes of constant memory in bank 0, 20 bytes of constant memory in bank 2
Loop on line 3177:
Used 42 registers
Used 40 bytes stack frames
Used 480 bytes of constant memory in bank 0
Loop on line 3238:
Used 24 registers
Used 436 bytes of constant memory in bank 0
And I run this test on CUDA 6.5 and CUDA 7.0 on GTX Titan with 346.47 driver version:
CUDA 6.5:
NAS Parallel Benchmarks 3.3.1 - DVMH version - BT Benchmark
No input file inputbt.data. Using compiled defaults
Size: 162x162x162
Iterations: 200 dt: 0.000100
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class C
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.6239811655176E+04 0.6239811655176E+04 0.7287837774866E-15
2 0.5079323919042E+03 0.5079323919042E+03 0.1119113877493E-15
3 0.1542353009301E+04 0.1542353009301E+04 0.4422599899090E-15
4 0.1330238792929E+04 0.1330238792929E+04 0.1709269618747E-15
5 0.1160408742844E+05 0.1160408742844E+05 0.1097279377060E-14
Comparison of RMS-norms of solution error
1 0.1646200836909E+03 0.1646200836909E+03 0.1035901894587E-14
2 0.1149710790382E+02 0.1149710790382E+02 0.3090093359582E-15
3 0.4120744620746E+02 0.4120744620746E+02 0.6897226604947E-15
4 0.3708765105969E+02 0.3708765105969E+02 0.1915847230703E-15
5 0.3621105305184E+03 0.3621105305184E+03 0.1412802795364E-14
Verification Successful
BT Benchmark Completed.
Class = C
Size = 162x162x162
Iterations = 200
Time in seconds = 27.70
Mop/s total = 103489.44
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3.1
CUDA 7.0:
NAS Parallel Benchmarks 3.3.1 - DVMH version - BT Benchmark
No input file inputbt.data. Using compiled defaults
Size: 162x162x162
Iterations: 200 dt: 0.000100
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class C
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.6239811655176E+04 0.6239811655176E+04 0.8745405329840E-15
2 0.5079323919042E+03 0.5079323919042E+03 0.2238227754985E-15
3 0.1542353009301E+04 0.1542353009301E+04 0.2948399932726E-15
4 0.1330238792929E+04 0.1330238792929E+04 0.1709269618747E-15
5 0.1160408742844E+05 0.1160408742844E+05 0.6270167868913E-15
Comparison of RMS-norms of solution error
1 0.1646200836909E+03 0.1646200836909E+03 0.1035901894587E-14
2 0.1149710790382E+02 0.1149710790382E+02 0.3090093359582E-15
3 0.4120744620746E+02 0.4120744620746E+02 0.6897226604947E-15
4 0.3708765105969E+02 0.3708765105969E+02 0.1915847230703E-15
5 0.3621105305184E+03 0.3621105305184E+03 0.1412802795364E-14
Verification Successful
BT Benchmark Completed.
Class = C
Size = 162x162x162
Iterations = 200
Time in seconds = 57.77
Mop/s total = 49615.84
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3.1
Consideration of information does not need to comment. But it should be emphasized that the maximum number of registers is used, and may be that is why compiler does not work correctly.
All code - base and converted - is available for download by the following link:
https://drive.google.com/file/d/0BwkVJGSs_ksSUURyTVJtTTNmSVk/view?usp=sharing
If you want to compile bt.fdv (Fortran f77 with FDVMH directives) you should to install DVM-system on your PC. If you have questions about this process I am ready to help.