Cuda 7.5 give a 30% performance loss vs cuda 6.5

I have optimized ccminer to run fast on the Maxwell chips here:

sourcecode:

But when I compile with the latest cuda version 7.5. the performance is down 30% in most algorithms.
The problem seems to be that the latest cuda compiler is spilling registers to the stack.

Take a look at the file:

ccminer/x11/cuda_x11_shavite512.cu

In cuda 6.5 this file compiles to 64 registers and no spills. In cuda 7.5 the stack usage explodes.

Same problem in:

cuda_x11_echo.cu

Compile in releasemode(x86) and run with:

ccminer -a x11 --benchmark.

(Cuda 7.0 gives a low performance and broken result)

I also get the best performance for most of my projects using 6.5. It seems code often needs to be adjusted from version to version.

Thanks for posting link to source code, as it is good work and readable.

For performance regressions of this magnitude, I would suggest filing a bug with NVIDIA right away. The bug reporting form is linked from the CUDA registered developer website.

I have added a bugreport. If you compile with the latest cuda version a GTX 980 is reduced to a gtx 970.

When the gtx 980 came out last year the gtx 980 was mining (digital cash (DASH)) at 6.6 MHASH

x11: 6.6 MHASH
Quark: 12.2 MHASH

http://cryptomining-blog.com/3503-crypto-mining-performance-of-the-new-nvidia-geforce-gtx-980/

1 year later I have managed to make it 50% faster. And some of the algorithms are 200-300% faster.

http://cryptomining-blog.com/4861-crypto-mining-performance-of-the-new-nvidia-geforce-gtx-980-ti/

A number of people on this forum have reported spilling issues with CUDA 7.5 and have dutifully filed bugs.

It’s a real pain to deal with this issue. Hopefully an updated Toolkit gets released soon.

But until we see a new release of the Toolkit, here are a couple observations and possible workarounds…

One workaround that works some of the time is to make sure variables that are declared but might not be set are initialized with a default value.

I’ve made this change and in some instances the spills were removed.

There are two cases where I’ve seen this problem frequently appear:

  1. Initializing a variable with 1 thread and then broadcasting it to the rest of the warp:
int x; // <--- if not initialized sometimes results in spills

if (warp_lane_is_first())
  x = atomicAdd(y,z); // or some other operation

x = __shfl(x,0);      // broadcast lane 0 to rest of warp

This bug has been around a long time.

  1. Declaring several variables, initializing the first N out of M with loads from global or shared memory and then jumping to process the first N:
while (true)
{
  int a,b,c,d,e,f; // <--- if not initialized on SM_5x + 7.5 RC results in spills

  a = mem_ptr[offset+WARP_SIZE*0];
  if (rem == 1)
    goto process;

  b = mem_ptr[offset+WARP_SIZE*1];
  if (rem == 2)
    goto process;

  ...

  e = mem_ptr[offset+WARP_SIZE*6];
  if (rem == 7)
    goto process;

  f = mem_ptr[offset+WARP_SIZE*7];

process:

  // do something with a
  if (rem == 1)
    break;

  ...

  // do something with f
  if (rem == 8)
    break;

  offset += 8 * WARP_SIZE;
  rem    -= 8;
}

This Duff-like idiom is pretty common and if properly implemented doesn’t result in compiler warnings. Note that a switch statement generates similar SASS. [ When are we getting indirect branches? :) ]

Yet, on Maxwell + 7.5 RC I’m seeing spillage unless the declared variables are initialized. Although the initialization in (2) squelches spills the SASS shows unnecessary initializations.

  1. If you’re using ‘pointer + constant offset’ addressing and you’re compiling for a 64-bit target then double-check the SASS and verify that you’re seeing those offsets.

If you’re not, then make sure you’re using a signed integer offset. Otherwise, you’ll see additional register pressure.

You should be seeing SASS like this:

LDG.E.64 R36, [R28+0x100];
LDG.E.64 R44, [R28+0x300];
LDG.E.64 R42, [R28+0x200];
LDG.E.64 R52, [R28+0x400];

Be safe out there. :)

I am building on windows with visual studio 2013 (x86)

(cuda_x11_shavite512.cu)

With the // Cuda compilation tools, release 6.5, V6.5.16 it spills 8 bytes.

With 7.5 30 times as much

(cuda_x11_echo512.cu) is worse. It spills 128 bytes in cuda 6.5 and over 1000 bytes in cuda_7.5

1>------ Build started: Project: ccminer, Configuration: Release Win32 ------
1> Compiling CUDA source file x11\cuda_x11_shavite512.cu…
1>
1> C:\code\ccminer>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\nvcc.exe” -gencode=arch=compute_50,code="sm_50,compute_50" -gencode=arch=compute_52,code="sm_52,compute_52" --use-local-env --cl-version 2013 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin” -I. -Icompat -I"compat\curl-for-windows\curl\include" -Icompat\jansson -Icompat\getopt -Icompat\pthreads -I"compat\curl-for-windows\openssl\openssl\include" -I"compat\curl-for-windows\zlib" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include" --keep --keep-dir Release -maxrregcount=128 --ptxas-options=-v --machine 32 --compile -cudart static --ptxas-options=“-O2” -DWIN32 -DNDEBUG -D_CONSOLE -D_CRT_SECURE_NO_WARNINGS -DCURL_STATICLIB -DUSE_WRAPNVML -DSCRYPT_KECCAK512 -DSCRYPT_CHACHA -DSCRYPT_CHOOSE_COMPILETIME -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Ox /Zi /MT " -o Release\cuda_x11_shavite512.cu.obj “C:\code\ccminer\x11\cuda_x11_shavite512.cu”
1> ptxas info : Overriding global maxrregcount 128 with entry-specific value 64 computed using thread count
1> ptxas info : Overriding global maxrregcount 128 with entry-specific value 64 computed using thread count
1> ptxas info : 0 bytes gmem, 1152 bytes cmem[3]
1> ptxas info : Compiling entry function ‘_Z26x11_shavite512_gpu_hash_64jjPy’ for ‘sm_50’
1> ptxas info : Function properties for _Z26x11_shavite512_gpu_hash_64jjPy
1> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
1> ptxas info : Used 64 registers, 4096 bytes smem, 332 bytes cmem[0]

In the ptx file:

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Tue Aug 26 06:07:09 2014 (1409026029)
// Cuda compilation tools, release 6.5, V6.5.16
//

This looks curious: --ptxas-options=“-O2”

What happens if you change this to “-O3”, which is the compiler default? What happens if you switch it to “-O1”?

My rule of thumb has always been to use the lowest CUDA SDK version that provides the required feature set. I believe for CUDA Miner and initial CCMiner builds I’ve held onto the CUDA 5.5 release.

I wish there was a bit more tweakability of the optimization that the compiler and PTXAS perform. Like shifting the balance between various optimization goals in favor of one or the other. This would ideally be available on a per-kernel basis, similar to how launch bounds are implemented.

Anyway, it’s great to see one’s project (ccminer in this case) being carried on by other smart fellows. There are various branches, but I think sp_'s branch is the most active one. I’ve pretty much lost interest early this year when profitability for mining most altcoins no longer covered even the energy costs required for mining.

I still use CUDA professionally. My employer models radio channels with it. You can’t ever have enough GFlops for MIMO antenna and radio simulations.Just recently I’ve been able to use some lessons learnt during my crypto days in the implementation of some linear algebra subroutines. We’re now distributing complex matrices up to size 32x32 across threads to lower the register counts. Warp shuffle to the rescue! ;)

You should check out the latest changes 1 year has passed. Quark is 60% faster than your latest version on the Maxwell chips. All the hash algos have been optimized. Even the killer groestl.

On the 980Ti quark is doing 26,5MHASH on standard clocks and 30MHASH with overclocking.
The quark AMD opensource (sgminer) is doing 2MHASH on a r9 280x

The global hashrate for quarkbased coins is 100GHASH (17 000 750ti’s)

Check this commit (groestl speedup):

I do bitslicing with 10% of the instructions that you did. (there are more changesets in this file)

etc…

984 commits. :)

I’ll forward your post to ChrisH, the other (sort of anonymous) author of ccminer. We’re regarding our bitsliced Groestl implementation as our best achievement. Both of us had to study hard to understand this algorithm and arithmetic operations in the Galois Fields GF(2^^8) - having no prior academic experience in cryptography. Our only reference were books and papers about AES (which is somewhat similar in concept to Groestl)

By the way there is another implementation of the bitsliced Gröstl that we never released in source code. The public version spreads its state across 4 threads. This was necessary to decently support Kepler Compute 3.0 devices with a max register count of 63 per thread. We’ve released only this version because we knew it would be really hard to port to AMD devices, as it relies heavily on Kepler specific instructions.

Yes, the Groestl implementation is impressive. The optimalizations I have done is just removing assembly instructions. The algorithm is intact, I just do it with fewer instructions. I also removed some conditional code that was slower. And changed launch bounds. still using 64 regs… But in quark, more work has been done in the other algos. uint2 rewrites of all routines that use the 64 bit rotates, Blake, skein,keccak etc etc. And register tuning for maxwell. Quark was already pretty fast in ccminer 1.2, but now it is faster…

The gtx 970 is mining quark 850% faster than a r9 280x (sgminer opensource) sp-mod release 63.

Yes, the Groestl implementation is impressive. The optimalizations I have done is just removing assembly instructions. The algorithm is intact, I just do it with fewer instructions. I also removed some conditional code that was slower. And changed launch bounds. still using 64 regs… But in quark, more work has been done in the other algos. uint2 rewrites of all routines that use the 64 bit rotates, Blake, skein,keccak etc etc. And register tuning for maxwell. Quark was already pretty fast in ccminer 1.2, but now it is faster…

The gtx 970 is mining quark 850% faster than a r9 280x (sgminer opensource) sp-mod release 63.

The problem is only in the 32bit compilation. 64bit seems bether, but still 7-10% slower than cuda 6.5

Echo is spilling 888 bytes in 32 bit mode and 16 bytes in 64 bit mode.

1>------ Build started: Project: ccminer, Configuration: Release Win32 ------
1>  Compiling CUDA source file x11\cuda_x11_echo.cu...
1>  
1>  C:\code\ccminer>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin\nvcc.exe" -gencode=arch=compute_50,code=\"sm_50,compute_50\" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin"  -I. -Icompat -I"compat\curl-for-windows\curl\include" -Icompat\jansson -Icompat\getopt -Icompat\pthreads -I"compat\curl-for-windows\openssl\openssl\include" -I"compat\curl-for-windows\zlib" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include"    --keep --keep-dir Release -maxrregcount=80 --ptxas-options=-v --machine 32 --compile -cudart static --ptxas-options="-O3"     -DWIN32 -DNDEBUG -D_CONSOLE -D_CRT_SECURE_NO_WARNINGS -DCURL_STATICLIB -DUSE_WRAPNVML -DSCRYPT_KECCAK512 -DSCRYPT_CHACHA -DSCRYPT_CHOOSE_COMPILETIME -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Zi  /MT " -o Release\cuda_x11_echo.cu.obj "C:\code\ccminer\x11\cuda_x11_echo.cu" 
1>  ptxas info    : Overriding global maxrregcount 80 with entry-specific value 64 computed using thread count
1>  ptxas info    : 0 bytes gmem, 1216 bytes cmem[3]
1>  ptxas info    : Compiling entry function '_Z29x11_echo512_gpu_hash_64_finaljjPKyPjj' for 'sm_50'
1>  ptxas info    : Function properties for _Z29x11_echo512_gpu_hash_64_finaljjPKyPjj
1>      528 bytes stack frame, 616 bytes spill stores, 628 bytes spill loads
1>  ptxas info    : Used 80 registers, 4096 bytes smem, 340 bytes cmem[0], 4 bytes cmem[2]
1>  ptxas info    : Compiling entry function '_Z23x11_echo512_gpu_hash_64jjPy' for 'sm_50'
1>  ptxas info    : Function properties for _Z23x11_echo512_gpu_hash_64jjPy
1>      616 bytes stack frame, 888 bytes spill stores, 912 bytes spill loads
1>  ptxas info    : Used 64 registers, 4096 bytes smem, 332 bytes cmem[0], 112 bytes cmem[2]
1>  ptxas info    : Overriding global maxrregcount 80 with entry-specific value 64 computed using thread count
1>  ptxas info    : 0 bytes gmem, 1216 bytes cmem[3]
1>  ptxas info    : Compiling entry function '_Z29x11_echo512_gpu_hash_64_finaljjPKyPjj' for 'sm_52'
1>  ptxas info    : Function properties for _Z29x11_echo512_gpu_hash_64_finaljjPKyPjj
1>      528 bytes stack frame, 616 bytes spill stores, 628 bytes spill loads
1>  ptxas info    : Used 80 registers, 4096 bytes smem, 340 bytes cmem[0], 4 bytes cmem[2]
1>  ptxas info    : Compiling entry function '_Z23x11_echo512_gpu_hash_64jjPy' for 'sm_52'
1>  ptxas info    : Function properties for _Z23x11_echo512_gpu_hash_64jjPy
1>      616 bytes stack frame, 888 bytes spill stores, 912 bytes spill loads
1>  ptxas info    : Used 64 registers, 4096 bytes smem, 332 bytes cmem[0], 112 bytes cmem[2]
1>  cuda_x11_echo.cu
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========



64 bit:

1>  ptxas info    : Overriding global maxrregcount 80 with entry-specific value 64 computed using thread count
1>  ptxas info    : 0 bytes gmem, 1216 bytes cmem[3]
1>  ptxas info    : Compiling entry function '_Z29x11_echo512_gpu_hash_64_finaljjPKyPjj' for 'sm_50'
1>  ptxas info    : Function properties for _Z29x11_echo512_gpu_hash_64_finaljjPKyPjj
1>      32 bytes stack frame, 16 bytes spill stores, 20 bytes spill loads
1>  ptxas info    : Used 80 registers, 4096 bytes smem, 348 bytes cmem[0], 4 bytes cmem[2]
1>  ptxas info    : Compiling entry function '_Z23x11_echo512_gpu_hash_64jjPy' for 'sm_50'
1>  ptxas info    : Function properties for _Z23x11_echo512_gpu_hash_64jjPy
1>      128 bytes stack frame, 156 bytes spill stores, 120 bytes spill loads
1>  ptxas info    : Used 64 registers, 4096 bytes smem, 336 bytes cmem[0], 112 bytes cmem[2]
1>  cuda_x11_echo.cu
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

That is very strange. Other than for specifying -m32 or -m64, are the nvcc command lines exactly the same?

The usual, expected, effect is that compiling for a 64-bit platforms requires more variable storage and thus more registers than the same source code compiled for a 32-bit platform, since pointers and ‘size_t’ are now 64-bit data types (as well as ‘long’ on non-Windows platforms).

The 32-bit vs. 64-bit result is probably unrelated to the root cause.

I’ve seen unexplained spills under a variety of configurations.

I filed a bug on this topic in March and it is reportedly fixed in development and will appear in 7.5 Final.

Hopefully others have posted their reproducers since 7.5 is probably imminent.

@sp_ — if you haven’t already, you should file a bug and point them at your GitHub repo since it’s public and looks like it’s full of good repro cases.

I’m really hoping these spill bugs get squashed in 7.5 Final.

Have you tried increasing the register count per thread using --maxrregcount=nn ?

I don’t use --maxrregcount but do use the more capable __launch_bounds() qualifier.

I have filed a bug report now, and I am in dialog with one of the developers.
It’s easy to test because there is a benchmark mode in ccminer…
just run it with ccminer -a x11 --benchmark

The kernal is made to run fast on cuda 6.5

I use launchbounds to set the registers.
Overriding global maxrregcount 80 with entry-specific value 64 computed using thread count

64 regs seems to be the fastest (cuda 6.5)

cuda 6.5 x86:

ccminer -x11 --benchmark

(750ti) 3MHASH
(980ti) 13MHASH (gigabyte windforce oc)

cuda 7.5 x86:

ccminer -x11 --benchmark

(750ti) 1.9MHASH
(980ti) Not yet tested.

cuda 7.5 x64:

(750ti) 2.6MHASH
(980ti) Not yet tested.

How is this going so far?
Haven’t seen any update :)

The support ticket is open, so I guess they need to find out if they want to spend time on this.

My fork is now working in cuda 7.0(x11 algo) (head at github), and the same performance problem exists there.

30% slower is alot, and it seems to be a bug in the x86 compiler.