'ptxas' died with status 0xC00000FD (STACK_OVERFLOW)

What a pain in the ‘ptxas’…

Can’t catch a freaking break here, I’ve got a kernel, that was working fine. I figured out a way to rework some chunks of the code to increase ILP and speed it up. Now the new code does the exact same work as the old code it just layers the variables differently to minimize execution dependencies. I can replace a few chunks of the old math with the new chunk and everything is still fine. But if I replace all the old chunks with a new chunk I suddenly get slammed with the following error when I try to build:

error : 'ptxas' died with status 0xC00000FD (STACK_OVERFLOW)	| CUDACOMPILE

That sounds… uh… not good.

I preemptively created as dirt simple of a test kernel as I could manage in the hopes I might try and repro and isolate the actual issue.(Hoping I could dance around it.) However I’ve got it boiled down to basically a small handful of working variables (all unsigned int’s to be specific) that just do the same 16 lines of math over and over. Once I paste enough of those chunks of math into the kernel the compiler blows up, all the time, everytime. Specifically it seems if I have anything that resembles a for() or while() loop after the compute chunks. Doesn’t even matter if I’m touching the working vars inside those loops or not, just the presence of a for() after all the math and the compiler fails. My actual kernel code also seems to blow up additionally from any kind of if() statement that checks any of the working variable values after the math. (The test kernel for some reason isn’t bothered by the if(), but there can’t be that many stackoverflow bugs hiding in the compiler to run into so I assume it’s the same issue.)

If I remove the loop: suddenly worky, worky. If I remove a few lines of the math: also worky, worky.

After spending hours playing around with the test kernel I can’t seem to find a way around this so…
Has anybody ran across a stackoverflow in the ptxas part of the compiler before???

Other Info:
I’m on Win7SP1 (x64), VS2013, and compiling for CC2.0, CC3.0, and CC3.5 all blow up, but CC5.0 somehow makes it through. Also I was on CUDA 8.0.44 when I hit this, but just upgraded to CUDA 8.0.61.2 and still no joy. (My main GFX card is Fermi or I would have jumped to the 9.somethin’)

Does this happen in the course of compiling PTX produced from CUDA source code by NVVM, or when processing PTX code from some other source?

I am asking because the predominant use of PTXAS is in the former scenario, and that is probably how it gets tested for the most part, meaning it may be less robust for the wider variety of PXT idioms that might occur in hand-generated code (or code generated by a third-party domain-specific language)

Since GPU architectures are generally not binary compatible, PTXAS contains multiple machine-specific code generators, so it can definitely happen (and has happened) that things work fine for some architectures but fail for others.

This stack overflow error seems like an error internal in PTXAS and therefore a compiler bug. I have seen PTXAS bugs that caused segfaults. I don’t offhand remember one that resulted in a stack overflow, so this case seems a bit unusual.

Since support for compute capability 2.x has already been discontinued in CUDA, there is presumably no way to deliver a bug fix for that architecture, but you could file a bug for with NVIDIA to get this fixed for compute capability 3.x, as that is still supported (and probably will be for a couple more years).

From the vague description it seems impossible to even guess what might trigger this stack overflow. I am speculating maybe large amounts of code generated or lots of variables created. You could try simple source code changes such as declaring functions noinline to prevent function inlinig, or stop loop unrolling by use of #pragma unroll 1. You could also try lowering the PTXAS optimization level from the default -O3 to something lower, e.g. -Xptxas -O2.

Have you tried one of theseß

If you your shell is sh, bash or ksh use

ulimit -s unlimited

If your shell is csh, tcsh or zsh use:

limit stacksize unlimited

on Windows one might have to patch the ptxas.exe binary to allow for a bigger stack size.

This is assuming we’re not running into an infinite recursion, but rather an excessive one ;)

on Windows it’s this command

editbin /STACK:reserve[,commit] ptxas.exe

I am not sure what the default value for reserved stack size is. Just play with the reserve value
to see if you find a setting that works.

Yeah this is not hand edited ptx code. I’m just coding in standard CUDA C++(or is CUDA only C, I forget) using VisualStudio 2013 with the Nsight for VisualStudio package installed. For my repro kernel I literally just clicked the baked-in Nvidia CUDA 8 project template and then pasted in the first CUDA helloworld code I could find. Single device mem pointer, single kernel launch of 4 blocks of 512 threads. No function calls or loops or branching other then the one test for() loop that I’ve been adding after the math.I haven’t fiddled with any of the project settings either and like I said tried to keep it as dirt simple as I could, the one aspect that is not so simple is the repetitive chunk of math. I only see this issue when I add in enough of those lines but its the exact same relatively simple lines over and over again.

Let me see if I can trim out some of the math rounds and I bet I can just post it up. I’ll add some comments to the code file too that way you all can double check I’m not doing anything wonky without realizing it.

I’m not familiar with that editbin command - I’ll look it up. Would I be correct to assume that in my case I would need to add that command into the project build settings somewhere? Also I’m slightly confused does that modify the stack size that ptxas utilizes when it’s doing its work or does it simply tell ptxas what you want the kernel’s stack size to be, in the code its creating?

(My initial take on the error was that the ptxas utility itself was experiencing a stack overflow and thus dying. I guess it could be saying that its detected the kernel its trying to build will overflow its stack and just giving up, I’m not sure how I would tell the difference from that error message though.)

It’s an error that is actually a windows exception:

https://forums.asp.net/t/1811350.aspx?Stack+overflow+exception+0xC00000FD+Exception

It’s due to the ptxas program overruning its stack, as detected by the (windows) runtime harness.

It doesn’t have anything to do with the stack that your CUDA program is using or not using.

A number of compiler features are implemented via recursion. My guess is that this is a case of recursion run amok, but that guess is really irrelevant. The recommended steps are:

  1. Produce a simple but complete reproducer code that you would be willing to share with others.

  2. Attempt to reproduce the problem on the latest CUDA version (9.1)

  3. If it does not reproduce on CUDA 9.1, then fix it by moving forward to CUDA 9.1. If it does reproduce on CUDA 9.1, file a bug with your simple reproducer code at http://developer.nvidia.com

That’s exactly right. PTXAS was terminated abnormally for access outside the allocated stack and the appropriate Windows error code 0xC00000FD was reported back.

My memory is very hazy, but I seem to recall that stack overflow under Windows doesn’t necessarily mean a lot of stack was used. It can also mean that the stack increased very quickly and “jumped the guard page”. [Later:] Microsoft says:

https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debugging-a-stack-overflow

Lets see if this blows up a forum post, lol…

–Place where I tried to post full code–

**EDIT: yep that blew it up haha! I’ll try again.
overflow.cu (344 KB)

So that overflow.cu file is the test kernel I’ve been playing with. I saved it there in the broken state. If anybody wants to take a look at it, probably want to scroll down to the bottom of the kernel after all the math test blocks. I added some comments on what I’ve tried that seems to make it suddenly start compiling again. Removing some of the math, removing the for() loop or while() loop.

Just re-checked and I get a ptxas StackOverflow with the above file when compiling it for CC20, CC30, and CC35. CC50 seems to not even blink at it though and compiles fine.

I’m not sure I feel like ripping out my CUDA 8 installation at the moment as I need that and the prospect of successfully downgrading back down from 9.1 seems … precarious.

The posted code compiles fine for me with CUDA 8 (with MSVS 2010) on 64-bit Windows 7. See log below.

C:\Users\Norbert\My Programs>nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Mon_Jan__9_17:32:33_CST_2017
Cuda compilation tools, release 8.0, V8.0.60

C:\Users\Norbert\My Programs>nvcc -o overflow.exe overflow.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
overflow.cu
support for Microsoft Visual Studio 2010 has been deprecated!
   Creating library overflow.lib and object overflow.exp

C:\Users\Norbert\My Programs>nvcc -arch=sm_30 -o overflow.exe overflow.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
overflow.cu
support for Microsoft Visual Studio 2010 has been deprecated!
   Creating library overflow.lib and object overflow.exp

C:\Users\Norbert\My Programs>nvcc -arch=sm_35 -o overflow.exe overflow.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
overflow.cu
support for Microsoft Visual Studio 2010 has been deprecated!
   Creating library overflow.lib and object overflow.exp

Huzzzaaaah! Well if it works for you, that’s a good sign… I think. I actually had VS2010 still kicking around on my box so thought do apples to apples and give it a try, but I however get the same result as VS2013…

(Note kernel.cu below is same code inside file as overflow.cu I posted)

1>------ Build started: Project: PtxasStackOverFlow, Configuration: Debug Win32 ------
1>  Compiling CUDA source file kernel.cu...
1>  
1>  D:\Code\PtxasStackOverFlow\PtxasStackOverFlow>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -gencode=arch=compute_30,code=\"sm_30,compute_30\" --use-local-env --cl-version 2010 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include"  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -g   -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od  /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "D:\Code\PtxasStackOverFlow\PtxasStackOverFlow\kernel.cu" 
1>  Internal error
1>CUDACOMPILE : nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
1>  kernel.cu
1>CUDACOMPILE : nvcc error : 'ptxas' died with status 0xC00000FD (STACK_OVERFLOW)
1>C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\BuildCustomizations\CUDA 8.0.targets(689,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -gencode=arch=compute_30,code=\"sm_30,compute_30\" --use-local-env --cl-version 2010 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include"  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -g   -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od  /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "D:\Code\PtxasStackOverFlow\PtxasStackOverFlow\kernel.cu"" exited with code 253.
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

Sooo one of these things is not like the other. What strikes me though is how much simpler your build command appears to be whereas mine has gobbs and gobbs of parameters and paths tacked onto it - I didn’t put any of it there mind you. That’s just how the File > New > Nvidia > CUDA 8.0 Runtime Project came out of the box.

Did you by chance manually trim off the “fluff” from the project settings there or fire off the compile manually somehow to get a minimalist compile command?

I don’t uses IDEs, I use compilers straight from the command line (or a make file). I notice your build says “Debug Win32”. My builds were release builds, let me try a debug build to see what happens.

Yup, if I switch to a debug build I can reproduce the issue (see below). So the workaround is simple: use release builds.

C:\Users\Norbert\My Programs>nvcc -g -G -o overflow.exe overflow.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
overflow.cu
Internal error
nvcc error   : 'ptxas' died with status 0xC00000FD (STACK_OVERFLOW)

I looked at the kernel and it is very big as kernels go, at almost 15,000 lines of PTX for the release build and a whopping 315,000 lines of PTX for the debug build. So a reasonable conjecture is that the size of the code plays directly into this stack-overflow condition.

Aaaaaaaaaaaaaw sunofbiscuit… Release build just compiled for me too. Good catch man!

Apparently I forgot to fiddle with that switch.

So release mode could get real hairy if I ever need to see a variable’s value, but its definitely progress!

Thanks for taking the time to fire that off for me!

Also just for my own selfish gratifications… I’m not crazy or doing anything incorrectly there right? That is actually something out of my control mis-behaving under the hood?

What you are encountering is definitely a bug. No application should ever terminate abnormally with a segfault or stack overflow. At minimum it should trow an “out of resources” error and then shutdown in an orderly fashion, and ideally it shouldn’t get that far in the first place.

Editbin changes a setting in the executable file header to allow for a different maximum stack size. The change is permanent when made (util you reinstall or upgrade your CUDA toolkit). It’s probably a good idea to keep a backup of the unmodified ptxas.exe file around.

Editbin can change the maximum allowed stack size that ptxas may use before encountering the error you’ve been seeing.