cuda memory usage in debug(with GDB),debug(without GDB) and release differ, extra 2GB usage in relea

i have few question i couldn’t find answer to:

  1. why when calling cudaDeviceSynchronize right after cudaSetDevice on debug with gdb there is a memory spike of 130MB.
  2. why when running my process in debug with gdb it uses 169MB more then running the process in debug without gdb
  3. i have large piece of code that when it complies in debug mode it easily fit in the card less then 1GB memory usage, (it compiles under 20 seconds) but when i compile it in release it is much slower
    • toolkit 6.5 - take more then 13 minutes to compile x40 slower
    • toolkit 7.5 - take more then 3.5 minutes to compile x10 slower

    the main problem: in release, it uses 2GB more memory - what is the reason for the 2GB extra memory usage?

  • i have monitored the memory with NVIDIA-SMI 352.68
  • Linux 3.19.0-33-generic #38~14.04.1-Ubuntu SMP Fri Nov 6 18:17:28 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
  • 4*780gtx ti

Debug builds are compiled without any optimizations. Release builds are compiled with full optimization. The CUDA compiler is quite sophisticated and “knows” and applies numerous code transformations when optimizing, and for some optimizations compile time grows significantly with code size.

The code as processed by the compiler may be many times larger than what is suggested by the source code, for example due to function inlining and loop unrolling. You can prevent functions from being inlined with the noinline attribute, and loops from unrolling with a pragma, but of course this may negatively impact performance. You may want to take a look at the size of the intermediate PTX code. If it comprises on the order of 10K lines, that is a pretty good indication that the lengthy compile times are due to code size.

I do not see how your question about memory use is answerable without having access to the source code. It is a plausible hypothesis that memory usage is higher when running an app under the control of the debugger versus standalone without debugger because the debugger may need memory to store its internal state.

I can understand the speed of the compilation and the spikes… i had the same thoughts.
Though i never had such a big differences on CPU code between release and debug,

our code base is large, much more then 10k lines of c/cpp code.
btw where and how i can look the intermediate PTX code?

its not possible to share the code. other solutions\ideas will be most appreciated

i still cannot understand why it uses 2GB more in release

If you used default compilation, or compiled with a simple -arch switch, PTX should be embedded in the executable produced by nvcc. You can use cudabobjdump --dump-ptx to extract it.

By 10K lines I referred to the size of individual kernels (global functions), not the total size of your code base. If each of your kernel comprises more than 10K lines of code, long compilation times are not surprising at all. The CUDA compiler has been improved over time to shorten compilation times, as you can tell from your own data.

I forgot to mention one other frequent cause of long compilation times: building for many different GPU architectures. Since the different GPU architectures are not binary compatible, the compiler has to generate code separately for each architecture, and compilation time increases roughly linearly with the number of different compilation targets. So to cut down on build time, you could restrict your build to just the one target architecture you currently need.

You don’t state how fast the system is that you use to compile your code, but if you find compilation times unacceptably long you could always file a bug report with NVIDIA (form is linked from registered CUDA developer site). Obviously to investigate the issue the CUDA compiler team will need access to your code in some form and they are accustomed to working with customers whose code base must remain confidential. There are various potential approaches how to to proceed in such cases, which you can discuss with NVIDIA once a bug has been filed.

Generally, for CUDA development I would recommend using a workstation with high single-thread performance (which I define as a CPU with >= 3.5 GHz) and the use of solid state storage (SSD).

cuobjdump --dump-ptx MyProj (in Release) gave the following:

Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

in debug it has the same output just with identifier with all the object files.

only cuda compute 3.5 selected.

Our team already in contact directly with NVIDIA about other bugs, but it goes slowly, so i start here in the forum before opening a bug

  • Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
  • 64GB ddr4
  • no ssd

Your build machine seems fast enough, so my best working hypothesis is that the kernel code is pretty voluminous after templates have been expanded, functions inlined, and loops unrolled.

It is a best practice to have PTX for the latest GPU architecture embedded in a CUDA executable. If that is not the case for your build, please consult the documentation on the -arch and -gencode switches of the compiler. If I recall correctly, nvcc also used to have a --dump-ptx switch, for dumping PTX during compilation. Not sure whether that still exists. I also recall it could cause trouble in complex builds, so would advocate going the cudaobjdump path.

If you are already in direct contact with NVIDIA, the level of expertise accessible through that path is generally higher than what you will be able to access through this public forum. There may be an occasional exception, for example if you want to know low-level details of the Maxwell instruction set: you may get a more comprehensive answer from Scott Gray than from NVIDIA, as NVIDIA doesn’t like to discuss hardware implementation details.

Here is some sample output from cuobjdump --dump-ptx (look for the section entitled “Fatbin ptx code”):

Fatbin elf code:
================
arch = sm_50
code version = [1,7]
producer = <unknown>
host = windows
compile_size = 64bit

Fatbin elf code:
================
arch = sm_50
code version = [1,7]
producer = cuda
host = windows
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_50
code version = [4,3]
producer = cuda
host = windows
compile_size = 64bit
compressed
ptxasOptions =

.version 4.3
.target sm_50
.address_size 64

.weak .func (.param .b32 func_retval0) cudaMalloc(
.param .b64 cudaMalloc_param_0,
.param .b64 cudaMalloc_param_1
)
{
.reg .b32 %r<2>;

mov.u32 %r1, 30;
st.param.b32    [func_retval0+0], %r1;
ret;
}
[...]

The following is tested in debug and release with no optimization and without optimization and gave the same results

  • all the loops that might be big are set with
    #pragma unroll 1
    

    because it caused a bug when there are a lot of breakpoints, so loops should not effect the size

  • templates should not effect as well because it does not cause problems in debug
  • i thought that cuda try inline entire code to one big function, except when using inheritance and virtual methods? is it correct?

i have added -keep flag and had the following outputs:

  • Release : the largest ptx file was over 16MB with approx 700K lines of cuda asm code
  • Debug : the same file was 7MB with approx 200K lines of cuda asm code + 300k lines of something like ``` .b8 116 .b8 69 .b8 82 .b8 75 .b8 83 .b8 50 .b8 95 ```

when i tried to profile the code “Kernel profile instruction execution” to see how many time each line was called, i saw that the java process (via top) started to grow over 150GB… eventual it stuck.
all other profile tests works and give results

i have managed to find another wired behavior on toolkit 7.5 in release

  • Release
    • single kernel (with 1000 threads each) it works (both 6.5 and 7.5)
    • multiple kernels (using the same kernel function with its own stream and memory ) it does not work in release on toolkit 7.5 it just freeze while processing * it is not a memory issue cause i tested it on gpu with x2 more memory. it had the same behavior
  • Debug
    • all tests cases work in 6.5 and 7.5 work fine

for now we move back to toolkit 6.5, it seems more stable when running large piece of code,unfortunately the code compile x3 slower in release and run 12% slower, but at least it run in release mode

does anyone have idea what can cause it or how to fix it?

I see you found the -keep flag of nvcc, I had erroneously referred to --dump-ptx on account of my hazy memory. Sorry for the confusion.

At several hundred thousand lines of PTX code, your kernels certainly are much bigger than most CUDA kernels, and easily fall into the “huge” category. They may therefore exercise rarely-used mechanisms in the compiler which in turn may well have exposed bugs. I would encourage the filing of bug reports with NVIDIA and working with their engineers to resolve the issues being encountered. After analysis, NVIDIA compiler engineers may be able to suggest a workaround.

For a quick experiment, you could try lowering the optimization level of the compiler backend component PTXAS, which compiles PTX into machine code (SASS). The default optimization level is -Xptxas -O3 (full optimization). I would suggest first trying -Xptxas -O2, then -Xptxas -O1.

that’s ok, most important i got a direction…

because of the large code we have encountered many internal bugs, some of them got fixed at 7.5 and some are fixed in the development version (future 8.0 i think) and some are still open, wish i could have it a try on it :)

we found some workarounds, some of them are posted by me here few weeks/months ago

it is very problematic for us to submit our code with the filed bug…

Ill give it a try,

does it differ from normal optimization flag?

I have made the test in 7.5 Release,
the kernel memory signature(kernel+data) is abit larger by 25MB from
2520MB to 2545MB in release, still in debug the signature with the same data is less then 2000MB

i think i found what cause the problems

when compiling with ptaxas -v the the stack frame for most functions is x4 larger in release and the spill store/load x2 larger in release.
so the heaviest function grow from 400 bytes stack frame to 1800 bytes
i think it can cause the high kernel footprint from 500MB in debug to almost 2000MB in release

any idea what can cause it?