"nvlink warning : Stack size for entry function...cannot be statically determined
it was a virtual definition of a destructor. when i removed the “virtual” the warning is disappeared
but now when i try to run\debug the kernel i receive the following error
warning: Cuda API error detected: cudaLaunch returned (0x2)
i tried to debug the launch and added --keep flag
however i reached up to cuda_runtime.h
template
how i can go deeper? to find the reason to the error?
if i add the virtual flags that i removed, release and debug are working
profiler not working with or without warning
Updates:
After some optimizations we have a working profiler in 6.5 toolkit (the same code does not profile in 7.5, just freeze at kernel launch, tough in 6.5 and 7.5 the code run in well in release and debug)
i found that the kernel uses much more registers per thread:
the ptxas info indicate that there are X registers required for kernel
in the properties of the profiler output there are approx 3x-10x more registers depend on the kernel.
“cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.”
I don’t think you’re going to be able to further understand that with a debugger. Since the kernel is not launching, you won’t be able to learn anything by trying to debug device code anyway. You should take a look at the memory allocations required by your kernel launch. You may be able to begin this process by compiling your code with:
-Xptxas -v
and study the output associated with the kernel that is failing to launch.
i tried, and made a lot of optimizations regards the output of -v,
reduced stack usage and currently have 39 registers on debug mode and approx 80 registers in release
what bother me that if the virtual exist the code runs well. but with the warning.
if i run the kernel with the virtual i have over 2GB memory free of 3GB(using nvidia smi)
what can the connection between the warning the the error