cudaMalloc hangs for several minutes on Titans on CentOS5_x64

We have an application that works fine across many generations of CUDA devices and software versions. We currently have a test machine with 4 titans, another test machine with 5 GTX680s and some productions machines each with 4 Tesla C1070s. They are all running the same CENTOS 5 64 bit with CUDA 5.5RC and the included driver. The Titans have given us problems with taking ~5 minutes to start and we tracked the issue down to a hang on the first cudaMalloc call. We also have random crashes in CUDPP on the Titans, but that may be completely unrelated. The other machines operate without issue. Has anyone else observed this behavior? Any other ideas to help debug?

These kind of delays are not related to cudaMalloc() per se, but are due to the time taken by the implicit CUDA context creation that is triggered by the first call to cudaMalloc(). You can double check this by putting a cudaFree(0) ahead of the first cudaMalloc(), and should then see the delay occuring on that CUDA call.

Initialization time can be long if the driver gets unloaded when the GPU is not is use, and needs to be reloaded when the GPU is used the next time. To prevent unloading of the driver activate persistence mode with nvidia-smi.

Thanks for the advice. As you suggested a cudaFree(0) triggers the hang. Persistance mode helps, but only a little. Hopefully newer drivers will mitigate this.

If your Linux version is one of the versions supported by CUDA, I would suggest filing a bug, using the bug reporting form linked from the registered developer website. From what I understand, the context initialization time increases with the number of GPUs connected and possibly based on the amount of memory on each of the GPU(s), but with persistence mode enabled the initialization should not take minutes. It is also interesting that among your three systems all with multiple large GPUs only the system with the GTX Titan shows this behavior.

Just figured out that if we change the cudpp build script to also compile for the sm=35 architecture the context creation times go back to normal. Not sure what to make of that just yet, we’re still poking around.

The driver is generating new code for sm35 ( it is jitting the ptx embedded in the binary).
It should only do it once but the default JIT cache size is only 32MB.
CUDPP is pretty big, so you will need to expand the cache size using the environment variable CUDA_CACHE_MAXSIZE.
Setting it to 512 MB should fix your problem.

Sorry, I overlooked the JIT delay issue because in my mind I threw the GTX 680 and the GTX Titan into the same “Kepler” bucket, but of course they are two different Kepler architectures (sm_30 and sm_35, respectively). It would be best not to rely on the JIT compiler and simply build a fat binary that incorporates the sm_20, sm_30, and sm_35 binary machine code.