cuda initialization takes too much time

thanasisGiannis · August 27, 2017, 1:06am

Hello,

I am using the magma library and I found that magma_init() takes a lot of time…this function is wrapper for cudaGetDeviceCount(), so cudaGetDeviceCount() takes, surprisingly too much time. Is there any help where the problem might be? I am using K40

Here is the link to the thread in magma forum I made

http://icl.cs.utk.edu/magma/forum/viewtopic.php?f=2&t=1581&sid=dad6dd5822d7ce77f4b2fb9e2a9c7bc9

njuffa · August 27, 2017, 1:18am

How much time, exactly? Is this a system with multiple GPUs or large system memory? A Windows or a Linux system? If the latter: Is the CUDA driver in persistence mode? Other than calling cudaGetDeviceCount(), what does magma_init() do?

thanasisGiannis · August 27, 2017, 1:34am

Up to 2 seconds. Its one K40c on a Linux system.Persistence mode is disabled.
Actually the first cuda call was taking too much time. Here are some numbers in various servers

k20: 1.35 sec
p100: 0.88 sec
k40: 6.21 sec
c2050: 1.15 sec

For more you could read the last answer in the link (http://icl.cs.utk.edu/magma/forum/viewtopic.php?f=2&t=1581&sid=dad6dd5822d7ce77f4b2fb9e2a9c7bc9)

Robert_Crovella · August 27, 2017, 1:41am

enable persistence mode
If you only intend to use a single GPU, then use the CUDA_VISIBLE_DEVICES environment variable to restrict your CUDA runtime footprint to that GPU only:

[url]Programming Guide :: CUDA Toolkit Documentation

That is about all I know of that you can reasonably do to reduce init time. Init time may vary based on your exact program, CUDA version, driver version, OS (e.g. linux or windows), exact GPU being used, size of system memory, number of GPUs in the system (although see above) and probably other factors.

In some cases, enabling persistence mode can make a substantial difference in init time.

njuffa · August 27, 2017, 1:47am

Once you have addressed the items listed by txbob, I think you will find that CUDA initialization time is largely a function of the amount of system memory in each server, because all memory (attached to GPus and CPUs) in the machine needs to be mapped into a single address space at CUDA startup. Because this mapping involves mostly single-threaded OS calls, you may find (other parameters being equal) that the server with the highest single-thread performance initializes in the shortest time. You may also find that higher system memory throughput reduces the initialization time.

thanasisGiannis · August 27, 2017, 1:56pm

Thank you for your replies! Indeed the time was reduced!