Varying cuBlas initialization times across different OS configurations

Most users are aware that there is an ‘one time’ overhead to setup the context for a library such as cuBlas or cufft.

I have been isolating these times and compared in three different cases;

  1. Windows 7 x64 Maxwell Titan X (in TCC mode) CUDA 8.0 => cuBlas init time avg = 220 ms

  2. Windows 8.1 x64 Pascal GTX 1080ti (WDDM) CUDA 8.0 => cuBlas init time avg = 353 ms

  3. Ubuntu 14.04 LTS Pascal GTX 1080ti CUDA 8.0 => cuBlas init time avg= 460 ms

All GPUs tested are not connected to the display, but even if they are the initialization time is still roughly the same.

What surprises my here is the longest init time comes from Ubuntu, while the shortest comes from Windows 7 using the TCC driver mode. The fact that the much maligned WDDM for Windows handily beats Ubuntu is also surprising.

The reason this matters is that when calling a DLL which uses both CUDA and cublas that init time often is longer than the computation performed during the application.

There probably is no way to avoid this overhead, but this difference across OS is confusing. The Windows 8.1 PC and the Ubuntu 14.04 PC are almost identical in terms of hardware, yet the Windows PC always seem to be a bit faster and this is just one example.

Anyone have any insight to this issue?

My understanding was that the result of the TCC driver mode for Windows generally matched the performance of a linux OS, but obviously that is not the case.

How were these times measured? I could imagine that I/O plays into this, e.g. file caching. The speed of memory mapping for UVM may be affected by existing fragmentation of page mappings influenced by uptime. I assume that the machines involved all have the same amount of system memory?

The way I would measure these times for comparison purposes is with a freshly booted machine, repeating the app execution ten times, then reporting the best of ten times.

Used both the Windows timer, a typical linux version of the a timer and the times generated from nvprof.All these times were reasonable and consistent over many trials.

No, the windows 7 system has 32 GB of 1666MHz DDR3 DRAM while the other 2 PCs have 64 GB of 2400 MHz DDR4 DRAM.
On paper the Ubuntu system should be the fastest based on the hardware but obviously it is more complicated than just looking at the hardware specs.

I would expect the startup overhead to be limited by single-thread activity, system memory throughput, and I/O throughput (with different aspects dominating in different phases). Ideally you would use a Windows / Linux dual-boot configuration to maintain hardware parity.

I would expect a good deal of the overhead to be OS-related. Maybe a system-level trace tool (strace, dtrace? haven’t used those tools in ages) can pinpoint where the majority of the time goes at OS level on Linux?

BTW, is the app linked statically, or dynamically? That could make a significant difference due to dynamic linking overhead.

I’m curious how you would be able to attribute a time to a specific lazy initialization process, and how you know what you are measuring.

Did you time cublasCreate() ? by itself? Or something else?

I would certainly expect a library initialization process to be OS dependent. And if your measure of goodness about GPU computing is a 250ms difference of initialization time, then its not obvious to me why you are using a GPU. For example, if this statement is correct:

"The reason this matters is that when calling a DLL which uses both CUDA and cublas that init time often is longer than the computation performed during the application. "

then I can’t imagine why you would be using a GPU there.

Independent of this use case it is useful to minimize initialization overhead in CUDA components, as this will increase the versatility of CUDA-accelerated applications. Over the years, faster GPUs have drastically reduced the time spent for actual GPU work, while the associated CPU-side overhead has not seen similar reductions, and in some cases has increased. A worked example of Amdahl’s Law in that 100x speedup of parallelized components now exposes bottlenecks due to the serial portion of the code.

In some cases the overhead may be immutable, since inherent in underlying OS mechanisms, while in other cases the overhead may be partially influenced by the manner in which CUDA uses or configures various OS mechanisms. It seems worthwhile finding out where that time is spent.

I am using a GPU because even with the overhead it is still 3-4 times faster than the equivalent MATLAB function calls (which do use the capabilities of the multi-core CPU).

Was simply curious what the timeline was for the operations, and in this case the actual kernels+cuBlas function calls were ~150 ms, with the total initialization (cuda setup plus creating a cublas handle) being about 400 ms.

Just tested on a Windows 10 x64 system with 128 GB DDR4 RAM and 2 GTX 1080TI GPUs and the first CUDA initialization time took about 570 ms. I was calling via a MATLAB mex DLL and subsequent calls to the CUDA mex dll showed 0 ms of CUDA initialization time after the first ‘cold start’.

Interestingly the initialization of cuFFT along with 2 very large calls to cufftPlanMany() which allocated over 3 GB took only 333 ms which is rather low all things considered.

So yes it does seem that the first CUDA initialization time is related to the amount of system memory.