CUDA execution multiples of 16ms

I have a sequence of kernels that are executed hundreds to thousands of times, all of which operate on a single set of input data in order to produce a result. Everything functions fine and behaves as expected.

I am focusing on performance now and have noticed that the time difference between executing these kernels on each set of input data is always a difference of some multiple of 16ms +/- 1ms. This is of course very close to the refresh rate of my display devices.

I have changed the behavior of the kernels to modify their performance, and while the times change the differences are always very close to a multiple of 16ms.

The GPUs I am running on have no output device connected to them (I have a separate GPU strictly for Windows to display on my monitors). Our application does not use any graphics on the CUDA device and shouldn’t care about the display rate at all. I have gone through the NVIDIA control panel and disabled all of the graphics-specific settings for this device. In short, I am using the GPU strictly as a non-rendering coprocessor to compute a result and return it to the CPU.

Our application essentially just crunches numbers and produces a result. There is no need for it to sync to the display whatsoever.

  • Is there some setting I have missed to make this device not behave like it needs to sync with the display?
  • Is there potentially something else I haven't considered that might be causing this?

Are your GPUs in WDDM mode?

You may have better results under windows if you can use GPUs that can be placed into TCC mode. Having said that, I’m not sure there’s anything about WDDM that would enforce a 16ms cycle. There may be something else going on.

CUDA batches kernels on Windows under WDDM. Insert cudaStreamQuery(0) to force immediate execution of your kernel.

You don’t describe the details of your measurement methodology. The timing granularity observed may also be the result of choosing a low-precision host-based timer. If I recall correctly the default low-precision timer in Windows has a granularity that is very close to 16 ms. You would want to use a high-precision timing facility instead, such as gettimeofday() on Linux, which easily allows measurements with 1 usec resolution. Here is some code I have been using for many years, maybe it is helpful.

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

Glad you’re back njuffa. I was starting to hyperventilate.

It’s refreshing to be completely off the grid every once in a while :-)

@njuffa Thanks for reminding me to double check my timers but yes I am using the same time routines shown in your example.

@tera I am not sure what cudaStreamQuery(0) is intended to do. According to the docs this function simply queries the stream (which in many APIs I have used can stall the pipeline). Does this call have a side-effect of flushing the stream or something? As an aside, I do eventually call cudaStreamQuery once all of the kernels have been inserted into the stream and the CPU thread is ready to read back the results.

When running CUDA with the WDDM driver on Windows, cudaStreamQuery(0) has the side effect of firing off the current accumulated batch of commands to the GPU. There are other CUDA API calls with the same side effect, cudaStreamQuery(0) is simply a safe, convenient, and low overhead one. The CUDA driver batches commands when used with the WDDM driver to mitigate the inherent cost of issuing commands to the GPU with that driver model. The batching reduces the average cost of issuing commands, but can cause other performance artifacts.

@jnuffa Interesting. Makes sense.

Consequently adding these calls does improve overall performance of my system quite a bit but it still exhibits the n*16ms +/- 1ms timing behaviour. In other words, adding cudaStreamQuery seems to decrease the value of n for many of the inputs but n but the time seems to be locked to (very nearly) 16ms increments only.

I think at this point you would want to fire up the profiler to find out where the application time goes. If possible, consider splitting your state, e.g. according to a red-black scheme, to gain parallelism and the ability of using multiple streams.

While a “cycle time” of 16 ms may suggest some connection to display refresh rates, I believe this is a red herring based on the information you have provided in this thread.

If your GPU is capable of operating with the more efficient non-graphics TCC driver, I would suggest using that instead of the default WDDM driver, since your GPU is not used for display purposes in the first place.

OP, have you tried running and timing the application from the command prompt using nvprof? That should show you the correct time like this which runs the same __shfl() based bitonic sort in the same stream (serial order) on 32 int2 values 16 times;

==6468== Profiling application: ConsoleApplication1.exe
==6468== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
283.90ms  1.6320us                    -               -         -         -         -      256B  156.86MB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
283.90ms  4.4160us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [183]
283.96ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [188]
284.00ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [193]
284.05ms  2.9130us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [198]
284.10ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [203]
284.14ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [208]
284.19ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [213]
284.23ms  2.9440us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [218]
284.28ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [223]
284.33ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [228]
284.38ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [233]
284.42ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [238]
284.47ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [243]
284.52ms  2.8800us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [248]
284.56ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [253]
284.61ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [258]
284.65ms  2.2720us                    -               -         -         -         -      256B  112.68MB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]

From the command line use nvprof --print-gpu-trace yourApplication.exe

It turns out a CPU-side process was causing the 16ms timing behaviour.

I am surprised how much time was saved by calling cudaStreamQuery(0) though. While it should be expected that efficient flushing of any queue would improve the timing of any system, I am seeing 10-20% improvement in completion times. Considering that I was rearranging kernel memory usage patterns to get 5% improvements (time consuming) but saw better actual improvements by careful flushing of the stream.

Thanks for closing the loop. Sounds like the 16 ms are related to the duration of the Windows time slice?

Somebody was doing a Sleep(0) in an seemingly unrelated area of the code but affected the timing of the CPU threads.

If you are using Windows then maybe timeBeginPeriod() can increase the accuracy of your measurements.