Reading globaltimer register or calling clock/clock64 in loop prevent concurrent kernel execution?

MingYang4 · March 18, 2017, 6:55pm

Hi, I’m trying to learn how GPU schedule concurrent kernels with the resource limitation (thread/register/shared mem). It allows normal computation to be concurrent, trying to schedule blocks from different kernels as much as possible. But when I want to control the running time for each block, by reading globaltimer register or calling clock/clock64 functions in a loop, kernels will be serialized, starting a new one after previous one finish. One thing to notice, two kernels referencing time will run concurrently if they are launched closely enough, but will be serialized if one is launched like 1 second later.

I only read in the documentation saying that configuration such as L1/shared mem changing could lead to serializing kernels, is reading globaltimer and so causing the similar serialization? In another word, is reading globaltimer/calling clock/clock64 preventing concurrent kernel execution?

Thanks,

Ming

Additional information about my experiment is below:

I launched the second kernel 1 second late, before the first finishes. Then they are not co-scheduled. Code below:

#include <stdio.h>

#define DELAY_VAL 5000000000ULL

long milliseconds()
{
    long            ms; // Milliseconds
    time_t          s;  // Seconds
    struct timespec spec;

    clock_gettime(CLOCK_REALTIME, &spec);

    s  = spec.tv_sec;
    ms = round(spec.tv_nsec / 1.0e6); // Convert nanoseconds to milliseconds
    return ms + s *1000;
}

__global__ void child(){

    unsigned long long start = clock64();
    while (clock64()< start+DELAY_VAL);
}

__global__ void parent(){

    child<<<1,1>>>();
}

int main(int argc, char* argv[]){

    cudaStream_t st1, st2;
    cudaStreamCreate(&st1);
    cudaStreamCreate(&st2);

    long start = milliseconds();
    long now = 0;
    parent<<<1,1,0,st1>>>();
    if (argc > 1){
        printf("running double kernel\n");
        while ( now < start + 1000) {
            now = milliseconds();
        }
        parent<<<1,1,0,st2>>>();
    }
    cudaDeviceSynchronize();
}

Results:

$ nvcc -arch=sm_53 -rdc=true -lcudadevrt t815.cu -o t815
$ time ./t815

real    0m5.451s
user    0m0.560s
sys     0m0.820s

$ time ./t815 1
running double kernel

real    0m10.467s
user    0m2.330s
sys     0m0.870s

$ time cuda-memcheck ./t815 1
========= CUDA-MEMCHECK
running double kernel
========= ERROR SUMMARY: 0 errors

real    0m12.028s
user    0m3.870s
sys     0m1.680s

These timing results show two kernels are not concurrent. I also confirmed that they are serialized with NVVP.

With normal computation (no time code), kernels would always run concurrently, at least every time I’ve seen. So it seems GPU is treating code referencing time with different policy.

Environment: Jetson TX1 with Ubuntu 16.04, CUDA 8.0
Full log of deviceQuery:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3994 MBytes (4188229632 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            72 MHz (0.07 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = NVIDIA Tegra X1
Result = PASS

This question was previously posted in CUDA programing and performance before: https://devtalk.nvidia.com/default/topic/1000091/cuda-programming-and-performance/reading-globaltimer-register-or-calling-clock-clock64-in-loop-prevent-concurrent-kernel-execution-/post/5109993/

AastaLLL · March 20, 2017, 9:54am

Hi,

Thanks for your question.

Please check following page first.
https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

MingYang4 · March 20, 2017, 4:06pm

Thanks for your reference. I did check it. And also a few other references talking about stream and concurrency. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwio-OWVuuXSAhUGPiYKHUu9CncQFggcMAA&url=http%3A%2F%2Fon-demand.gputechconf.com%2Fgtc-express%2F2011%2Fpresentations%2FStreamsAndConcurrencyWebinar.pdf&usg=AFQjCNHsStlFXW5xeKh6bfETkS8TOEO8bg&sig2=FOgO8Hp_vAZguRX2wjPI9w

Still thinking that the example program I gave should have kernels run concurrently, but they didn’t on Jetson TX1. Figuring why.

AastaLLL · March 22, 2017, 9:56am

Hi,

With a slightly modification, kernels can run concurrently.
Guess that original kernel code returns immediately rather than waiting.

Could you help us check it again? And please let us know the results.
Thanks.

__global__ void child(){
    unsigned long long start = clock64();
    while ( (clock64()-start)/1000<1000000);
}

MingYang4 · March 22, 2017, 1:03pm

AastaLLL:

Hi,

With a slightly modification, kernels can run concurrently.
Guess that original kernel code returns immediately rather than waiting.

Could you help us check it again? And please let us know the results.
Thanks.
__global__ void child(){
    unsigned long long start = clock64();
    while ( (clock64()-start)/1000<1000000);
}

Please take a look at the DELAY_VAL

#define DELAY_VAL 5000000000ULL

Which is 50x the value you give. So that’s not the problem.

Could you help try this on your Jetson TX1, if you have the device?

MingYang4 · March 25, 2017, 8:50pm

Problem solved. I didn’t stop lightdm service. Although I haven’t figured out how such small size kernel of <<<1,1>>> can be delayed, it doesn’t happen any more.

Thanks to everyone!