Hi, I’m trying to learn how GPU schedule concurrent kernels with the resource limitation (thread/register/shared mem). It allows normal computation to be concurrent, trying to schedule blocks from different kernels as much as possible. But when I want to control the running time for each block, by reading globaltimer register or calling clock/clock64 functions in a loop, kernels will be serialized, starting a new one after previous one finish. One thing to notice, two kernels referencing time will run concurrently if they are launched closely enough, but will be serialized if one is launched like 1 second later.
I only read in the documentation saying that configuration such as L1/shared mem changing could lead to serializing kernels, is reading globaltimer and so causing the similar serialization? In another word, is reading globaltimer/calling clock/clock64 preventing concurrent kernel execution?
Thanks,
- Ming
Additional information about my experiment is below:
I launched the second kernel 1 second late, before the first finishes. Then they are not co-scheduled. Code below:
#include <stdio.h>
#define DELAY_VAL 5000000000ULL
long milliseconds()
{
long ms; // Milliseconds
time_t s; // Seconds
struct timespec spec;
clock_gettime(CLOCK_REALTIME, &spec);
s = spec.tv_sec;
ms = round(spec.tv_nsec / 1.0e6); // Convert nanoseconds to milliseconds
return ms + s *1000;
}
__global__ void child(){
unsigned long long start = clock64();
while (clock64()< start+DELAY_VAL);
}
__global__ void parent(){
child<<<1,1>>>();
}
int main(int argc, char* argv[]){
cudaStream_t st1, st2;
cudaStreamCreate(&st1);
cudaStreamCreate(&st2);
long start = milliseconds();
long now = 0;
parent<<<1,1,0,st1>>>();
if (argc > 1){
printf("running double kernel\n");
while ( now < start + 1000) {
now = milliseconds();
}
parent<<<1,1,0,st2>>>();
}
cudaDeviceSynchronize();
}
Results:
$ nvcc -arch=sm_53 -rdc=true -lcudadevrt t815.cu -o t815
$ time ./t815
real 0m5.451s
user 0m0.560s
sys 0m0.820s
$ time ./t815 1
running double kernel
real 0m10.467s
user 0m2.330s
sys 0m0.870s
$ time cuda-memcheck ./t815 1
========= CUDA-MEMCHECK
running double kernel
========= ERROR SUMMARY: 0 errors
real 0m12.028s
user 0m3.870s
sys 0m1.680s
These timing results show two kernels are not concurrent. I also confirmed that they are serialized with NVVP.
With normal computation (no time code), kernels would always run concurrently, at least every time I’ve seen. So it seems GPU is treating code referencing time with different policy.
Environment: Jetson TX1 with Ubuntu 16.04, CUDA 8.0
Full log of deviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3994 MBytes (4188229632 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 72 MHz (0.07 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = NVIDIA Tegra X1
Result = PASS
This question was previously posted in CUDA programing and performance before: https://devtalk.nvidia.com/default/topic/1000091/cuda-programming-and-performance/reading-globaltimer-register-or-calling-clock-clock64-in-loop-prevent-concurrent-kernel-execution-/post/5109993/