Hi, I’m trying to learn how GPU schedule concurrent kernels with the resource limitation (thread/register/shared mem). It allows normal computation to be concurrent, trying to schedule blocks from different kernels as much as possible. But when I want to control the running time for each block, by reading globaltimer register or calling clock/clock64 functions in a loop, kernels will be serialized, starting a new one after previous one finish.
I only read in the documentation saying that configuration such as L1/shared mem changing could lead to serializing kernels, is reading globaltimer and so causing the similar serialization? In another word, is reading globaltimer/calling clock/clock64 preventing concurrent kernel execution?
But how to explain that if I have the following code in the kernel, then no kernel would be executed concurrently with it. If I remove these code, concurrent kernel execution happens.
I couldn’t find the reason neither - ideally clock shouldn’t influence the concurrent kernel execution, but it seems the only reason that CKE is prevented.
long long int start = clock64();
while(clock64() < start + 1000000000) continue;
Running on TX1 (Maxwell, 2 SMs, 4 * 1024 threads limit in total)
Comparison between two experiment
Experiment 1: use clock/globaltimer to control run time for each block
Result: K2 starts after K1 finishes
Experiment 2: no clock/globaltimer code in the kernel, pure normal computation (assign sqrt of the index to array element)
Result: K2 is concurrent with K1
Thanks for the information. I tried with it. Then I realized one thing I forgot to mention, which is important:
If two kernels are launched closely in time, they will always be concurrent. But if one is launched late enough, for example, I launched the second kernel 1 second late, before the first finishes. Then they are not co-scheduled. Code below:
#include <stdio.h>
#define DELAY_VAL 5000000000ULL
long milliseconds()
{
long ms; // Milliseconds
time_t s; // Seconds
struct timespec spec;
clock_gettime(CLOCK_REALTIME, &spec);
s = spec.tv_sec;
ms = round(spec.tv_nsec / 1.0e6); // Convert nanoseconds to milliseconds
return ms + s *1000;
}
__global__ void child(){
unsigned long long start = clock64();
while (clock64()< start+DELAY_VAL);
}
__global__ void parent(){
child<<<1,1>>>();
}
int main(int argc, char* argv[]){
cudaStream_t st1, st2;
cudaStreamCreate(&st1);
cudaStreamCreate(&st2);
long start = milliseconds();
long now = 0;
parent<<<1,1,0,st1>>>();
if (argc > 1){
printf("running double kernel\n");
while ( now < start + 1000) {
now = milliseconds();
}
parent<<<1,1,0,st2>>>();
}
cudaDeviceSynchronize();
}
With normal computation (no time code), kernels would always run concurrently, at least every time I’ve seen. So it seems GPU is treating code referencing time with different policy.
Here is my test case, using exactly the code you posted, on CUDA 8, CentOS7, Tesla K20X:
$ nvcc -arch=sm_35 -rdc=true -o t958 t958.cu -lcudadevrt
$ time ./t958
real 0m12.794s
user 0m2.373s
sys 0m8.661s
$ time ./t958 1
running double kernel
real 0m13.768s
user 0m3.169s
sys 0m8.815s
$ time cuda-memcheck ./t958 1
========= CUDA-MEMCHECK
running double kernel
========= ERROR SUMMARY: 0 errors
real 0m19.088s
user 0m4.087s
sys 0m13.272s
$
This system has a lot of main memory and 4 GPUs in it, and I don’t have the GPUs in persistence mode, so there is a long start-up delay (5-6s) for any CUDA code running on this system.
In any event, we see above that the single kernel run takes 12.794s, the double kernel run takes approximately 1 second more at 13.768s, and the serialized run (cuda-memcheck serializes kernel launches) takes ~19s.
These numbers make sense to me and suggest to me in the double kernel launch case, the extra kernel being delayed by 1 second causes almost exactly 1 additional second of execution time, which is exactly what I would expect. The only possible way these numbers make sense is if the two parent kernels (and their child kernels) are running concurrently in the 13.768s case.
Interesting! This could be special to Jetson TX1 then, or architecture after Maxwell, or newer CC (TX1 is sm53). See my result shows they are not concurrent. I also confirmed with nvvp, showing they are serialized.
$ time ./t815
real 0m5.451s
user 0m0.560s
sys 0m0.820s
$ time ./t815 1
running double kernel
real 0m10.467s
user 0m2.330s
sys 0m0.870s
$ time cuda-memcheck ./t815 1
========= CUDA-MEMCHECK
running double kernel
========= ERROR SUMMARY: 0 errors
real 0m12.028s
user 0m3.870s
sys 0m1.680s
Yes Jetson TX1. CUDA 8.0 with Ubuntu 16.04. Full log of deviceQuery is below:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3994 MBytes (4188229632 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 72 MHz (0.07 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = NVIDIA Tegra X1
Result = PASS
Jetson has a variety of settings that I’m not intimately familiar with. It’s possible your issue is specific to Jetson. You might get some additional folks to look at it by posting in the Jetson forum.
Problem solved. I didn’t stop lightdm service. Although I haven’t figured out how such small size kernel of <<<1,1>>> can be delayed, it doesn’t happen any more.