I have a piece of code like this:(some parameters in functions are omitted)
cudaStream_t stream[MAXSTREAM];
for(int i = 0; i < n; i ++)
{
int *h_toDevice;
cudaMallocHost((void **)&h_toDevive);
memset(h_toDevive);
int *d_toDevice;
cudaMalloc(&d_toDevice);
for(int s = 0; s < MAXSTREAM; s ++)
cudaMemcpyAsync(d_toDevice,h_toDevive,stream[i%MAXSTREAM]);
for(int s = 0; s < MAXSTREAM; s ++)
kernelFunc<<<stream[i%MAXSTREAM]>>>();
for(int s = 0; s < MAXSTREAM; s ++)
{
cudaFreeHost(h_toDevice);
cudaFree(d_toDevice);
}
}
(The MAXSTREAM is about 50,and n is about 100)
For n times, I launch MAXSTREAM kernel functions at a time to process some tasks. But something I can’t understand come up. For example, I launch 20 blocks in a kernel function, but 10 of them are not launched according to output imformation! I can’t get any errors using cuda-memcheck or cudaGetLastError() after kernel function launch.
I don’t understand why some blocks in a kernel could launch while others can’t.
I guess the GPU may run out of compute resources.But I put compute tasks in streams, and max number of kernel launched simultaneously is limited, will that still happen? If so,how can I get the error information to arrange the tasks reasonably?