some blocks in kernel can't launch

hulin · April 16, 2018, 12:48pm

I have a piece of code like this:(some parameters in functions are omitted)

cudaStream_t stream[MAXSTREAM];
for(int i = 0; i < n; i ++)
{
    int *h_toDevice;
    cudaMallocHost((void **)&h_toDevive);
    memset(h_toDevive);

    int *d_toDevice;
    cudaMalloc(&d_toDevice);
    for(int s = 0; s < MAXSTREAM; s ++)
        cudaMemcpyAsync(d_toDevice,h_toDevive,stream[i%MAXSTREAM]);

    for(int s = 0; s < MAXSTREAM; s ++)
        kernelFunc<<<stream[i%MAXSTREAM]>>>();

    for(int s = 0; s < MAXSTREAM; s ++)
    {
        cudaFreeHost(h_toDevice);
        cudaFree(d_toDevice);
    }
}

(The MAXSTREAM is about 50,and n is about 100)
For n times, I launch MAXSTREAM kernel functions at a time to process some tasks. But something I can’t understand come up. For example, I launch 20 blocks in a kernel function, but 10 of them are not launched according to output imformation! I can’t get any errors using cuda-memcheck or cudaGetLastError() after kernel function launch.
I don’t understand why some blocks in a kernel could launch while others can’t.
I guess the GPU may run out of compute resources.But I put compute tasks in streams, and max number of kernel launched simultaneously is limited, will that still happen? If so,how can I get the error information to arrange the tasks reasonably?

cbuchner1 · April 16, 2018, 3:43pm

none of the calls to cudaMallocHost, memset, cudaMalloc make sense.

the calls miss required arguments.

hulin · April 17, 2018, 1:50am

The code I showed is just pseudo code.As I saied,some parameters of the functions are omitted.
The code can be compiled and run.

cbuchner1 · April 17, 2018, 8:51am

If you posted complete & self contained code examples that compile unmodified, plus a short statement of what the expected output is vs. what you actually received, then would be more likely that someone is willing to look into the details of what is happening here.

I don’t see a cudaDeviceSynchronize() after you launch all your kernels. It may be that you’re killing the GPU context before the kernel is done launching? On the other hand, the cudaFree() should perform implicit synchronization.