Killing all kernel threads

Hey, I apologize if this has been answered. I am not entirely sure how to formulate the question for searching (or more specifically, all my previous searches point in very wrong directions).

I have a series of kernels all doing substantial work (millions of threads each potentially). At a high level the code does some initialization (A,B), iterates a high-level loop (C 10x max) makes a decision (D) and if the decision is successful performs another high-level loop (E 10x max) and collects the results (F). So the serialization of this is as follows:

A, B, C, C, C, C, C, C, C, C, C, C, C, D, E, E, E, E, E, E, E, E, E, E, F

The high-level algorithm may need to do up to 10 iterations of C and E, but frequently is required to do much less (like 3-4 iterations).
With hardware that supports dynamic parallelism, each iteration of C and E could decide if a subsequent iteration is required and issue the next kernel invocation correct? Also if D fails, and decides E and F are unnecessary, it can simply not issue them to the stream correct?

The problem is that this is required to run on pre-dynamic parallelism hardware. So the real question is: Can I add a trivial check in C and E to decide if any work is necessary and simply early out (i.e. tell the GPU NOT to schedule any further threadblocks) for the current kernel invocation? (Or better yet for D, tell the currently active stream to kill and remove any future work items).

I found some posts describing asm(“trap;”), but it seems this is more for exceptional behaviour. I am simply looking for a way to insert conditional behaviour to the high-level sequence of kernels.

So C and E are repeating kernel launches? In parallel? Or serially?

Repeating serial kernel launches. (As is E). They are both used to refine a global solution at a high level. What is hidden is that C and E are also both actually a serial sequence of kernels executing each working with different launch parameters (dimensions) that suit the work being done within each kernel.

You can think of the desired high level (sequential) behavior as (in pseudo C code):

void C()
{
   // These all use different launch parameters
   Ca();
   Cb();
   Cd();
}

void E()
{
   // These all use different launch parameters
   Ea();
   Eb();
   Ed();
}

main()
{
   // Initialization routines
   A();  // ~5ns
   B();  // ~10ns

   // Refine initial estimate
   for( int ci = 0; ci < 10; ci++ )
   {
      C();   // ~100-1500ns
      if( IsCDone() ) break;
   }

   // Check to see if refined initial estimate 
   // has potential final solution
   if( D() ) // ~5ns
   {
      // Refine final solution
      for( int ei = 0; ei < 10; ei++ )
      {
         E();  // ~100-1500ns
         if( IsEDone() ) break;
      }
   }

   // Gather relevant final solution and statistics
   F(); // ~5ns
}

We already have code within the kernels to perform the IsC/EDone(), but each thread still needs to be scheduled, load the data from global memory and perform the check and early out. This works, but adds a surprising amount of time to get to F when there isn’t really anything useful going on.

Also if D returns false, it would be nice to simply say "clear this stream, there is no more relevant work to do. Don’t schedule any further threads for E (with dynamic parallelism), or clear the existing schedule (without dynamic parallelism).

Unfortunately, the kernels that are doing the refinement do significant amount of work and would greatly benefit from not running if they aren’t necessary. (Like 1 invocation of C adds up to significantly more time than all the other non-C/E kernels, so stopping them at the earliest opportunity is very desirable).

the ability to flush streams would indeed be the cheapest way
surprising myself, i have actually issued a rfe some time ago
i suppose it might get included in cuda 23, if we are lucky

i do not see why dynamic parallelism is a requirement
the other option is to use atomics
both D and F can set a global atomic that the host, as well as C and F read
C and F would read the atomic at the very start of the kernel, and prematurely exist, if it is set
the host can use the atomic as such: it assumes it is not set (the iteration is not done), forward issues at least 1 iteration, copies the atomic, and simply stops forward issuing the moment it detects the atomic is set
this way, the host does not really need to break to evaluate conditionals; it only stops to wait for the device to catch up - to not run away from the device (via stream events)

Would it be feasible to kill kernel threads by triggering an exception (e.g. with an illegal memory access) that makes the entire CUDA kernel return with an error? This way, not all threads have to check the termination condition individually.

Of course the host code has to be ready to expect such a “dirty” kernel abort and be cool with it.

I had thought of that (trigger a device code assert, for example), but it means you’ll end up with a corrupted cuda context, so you’d better have copied whatever data you want back from the GPU before that happens, and then, as you say, the host code has to be prepared to do a full clean up e.g. cudaDeviceReset followed by re-allocation of all variables. Seems messy. little_jimmy did post a thread asking for something similar not long ago.

Is it really that bad? Say, making a read from address 0x00000000. Why would this leave the CUDA context in an undefined or corrupted state?

IMHO it should be enough to read (and thereby reset) the cuda error state with cudaGetLastError() and continue chugging along.

Is there any reliable (and definitive) documentation recovery from failed kernel launches?

“This way, not all threads have to check the termination condition individually.”

even though i mentioned an atomic, i do not think it needs to be a ‘pure’ atomic
and all threads do not need to read the termination condition in a sense - a single thread per kernel block can broadcast the global flag via shared memory
the flag should quickly cache too, i would think, as it does not need to be pure atomic

still, in this case, control at the kernel level would be more optimal than control at the kernel block level

Since the whole pipeline is so complex, I would rather not cause an exception intentionally since there are many places where a real exception could occur. Unless there is some way to define a unique exception identifier that I can disambiguate from real exceptions.

@little_jimmy I don’t really need atomics since there is code in one of the C kernels that can simply report if there is no more computation to be done. Fortunately I am already doing this, all C/E kernels start up like

__global__ void Ca_Kernel( unsigned int  *g_ProcessComplete)
{
    if( 0 != g_ProcessComplete[ threadIdx.whatever ] ) return;
    ...

__global__ void Cb_Kernel( unsigned int  *g_ProcessComplete)
{
    if( 0 != g_ProcessComplete[ threadIdx.whatever ] ) return;
    ...

__global__ void Cd_Kernel( unsigned int  *g_ProcessComplete)
{
    if( 0 != g_ProcessComplete[ threadIdx.whatever ] ) return;
    ...

The real problem is that there are up to 30 of these (each) and if the launch parameters has millions (D) of data points, then each of these kernels gets called D/32 times. If the ith kernel determines there is no point in continuing, then there are (|C|+|E|-i)*D/32 unnecessary kernel launches with these checks being done.

Say there are 1,000,000 data points that is determined can’t have a solution after the first Cd_Kernel launch, then there are (30+30-3)*1,000,000/32 = 1,781,250 unnecessary calls to: if( 0 != g_ProcessComplete[ threadIdx.whatever ] ) return;

Again, this works but isn’t the most efficient.

If I had the ability to clear the stream then the millions of threadblocks can be avoided for scheduling, all kernel launches will be avoided and the system can return gracefully and immediately.

“I don’t really need atomics”

i should not have used the term ‘atomic’ - clearly it is now biting me in the … ankle
i meant a variable visible to multiple kernels, and the host
i personally like to refer to that as a sort of atomic, for lack of a better word; evidently, this is lacking

flushing streams would be most optimal; i think we agree on this
the problem is i do not see it implemented and thus as a viable solution for the foreseeable future

in light of this, premature termination - as you already do - seems the next best thing
i would only add that, if you waste that much kernel block launches - 1,781,250 unnecessary calls as mentioned by you - due to conditionality, then perhaps you should revisit your issuing
i am sure one can forward issue too little, and too much
perhaps you are forward issuing too much
if you control the amount of work you forward issue, you can get the count down, i would think
for axample, if the device can only seat 50 kernel blocks at a time, issuing more than 200 kernel blocks at a time seems unnecessary
you could easily issue the first 100 kernel blocks, an event, another 100 kernel blocks, another event, and thereafter only issue the next 100 kernel blocks once a set of 100 kernel blocks have been completed
this should both ensure the device is well fed, and at the same time limit the number of kernel blocks waiting on the wrong side of the conditionality flag
perhaps looping kernel blocks is another option - you may have work the equivalent of thousands of kernel blocks, but do you truly need thousands of kernel blocks to do the work?

I have a long-lived search kernel that I wanted to abort when a thread found “the answer”. I had no problems using a trap instruction to abort the kernel. The code is really as simple as:

*global_answer=found_result;
   __threadfence_system();  
      asm("trap;");

The threadfence works… the CUDA docs don’t really say what “trap” does to pending writes, but in practice the threadfence_system was enough.

What kind of error code is signalled when a trap instruction is encountered?

I don’t even check. My search kernel is infinite and the trap is the only way out.
On the host side, I have an infinite loop that just records the success and starts another search.
The host waits on cudaThreadSynchronize(), records the result, calls cudaDeviceReset(), and starts a new search.

I chose this method because it was far easier to code (3 lines!) than adding checks inside my multiple search loops, and doesn’t add any overhead to the loops themselves. My exceptions happen on the timescale of once every few hours, so kernel launch overhead is ignorable.

@little_jimmy
I may have to have the host play a larger role; I was hoping this wouldn’t be necessary since it will diverge architecturally from the dynamic parallelism version. If that is the way it has to be, I guess that is how it goes.

I agree that this is not likely possible. The documents state several places that stream manipulation from within a kernel is not desirable/possible. Part of the intent of this thread is to highlight to nVidia themselves why some basic stream control from within GPU execution may be desirable. (Maybe dynamic parallelism or drivers to support Vulkan will address this).

One thing I forgot to mention is that our we have a time limit of 17ms and currently it is estimated that 70-90% of this time is being taken by this (potentially) pointless looping. While there isn’t a huge desire to have the pre-dynamic-parallelism version differ, we have been bitten by the overhead (breaking the pipeline) from communicating commands/results between cpu-gpu.

It sounds more and more like this might not be possible given our time vs compute requirements.

Another approach is to remove the idea of different kernels altogether. Combine them into a single uberkernel that decides itself, by a simple switch statement at launch time, which subkernel to run. This is easy and cheap to do.

The significant common deficit of an uberkernel is when the different kernels have significantly different resource requirements and/or launch configurations. The uberkernel will likely waste registers, potentially large numbers. Your thread count is also fixed for all the kernels since they share the same launch.

But sometimes the inflexibility and resource waste isn’t too bad, and if so, you can eliminate most of that schedule hassle, complexity, and inefficiency since the CPU becomes irrelevant… it just fires off the “same” kernel a dozen times, and the kernel itself decides on when to switch from initialization to refinement to report subkernels.

Google “CUDA uberkernel” for more examples of the idea.

you champion dynamic parallelism in a sense, and you are free to do so
however, in my mind, dp is in competition with the host, and i am not convinced that dp is winning
in many cases, the host can do what dp does, and in many cases better too

“If the ith kernel determines there is no point in continuing, then there are (|C|+|E|-i)*D/32 unnecessary kernel launches with these checks being done.”

“if( 0 != g_ProcessComplete[ threadIdx.whatever ] ) return;”

why a local flag per thread?
why not (have the host) preset a global flag, and have a kernel reset the value, if it “determines there is no point in continuing”?

h_done = 0;
h2d(d_done, h_done);

int lint[3];

lint[0] = 0;
lint[1] = 0;

while (h_done == 0)
{
kernel<portion_of_blocks, stream>

d2hAsync(h_done, d_done);

cudaEventRecord(event[lint[1]], stream);

if (lint[0] == 0)
{
lint[0] = 1;
}

else
{
cudaEventSynchronize(lint[2]);
}

lint[2] = lint[1];
lint[1]++;

if (lint[1] >= stream_cnt) // stream_cnt likely 1 in this case
{
lint[1] = 0;
}
}

from this, i do not see the excessive overhead you are referring to