Warp divergence branching inside loop
hi,
please correct me if Im wrong.

I have a kernel like that:

[code]
_global__ void someKernel(float* g_in, float* g_out){

if(tid<32){
  Â //here first whole warp do some work, rest is waiting on barrier below
  Â //each thread from warp reads some data from global to shared here
  Â //(data is aligned to 32 and every thread from warp reads consecutive entry)
  Â //- it should be coalesced read right?
}

syncthreads(); //here all threads meet and continue

for(int g=0;g<64;++g){ //here we enter a big loop in each thread

  if(tid == 0){
  Â  //here I need just one thread to do some preparations on the beginning
  Â  //of every iteration. Rest is waiting on barrier below.
  }
  
  syncthreads(); //all-thread meeting here - lets continue
  
  //here all threads(warps) do some work in every 'for' iteration - no branching
  //here so all warps are executing in parallel - right?
  //but Im worried about that big loop and solo execution of thread 0
  //at the beginning of every iteration :/ - does it serialize execution?

  //DO WORK

  syncthreads(); //do i need this barrier here?

} //end of big loop

//we're out loop
//every thread writes some data from shared back to global
//and kernel finishes

}
[/code]


Do i get kernel flow right? - will it work like i wrote in comments above?
thank you!

edit: I mean just the kernel structure - serialization of execution, parallelism. Im not sure if i get it right.
hi,

please correct me if Im wrong.



I have a kernel like that:





_global__ void someKernel(float* g_in, float* g_out){



if(tid<32){

  Â //here first whole warp do some work, rest is waiting on barrier below

  Â //each thread from warp reads some data from global to shared here

  Â //(data is aligned to 32 and every thread from warp reads consecutive entry)

  Â //- it should be coalesced read right?

}



syncthreads(); //here all threads meet and continue



for(int g=0;g<64;++g){ //here we enter a big loop in each thread



  if(tid == 0){

  Â  //here I need just one thread to do some preparations on the beginning

  Â  //of every iteration. Rest is waiting on barrier below.

  }

  

  syncthreads(); //all-thread meeting here - lets continue

  

  //here all threads(warps) do some work in every 'for' iteration - no branching

  //here so all warps are executing in parallel - right?

  //but Im worried about that big loop and solo execution of thread 0

  //at the beginning of every iteration :/ - does it serialize execution?



  //DO WORK



  syncthreads(); //do i need this barrier here?



} //end of big loop



//we're out loop

//every thread writes some data from shared back to global

//and kernel finishes



}






Do i get kernel flow right? - will it work like i wrote in comments above?

thank you!



edit: I mean just the kernel structure - serialization of execution, parallelism. Im not sure if i get it right.

#1
Posted 08/13/2008 06:52 AM   
Yep, you've got everything correct.

The threads of the first warp will diverge shortly at the beginning of the loop, but then converge so everything is parallel in the guts of the loop.

Here is another way to think of __syncthreads(); It is a barrier to use whenever a thread touches shared memory that will be later read by a different thread. In loops, you need to have __syncthreads() both before and after the memory is touched so that one thread doesn't get too far ahead and write a value that other threads are still reading, so your __synthreads() at the end of the loop [b]is[/b] needed. I usually write it like this so the memory that the syncthreads() is protecting is made more obvious than stuffing a synchtreads at the end of the loop:
[code]
__synthreads()
if (threadIdx meats condition)
    update shared memory
__synthreads()
[/code]
Yep, you've got everything correct.



The threads of the first warp will diverge shortly at the beginning of the loop, but then converge so everything is parallel in the guts of the loop.



Here is another way to think of __syncthreads(); It is a barrier to use whenever a thread touches shared memory that will be later read by a different thread. In loops, you need to have __syncthreads() both before and after the memory is touched so that one thread doesn't get too far ahead and write a value that other threads are still reading, so your __synthreads() at the end of the loop is needed. I usually write it like this so the memory that the syncthreads() is protecting is made more obvious than stuffing a synchtreads at the end of the loop:



__synthreads()

if (threadIdx meats condition)

    update shared memory

__synthreads()

#2
Posted 08/13/2008 01:29 PM   
[quote name='MisterAnderson42' date='Aug 13 2008, 03:29 PM']Yep, you've got everything correct.

The threads of the first warp will diverge shortly at the beginning of the loop, but then converge so everything is parallel in the guts of the loop.
[/quote]

oh ok then - thx!
I've modified my kernel a little - found that in at-the-beginning-of-the-loop initialization phase I can divide my operations into two parallel parts. Please take a look once more, then if its ok, I ll be assured I finally understand this right.

Please take a look:
[code]
__global__ void kernel(...){

if(tid < 32){
//some work done in parallel by threads from first warp - rest is waiting.
//on first barrier inside loop
}

// no syncthreads() here! I moved it down into loop to avoid it at the
// end of loop (according to what you wrote in last post) - hope its ok, or do I need
// it here also?

for(int i = 0; i<64; ++i) { //big loop begins

 syncthreads();
 if(tid == 0){
   //some work done by thread 0 (warp 0)
 }
 if(tid == 1) //OR SHOULD IT BE tid==32 in order to paralellize this initialization??
   //some work done by thread 1 (warp 0)  //thread 32 (warp 1)
 }
 syncthreads();

 //some work in parallel - all threads/warps
 //no syncthreads() here! - instead at the beginning of the loop
}

syncthreads(); //I dont know if i need it here, since I have one at loop begin

//all threads write to global

} //kernel finish

[/code]

Is this code structure optimal for this algorithm kind ? :
1. Initialize at kernel start - parallelizable for 32 threads
//LOOP START
2. Initialize at loop start - parallelizable for 2 threads
3. Parallel work
//LOOP END
4. Kernel "de-briefing" - parallelizable for all threads.


And two quick ones:
First:
What is the right way to pass array (in shared memory) to device funtion:
[code]
__device__ void devFun(float2 * arr){
 //...
}

__global__ void kernel(....){
 __shared__ float2 s_array[76];
 //...
 devFun(&s_array);
}
[/code]

devFun performs some work on passed array. But compiler protest when I try to do above.

Second:

I have float array of 3136 in shared memory - s_out. I need to write it back to global memory - g_out. I have 448 threads in block (14 warps).
So every thread has to write 7 values into global (448*7=3136).
3136 divides by 16, by 32, by 7, by 14, by 448 - so I can do the coalesced write, right?
Does this performs the trick? :
[code]
//at the end of kernel
int addr = blockIdx.x*3136; //block starting address
g_out[addr] = s_out[addr];  //0 ... 447
g_out[addr+448] = s_out[addr+448];  //448 ...895
g_out[addr+896] = s_out[addr+896];  //896...1343
g_out[addr+1344] = s_out[addr+1344];  //1344...1791
g_out[addr+1792] = s_out[addr+1792];  //1792...2239
g_out[addr+2240] = s_out[addr+2240];  //2240...2687
g_out[addr+2688] = s_out[addr+2688];  //2688...3135
[/code]
or the same in loop:
[code]
//at the end of kernel
int addr = blockIdx.x*3136; //block starting address
for(unsigned int offset=0;offset<3136;offset+=448){
 g_out[addr+offset] = s_out[addr+offset];
}
[/code]


Will above 7 writes be coalesced? - this one is really hard for me /blarg.gif' class='bbc_emoticon' alt=':/' />

Maybe Im asking for too much, but if you could provide answer to above questions it will be of great help to me - I need someone to assure me that i got everything right, or prove im wrong. So thank you very very much!
[quote name='MisterAnderson42' date='Aug 13 2008, 03:29 PM']Yep, you've got everything correct.



The threads of the first warp will diverge shortly at the beginning of the loop, but then converge so everything is parallel in the guts of the loop.





oh ok then - thx!

I've modified my kernel a little - found that in at-the-beginning-of-the-loop initialization phase I can divide my operations into two parallel parts. Please take a look once more, then if its ok, I ll be assured I finally understand this right.



Please take a look:



__global__ void kernel(...){



if(tid < 32){

//some work done in parallel by threads from first warp - rest is waiting.

//on first barrier inside loop

}



// no syncthreads() here! I moved it down into loop to avoid it at the

// end of loop (according to what you wrote in last post) - hope its ok, or do I need

// it here also?



for(int i = 0; i<64; ++i) { //big loop begins



 syncthreads();

 if(tid == 0){

   //some work done by thread 0 (warp 0)

 }

 if(tid == 1) //OR SHOULD IT BE tid==32 in order to paralellize this initialization??

   //some work done by thread 1 (warp 0)  //thread 32 (warp 1)

 }

 syncthreads();



 //some work in parallel - all threads/warps

 //no syncthreads() here! - instead at the beginning of the loop

}



syncthreads(); //I dont know if i need it here, since I have one at loop begin



//all threads write to global



} //kernel finish






Is this code structure optimal for this algorithm kind ? :

1. Initialize at kernel start - parallelizable for 32 threads

//LOOP START

2. Initialize at loop start - parallelizable for 2 threads

3. Parallel work

//LOOP END

4. Kernel "de-briefing" - parallelizable for all threads.





And two quick ones:

First:

What is the right way to pass array (in shared memory) to device funtion:



__device__ void devFun(float2 * arr){

 //...

}



__global__ void kernel(....){

 __shared__ float2 s_array[76];

 //...

 devFun(&s_array);

}




devFun performs some work on passed array. But compiler protest when I try to do above.



Second:



I have float array of 3136 in shared memory - s_out. I need to write it back to global memory - g_out. I have 448 threads in block (14 warps).

So every thread has to write 7 values into global (448*7=3136).

3136 divides by 16, by 32, by 7, by 14, by 448 - so I can do the coalesced write, right?

Does this performs the trick? :



//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

g_out[addr] = s_out[addr];  //0 ... 447

g_out[addr+448] = s_out[addr+448];  //448 ...895

g_out[addr+896] = s_out[addr+896];  //896...1343

g_out[addr+1344] = s_out[addr+1344];  //1344...1791

g_out[addr+1792] = s_out[addr+1792];  //1792...2239

g_out[addr+2240] = s_out[addr+2240];  //2240...2687

g_out[addr+2688] = s_out[addr+2688];  //2688...3135


or the same in loop:



//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset=0;offset<3136;offset+=448){

 g_out[addr+offset] = s_out[addr+offset];

}






Will above 7 writes be coalesced? - this one is really hard for me /blarg.gif' class='bbc_emoticon' alt=':/' />



Maybe Im asking for too much, but if you could provide answer to above questions it will be of great help to me - I need someone to assure me that i got everything right, or prove im wrong. So thank you very very much!

#3
Posted 08/14/2008 07:41 AM   
[quote name='indigo80' date='Aug 14 2008, 09:41 AM'][code]
__device__ void devFun(float2 * arr){
 //...
}

__global__ void kernel(....){
 __shared__ float2 s_array[76];
 //...
 devFun(&s_array);
}
[/code]
[/quote]

you should use
[code]
devFun(s_array);
[/code]
or
[code]
devFun(&s_array[0]);
[/code]

in your case you are passing the address of the pointer, not the pointer itself


Vrah
[quote name='indigo80' date='Aug 14 2008, 09:41 AM']


__device__ void devFun(float2 * arr){

 //...

}



__global__ void kernel(....){

 __shared__ float2 s_array[76];

 //...

 devFun(&s_array);

}






you should use



devFun(s_array);


or



devFun(&s_array[0]);




in your case you are passing the address of the pointer, not the pointer itself





Vrah

#4
Posted 08/14/2008 08:03 AM   
[quote name='VrahoK' date='Aug 14 2008, 10:03 AM']in your case you are passing the address of the pointer, not the pointer itself
Vrah
[/quote]

what a dumb mistake /argh.gif' class='bbc_emoticon' alt=':argh:' /> thanks!
so two things left - code structure and coalescing , anyone? B)
[quote name='VrahoK' date='Aug 14 2008, 10:03 AM']in your case you are passing the address of the pointer, not the pointer itself

Vrah





what a dumb mistake /argh.gif' class='bbc_emoticon' alt=':argh:' /> thanks!

so two things left - code structure and coalescing , anyone? B)

#5
Posted 08/14/2008 08:15 AM   
[quote name='indigo80' date='Aug 14 2008, 09:41 AM'][code]
//at the end of kernel
int addr = blockIdx.x*3136; //block starting address
for(unsigned int offset=0;offset<3136;offset+=448){
 g_out[addr+offset] = s_out[addr+offset];
}
[/code]
[/quote]

Though I'm not 100% sure, I think this should work if you take the threadIdx.x as starting offset and your g_out is aligned. Then the writes in each warp are following each other so they should be coalesced.

[code]
//at the end of kernel
int addr = blockIdx.x*3136; //block starting address
for(unsigned int offset = threadIdx.x; offset < 3136; offset += 448){
 g_out[addr+offset] = s_out[addr+offset];
}
[/code]

Edit:
You can also write your code more flexible to the block and grid dimensions:
[code]
//at the end of kernel
int addr = blockIdx.x * 7 * blockDim.x; //block starting address
for(unsigned int offset = threadIdx.x; offset < 7 * blockDim.x; offset += blockDim.x){
 g_out[addr+offset] = s_out[addr+offset];
}
[/code]

Only thing I'm not sure about is if the starting address meets the coalescing requirements. (blockIdx.x * 7 * 14 * 32)
[quote name='indigo80' date='Aug 14 2008, 09:41 AM']


//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset=0;offset<3136;offset+=448){

 g_out[addr+offset] = s_out[addr+offset];

}






Though I'm not 100% sure, I think this should work if you take the threadIdx.x as starting offset and your g_out is aligned. Then the writes in each warp are following each other so they should be coalesced.





//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset = threadIdx.x; offset < 3136; offset += 448){

 g_out[addr+offset] = s_out[addr+offset];

}




Edit:

You can also write your code more flexible to the block and grid dimensions:



//at the end of kernel

int addr = blockIdx.x * 7 * blockDim.x; //block starting address

for(unsigned int offset = threadIdx.x; offset < 7 * blockDim.x; offset += blockDim.x){

 g_out[addr+offset] = s_out[addr+offset];

}




Only thing I'm not sure about is if the starting address meets the coalescing requirements. (blockIdx.x * 7 * 14 * 32)

#6
Posted 08/14/2008 08:33 AM   
there is an error in my code, it of course should be:
[code]
//at the end of kernel
int addr = blockIdx.x*3136+tid; //block starting address
for(unsigned int offset=0;offset<3136;offset+=448){
g_out[addr+offset] = s_out[addr+offset];
}
[/code]
Ive added tid to addr.

so its the same to what you, thanks for corretion, wrote:
[code]
//at the end of kernel
int addr = blockIdx.x*3136; //block starting address
for(unsigned int offset = threadIdx.x; offset < 3136; offset += 448){
g_out[addr+offset] = s_out[addr+offset];
}
[/code]

and final question remain unanswered - is this coalesced?
you state that writing to global like above should be ok, and only thing uncertain is block starting address (bid*7*14*32 = bid*3136) - so.. when starting address is ok? :huh:

According to Programming Guide:
[quote]HalfWarpBaseAddress should be aligned to 16*sizeof(type) bytes (i.e. be a multiple of 16*sizeof(type)). Any address BaseAddress of a variable residing in global memory or returned by one of the memory allocation routines from Sections D.5 or E.8 is always aligned to at least 256 bytes, so to satisfy the memory alignment constraint, HalfWarpBaseAddress-BaseAddress should be a multiple of 16*sizeof(type).[/quote]

so .. type is float so 16*sizeof(float) is 64. Starting address for each block is divisible by 64 : bid*3136/64 = bid * 49. Starting offset for all 7 writes is also divisible by 64:
448/64 = 7
896/64 = 14
1344/64 = 21
1792/64 = 28
2240/64 = 35
2688/64 = 42
so long everything is divisible by 64, but...
[quote]HalfWarpBaseAddress-BaseAddress should be a multiple of 16*sizeof(type)[/quote]
ok lets than take first write (offset 0) in block 1
block starting address is then 1*3136 = 3136
lets take second warp in this block, its first halfwarp - the first thread of this halfwarp is thread 32 -> so the halfwarp starting address is 3136+32=3168
so: HalfWarpBaseAddress-BaseAddress = 3168-3136=(of course) 32.
32/16*sizeof(float)=32/64=0.5 - so it isnt divisible.... but sticking to my computations coalesced access never occurs so i think i dont get it :(
there is an error in my code, it of course should be:



//at the end of kernel

int addr = blockIdx.x*3136+tid; //block starting address

for(unsigned int offset=0;offset<3136;offset+=448){

g_out[addr+offset] = s_out[addr+offset];

}


Ive added tid to addr.



so its the same to what you, thanks for corretion, wrote:



//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset = threadIdx.x; offset < 3136; offset += 448){

g_out[addr+offset] = s_out[addr+offset];

}




and final question remain unanswered - is this coalesced?

you state that writing to global like above should be ok, and only thing uncertain is block starting address (bid*7*14*32 = bid*3136) - so.. when starting address is ok? :huh:



According to Programming Guide:

HalfWarpBaseAddress should be aligned to 16*sizeof(type) bytes (i.e. be a multiple of 16*sizeof(type)). Any address BaseAddress of a variable residing in global memory or returned by one of the memory allocation routines from Sections D.5 or E.8 is always aligned to at least 256 bytes, so to satisfy the memory alignment constraint, HalfWarpBaseAddress-BaseAddress should be a multiple of 16*sizeof(type).




so .. type is float so 16*sizeof(float) is 64. Starting address for each block is divisible by 64 : bid*3136/64 = bid * 49. Starting offset for all 7 writes is also divisible by 64:

448/64 = 7

896/64 = 14

1344/64 = 21

1792/64 = 28

2240/64 = 35

2688/64 = 42

so long everything is divisible by 64, but...

HalfWarpBaseAddress-BaseAddress should be a multiple of 16*sizeof(type)


ok lets than take first write (offset 0) in block 1

block starting address is then 1*3136 = 3136

lets take second warp in this block, its first halfwarp - the first thread of this halfwarp is thread 32 -> so the halfwarp starting address is 3136+32=3168

so: HalfWarpBaseAddress-BaseAddress = 3168-3136=(of course) 32.

32/16*sizeof(float)=32/64=0.5 - so it isnt divisible.... but sticking to my computations coalesced access never occurs so i think i dont get it :(

#7
Posted 08/14/2008 10:07 AM   
anyone? :(
anyone? :(

#8
Posted 08/15/2008 10:37 PM   
I think it will be coalesced, since you're writing float values, which is 4 bytes. (3136 + 4 *16*halfWarpNum) is always going to be divisible by 64, so you're all set :)
I think it will be coalesced, since you're writing float values, which is 4 bytes. (3136 + 4 *16*halfWarpNum) is always going to be divisible by 64, so you're all set :)

#9
Posted 08/18/2008 10:50 PM   
You can always run your code with the visual profiler and actually measure if it is coalesced or not.
You can always run your code with the visual profiler and actually measure if it is coalesced or not.

#10
Posted 08/19/2008 12:19 PM   
[quote name='MisterAnderson42' date='Aug 19 2008, 02:19 PM']You can always run your code with the visual profiler and actually measure if it is coalesced or not.
[right][snapback]427940[/snapback][/right]
[/quote]

man you made my day - simplest solutions are often best one - my fault i didnt know about profiler before :">

checked and yes, things are coalesced!
[quote name='MisterAnderson42' date='Aug 19 2008, 02:19 PM']You can always run your code with the visual profiler and actually measure if it is coalesced or not.

[snapback]427940[/snapback]






man you made my day - simplest solutions are often best one - my fault i didnt know about profiler before :">



checked and yes, things are coalesced!

#11
Posted 08/20/2008 04:43 PM   
[quote name='indigo80' date='Aug 20 2008, 09:43 AM']man you made my day - simplest solutions are often best one - my fault i didnt know about profiler before  :">

checked and yes, things are coalesced!
[right][snapback]428581[/snapback][/right]
[/quote]
Keep in mind that the profiler will claim things are always coalesced on GT200, even when they're clearly not...
[quote name='indigo80' date='Aug 20 2008, 09:43 AM']man you made my day - simplest solutions are often best one - my fault i didnt know about profiler before  :">



checked and yes, things are coalesced!

[snapback]428581[/snapback]




Keep in mind that the profiler will claim things are always coalesced on GT200, even when they're clearly not...

#12
Posted 08/20/2008 04:56 PM   
Scroll To Top