do not understand thread/block division
Hi,

I have a simple kernel:

[code]
__global__ void fwd_conv_shared( int *d_output, int nData )
{

int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;

bool invalid = ( x < nData );

if( invalid )
{
//d_output[ x ] = threadIdx.x;
d_output[ x ] = blockIdx.x;
}

}[/code]

and when I call it this way:

[code]
dim3 block(8, 1, 1);
dim3 gridSize( 1, 1, 1 );


fwd_conv_shared<<< gridSize, block>>>( (int *)dDstData, nData );
[/code]

I obtain in dDstData (nData = 8):

0, 0, 0, 0, 0, 0, 0, 0
and when list threadIdx.x:
0, 1, 2, 3, 4, 5, 6, 7

OK, but, when I call it this way:

[code]
dim3 blockSize(8, 1, 1);
dim3 gridSize( 3, 1, 1 ); //!!


fwd_conv_shared<<< gridSize, blockSize>>>( (int *)dDstData, nData );
[/code]

I've got:

threadIdx.x = 0, 1, 2, 3, 4, 0, 1, 2

and:
blockIdx.x = 0, 0, 0, 0, 0, 1, 1, 1


I expected, that when the size of the block is 8, the kernel will be started always on 8 threads in block.
I do not understand, why it was divided into two blocks. How to make CUDA, to run fixed number of threads in a block?

It is nightmare, when I try to use blockIdx.x to index shared memory.

I'll be very grateful for your answers,

Best regards,
Jakub

//CUDA 4.1, GTX480
Hi,



I have a simple kernel:





__global__ void fwd_conv_shared( int *d_output, int nData )

{



int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;



bool invalid = ( x < nData );



if( invalid )

{

//d_output[ x ] = threadIdx.x;

d_output[ x ] = blockIdx.x;

}



}




and when I call it this way:





dim3 block(8, 1, 1);

dim3 gridSize( 1, 1, 1 );





fwd_conv_shared<<< gridSize, block>>>( (int *)dDstData, nData );




I obtain in dDstData (nData = 8):



0, 0, 0, 0, 0, 0, 0, 0

and when list threadIdx.x:

0, 1, 2, 3, 4, 5, 6, 7



OK, but, when I call it this way:





dim3 blockSize(8, 1, 1);

dim3 gridSize( 3, 1, 1 ); //!!





fwd_conv_shared<<< gridSize, blockSize>>>( (int *)dDstData, nData );




I've got:



threadIdx.x = 0, 1, 2, 3, 4, 0, 1, 2



and:

blockIdx.x = 0, 0, 0, 0, 0, 1, 1, 1





I expected, that when the size of the block is 8, the kernel will be started always on 8 threads in block.

I do not understand, why it was divided into two blocks. How to make CUDA, to run fixed number of threads in a block?



It is nightmare, when I try to use blockIdx.x to index shared memory.



I'll be very grateful for your answers,



Best regards,

Jakub



//CUDA 4.1, GTX480

#1
Posted 04/20/2012 01:51 PM   
You should define dDstData if the size of the problem. When you submit with 3 blocks with 8 threads each you have 24 threads but ndata is only 8. Also I do not understant this line int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;

It is ok to use int x = blockIdx.x*blockDim.x + threadIdx.x ;
You should define dDstData if the size of the problem. When you submit with 3 blocks with 8 threads each you have 24 threads but ndata is only 8. Also I do not understant this line int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;



It is ok to use int x = blockIdx.x*blockDim.x + threadIdx.x ;

#2
Posted 04/20/2012 02:26 PM   
thank you pasoleatis,

dDstData is allocated as 8 element device linear memory.

__umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;

is an equivalent of

int x = blockIdx.x*blockDim.x + threadIdx.x;

I now, that in the second case I run 24 threads on 8 element data table, however I thought, that all data will be proceeded in first block, and in blocks second and third "invalid" will be "false".
I don't know, why BlockDim.x is 5 instead of 8.

regards,
J
thank you pasoleatis,



dDstData is allocated as 8 element device linear memory.



__umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;



is an equivalent of



int x = blockIdx.x*blockDim.x + threadIdx.x;



I now, that in the second case I run 24 threads on 8 element data table, however I thought, that all data will be proceeded in first block, and in blocks second and third "invalid" will be "false".

I don't know, why BlockDim.x is 5 instead of 8.



regards,

J

#3
Posted 04/20/2012 02:55 PM   
[quote name='JKPie' date='20 April 2012 - 03:55 PM' timestamp='1334933758' post='1398742']
thank you pasoleatis,

dDstData is allocated as 8 element device linear memory.

__umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;

is an equivalent of

int x = blockIdx.x*blockDim.x + threadIdx.x;

I now, that in the second case I run 24 threads on 8 element data table, however I thought, that all data will be proceeded in first block, and in blocks second and third "invalid" will be "false".
I don't know, why BlockDim.x is 5 instead of 8.

regards,
J
[/quote]
Hello,

Yes you are right there should be no problem with your code. Check the last error from the kernel call.
Just a thought use this
[code] fwd_conv_shared<<< 3,8>>>( (int *)dDstData, nData ); [/code] to call the kernel


[code]
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg,
cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}

// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");

[/code]

Source: http://www.drdobbs.com/parallel/207603131?pgno=3
[quote name='JKPie' date='20 April 2012 - 03:55 PM' timestamp='1334933758' post='1398742']

thank you pasoleatis,



dDstData is allocated as 8 element device linear memory.



__umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;



is an equivalent of



int x = blockIdx.x*blockDim.x + threadIdx.x;



I now, that in the second case I run 24 threads on 8 element data table, however I thought, that all data will be proceeded in first block, and in blocks second and third "invalid" will be "false".

I don't know, why BlockDim.x is 5 instead of 8.



regards,

J



Hello,



Yes you are right there should be no problem with your code. Check the last error from the kernel call.

Just a thought use this

fwd_conv_shared<<< 3,8>>>( (int *)dDstData, nData );
to call the kernel







void checkCUDAError(const char *msg)

{

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err)

{

fprintf(stderr, "Cuda error: %s: %s.\n", msg,

cudaGetErrorString( err) );

exit(EXIT_FAILURE);

}

}



// check if kernel execution generated an error

// Check for any CUDA errors

checkCUDAError("kernel invocation");






Source: http://www.drdobbs.com/parallel/207603131?pgno=3

#4
Posted 04/20/2012 03:28 PM   
I' checked cuda errors with the "no error" result.

However, I did not paste whole kernel code above.

In fact I have:
[code]
__global__ void fwd_conv_shared( int *d_output, int nData )
{

int shift = 3 * blockIdx.x;
int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;
x-=shift;

bool invalid = ( x < nData );

if( invalid )
{
//d_output[ x ] = threadIdx.x;
d_output[ x ] = blockIdx.x;
}

}
[/code]

But for the first block, shift should be 0, so I did not paste this part.
However, when I remove the part with "shift", or set shift manually to 0, everything seems to be ok.
I' checked cuda errors with the "no error" result.



However, I did not paste whole kernel code above.



In fact I have:



__global__ void fwd_conv_shared( int *d_output, int nData )

{



int shift = 3 * blockIdx.x;

int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;

x-=shift;



bool invalid = ( x < nData );



if( invalid )

{

//d_output[ x ] = threadIdx.x;

d_output[ x ] = blockIdx.x;

}



}




But for the first block, shift should be 0, so I did not paste this part.

However, when I remove the part with "shift", or set shift manually to 0, everything seems to be ok.

#5
Posted 04/20/2012 04:16 PM   
The difference with shift is that "invalid" is set after the subtraction of shift. This means that the first three cases for blockIdx.x=1 also will be valid since they have x=5,6,7. In your case, It appears that the block with blockIdx.x=1 is executed after the block with blockIdx.x=0 and thereby overwrites the results of the block with blockIdx.x=0.
The difference with shift is that "invalid" is set after the subtraction of shift. This means that the first three cases for blockIdx.x=1 also will be valid since they have x=5,6,7. In your case, It appears that the block with blockIdx.x=1 is executed after the block with blockIdx.x=0 and thereby overwrites the results of the block with blockIdx.x=0.

#6
Posted 04/20/2012 04:53 PM   
By the way the __mul24 function is slower on Fermi then just doing the normal integer multiplication
By the way the __mul24 function is slower on Fermi then just doing the normal integer multiplication

#7
Posted 04/21/2012 07:15 AM   
Thank you very, very much for the explanation. I've got one last question.

now I have nData = 8;

I run the kernel:

[code]
__global__ void fwd_conv_shared( int *d_output, int nData )
{
int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );
d_output[ x ] = blockIdx.x;
}
[/code]
(Each next block overlaps last three elements of the previous block.)
[code]
fwd_conv_shared<<< 5, 4>>>( (float *)dDstData, nData );
[/code]

so I expected to have the result like:
0, 1, 2, 3, 4, 4, 4, 4
but I've got:
0, 1, 2, 2, 2, 2, 4, 4

Why the block number 2 overlaps data of the block number 4 if he was executed before the execution of the last block?
I tried to use atomicExch(), __threadfence() and __syncthreads() with the same result.
Thank you very, very much for the explanation. I've got one last question.



now I have nData = 8;



I run the kernel:





__global__ void fwd_conv_shared( int *d_output, int nData )

{

int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );

d_output[ x ] = blockIdx.x;

}


(Each next block overlaps last three elements of the previous block.)



fwd_conv_shared<<< 5, 4>>>( (float *)dDstData, nData );




so I expected to have the result like:

0, 1, 2, 3, 4, 4, 4, 4

but I've got:

0, 1, 2, 2, 2, 2, 4, 4



Why the block number 2 overlaps data of the block number 4 if he was executed before the execution of the last block?

I tried to use atomicExch(), __threadfence() and __syncthreads() with the same result.

#8
Posted 04/23/2012 11:41 AM   
[quote name='JKPie' date='23 April 2012 - 12:41 PM' timestamp='1335181283' post='1399785']
Thank you very, very much for the explanation. I've got one last question.

now I have nData = 8;

I run the kernel:

[code]
__global__ void fwd_conv_shared( int *d_output, int nData )
{
int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );
d_output[ x ] = blockIdx.x;
}
[/code]
(Each next block overlaps last three elements of the previous block.)
[code]
fwd_conv_shared<<< 5, 4>>>( (float *)dDstData, nData );
[/code]

so I expected to have the result like:
0, 1, 2, 3, 4, 4, 4, 4
but I've got:
0, 1, 2, 2, 2, 2, 4, 4

Why the block number 2 overlaps data of the block number 4 if he was executed before the execution of the last block?
I tried to use atomicExch(), __threadfence() and __syncthreads() with the same result.
[/quote]
Hello,

You expect a specific order of execution for the blocks. In practice there is no order in which the blocks are executed. The atomic operation insures that if an adress is read no other thread can acess that address until it is written back. The thredfence function makes sure the result is visible to the global memory by the other threads, while the synthreads() only synchronises the threads in a given block. None of them imposes an order like block 1 should be done before block 2.

In the programming guide there is an example about how you can make the last block to be executed to perform a certain task on data provided by other blocks, but I do not think it is possible to make an order of eecution such as block 0 is ran, then block 1 and so on.


This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

[quote]
In general, when a thread issues a series of writes to memory in a particular order,
other threads may see the effects of these memory writes in a different order.
__threadfence_block(), __threadfence(), and
__threadfence_system() can be used to enforce some ordering.
One use case is when threads consume some data produced by other threads as
illustrated by the following code sample of a kernel that computes the sum of an
array of N numbers in one call. Each block first sums a subset of the array and
stores the result in global memory. When all blocks are done, the last block done
reads each of these partial sums from global memory and sums them to obtain the
final result. In order to determine which block is finished last, each block atomically
increments a counter to signal that it is done with computing and storing its partial
sum (see Section B.11 about atomic functions). The last block is the one that
receives the counter value equal to gridDim.x-1. If no fence is placed between
storing the partial sum and incrementing the counter, the counter might increment
before the partial sum is stored and therefore, might reach gridDim.x-1 and let
the last block start reading partial sums before they have been actually updated in
memory.
[/quote]
[code]
__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
float* result)
{
// Each block sums a subset of the input array
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
// Thread 0 of each block stores the partial sum
// to global memory
result[blockIdx.x] = partialSum;
// Thread 0 makes sure its result is visible to
// all other threads
__threadfence();
// Thread 0 of each block signals that it is done
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 of each block determines if its block is
// the last block to be done
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1]

float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores total sum
// to global memory and resets count so that
// next kernel call works properly
result[0] = totalSum;
count = 0;
}
[/code]
[quote name='JKPie' date='23 April 2012 - 12:41 PM' timestamp='1335181283' post='1399785']

Thank you very, very much for the explanation. I've got one last question.



now I have nData = 8;



I run the kernel:





__global__ void fwd_conv_shared( int *d_output, int nData )

{

int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );

d_output[ x ] = blockIdx.x;

}


(Each next block overlaps last three elements of the previous block.)



fwd_conv_shared<<< 5, 4>>>( (float *)dDstData, nData );




so I expected to have the result like:

0, 1, 2, 3, 4, 4, 4, 4

but I've got:

0, 1, 2, 2, 2, 2, 4, 4



Why the block number 2 overlaps data of the block number 4 if he was executed before the execution of the last block?

I tried to use atomicExch(), __threadfence() and __syncthreads() with the same result.



Hello,



You expect a specific order of execution for the blocks. In practice there is no order in which the blocks are executed. The atomic operation insures that if an adress is read no other thread can acess that address until it is written back. The thredfence function makes sure the result is visible to the global memory by the other threads, while the synthreads() only synchronises the threads in a given block. None of them imposes an order like block 1 should be done before block 2.



In the programming guide there is an example about how you can make the last block to be executed to perform a certain task on data provided by other blocks, but I do not think it is possible to make an order of eecution such as block 0 is ran, then block 1 and so on.





This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.





In general, when a thread issues a series of writes to memory in a particular order,

other threads may see the effects of these memory writes in a different order.

__threadfence_block(), __threadfence(), and

__threadfence_system() can be used to enforce some ordering.

One use case is when threads consume some data produced by other threads as

illustrated by the following code sample of a kernel that computes the sum of an

array of N numbers in one call. Each block first sums a subset of the array and

stores the result in global memory. When all blocks are done, the last block done

reads each of these partial sums from global memory and sums them to obtain the

final result. In order to determine which block is finished last, each block atomically

increments a counter to signal that it is done with computing and storing its partial

sum (see Section B.11 about atomic functions). The last block is the one that

receives the counter value equal to gridDim.x-1. If no fence is placed between

storing the partial sum and incrementing the counter, the counter might increment

before the partial sum is stored and therefore, might reach gridDim.x-1 and let

the last block start reading partial sums before they have been actually updated in

memory.





__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,

float* result)

{

// Each block sums a subset of the input array

float partialSum = calculatePartialSum(array, N);

if (threadIdx.x == 0) {

// Thread 0 of each block stores the partial sum

// to global memory

result[blockIdx.x] = partialSum;

// Thread 0 makes sure its result is visible to

// all other threads

__threadfence();

// Thread 0 of each block signals that it is done

unsigned int value = atomicInc(&count, gridDim.x);

// Thread 0 of each block determines if its block is

// the last block to be done

isLastBlockDone = (value == (gridDim.x - 1));

}

// Synchronize to make sure that each thread reads

// the correct value of isLastBlockDone

__syncthreads();

if (isLastBlockDone) {

// The last block sums the partial sums

// stored in result[0 .. gridDim.x-1]



float totalSum = calculateTotalSum(result);

if (threadIdx.x == 0) {

// Thread 0 of last block stores total sum

// to global memory and resets count so that

// next kernel call works properly

result[0] = totalSum;

count = 0;

}

#9
Posted 04/23/2012 12:00 PM   
Ok, thank you, I understand it now.

The thing I wanted to do to implement my algorithm was something like:

[code]
int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );
d_output[ x ] += something;
[/code]

Next block adds something to the last three elements of previous block. It's something like gaussian forward operator, or any other forward pixel operator, which spreads the pixel value in new coordinate system.
Becouse I get an error I wanted to understand how the blocks are executing.

I tried to call synthreads() and memory synchronization functions becouse I thought, that there is something like read after write error between blocks what was not understandable for me.

It is interesting, that in code above atomicAdd instead += works correctly, however I do not want to use atomic functions because they can kill the performance.
Ok, thank you, I understand it now.



The thing I wanted to do to implement my algorithm was something like:





int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );

d_output[ x ] += something;




Next block adds something to the last three elements of previous block. It's something like gaussian forward operator, or any other forward pixel operator, which spreads the pixel value in new coordinate system.

Becouse I get an error I wanted to understand how the blocks are executing.



I tried to call synthreads() and memory synchronization functions becouse I thought, that there is something like read after write error between blocks what was not understandable for me.



It is interesting, that in code above atomicAdd instead += works correctly, however I do not want to use atomic functions because they can kill the performance.

#10
Posted 04/23/2012 12:51 PM   
[quote name='JKPie' date='23 April 2012 - 01:51 PM' timestamp='1335185487' post='1399819']
Ok, thank you, I understand it now.

The thing I wanted to do to implement my algorithm was something like:

[code]
int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );
d_output[ x ] += something;
[/code]

Next block adds something to the last three elements of previous block. It's something like gaussian forward operator, or any other forward pixel operator, which spreads the pixel value in new coordinate system.
Becouse I get an error I wanted to understand how the blocks are executing.

I tried to call synthreads() and memory synchronization functions becouse I thought, that there is something like read after write error between blocks what was not understandable for me.


It is interesting, that in code above atomicAdd instead += works correctly, however I do not want to use atomic functions because they can kill the performance.
[/quote]
In the case above you have no option because you have a race condition. You either use atomiAdd if there are not many additions or you need some other way to make the summation of the partial results, like defining an array of the size of number of blocks so each block will save the partial result to a global temp array.
[quote name='JKPie' date='23 April 2012 - 01:51 PM' timestamp='1335185487' post='1399819']

Ok, thank you, I understand it now.



The thing I wanted to do to implement my algorithm was something like:





int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );

d_output[ x ] += something;




Next block adds something to the last three elements of previous block. It's something like gaussian forward operator, or any other forward pixel operator, which spreads the pixel value in new coordinate system.

Becouse I get an error I wanted to understand how the blocks are executing.



I tried to call synthreads() and memory synchronization functions becouse I thought, that there is something like read after write error between blocks what was not understandable for me.





It is interesting, that in code above atomicAdd instead += works correctly, however I do not want to use atomic functions because they can kill the performance.



In the case above you have no option because you have a race condition. You either use atomiAdd if there are not many additions or you need some other way to make the summation of the partial results, like defining an array of the size of number of blocks so each block will save the partial result to a global temp array.

#11
Posted 04/23/2012 12:56 PM   
Scroll To Top