OpenCL, is there instruction limitations ? opencl, instruction
Hi

I plan to code something in OpenCL, using überKernel pattern.
It means that a given kernel would have this structure:

[code]__kernel void my_uber_kernel(void)
{
while(...)
{
if(stage==..)
{
device_function_0() ;
} else
if(stage==...)
{
device_function_1() ;
}
// etc...
stage = stage + 1 ;
}
}[/code]

Each one of [code]device_function_X()[/code] potentially contains a substantial amount of code.
I'm wondering if there is known limitations regarding the amount of instructions supported (per thread?) before performances are impacted ?

Does splitting process in small device functions calls help to optimize ?
Or do I have to split process in several kernel calls (so that above-mentioned [i]device_function_X[/i] become kernels)
Hi



I plan to code something in OpenCL, using überKernel pattern.

It means that a given kernel would have this structure:



__kernel void my_uber_kernel(void)

{

while(...)

{

if(stage==..)

{

device_function_0() ;

} else

if(stage==...)

{

device_function_1() ;

}

// etc...

stage = stage + 1 ;

}

}




Each one of
device_function_X()
potentially contains a substantial amount of code.

I'm wondering if there is known limitations regarding the amount of instructions supported (per thread?) before performances are impacted ?



Does splitting process in small device functions calls help to optimize ?

Or do I have to split process in several kernel calls (so that above-mentioned device_function_X become kernels)

#1
Posted 01/31/2012 05:16 PM   
Doing this in separate kernel launches will include reading / writing to global memory overheads ( 100's of cycles ). Best thing to do here is to compute all the stuff in one kernel, keep temporary results in registers and write results just once.
Doing this in separate kernel launches will include reading / writing to global memory overheads ( 100's of cycles ). Best thing to do here is to compute all the stuff in one kernel, keep temporary results in registers and write results just once.

GPU Developer at AccelerEyes ([email="support@accelereyes.com"]Email me[/email])

#2
Posted 01/31/2012 05:54 PM   
[quote name='short' date='31 January 2012 - 06:54 PM' timestamp='1328032469' post='1363149']
Doing this in separate kernel launches will include reading / writing to global memory overheads ( 100's of cycles ). Best thing to do here is to compute all the stuff in one kernel, keep temporary results in registers and write results just once.
[/quote]

The memory usage overhead is one reason why I chose überkernel way, but what if, in the end, the kernel contains like 10,000 lines of code (all calls inlined) ?
[quote name='short' date='31 January 2012 - 06:54 PM' timestamp='1328032469' post='1363149']

Doing this in separate kernel launches will include reading / writing to global memory overheads ( 100's of cycles ). Best thing to do here is to compute all the stuff in one kernel, keep temporary results in registers and write results just once.





The memory usage overhead is one reason why I chose überkernel way, but what if, in the end, the kernel contains like 10,000 lines of code (all calls inlined) ?

#3
Posted 01/31/2012 07:09 PM   
Maximum kernel size (the limit is on the kernel, not thread), is 2000000 assembly instructions (I don't think that that changed with Fermi).

The thing that you may need to watch is instruction cache pollution. You don't want too much code inside an if conditional where the block diverges as it causes instruction cache pollution that can degrade performance. It can also cause issues if you have multiple blocks per multicore and they diverge.

Whether it's better to split to multiple kernels or use a single überKernel depends on your actual code. Going to global memory is very expensive, generally much more so than instruction cache pollution, but there are exception.
Maximum kernel size (the limit is on the kernel, not thread), is 2000000 assembly instructions (I don't think that that changed with Fermi).



The thing that you may need to watch is instruction cache pollution. You don't want too much code inside an if conditional where the block diverges as it causes instruction cache pollution that can degrade performance. It can also cause issues if you have multiple blocks per multicore and they diverge.



Whether it's better to split to multiple kernels or use a single überKernel depends on your actual code. Going to global memory is very expensive, generally much more so than instruction cache pollution, but there are exception.

#4
Posted 02/05/2012 01:17 PM   
and do you know an order of magnitude for the program cache (instruction cache) size ?
something like 64KB ?
and do you know an order of magnitude for the program cache (instruction cache) size ?

something like 64KB ?

#5
Posted 02/05/2012 06:04 PM   
The instruction cache is in the constant cache. If memory serves its 8KB.
The instruction cache is in the constant cache. If memory serves its 8KB.

#6
Posted 02/05/2012 06:46 PM   
Best thing to do here is to compute all the stuff in one kernel.[img]http://www.nobod.info/g.gif[/img]
Best thing to do here is to compute all the stuff in one kernel.Image

#7
Posted 03/14/2012 04:22 AM   
Scroll To Top