How to do masking in the most efficient way on cuda?

Hello dear CUDA-community,

I would like to do masking on cuda. The problem is that my mask consists in plenty (up to 30) small AOI’s(5x10 pixel - 15x30pix) which differ in size and position from frame to frame. The image where the mask has to be applied is 2kx2k.
I was thinking about uploading the AOI’s as an array[30][4] to the device and use them directly in my kernel, but this will lead to an quite insufficient code, because of plenty if-cases.

I can also generate the full mask on the host and upload it to the device for every single frame, but is this really the best approach?

Thank you very much for your suggestions in advance.

Best
Greg

“I was thinking about uploading the AOI’s as an array[30][4] to the device and use them directly in my kernel, but this will lead to an quite insufficient code, because of plenty if-cases.”

you could perhaps avoid the if’s by localizing the mask data per frame (or other construct/ base you access/ process the image by) and then scattering ‘internally’ - in local or shared memory

after reading the image data
for the frame x
for the number of mask elements in frame x
for the offset of the mask data of frame x
read the local addresses the mask data applies to
update the image data corresponding to the local addresses

“I can also generate the full mask on the host and upload it to the device for every single frame, but is this really the best approach?”

i think much may depend on how you access/ utilize the image data
if this occurs in frames, and if the host is not overly occupied, it can always help with preprocessing
the host would be preprocesing the next frame, whilst the device is processing the current frame

Perhaps use “alpha” channel to do masking. Express it as a 0.0 to 1.0 floating point value.

Alpha 1.0 would be opaque and Alpha 0.0 would be transparent.

Then it would be something like:
DestMask = 1 - Alpha;
SourceMask = Alpha;

Buffer[i] = Buffer[i] * DestMask + Input[i] * SourceMask;

Does that make sense ? ;)

Hi,

thank you very much Skybuck and little_jimmy for you suggestions.
The alpha-chennel is great but unfortunately it’s not going to work for me, because I’m uploading only raw-grayscale images, doing demosaicing on the GPU in addition.

little_jimmies approach sounds to be more promising for my goals.
If I got it correctly, knowing the base address of the image in device memory I should try to access the few pixels (or via shared memory) and set these to the expected value manually, right?

Cheers
greg

the point being:

if you are going to work in on data, at some point the thread block is going to read the data from device memory

you could then simply provide the thread block with additional (localized) data pertaining to the masking; something like below
i shall use shared memory, but it may be local memory just as well
it is a very elementary example, but the principle should be evident

shared_data[threadIdx.x] = primary_data_in_gbl_memory[offset + threadIdx.x];

__syncthreads();

if (threadIdx.x < mask_elements_within_the_block)
{
// i use shared memory here, but local memory is perhaps best
// i implement the mask as a multiplication, which may be more illustrative than practical

mask_address[threadIdx.x] = mask_add_data[offset + threadIdx.x];
additional_mask_data_perhaps[threadIdx.x] = add_mask_data[offset + threadIdx.x];

shared_data[mask_address[threadIdx.x]] *= additional_mask_data_perhaps[threadIdx.x];
}

__syncthreads();

to localize data, you have to recalculate mask element addresses as within threadblock addresses

I don’t understand your problem though, what you mean with if-cases ?