parallelization on an array of numbers with CUDA
I'm having trouble doing the parallelization on an array of numbers with CUDA.

So, for example if we have an array M containing numbers ( 1 , 2 , 3 , 4 , 5)

And If I were to remove the number 2 in the array and shift everything to the left,

the resulting array would be ( 1 , 3 , 4 , 5 , 5 )

where M[1] = M[2], M[2] = M[3] , M[3] = M[4]

And my question is how can we do this in parallel in cuda? Because when we parallel this there might be a race condition where the number 2 (M[1]) might not be the first one to act first, if M[2] were the first one to shift, the resulting array would become ( 1 , 4 , 4 , 5 , 5). Is there any method to handle this? I'm fairly new to cuda so I'm not sure what to do...

My current code is as follows:

`__global__ void gpu_shiftSeam(int *MCEnergyMat, int *seam, int width, int height, int currRow)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

int index = i+width*j;

if(i < width && j <height)
{
//shift values of -1 to the side of the image
if(MCEnergyMat[i+width*j] == -1) //check if its equal to -1, if so remove and start shifting
{
if(i+1 != width) //check boundary
MCEnergyMat[index] = MCEnergyMat[index+1];
}
if(seam[j] < i) //check if its past the "-1" shifting point, if so start shifting, else dont do anything
{
if(i+1 != width) //check boundary
MCEnergyMat[index] = MCEnergyMat[index+1];
}
}
}`

Where seam[i] contains the index I would like to remove in the array. and MCEnergyMat is just a 1D array converted from a 2d array.. However, my code does not work...and I believe race condition is the problem.

Thanks!
I'm having trouble doing the parallelization on an array of numbers with CUDA.



So, for example if we have an array M containing numbers ( 1 , 2 , 3 , 4 , 5)



And If I were to remove the number 2 in the array and shift everything to the left,



the resulting array would be ( 1 , 3 , 4 , 5 , 5 )



where M[1] = M[2], M[2] = M[3] , M[3] = M[4]



And my question is how can we do this in parallel in cuda? Because when we parallel this there might be a race condition where the number 2 (M[1]) might not be the first one to act first, if M[2] were the first one to shift, the resulting array would become ( 1 , 4 , 4 , 5 , 5). Is there any method to handle this? I'm fairly new to cuda so I'm not sure what to do...



My current code is as follows:



`__global__ void gpu_shiftSeam(int *MCEnergyMat, int *seam, int width, int height, int currRow)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;



int index = i+width*j;



if(i < width && j <height)

{

//shift values of -1 to the side of the image

if(MCEnergyMat[i+width*j] == -1) //check if its equal to -1, if so remove and start shifting

{

if(i+1 != width) //check boundary

MCEnergyMat[index] = MCEnergyMat[index+1];

}

if(seam[j] < i) //check if its past the "-1" shifting point, if so start shifting, else dont do anything

{

if(i+1 != width) //check boundary

MCEnergyMat[index] = MCEnergyMat[index+1];

}

}

}`



Where seam[i] contains the index I would like to remove in the array. and MCEnergyMat is just a 1D array converted from a 2d array.. However, my code does not work...and I believe race condition is the problem.



Thanks!

#1
Posted 04/25/2012 08:59 PM   
Use different arrays for input and output.

If you have to do this in place, use a single warp for each line (looping over i), so you have full control of the order of operations within each line. This might provide insufficient parallelism to fully load large GPUs, but this operation is memory bandwidth bound anyway so it doesn't matter that much.
Use different arrays for input and output.



If you have to do this in place, use a single warp for each line (looping over i), so you have full control of the order of operations within each line. This might provide insufficient parallelism to fully load large GPUs, but this operation is memory bandwidth bound anyway so it doesn't matter that much.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/25/2012 10:42 PM   
[quote name='tera' date='25 April 2012 - 10:42 PM' timestamp='1335393774' post='1401020']
Use different arrays for input and output.

If you have to do this in place, use a single warp for each line (looping over i), so you have full control of the order of operations within each line. This might provide insufficient parallelism to fully load large GPUs, but this operation is memory bandwidth bound anyway so it doesn't matter that much.
[/quote]


Hmm, could you explain a little bit more? I'm not familiar with warp and how do I use it for loop over i ?

Thanks!
[quote name='tera' date='25 April 2012 - 10:42 PM' timestamp='1335393774' post='1401020']

Use different arrays for input and output.



If you have to do this in place, use a single warp for each line (looping over i), so you have full control of the order of operations within each line. This might provide insufficient parallelism to fully load large GPUs, but this operation is memory bandwidth bound anyway so it doesn't matter that much.







Hmm, could you explain a little bit more? I'm not familiar with warp and how do I use it for loop over i ?



Thanks!

#3
Posted 04/26/2012 04:29 AM   
Scroll To Top