Okay, I have written my own little separable convolution program that has two separate kernels (one for Row convolution, and one for Column). It is very similar to how the C version works in the SDK, except written parallel (the loop unrolling version in the SDK did not fit what I needed exactly). Basically the user calls the program and tells it how many times they want to run the convolution. Currently what it does, is copies the array (call it arrayA) over to the GPU, saves the result in arrayB on the GPU, then passes arrayB to the second kernel and saves the results back to arrayA…then arrayA is copied back to the host and returned to the user…if they user wants more than one run, the whole thing starts off (copying the new arrayA to the GPU etc). This means that for 600 convolutions (the low end of what I am doing), there are 1200 copies (600 each way) between the GPU and CPU…is there a better way to deal with the memory for this?
I don’t know how big the images are before hand, so I couldn’t figure out a way to just make it all sit in global memory, but maybe I am over thinking it, I am very new to cuda.
This is the loop:
int counter;
for (counter = 0; counter<numSmooths; counter++)
{
cutilSafeCall( cudaMemcpy(d_DataA, h_DataA, DATA_SIZE, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaThreadSynchronize() );
convolutionRowGPU<<<blocks,threads>>>(
d_DataB,
d_DataA,
d_Kernel,
DATA_W,
DATA_H,
KERNEL_R
);
cutilCheckMsg("convolutionRowGPU() execution failed\n");
convolutionColumnGPU<<<blocks,threads>>>(
d_DataA,
d_DataB,
d_Kernel,
DATA_W,
DATA_H,
KERNEL_R
);
cutilCheckMsg("convolutionColumnGPU() execution failed\n");
cutilSafeCall( cudaThreadSynchronize() );
cutilSafeCall( cudaMemcpy(h_DataA, d_DataA, DATA_SIZE, cudaMemcpyDeviceToHost) );
}