Cuda Convolution - best memory useage

Okay, I have written my own little separable convolution program that has two separate kernels (one for Row convolution, and one for Column). It is very similar to how the C version works in the SDK, except written parallel (the loop unrolling version in the SDK did not fit what I needed exactly). Basically the user calls the program and tells it how many times they want to run the convolution. Currently what it does, is copies the array (call it arrayA) over to the GPU, saves the result in arrayB on the GPU, then passes arrayB to the second kernel and saves the results back to arrayA…then arrayA is copied back to the host and returned to the user…if they user wants more than one run, the whole thing starts off (copying the new arrayA to the GPU etc). This means that for 600 convolutions (the low end of what I am doing), there are 1200 copies (600 each way) between the GPU and CPU…is there a better way to deal with the memory for this?

I don’t know how big the images are before hand, so I couldn’t figure out a way to just make it all sit in global memory, but maybe I am over thinking it, I am very new to cuda.

This is the loop:

int counter;

for (counter = 0; counter<numSmooths; counter++)
{
cutilSafeCall( cudaMemcpy(d_DataA, h_DataA, DATA_SIZE, cudaMemcpyHostToDevice) );

    cutilSafeCall( cudaThreadSynchronize() );
    
    convolutionRowGPU<<<blocks,threads>>>(
        d_DataB,
        d_DataA,
    d_Kernel,
        DATA_W,
        DATA_H,
    KERNEL_R
    );
    cutilCheckMsg("convolutionRowGPU() execution failed\n");

    convolutionColumnGPU<<<blocks,threads>>>(
        d_DataA,
        d_DataB,
    d_Kernel,
        DATA_W,
        DATA_H,
    KERNEL_R
    );
    cutilCheckMsg("convolutionColumnGPU() execution failed\n");
cutilSafeCall( cudaThreadSynchronize() );


    cutilSafeCall( cudaMemcpy(h_DataA, d_DataA, DATA_SIZE, cudaMemcpyDeviceToHost) );

}

Hi,
Since I’m not familiar with the type of computation you are doing, I might say something plain stupid, so please forgive me if ever.
That said, if the code snippet you gave actually reflect your algorithm and if you don’t touch “h_data” inside your loop, why don’t you just push both cudaMemcpy H2D and D2H outside of the loop? The first H2D would go prior to the loop and the D2H right after. This way, no unnecessary data transfer would occur…
While writing this, it looks so obvious that I must miss something here…

Wow. Okay, I see what you are saying, for some reason I had it in my head that to restart the loop again with the resulting image from the previous loop, I had to copy it over again…

I am very new at this, thanks!

Take a look at the best practices document on the nvidia website. It gives good idea about what to keep look after in order to have good performance.