17x drop in Cuda performance When each thread operate on subset of kernel input data

I have two scenarios: The first produce great results and the second not so much. In the first scenario, 100 threads (1,100) operate on the same input data. The 100 threads process the same exact same data, same *input pointer, therefore if the input data was 1MB the total data processed was 100MB (1MB). I record the time after the threads have completed and things look promising. Then in the second scenario I take a 100MB input file, send it to the device to be processed over 100 threads (1,100). Each thread processes a 1/100th chunk of the original input. In total 100MB was processed and it took 17 times longer to process the 100MB than in scenario 1. In both cases 100MB was processed. In both scenarios each thread processes 1MB. In the first one it is the same 1MB in each thread and in the second a different 1MB input in each thread. The copy times to and from device are excluded. Different input sizes have marginally different performance drop ratio’s i.e. 2KB vs 200KB and the difference is still there. Streaming multiple kernels has a ±2x improvement, but the fundamental performance drop is simply illustrated with the two simple kernels below:

Scenario one looks something like this:

// 1MB input x 100 threads

    __global__ void myKernel(char *input, char *output, size_t len) {

    for (size_t i=0; i<len; ++i)

        output[i] = input[i];

And scenario two:

// 100MB input / 100 threads

    __global__ void myKernel(char *input, char *output, size_t len) {

    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    int nb_threads = blockDim.x*gridDim.x;

    size_t mylen = len/nb_threads;

    size_t mystart = tid*mylen;

    for (size_t i=mystart; i<(mystart+mylen); ++i)

        output[i] = input[i];

Any advice or ideas would be appreciated.

Full code

__global__ void myKernel(char *input, char *output, size_t len) 

{

    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    int nb_threads = blockDim.x*gridDim.x;

    size_t mylen = len/nb_threads;

    size_t mystart = tid*mylen;

    for (size_t i=mystart; i<(mystart+mylen); ++i)

        output[i] = input[i];

}

int main(int argc, char* argv[])

{

    FILE *infile = fopen("infile", "rb");

    FILE *outfile = fopen("outfile", "wb");

    fseek(infile, 0, SEEK_END);

    size_t len = ftell(infile);

    fseek(infile, 0, SEEK_SET);

    char *src = (char*) malloc(len);

    char *dst = (char*) malloc(len);

    fread(src, 1, len, infile);

char *input, *output;

    cudaMalloc((void**) &input, len);

    cudaMalloc((void**) &output, len);

    cudaMemcpy(input, src, len, cudaMemcpyHostToDevice);

myKernel<<< 1, 100 >>>(input, output, len); 

    cudaMemcpy(dst, output, len, cudaMemcpyDeviceToHost);

cudaFree(input);

    cudaFree(output);

fwrite(dst, len, 1, outfile);

    fclose(infile);

    fclose(outfile);

    free(src);

    free(dst);

return 0;

}

Interestingly, on a basic comparison this yields a x2 improvement, although not the solution, it possibly alludes to the underlying problem.

What GPU are you using? Fermi GPUs added a L2 cache of 48kB to each SM. Therefore, if reading in the same 1MB in each thread then each 48kB gets cached, reducing the number of reads to global memory. When each thread reads different parts of global memory, 100 threads accessing 100MB, then the cache doesn’t help and you see a larger number of global memory reads.

Thanks for the response chippies, that makes a lot of sense. Any ideas how I can optimize the memory access?

The card is a 550ti. Just a bit of additional useless info: I tested the scenarios with small input files (1-2KB) vs a larger file (200KB) and the difference is still there, although the ratio is slightly better.

I’ve also found this statement on the net:

Now that is pretty much what I did, I needed a double array, but flattened it into a 1D array and used pointer arithmetic to select the chunks. The only drawback is the performance loss, which is contradictory to other threads that state the linear approach is faster compared to the more complex double memory accesses of the 2D array.

In a bandwidth limited scenario, processing 100MB in total vs 1MB in total would result in a 100x slowdown External Image

So I guess you got lucky by seeing only 17x

Christian

Thanks for the response Christian. They are both processing 100MB. In both scenarios each thread processes 1MB. In the first one it is the same 1MB in each thread and in the second a different 1MB input in each thread. The copy times to and from device are excluded. I’ll update the original post.

The first test concludes that the GPU has enough processing power, while the second may suggest the bottleneck is memory access. Any suggestions how to do some additional tests to confirm this and/or perhaps a method to optimize it?

The code may be processing 100MB in both cases, but the bandwidth requirements to external (off-chip, on-card) memory are vastly different. When all threads consume the same data, there is tons of data re-use and thus the code gets great mileage out of the on-chip caches. For optimal memory throughput you would want to to use many thousands of threads, each touching one or very few data elements, in a fully coalesced base + tid access pattern.

Thanks njuffa. I will read up on fully coalesced memory access. I have seen this many times, but have not quite understood it, seems I am not going to get past it easily.