17x drop in Cuda performance When each thread operate on subset of kernel input data
I have two scenarios: The first produce great results and the second not so much. In the first scenario, 100 threads (1,100) operate on the same input data. The 100 threads process the same exact same data, same *input pointer, therefore if the input data was 1MB the total data processed was 100MB (1MB). I record the time after the threads have completed and things look promising. Then in the second scenario I take a 100MB input file, send it to the device to be processed over 100 threads (1,100). Each thread processes a 1/100th chunk of the original input. In total 100MB was processed and it took 17 times longer to process the 100MB than in scenario 1. In both cases 100MB was processed. In both scenarios each thread processes 1MB. In the first one it is the same 1MB in each thread and in the second a different 1MB input in each thread. The copy times to and from device are excluded. Different input sizes have marginally different performance drop ratio's i.e. 2KB vs 200KB and the difference is still there. Streaming multiple kernels has a +-2x improvement, but the fundamental performance drop is simply illustrated with the two simple kernels below:

Scenario one looks something like this:
[code]
// 1MB input x 100 threads
__global__ void myKernel(char *input, char *output, size_t len) {
for (size_t i=0; i<len; ++i)
output[i] = input[i];
[/code]

And scenario two:
[code]
// 100MB input / 100 threads
__global__ void myKernel(char *input, char *output, size_t len) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int nb_threads = blockDim.x*gridDim.x;
size_t mylen = len/nb_threads;
size_t mystart = tid*mylen;
for (size_t i=mystart; i<(mystart+mylen); ++i)
output[i] = input[i];
[/code]

Any advice or ideas would be appreciated.

Full code
[code]
__global__ void myKernel(char *input, char *output, size_t len)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int nb_threads = blockDim.x*gridDim.x;
size_t mylen = len/nb_threads;
size_t mystart = tid*mylen;
for (size_t i=mystart; i<(mystart+mylen); ++i)
output[i] = input[i];
}

int main(int argc, char* argv[])
{
FILE *infile = fopen("infile", "rb");
FILE *outfile = fopen("outfile", "wb");
fseek(infile, 0, SEEK_END);
size_t len = ftell(infile);
fseek(infile, 0, SEEK_SET);
char *src = (char*) malloc(len);
char *dst = (char*) malloc(len);
fread(src, 1, len, infile);

char *input, *output;
cudaMalloc((void**) &input, len);
cudaMalloc((void**) &output, len);
cudaMemcpy(input, src, len, cudaMemcpyHostToDevice);

myKernel<<< 1, 100 >>>(input, output, len);
cudaMemcpy(dst, output, len, cudaMemcpyDeviceToHost);

cudaFree(input);
cudaFree(output);

fwrite(dst, len, 1, outfile);
fclose(infile);
fclose(outfile);
free(src);
free(dst);

return 0;
}[/code]
I have two scenarios: The first produce great results and the second not so much. In the first scenario, 100 threads (1,100) operate on the same input data. The 100 threads process the same exact same data, same *input pointer, therefore if the input data was 1MB the total data processed was 100MB (1MB). I record the time after the threads have completed and things look promising. Then in the second scenario I take a 100MB input file, send it to the device to be processed over 100 threads (1,100). Each thread processes a 1/100th chunk of the original input. In total 100MB was processed and it took 17 times longer to process the 100MB than in scenario 1. In both cases 100MB was processed. In both scenarios each thread processes 1MB. In the first one it is the same 1MB in each thread and in the second a different 1MB input in each thread. The copy times to and from device are excluded. Different input sizes have marginally different performance drop ratio's i.e. 2KB vs 200KB and the difference is still there. Streaming multiple kernels has a +-2x improvement, but the fundamental performance drop is simply illustrated with the two simple kernels below:



Scenario one looks something like this:



// 1MB input x 100 threads

__global__ void myKernel(char *input, char *output, size_t len) {

for (size_t i=0; i<len; ++i)

output[i] = input[i];




And scenario two:



// 100MB input / 100 threads

__global__ void myKernel(char *input, char *output, size_t len) {

int tid = blockIdx.x * blockDim.x + threadIdx.x;

int nb_threads = blockDim.x*gridDim.x;

size_t mylen = len/nb_threads;

size_t mystart = tid*mylen;

for (size_t i=mystart; i<(mystart+mylen); ++i)

output[i] = input[i];




Any advice or ideas would be appreciated.



Full code



__global__ void myKernel(char *input, char *output, size_t len)

{

int tid = blockIdx.x * blockDim.x + threadIdx.x;

int nb_threads = blockDim.x*gridDim.x;

size_t mylen = len/nb_threads;

size_t mystart = tid*mylen;

for (size_t i=mystart; i<(mystart+mylen); ++i)

output[i] = input[i];

}



int main(int argc, char* argv[])

{

FILE *infile = fopen("infile", "rb");

FILE *outfile = fopen("outfile", "wb");

fseek(infile, 0, SEEK_END);

size_t len = ftell(infile);

fseek(infile, 0, SEEK_SET);

char *src = (char*) malloc(len);

char *dst = (char*) malloc(len);

fread(src, 1, len, infile);



char *input, *output;

cudaMalloc((void**) &input, len);

cudaMalloc((void**) &output, len);

cudaMemcpy(input, src, len, cudaMemcpyHostToDevice);



myKernel<<< 1, 100 >>>(input, output, len);

cudaMemcpy(dst, output, len, cudaMemcpyDeviceToHost);



cudaFree(input);

cudaFree(output);



fwrite(dst, len, 1, outfile);

fclose(infile);

fclose(outfile);

free(src);

free(dst);



return 0;

}

#1
Posted 04/14/2012 09:45 PM   
[quote]Use a lot more threads and let the last 31 threads in each warp do nothing.[/quote]

Interestingly, on a basic comparison this yields a x2 improvement, although not the solution, it possibly alludes to the underlying problem.
Use a lot more threads and let the last 31 threads in each warp do nothing.




Interestingly, on a basic comparison this yields a x2 improvement, although not the solution, it possibly alludes to the underlying problem.

#2
Posted 04/15/2012 01:22 AM   
What GPU are you using? Fermi GPUs added a L2 cache of 48kB to each SM. Therefore, if reading in the same 1MB in each thread then each 48kB gets cached, reducing the number of reads to global memory. When each thread reads different parts of global memory, 100 threads accessing 100MB, then the cache doesn't help and you see a larger number of global memory reads.
What GPU are you using? Fermi GPUs added a L2 cache of 48kB to each SM. Therefore, if reading in the same 1MB in each thread then each 48kB gets cached, reducing the number of reads to global memory. When each thread reads different parts of global memory, 100 threads accessing 100MB, then the cache doesn't help and you see a larger number of global memory reads.

#3
Posted 04/15/2012 02:39 PM   
Thanks for the response chippies, that makes a lot of sense. Any ideas how I can optimize the memory access?

The card is a 550ti. Just a bit of additional useless info: I tested the scenarios with small input files (1-2KB) vs a larger file (200KB) and the difference is still there, although the ratio is slightly better.

I've also found this statement on the net:
[quote]Flatten out your 2D array into a 1D array and use pointer arithmetic to access the chunk of array you desire. This works but the issue is probable performance loss due to classical data structure alignment issues[/quote]

Now that is pretty much what I did, I needed a double array, but flattened it into a 1D array and used pointer arithmetic to select the chunks. The only drawback is the performance loss, which is contradictory to other threads that state the linear approach is faster compared to the more complex double memory accesses of the 2D array.
Thanks for the response chippies, that makes a lot of sense. Any ideas how I can optimize the memory access?



The card is a 550ti. Just a bit of additional useless info: I tested the scenarios with small input files (1-2KB) vs a larger file (200KB) and the difference is still there, although the ratio is slightly better.



I've also found this statement on the net:

Flatten out your 2D array into a 1D array and use pointer arithmetic to access the chunk of array you desire. This works but the issue is probable performance loss due to classical data structure alignment issues




Now that is pretty much what I did, I needed a double array, but flattened it into a 1D array and used pointer arithmetic to select the chunks. The only drawback is the performance loss, which is contradictory to other threads that state the linear approach is faster compared to the more complex double memory accesses of the 2D array.

#4
Posted 04/15/2012 06:50 PM   
[quote name='ultracuda' date='14 April 2012 - 11:45 PM' timestamp='1334439907' post='1396454']
In total 100MB was processed and it took 17 times longer to process the 100MB than in scenario 1.
[/quote]

In a bandwidth limited scenario, processing 100MB in total vs 1MB in total would result in a 100x slowdown /wink.gif' class='bbc_emoticon' alt=';)' />
So I guess you got lucky by seeing only 17x

Christian
[quote name='ultracuda' date='14 April 2012 - 11:45 PM' timestamp='1334439907' post='1396454']

In total 100MB was processed and it took 17 times longer to process the 100MB than in scenario 1.





In a bandwidth limited scenario, processing 100MB in total vs 1MB in total would result in a 100x slowdown /wink.gif' class='bbc_emoticon' alt=';)' />

So I guess you got lucky by seeing only 17x



Christian

#5
Posted 04/16/2012 09:33 AM   
Thanks for the response Christian. They are both processing 100MB. In both scenarios each thread processes 1MB. In the first one it is the same 1MB in each thread and in the second a different 1MB input in each thread. The copy times to and from device are excluded. I'll update the original post.

The first test concludes that the GPU has enough processing power, while the second may suggest the bottleneck is memory access. Any suggestions how to do some additional tests to confirm this and/or perhaps a method to optimize it?
Thanks for the response Christian. They are both processing 100MB. In both scenarios each thread processes 1MB. In the first one it is the same 1MB in each thread and in the second a different 1MB input in each thread. The copy times to and from device are excluded. I'll update the original post.



The first test concludes that the GPU has enough processing power, while the second may suggest the bottleneck is memory access. Any suggestions how to do some additional tests to confirm this and/or perhaps a method to optimize it?

#6
Posted 04/16/2012 05:42 PM   
The code may be processing 100MB in both cases, but the bandwidth requirements to external (off-chip, on-card) memory are vastly different. When all threads consume the same data, there is tons of data re-use and thus the code gets great mileage out of the on-chip caches. For optimal memory throughput you would want to to use many thousands of threads, each touching one or very few data elements, in a fully coalesced base + tid access pattern.
The code may be processing 100MB in both cases, but the bandwidth requirements to external (off-chip, on-card) memory are vastly different. When all threads consume the same data, there is tons of data re-use and thus the code gets great mileage out of the on-chip caches. For optimal memory throughput you would want to to use many thousands of threads, each touching one or very few data elements, in a fully coalesced base + tid access pattern.

#7
Posted 04/16/2012 06:06 PM   
Thanks njuffa. I will read up on fully coalesced memory access. I have seen this many times, but have not quite understood it, seems I am not going to get past it easily.
Thanks njuffa. I will read up on fully coalesced memory access. I have seen this many times, but have not quite understood it, seems I am not going to get past it easily.

#8
Posted 04/16/2012 07:01 PM   
Scroll To Top