I have two scenarios: The first produce great results and the second not so much. In the first scenario, 100 threads (1,100) operate on the same input data. The 100 threads process the same exact same data, same *input pointer, therefore if the input data was 1MB the total data processed was 100MB (1MB). I record the time after the threads have completed and things look promising. Then in the second scenario I take a 100MB input file, send it to the device to be processed over 100 threads (1,100). Each thread processes a 1/100th chunk of the original input. In total 100MB was processed and it took 17 times longer to process the 100MB than in scenario 1. In both cases 100MB was processed. In both scenarios each thread processes 1MB. In the first one it is the same 1MB in each thread and in the second a different 1MB input in each thread. The copy times to and from device are excluded. Different input sizes have marginally different performance drop ratio’s i.e. 2KB vs 200KB and the difference is still there. Streaming multiple kernels has a ±2x improvement, but the fundamental performance drop is simply illustrated with the two simple kernels below:
Scenario one looks something like this:
// 1MB input x 100 threads
__global__ void myKernel(char *input, char *output, size_t len) {
for (size_t i=0; i<len; ++i)
output[i] = input[i];
And scenario two:
// 100MB input / 100 threads
__global__ void myKernel(char *input, char *output, size_t len) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int nb_threads = blockDim.x*gridDim.x;
size_t mylen = len/nb_threads;
size_t mystart = tid*mylen;
for (size_t i=mystart; i<(mystart+mylen); ++i)
output[i] = input[i];
Any advice or ideas would be appreciated.
Full code
__global__ void myKernel(char *input, char *output, size_t len)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int nb_threads = blockDim.x*gridDim.x;
size_t mylen = len/nb_threads;
size_t mystart = tid*mylen;
for (size_t i=mystart; i<(mystart+mylen); ++i)
output[i] = input[i];
}
int main(int argc, char* argv[])
{
FILE *infile = fopen("infile", "rb");
FILE *outfile = fopen("outfile", "wb");
fseek(infile, 0, SEEK_END);
size_t len = ftell(infile);
fseek(infile, 0, SEEK_SET);
char *src = (char*) malloc(len);
char *dst = (char*) malloc(len);
fread(src, 1, len, infile);
char *input, *output;
cudaMalloc((void**) &input, len);
cudaMalloc((void**) &output, len);
cudaMemcpy(input, src, len, cudaMemcpyHostToDevice);
myKernel<<< 1, 100 >>>(input, output, len);
cudaMemcpy(dst, output, len, cudaMemcpyDeviceToHost);
cudaFree(input);
cudaFree(output);
fwrite(dst, len, 1, outfile);
fclose(infile);
fclose(outfile);
free(src);
free(dst);
return 0;
}