After a couple of days going in circles and not finding the solution via google, I realize that I am missing a fundamental concept in C/C++/Cuda that I cannot define accurately and a guru could possibly point it out easily.
Here goes:
infile is an input file of e.g. 1000kb of data. infile needs to be split into chucks of 500kb each (possibly a larger file into 1000’s of chucks eventually). Each chuck of the file is processed by a thread. In single thread the input file is stored in char*. The idea (and problem) is to store each chuck in a char array i.e. chararray[1] = chunk2. Here chunk2 could be e.g 500kb long. This is the confusing part, since the C for noobs book says char stores only one character. I thought then a double array would fix that, the first dimension for the chunk/thread number and the second for the actual data e.g. for two chunks: double_array[1][?]. I tried that with a fixed length and also an array of a struct but in both cases ended up with the same issue as with the single array, which leads me to think the problem is more basic. A simplified version of the code is provided below. The simplification may have introduced some new bugs, but overall there should be less.
Problem: the output of the outfile is fine for idx 0 and host_array[0]. In the code below, this means outfile is exactly like the source file. Perhaps by chance, because adding syncthreads after the copy back to host changes this as well. When running kernel 1,2 i.e idx 1 host_array[1] is empty, which is the main issue. host_array[1] in ‘principle’ should have another copy of infile in this example.
Related?:In the full version there is an added twist which might be related: When running the kernel with 1,1 i.e 1 thread the output is as expected. With 1,2 the output of host_array[0] becomes altered slightly (like every 50th character or so). The more threads, the more its garbled. This sounds like a race condition? but this might be solved with the problem described above:
Any help would be much appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include "cuda_runtime_api.h"
#include <string.h>
__device__ void myfunction(char *input, char *output, size_t len)
{
memcpy(output, input, len);
}
__global__ void myKernel(char *d_src, char *dst_array, size_t len)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
myfunction(d_src, &dst_array[idx], len);
}
int main(int argc, char* argv[])
{
FILE *infile, *outfile;
char *src, *input;
size_t len;
infile = fopen("infile", "rb");
outfile = fopen("outfile", "wb");
fseek(infile, 0, SEEK_END);
len = ftell(infile);
fseek(infile, 0, SEEK_SET);
src = (char*) malloc(len);
fread(src, 1, len, infile);
cudaMalloc( (void**) &input, len) ;
cudaMemcpy( input, src, len, cudaMemcpyHostToDevice) ;
int N = 2;
char *host_array[2];
char *dst_array[2];
for (int i = 0; i < N; i++)
{
host_array[i] = (char *) malloc(len);
}
cudaMalloc((void **)&dst_array, N*(len));
cudaMemcpy(*dst_array, *host_array, N*(len), cudaMemcpyHostToDevice);
myKernel<<< 1, 2 >>> (input, *dst_array, len);
cudaMemcpy(*host_array, *dst_array, N*(len), cudaMemcpyDeviceToHost);
cudaFree(input);
cudaFree(dst_array);
//host_array[1]
fwrite(host_array[0], len, 1, outfile);
fclose(infile);
fclose(outfile);
return 0;
}