Working with arrays, pointers and a race condition

After a couple of days going in circles and not finding the solution via google, I realize that I am missing a fundamental concept in C/C++/Cuda that I cannot define accurately and a guru could possibly point it out easily.

Here goes:

infile is an input file of e.g. 1000kb of data. infile needs to be split into chucks of 500kb each (possibly a larger file into 1000’s of chucks eventually). Each chuck of the file is processed by a thread. In single thread the input file is stored in char*. The idea (and problem) is to store each chuck in a char array i.e. chararray[1] = chunk2. Here chunk2 could be e.g 500kb long. This is the confusing part, since the C for noobs book says char stores only one character. I thought then a double array would fix that, the first dimension for the chunk/thread number and the second for the actual data e.g. for two chunks: double_array[1][?]. I tried that with a fixed length and also an array of a struct but in both cases ended up with the same issue as with the single array, which leads me to think the problem is more basic. A simplified version of the code is provided below. The simplification may have introduced some new bugs, but overall there should be less.

Problem: the output of the outfile is fine for idx 0 and host_array[0]. In the code below, this means outfile is exactly like the source file. Perhaps by chance, because adding syncthreads after the copy back to host changes this as well. When running kernel 1,2 i.e idx 1 host_array[1] is empty, which is the main issue. host_array[1] in ‘principle’ should have another copy of infile in this example.

Related?:In the full version there is an added twist which might be related: When running the kernel with 1,1 i.e 1 thread the output is as expected. With 1,2 the output of host_array[0] becomes altered slightly (like every 50th character or so). The more threads, the more its garbled. This sounds like a race condition? but this might be solved with the problem described above:

Any help would be much appreciated.

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include "cuda_runtime_api.h" 

#include <string.h>

__device__ void myfunction(char *input, char *output, size_t len)

{

	memcpy(output, input, len);

}

__global__ void myKernel(char *d_src, char *dst_array, size_t len)

{

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	myfunction(d_src, &dst_array[idx], len);

}

int main(int argc, char* argv[])

{

    FILE *infile, *outfile;

    char *src, *input;

	size_t len; 

	infile = fopen("infile", "rb");

	outfile = fopen("outfile", "wb");

    fseek(infile, 0, SEEK_END);

    len = ftell(infile);

    fseek(infile, 0, SEEK_SET);

	src = (char*) malloc(len);

	fread(src, 1, len, infile);

	

	cudaMalloc( (void**) &input, len) ;

	cudaMemcpy( input, src, len, cudaMemcpyHostToDevice) ;

	int N = 2;

	char *host_array[2];

	char *dst_array[2];

		for (int i = 0; i < N; i++)

		{

		host_array[i] = (char *) malloc(len);

		}

	cudaMalloc((void **)&dst_array, N*(len));

	cudaMemcpy(*dst_array, *host_array, N*(len), cudaMemcpyHostToDevice);

	myKernel<<< 1, 2 >>> (input, *dst_array, len); 

	cudaMemcpy(*host_array, *dst_array, N*(len), cudaMemcpyDeviceToHost);

	cudaFree(input);

	cudaFree(dst_array);

	//host_array[1]

	fwrite(host_array[0], len, 1, outfile);

	fclose(infile);

	fclose(outfile);

return 0;

}

Hi,

In your code, you happily mix-up char* and char**, which are not (at all) the same.

Without going in too much details, you actually don’t need to have a table of strings (ie a char**) since an simple string (char*) of the whole data to process will perfectly do the trick, with much less of pain in regard to scalability (in number of threads to process the data) and to data transfers (you cannot easily transfer a multidimensional array with cudaMemcpy).

Here is what your code might look like:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

#include <string.h>

__device__ void myfunction(char *input, char *output, size_t len) {

    for (size_t i=0; i<len; ++i)

        output[i] = input[i];

}

__global__ void myKernel(char *src, char *dst, size_t len) {

    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    int nb_threads = blockDim.x*gridDim.x;

    size_t mylen = len/nb_threads;

    int remain = len%nb_threads;

    size_t mystart = tid*mylen+min(tid,remain);

    if (remain>tid) mylen++;

    myfunction(src+mystart, dst+mystart, mylen);

}

int main(int argc, char* argv[]) {

    FILE *infile = fopen("infile", "rb");

    FILE *outfile = fopen("outfile", "wb");

    fseek(infile, 0, SEEK_END);

    size_t len = ftell(infile);

    fseek(infile, 0, SEEK_SET);

    char *src = (char*) malloc(len);

    char *dst = (char*) malloc(len);

    fread(src, 1, len, infile);

char *input, *output;

    cudaMalloc((void**) &input, len);

    cudaMalloc((void**) &output, len);

    cudaMemcpy(input, src, len, cudaMemcpyHostToDevice);

int NB_THREADS = 7; //or whatever. No need of dividing evenly len

    int NB_BLOCKS = 2;  //or whatever

myKernel<<< NB_BLOCKS, NB_THREADS >>>(input, output, len); 

    cudaMemcpy(dst, output, len, cudaMemcpyDeviceToHost);

cudaFree(input);

    cudaFree(output);

fwrite(dst, len, 1, outfile);

    fclose(infile);

    fclose(outfile);

    free(src);

    free(dst);

return 0;

}

Don’t be mistaken, this code is stupid and that is definitely not the way of making a copy of array in cuda. However, I believe it achieves the type of processing you wanted to implement.

Of course, adding some error checking is a must have, that I haven’t put here.

HTH.

Thanks Gilles_C. Your example has given me a different perspective to solving the problem. External Image

This methods works to get the threads to process each chunk, but I later discovered it also bears a ~17 times performance penalty. Perhaps there is still something I am missing.

Performance hit post: The Official NVIDIA Forums | NVIDIA

Hi,

I’m not sure how you did your performance measurement, but what is for sure is that the code I gave you was only an illustration of a possible feature, by no mean an actual example dedicated to performance (hence my disclaimer at the end of the post). The raison for this expected counter-performance is the memory access pattern it includes:

    [*]On a proper data copy code, you would pass a single pointer to the whole set of threads, and they would process it all in one go, each thread accessing one single element at a time in a coalesced manner. This would optimise the memory bandwidth usage;

    [*]On the one I gave you, each thread gets a consecutive share of the global workload to process individually. This makes impossible to coalesce the memory access and dramatically slows down the global workload;

Now, the point of the sample code was not the performance, but reproducing as closely as possible what you were trying to do, but correctly.

So the question that remains in my mind is, what are you actually trying to do? I’m sure having a flattened 2D array instead of the structure you tried to use at the beginning is not an issue. To convince yourself, just try to replace in the sample code the kernel by this one:

__global__ void myKernel(char *src, char *dst, size_t len) {

    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    int nb_threads = blockDim.x*gridDim.x;

    for (int i=tid ; i<len ; i+=nb_threads)

        dst[i] = src[i];

}

This should give you the exact same result, but much much faster.

Thanks once again Gilles_C and once again you are right, no surprise.

That helps me understand the coalesce memory access concept better as well.

I am trying to process 1000 chunks of data on the GPU over 1000 threads. The chunks are small at 1KB (if performance allows up to 16KB). The device code is a serial CPU script, processing each chunk in a serial manner. The limitation here is that the app is serial because the data in each chunk needs to be processed serially. But that is okay since as a test, when all the threads processes the same source input the speed is 20xCPU, which gives the impression the processing power of the GPU is good enough to handle the serial script. As you have pointed out the challenge is the memory access bandwidth. Coalesced access solved this.

Next, the data could be rearranged in the coalesced pattern before sending it to the processing kernel, once read by the threads they will be in the original serial order again. I’ll see at what rate the CPU rearrange this vs many GPU threads in a pre-processing kernel.

Since the script references the source pointer multiple times and has its own pointer arithmetic, I’ll see if it is ‘possible’ to create a new source variable that is an array of char pointers in the coalesced order without breaking things again:

int order = 0; // serial pointer arithmetic mapped to coalesced order

for (int i=tid ; i<len ; i+=nb_threads)

{

       &newsrc[order] = &src[i];

       order++;

}

serialfunction(newsrc);

Haven’t tried this yet External Image

Well, if what you want to do is really this, I see a possible solution that will both allow fully coalesced accesses and sequential processing of the individual chunks of memory: tiling in shared memory.

Let’s assume for the sake of simplicity that you have 1024 data chunks stored sequentially in a 1D array of 1024 elements each (being char, those are 1KiB each). So far it looks reasonably close to your problem’s description. Let’s also assume you’re on a Fermi card with 512 cores organised in 16 SMs (like a GTX 580 or a M2090).

Then you can fit in shared memory 32 chunks of data to be processed individually by 32 threads per SM later on. With 32 blocks, you cover your 1024 chunks of data to process, with a ratio of 2 blocks per SM. The shared memory usage won’t allow more than one block per SM to be scheduled at a time, but this should be OK anyway since all your data should be local at this time.

Here is how your kernel could look like:

#define BLOCKSIZE 32

#define GRIDSIZE 32

#define CHUNKSIZE 1024

__global__ void myKernel(char *src, char *dst, size_t len) {

    __shared__ char data[BLOCKSIZE][CHUNKSIZE+1]; // +1 to avoid bank conflicts when processing data

// loading data chunks in shared memory in a coalesced manner

    int offset = BLOCKSIZE*CHUNKSIZE*blockIdx.x;

    for (int i=0; i<BLOCKSIZE; i++) {

        for (int j=threadIdx.x; j < CHUNKSIZE; j+=BLOCKSIZE)

            data[i][j] = src[offset+j];

        offset += CHUNKSIZE;

    }

    __syncthreads();

// sequential (in-place) processing of the local data chunk

    processMyData(data[threadIdx.x], CHUNKSIZE);

// copy back in global memory in a coalesced manner

    __syncthreads();

    offset = BLOCKSIZE*CHUNKSIZE*blockIdx.x;

    for (int i=0; i<BLOCKSIZE; i++) {

        for (int j=threadIdx.x; j < CHUNKSIZE; j+=BLOCKSIZE)

            dst[offset+j] = data[i][j];

        offset += CHUNKSIZE;

    }

}

If processing in-place your local data is not possible, then you’ll have to either slightly decrease your chunk size and allocate both an input and an output buffer in the shared memory, or decrease your block size (which I don’t recommend since it is already down to a warp size).

HTH