Binary Arithmetic

hi, I’m having some problem with the following code. taking time of run, sometimes it runs in 10 ms, some in 0,1 ms, and some else it cause a “invalid configuration argument”. i run this kernel with 256 threads per block and (16001200sizeof(float)/256) block (30’000 block). the focus of kernel is to translate each byte of Src in a sequence of bit. so, cod variable is 256*80 and length is 256. to encode bit to bit, i must use bit operations, so i set a mask and use OR an AND operation to set the specific bit. the next step will be to copy the result on the array Dst, but first i must know why the kernel runs in such different way.

thank you all, A.

__global__ void Encode(float *Src,unsigned char *Dst,bool *cod, unsigned int *length){

	const int tid=threadIdx.x;

	const int bid=blockIdx.x*blockDim.x;

	unsigned char mycod[10];

	unsigned char mask;

	unsigned char *c=(unsigned char*)Src;

	unsigned char curr=c[tid+bid];

	unsigned int len=length[curr];

	//#pragma unroll

	for(int i=0;i<len;i++){

		mask=1;

		if(cod[i+80*curr]==1){

			mask <<= i%8;

			mycod[i%8] |=mask;

		}

		else{

			mask <<= i%8;

			mask= ~mask;

			mycod[i%8] &=mask;

		}

	}

}

What happens if you run your program under cuda-memcheck?

thanks for reply, but:

========= ERROR SUMMARY: 0 errors

always, even if run gives “invalid configuration argument” error…

Hmm. How do you launch the kernel?

I launch the kernel in classic way,

Encode<<<grid,threads>>>(da,dst,dcod,dlength);

grid and threads are always the same number such i wrote in first post, da is already in device memory, dcod and dlength are copied just before kernel call…

up.
i found where is the problem… grid variable changes runtime even if I set it constant and even if I used it once when call one other kernel, and sometimes grid.x > 65’000, some other it’s correct to 30’000, other it’s 257 and so on… what the hell is happening???