Binary Arithmetic
hi, I'm having some problem with the following code. taking time of run, sometimes it runs in 10 ms, some in 0,1 ms, and some else it cause a "invalid configuration argument". i run this kernel with 256 threads per block and (1600*1200*sizeof(float)/256) block (30'000 block). the focus of kernel is to translate each byte of Src in a sequence of bit. so, cod variable is 256*80 and length is 256. to encode bit to bit, i must use bit operations, so i set a mask and use OR an AND operation to set the specific bit. the next step will be to copy the result on the array Dst, but first i must know why the kernel runs in such different way.
thank you all, A.
[code]__global__ void Encode(float *Src,unsigned char *Dst,bool *cod, unsigned int *length){
const int tid=threadIdx.x;
const int bid=blockIdx.x*blockDim.x;
unsigned char mycod[10];
unsigned char mask;
unsigned char *c=(unsigned char*)Src;
unsigned char curr=c[tid+bid];
unsigned int len=length[curr];
//#pragma unroll
for(int i=0;i<len;i++){
mask=1;
if(cod[i+80*curr]==1){
mask <<= i%8;
mycod[i%8] |=mask;
}
else{
mask <<= i%8;
mask= ~mask;
mycod[i%8] &=mask;
}
}
}
[/code]
hi, I'm having some problem with the following code. taking time of run, sometimes it runs in 10 ms, some in 0,1 ms, and some else it cause a "invalid configuration argument". i run this kernel with 256 threads per block and (1600*1200*sizeof(float)/256) block (30'000 block). the focus of kernel is to translate each byte of Src in a sequence of bit. so, cod variable is 256*80 and length is 256. to encode bit to bit, i must use bit operations, so i set a mask and use OR an AND operation to set the specific bit. the next step will be to copy the result on the array Dst, but first i must know why the kernel runs in such different way.

thank you all, A.

__global__ void Encode(float *Src,unsigned char *Dst,bool *cod, unsigned int *length){

const int tid=threadIdx.x;

const int bid=blockIdx.x*blockDim.x;

unsigned char mycod[10];

unsigned char mask;

unsigned char *c=(unsigned char*)Src;

unsigned char curr=c[tid+bid];

unsigned int len=length[curr];

//#pragma unroll

for(int i=0;i<len;i++){

mask=1;

if(cod[i+80*curr]==1){

mask <<= i%8;

mycod[i%8] |=mask;

}

else{

mask <<= i%8;

mask= ~mask;

mycod[i%8] &=mask;

}

}

}

#1
Posted 04/10/2012 10:08 AM   
What happens if you run your program under cuda-memcheck?
What happens if you run your program under cuda-memcheck?

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/10/2012 10:49 AM   
[quote name='tera' date='10 April 2012 - 12:49 PM' timestamp='1334054981' post='1394184']
What happens if you run your program under cuda-memcheck?
[/quote]
thanks for reply, but:
========= ERROR SUMMARY: 0 errors
always, even if run gives "invalid configuration argument" error...
[quote name='tera' date='10 April 2012 - 12:49 PM' timestamp='1334054981' post='1394184']

What happens if you run your program under cuda-memcheck?



thanks for reply, but:

========= ERROR SUMMARY: 0 errors

always, even if run gives "invalid configuration argument" error...

#3
Posted 04/10/2012 10:59 AM   
Hmm. How do you launch the kernel?
Hmm. How do you launch the kernel?

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 04/11/2012 12:14 PM   
I launch the kernel in classic way, [code]Encode<<<grid,threads>>>(da,dst,dcod,dlength);[/code] grid and threads are always the same number such i wrote in first post, da is already in device memory, dcod and dlength are copied just before kernel call...
I launch the kernel in classic way,
Encode<<<grid,threads>>>(da,dst,dcod,dlength);
grid and threads are always the same number such i wrote in first post, da is already in device memory, dcod and dlength are copied just before kernel call...

#5
Posted 04/12/2012 07:33 AM   
up.
i found where is the problem... grid variable changes runtime even if I set it constant and even if I used it once when call one other kernel, and sometimes grid.x > 65'000, some other it's correct to 30'000, other it's 257 and so on... what the hell is happening???
up.

i found where is the problem... grid variable changes runtime even if I set it constant and even if I used it once when call one other kernel, and sometimes grid.x > 65'000, some other it's correct to 30'000, other it's 257 and so on... what the hell is happening???

#6
Posted 04/12/2012 09:21 AM   
Scroll To Top