As the title states, I’m having an issue where the kernel seems to be hanging for a few seconds when I access an array in certain conditions. This problem only seems to exist on my jetson tk1 board with CUDA 6.5, I’ve tried it out on both the Fermi and Kepler architectures, and they work fine.
Here’s the problem code:
#define THREADS_PER_BLOCK 512
using namespace std;
__global__ void kernel(float* mask, int cols, int rows) {
int item=0;
float c;
int limi = 2500;
int limk = 102;
for(int i = 0; i < limi; i++) {
for(int k = 0; k < limk; k++) {
c = mask[i];
item = item--;
if(c == 0)
item = i-1;
}
}
item = item;
}
void kernel_caller(float* data_h, int rows, int cols) {
float* data_d;
cudaMalloc((void**)&data_d, cols*rows*sizeof(float));
cudaDeviceSynchronize();
cudaMemcpy(data_d, data_h, cols*rows*sizeof(float), cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
dim3 gridDim((cols*rows)/THREADS_PER_BLOCK + 1, 1, 1);
dim3 blockDim(THREADS_PER_BLOCK, 1, 1);
printf("Running Kernel...\n");
kernel<<<gridDim, blockDim>>>(data_d, cols, rows);
cudaDeviceSynchronize();
printf("Kernel complete!\n");
}
int main() {
int rows = 2800;
int cols = 5000;
float* data = new float[rows*cols];
for(int i = 0; i < rows*cols; i++)
data[i] = 0;
kernel_caller(data, rows, cols);
return 0;
}
Yes, I know this code does nothing of use, its simply a simplified form of the code I’m working on, and still has the same issue. For some reason, when limk becomes sufficiently large, the kernel will hang for 4-18 seconds on execution. Has anyone encountered a similar issue? Or see anything wrong with my code?