When I made my comment here:
https://devtalk.nvidia.com/default/topic/1028130/cuda-programming-and-performance/best-way-to-find-many-minimums/post/5229816/#5229816
I meant that you should do this:
__global__ void
kernel(
const int i, const int turn, const int length,
const int* fml_i, const int* fml_j, //In
int* dml) { //Out
const int x = blockIdx.x;
const int j = x + (i + 2*(turn+1)) + 1;
int y = threadIdx.x;
int thread = j*(j-1)/2 + threadIdx.x + i + (turn+1) + 1;
__shared__ int en[BLOCK_SIZE];
en[threadIdx.x] = INF;
for(; y <= x; thread+=blockDim.x, y+=blockDim.x) {
//assert(x>=0 && x<=length);
//assert(y>=0 && y<=length);
//assert(y<=x);
int temp = ((fml_i[y] != INF ) && (fml_j[thread] != INF))? fml_i[y] + fml_j[thread] : INF;
en[threadIdx.x] = MIN2(en[threadIdx.x], temp);
} //endfor
//Use reduction, require power of two block size
#if BLOCK_SIZE > 32
#define SYNC32 __syncthreads()
#else
#define SYNC32
// this is a defective case
#endif
int ix = threadIdx.x;
const int ix_stop = MIN2(x-y+threadIdx.x, blockDim.x - 1);
//assert(ix_stop >= 0 && ix_stop < blockDim.x);
//assert(en[ix] > -INF && en[ix] <= INF);
//assert(en[ix] != 0); //for testing only
#if BLOCK_SIZE >=1024
__syncthreads(); if(ix+512 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+512]);
#endif
#if BLOCK_SIZE >=512
__syncthreads(); if(ix+256 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+256]);
#endif
#if BLOCK_SIZE >=256
__syncthreads(); if(ix+128 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+128]);
#endif
#if BLOCK_SIZE >=128
__syncthreads(); if(ix+ 64 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 64]);
#endif
#if BLOCK_SIZE >=64
__syncthreads(); if(ix+ 32 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 32]);
#endif
SYNC32; if(ix+ 16 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 16]);
SYNC32; if(ix+ 8 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 8]);
SYNC32; if(ix+ 4 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 4]);
SYNC32; if(ix+ 2 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 2]);
SYNC32; if(ix+ 1 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 1]);
SYNC32;
if(threadIdx.x==0){
dml[j] = en[0];
}
}
Allow the for loop to stride through memory, performing your reduction along the way.
Only do the shared-memory sweep ONCE, at the end of the processing.
By the way, I’m not suggesting this is an exhaustive treatment of “how to optimize this code”. I’m really just focusing on one issue here, which I previously mentioned, and the code modification above is only intended to clarify that one concept. There may be numerous possible additional optimization suggestions/possibilities.
The code I posted (from OP) also contains a bug in the use of warp sync reduction for BLOCK_SIZE <= 32. In this case, the code will dispense with __syncthreads(), but this should only be done if the shared memory pointer in use at the warp level is marked volatile. The code as posted does not do that, so it may break in this case. I would not use the code as-is when BLOCK_SIZE is 32 or less.
Apart from that known issue, the code is entirely untested by me, and is merely a mechanical transformation of the code supplied by OP to identify a particular concept. The code may have any number of defects in it. Use it at your own risk.