I have simplified my kernel so that the data to minimised are
sequential in a single global int array. The goal is to find the
minimum of each chunk and write it to a global output array.
The chunks are different sizes (although packed small to big).
The kernel reads a block size int into shared memory and uses the shared
memory to find the min of all the chunks in the current block using reduction.
This is complicated because of the chunk boundaries but in most cases
the block will contain only one chunk. Finally one or more min int values
are written to global memory using atomicMin. The value returned by atomicMin
is not used and the block terminates. Many global outputs are updated just once,
but typically 20+ blocks contribute to one global output, each needing 20+
atomicMin updates.
I am using a block size of 128 threads, CUDA 7.0 on GeForce GTX 745
although I would of course like it to work on any recent GPU.
Since there is no data reuse on input, why does nvvp claim an L2 cache hit rate of 50% ?
Would it be better to use zero memory rather than explicit pinned cudaMemcpy ?
(the input array varies from 0 to 16MBytes, 0 to 3000 chunks.
nvvp claims typical PCIe transfer rates of 12.7GB/s in and 2.9GB back to the host.)
Is shared memory reduction the best way to go?
(I have seen mention of threadfence but am not sure how it synchronises
across blocks).
Also I have seen mention of doing reductions with kepler shuffle rather
than shared memory but:
- it looks horrendous
- there is no shortage of shared memory (using one int per thread)
- the bottle neck seems to be reading global memory and/or atomicMin write to global
nvvp complains of poor use of shared memory (but a typical kernel
has a “shared efficiency” of 87.8%)
nvvp complains of low “Global Load Efficiency” (a typical kernel has 40%)
I have ignored the output of atomicMin. Will the compiler ensure the hardware
does not try to return the unwanted original value? Would this make the
atomic operation more efficient?
I am not sure how to use either nvvp or nvprof to understand or improve
access to global memory reads of atomic writes.
Any help or guidance you can give would be most welcome
Bill