What does the "shared_efficiency" really mean?

I wrote a cuda program to accelerate matrix production.

Profile using nvprof, got following results:

gst_efficiency: 100%
gld_efficiency: 100%
shared_efficiency: 30.77%
shared_st_bank_conflicts: 0
shared_ld_bank_conflicts: 0

I wonder why the shared efficiency is not 100% when there is no bank conflicts exists.

Is it caused by the broadcast when threads access the same address in SMEM?

Anyone know how to achieve 100% shared_efficiency?

I am sure this is bank conflicts not back conflicts ;)

To be honest I have no idea how shared efficiency can be low without showing any bank conflicts.

Maybe the bank conflict performance counter was not available during the profiling run, or not enabled?

Do any of these performance counters show > 0 values?

shared_replay_overhead
shared_load_replay
shared_store_replay

Thanks for the correction.

But there is no such metrics in cc6x devices, as described below.

shared_efficiency = Ratio of requested shared memory throughput to required shared memory throughput expressed as percentage

The numerator is collected using a shader patch to determine number of bytes requested. This takes into account if threads are active (and I believe predicated true).

The denominator is the total number of cycles data is returned from shared memory x width of the interface.

On Kepler architecture the shared memory return path width is 256B which can only be achieved if the kernel is run in 8B bank mode and 64-bit or greater accesses are executed. If the kernel is executed in 4B bank mode the maximum efficiency may be limited to 50%. On Maxwell - Turing architectures the bank mode is fixed and all instruction widths should be able to achieve full efficiency.

1 Like

I wonder if the topic starter came to conclusion 5 years ago. I am profiling kernels on an old GP104 with nvprof, and have exactly these symptoms: sh. utilization if high (9-10 points), shared memory efficiency is low (30%), there are no bank conflicts (shared_st_bank_conflicts=shared_ld_bank_conflicts=0). I do not understand why shared memory efficiency is not 100%, as @Greg said. If it is because I have many broadcast/multicast accesses? Thank you.

efficiency probably means something like bytes used/bytes requested.

An access pattern could be conjectured that had relatively low efficiency with no bank conflicts. For example only 1 thread in a warp requests a value. Or each thread in a warp requests only a byte.

My suggestion would be to investigate the actual access pattern, and then the reason for low efficiency is likely to become clear. Also, for profiler-specific questions, we have dedicated forums for those.

1 Like