Dissecting instruction replay overhead in Kepler

According to the answer of @Greg Smith, the instruction overhead of can be counted as follwing events and metrics

events:
	global_ld_mem_divergence_replays
	global_st_mem_divergence_replays
	shared_load_replay
	shared_store_replay
metrics:
	atomic_replay_overhead
	global_cache_replay_overhead
	global_replay_overhead
	local_replay_overhead
	shared_replay_overhead

But to my surprise is that those events and metrics are zero in my nvprof results while the instruction replay overhead is pretty high.
Below are two simple vector addition benchmarks, the first one uses global memory while the other use constant memory.
https://gist.github.com/StevenHuang4321/64652656209d643bebf2b60f7a7f25ea
https://gist.github.com/StevenHuang4321/7ecaed41d16ea48523f60eab987c7f68

Nvprof results could be found here
https://docs.google.com/spreadsheets/d/1L8rjc9mXInabTn2UzpXN6CmR68mpPlPARR0guHo6_wc/edit?usp=sharing

References from Greg Smith
http://stackoverflow.com/questions/35566178/instruction-replay-in-cuda/35593124#35593124

  1. constant memory is single-banked (optimized for uniform access where all threads in the warp access the same address), that’s the reason of high replay count for the second case

  2. for the first case, may be the reason is small size of the kernel? try to add million numbers, processing at least hundred in each thread