atomicAdd memory access patterns on P100

This question is about performance of atomicAdds vis-a-vi memory access patterns made by such atomicAdd on P100

I have a kernel that guarantees coalesced atomicAdds of a certain width. Results show that 256-bit access patterns (single or double precision) give best results on a P100. Why is it so and not 128-bits? I also noticed that groups of 256-bit atomicAdds scattered over large areas of memory are less performant then if the area in question is smaller. Is this some form of partition camping?

Regards
Daniel

It’s not clear to me what you mean by “256-bit access patterns”. Could you show a worked example? What exactly do you mean by “best results”? Why do you expect “best result” with 128-bit access patterns?

My understanding is that Nvidia GPU works best when warps access full 128-bit words, so I am making sure that a group of neighbour threads in a warp do atomicAdds on consecutive memory addresses.
Now if let’s say groups are made up of two member threads and there are double precision atomicAdds involved. Thread0 and Thread1 do atomicAdds on doubles next to each other in memory. Thread2 and 3 do such a thing somewhere else…etc. With this configuration, I get a given performance. If my groups have no four members (therefore I will access consecutive 256-bit aligned memory) I get a better performance. Why is that?

I can’t give you my code as it does not make sense alone, but I’ll try to create a working example if it is still not clear

Thanks for the reply
D

I’m not really sure what you mean by “If my groups have no four members…”. However:

a dram memory segment is 32 bytes

That’s all anybody gets to ask for, ever, under any circumstances. You want a byte? you get 32. You want to do atomics? You get (at least) 32. You want to write something to dram? Give me the full 32 bytes that belong there in that segment. You want 1 byte at the end of segment 0 and another byte at the beginning of segment 1? That’s going to require reading 2 full segments, i.e. 64 bytes fetched from DRAM. (of course this discussion assumes the data is not already in the L2.)

The L2 manages all this for you, but it stands to reason that if the resolution of atomics require access to 2 separate segments in dram, that is going to be slower than a set of atomics that can all be resolved in single dram segment.

Detailed descriptions of how exactly atomics are resolved in the L2 cache are unpublished, unspecified, and I think you’re unlikely to get a full treatment of the details by asking here. Microbenchmarking is the approach most folks take to discover these things, if necessary. Anything you learn today could change tomorrow, in the next CUDA version, or in the next GPU.