Shared memory write performance

rgirish · April 17, 2017, 10:58pm

Is there a difference in shared memory write performance if address with in a warp are sequential vs random No bank conflicts in both cases

For example: Assuming a float write
Sequential array indexes in a warp: 0,1,2,3,4,5…31
Random array indexes in a warp: 20,21,30,0,16,…1,2

Both accesses wont result in a bank conflict. But any difference in Performance ?

Robert_Crovella · April 17, 2017, 11:05pm

Not for newer GPUs.

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x__examples-of-irregular-shared-memory-accesses[/url]

rgirish · April 17, 2017, 11:46pm

I have seen that link. it doesnt clearly say about write performance. GPU HW could optimize if the addresses in the threads are sequential for example: Send base address and type to SLM unit and get all the data. Savings on bus bandwidth. If the addresses are jumbled across threads it gets tricky for HW to optimize.

BulatZiganshin · April 18, 2017, 12:01am

afaik, shared memory has 32 banks which are essentially jusy independent memory spaces. each bank can perfrorm 1 read or 1 write per cycle, and this doesn’t depend on what other banks are doing

rgirish · April 18, 2017, 12:10am

Its not related to SLM banks performance. That is same in both cases. In case of write CUDA cores need to send both address and data to Shared Memory . I’m wondering if there is any address optimization happens in sequential case so as to send less addresses. I just see STS instruction generated for both cases. Has anyone see any variations of STS instruction ?

njuffa · April 18, 2017, 12:37am

What kind of variations are you thinking of?

rgirish · April 18, 2017, 4:58pm

Somthing like an STS.S if compiler could determine that the addresses are sequential.