I have seen that link. it doesnt clearly say about write performance. GPU HW could optimize if the addresses in the threads are sequential for example: Send base address and type to SLM unit and get all the data. Savings on bus bandwidth. If the addresses are jumbled across threads it gets tricky for HW to optimize.
afaik, shared memory has 32 banks which are essentially jusy independent memory spaces. each bank can perfrorm 1 read or 1 write per cycle, and this doesn’t depend on what other banks are doing
Its not related to SLM banks performance. That is same in both cases. In case of write CUDA cores need to send both address and data to Shared Memory . I’m wondering if there is any address optimization happens in sequential case so as to send less addresses. I just see STS instruction generated for both cases. Has anyone see any variations of STS instruction ?