Partition camping in GTX 680?

allanmac · March 31, 2013, 6:58pm

I was surprised to see that the partition camping avoidance kernel in the CUDA Samples transpose example is significantly faster than the optimized (coalesced-and-padded) kernel.

This is on a GTX 680 with the transpose kernels recompiled to use a 32x32 tile size and a 32x8 block size.

optimized: 109.2 GB/sec [*] diagonal: 123.5 GB/sec

Note that on a K20c the optimized kernel is always faster than the partition camping avoidance kernel.

So is this just an example of an access pattern that foils the 680’s hashing scheme? Any ideas?

CudaaduC · March 31, 2013, 8:46pm

excuse my ignorance, but what is ‘partition camping’ in this context?

allanmac · March 31, 2013, 10:11pm

Here are a couple excellent descriptions of partition camping:

Optimizing Matrix Transpose in CUDA [*]Bounding the Effect of Partition Camping in GPU Kernels

Past discussions on partition camping focus on the GT200 because Fermi and Kepler (?) hash memory addresses. So that’s why I was surprised to see that the diagonal reordering of blocks provided a 15% boost on the GTX 680.

CudaaduC · March 31, 2013, 10:20pm

Interesting, thanks for the link.

Another ignorant question, If most Linear Algebra Library functions (such as Sgemm or Sgeam) take in a parameter which allows you transpose either Matrix, then why would one need to create another copy of a Matrix in the transposed state?

Maybe if you need needed to calculate A’ x A ?

allanmac · March 31, 2013, 10:45pm

I think the example is primarily there for educational purposes. As you note, it’s probably rarely done in an entirely standalone way in practice. Someone else in this forum could probably write a novel on when/where/how transpose is performed in CUBLAS,etc.

My interest in the primitive is unrelated to linear algebra. I’ve spent a few days focused on the performance of transposing values from lane order to row order in kernels that have, for example, a 32x32 register working set per warp. The problem is the same and, in my case, performance really matters.

I recently wrote a blog post on a “no shared, no sync” approach to transposing. Performance is pretty good and being able to transpose without touching shared might be useful in some unique situation.

For this reason, I’d like to know when partition camping might occur on Kepler.

allanmac · April 3, 2013, 5:00pm

bump

Any ideas on why the GTX 680 would see +13% throughput when using the partition camping-avoiding “diagonal” transpose kernel?

Jimmy_Pettersson · April 4, 2013, 12:25pm

I seem to remember Volkov showing that minor partition camping effects where still visible on Fermi, perhaps this is what you are seeing on Kepler?

Volkov wrote a post on these forums about it (link?)…

By the way, kudos on a creative approach to matrix transpose!

allanmac · April 4, 2013, 4:40pm

I will look for that forum post.

If this actually is partition camping (vs. something else) then one idea is to try to discover the actual hash (or XOR?) being used by running a carefully built memory-bound kernel and monitoring its performance.

Re: transpose – Thanks! I wrote a bunch of other variants too including ones that simply use permuted but coalesced loads/stores to avoid shuffles entirely. Starting with values already in registers was the original use case though and requires SHFLs. I will also try out a slightly more complex SHFL transpose that achieves 64-byte memory transactions for 32-bit words and 128-byte for 64-bits.

Jimmy_Pettersson · April 5, 2013, 7:58am

I wrote some test codes several years ago aimed at determining the number of partitions on a certain card.

If I remember correctly the approach was to perform matrix transpose for large matrices where I varied the matrix dimension to map over e.g. N or N+1 partitions, the PC effect would then become apparent at multiples of the memory partition count… I think :-)

skabala · April 13, 2013, 2:30pm

I observed this, too, this week when working on transpose on Kepler (GTX 680). The kernel is significantly faster using diagonal block reordering, despite global address hashing.