Is 3-address FFMA faster than 4-address FFMA?

tera · April 16, 2013, 12:23am

On GK110 in single precision I can get to >~ 4 TFLOP/s in artificial benchmarking code where 3 of the four register arguments to FFMA are identical (a = a*b+a).

With 2 identical registers (c = a*b+c) I get to about 3 TFLOP/s.

In real world code ptxas somehow decides to randomly shuffle data between the registers so I get 4 different reg (d = a*b+c) and I only get to about 2 TFLOP/s.

I remember that Nvidia GPUs have suffered from register bandwidth starvation before, so I wonder if this could be the case again (although I don’t really see where the difference between 3-address and 4-address FFMA should come from).

Unfortunately I don’t know how to force ptxas into generation of either 3-address or 4-address FFMA, so I can’t test the hypothesis without going to the pain of patching object binaries.

Has anyone observed > 2 TFLOP/s on 4-address FFMA? Alternatively, does anyone know how to tell ptxas to keep data in the same register?

tera · April 16, 2013, 12:39am

I also notice in the profiler that the ratio between “instructions issued 2” and “instructions issued 1” is consistent with the observed arithmetic throughput.

I…e. for the 2-address FFMA (a = a*b+a) I get “instructions issued 2” to be roughly the same as “instructions issued 1”, as required to fully load all 192 single-precision “cores”. For 3-address FFMA I get a ratio of about 1 to 3.5, while for the 4-address FFMA there only is a tiny fraction of dual-issued instructions (< 1%).

tera · April 16, 2013, 12:44am

Of course this correlation might be entirely coincidental. It would be nice to have more control over the code that ptxas generates.

tera · April 16, 2013, 11:49am

And the answer is NO.

I figured I can increase the ratio of 3-address FFMAs from 16% to 74% in my code by providing the --dont-merge-basicblocks option to ptxas. However the fraction of dual-issued instructions increases only marginally.

Still I’d like to know what limits dual-issue of FFMA instructions in my case.

Sylvain_Collange · April 17, 2013, 9:37am

You may look at the register IDs on the disassembled SASS to check for register bank conflicts.
Although it is not documented, some microbenchmark results suggest that registers are split into banks by their reg id: https://www.petaqcd.org/IMG/pdf/Review_Junjie_LAI.pdf
It is possible that dual-issue only happens when all unique input operands fall into different banks.

tera · April 17, 2013, 9:58am

Thank you Sylvain! For sure I have loads of these conflicts.

So I guess it is asfermi time then. I had stayed away from it so far as I thought it would be too immature on GK110. Maybe a good reason to lend a helping hand there.

tera · April 17, 2013, 10:07am

Errm, immature might be a slight understatement. It appears work on sm_35 hasn’t even started yet.

vvolkov · April 17, 2013, 10:20am

I’d also be interested in assembler for sm30 and sm35 :)

tera · April 17, 2013, 10:30am

I think we all are. :)
Bad news is that I won’t start working on it for now, as I figured I can control the placement of registers in banks by using float4 variables everywhere.

tera · April 17, 2013, 10:58am

Controlling the placement of registers in banks by using float4 variables doesn’t appear to work very well. If loops get unrolled then ptxas still happily shuffles around variables between banks. For my purposes I think I can for now live without unrolling those loops though.

vvolkov · April 17, 2013, 12:50pm

Did you try storing that float4 value to memory? This uses LD.128, which forces register ordering - at least around the LD instruction. You can use LD with zero predicate to avoid stressing memory system. Just make sure compiler doesn’t know the predicate is zero ;)

tera · April 17, 2013, 1:19pm

Yes I do that. And I load it from memory as well. So the float4 is correctly ordered at the start and at the end. But if I unroll loops in the middle, it starts shuffling things around.

Using a predicated off LD.128/ST.128 is a neat idea though - could use that to enforce ordering on each iteration as well. :)

Sylvain_Collange · April 17, 2013, 1:22pm

Yes, asfermi sort-of supports sm_30 (except scheduling hints which remain undeciphered for the most part). But nothing was done for sm_35 as far as I know.
From an instruction encoding perspective, sm_30 is very close to sm_2x, but sm_35 is completely different.

Envydis/envyas apparently claims basic support of sm_35, though ([url]https://github.com/pathscale/envytools[/url]).

tera · April 17, 2013, 1:34pm

Thanks again Sylvain - haven’t looked at envytools for quite a while, but will do so again. Spending work there might also make more sense because they are backed by commercial interest (haven’t checked recently how far pathscale has come - but having an open CUDA stack would definitely be very appreciated).

tera · April 17, 2013, 5:45pm

The bank conflicts are not the whole story though. While my first tries with enforced banks achieved (about) equal numbers of single- and dual-issued instructions, in later tests the number of dual-issued instructions has decreased again.

More importantly though, all of those kernels achieve only dysmal performance. GFLOP/s drop from ~3200 to the low hundreds. It seems like this not the optimal, but in fact the worst distribution of registers between banks.

SPWorley · April 17, 2013, 8:03pm

I have never gotten down to the nitty-gritty of register banking or peak FMA throughput, but I did want to say that I’m following this thread since such details are really interesting (and important for that nitty-gritty!)

Sylvain, thanks especially for that link to the SGEMM optimization slides. Those are really interesting! And they directly back up some small undocumented implementation details about cuBLAS. To get peak throughput for SGEMM, NVidia does not use nvcc or even ptxas… the library team has their own hand-crafted procedural matrix kernel building tool which outputs raw SASS. The reason were specifically to optimize register usage (and banking efficiency) to get full FMADD throughput. Hand building and tuning the kernels was just too much effort, especially with SM 3.5 and using all of its 255 registers per thread to full potential. In practice, SM 3.5 SGEMM reaches over 90% of theoretical flops! The custom tool is written in Python, and was written by LSChien, who some oldtimers here on the forum may remember from his incredibly detailed dissections of Fermi microperformance a couple years back. Apparently NVidia scooped him up… he’s obviously a great asset!
I learned this from Philippe Vandermersch during a GPU meetup presentation about cuBLAS design.

tera · April 17, 2013, 8:59pm

Thanks, SPWorley, for this background info! I have suspected that for quite a while. But it’s great to hear LSChien’s work has been recognized and brought to good use by Nvida.
I think I need to attend some CUDA get-togethers as well.

Regarding the FFMA problem at hand, I now believe the right way is not to fight ptxas. It is now obvious that it has internal knowledge that we don’t, and that is essential to achieve good performance. So I’ll try to learn something tomorrow from how it arranges registers.

And it is good to know this thread is followed by people. Where is Gregory Diamos? :) :)

Jimmy_Pettersson · April 19, 2013, 7:35am

SPWorley:

I have never gotten down to the nitty-gritty of register banking or peak FMA throughput, but I did want to say that I’m following this thread since such details are really interesting (and important for that nitty-gritty!)

Sylvain, thanks especially for that link to the SGEMM optimization slides. Those are really interesting! And they directly back up some small undocumented implementation details about cuBLAS. To get peak throughput for SGEMM, NVidia does not use nvcc or even ptxas… the library team has their own hand-crafted procedural matrix kernel building tool which outputs raw SASS. The reason were specifically to optimize register usage (and banking efficiency) to get full FMADD throughput. Hand building and tuning the kernels was just too much effort, especially with SM 3.5 and using all of its 255 registers per thread to full potential. In practice, SM 3.5 SGEMM reaches over 90% of theoretical flops! The custom tool is written in Python, and was written by LSChien, who some oldtimers here on the forum may remember from his incredibly detailed dissections of Fermi microperformance a couple years back. Apparently NVidia scooped him up… he’s obviously a great asset!
I learned this from Philippe Vandermersch during a GPU meetup presentation about cuBLAS design.

Back in late 2009 I asked some questions on the forum about partition camping. LSChien responded with at least a 4-5 page answer based on his own micro benchmarks and investegations. People commented that his forum posts where good enough to put in a whitepaper :-)

I believe he is from Qinghua university, very “ivy league” in China.