Using dual-issue in Fermi
Hello, I'm planning to benchmark a kernel. How do I do a dual-issue MAD and MUL?. I tried doing a MUL operation after MAD, but the GFLOP/s seems to decrease.
Is there a way I could somehow direct the instruction to SFU?

Thanks in advance
Hello, I'm planning to benchmark a kernel. How do I do a dual-issue MAD and MUL?. I tried doing a MUL operation after MAD, but the GFLOP/s seems to decrease.

Is there a way I could somehow direct the instruction to SFU?



Thanks in advance

#1
Posted 03/09/2012 10:18 AM   
I don't think compute capability 2.0 devices are capable of issuing a MUL to the special function units in parallel to a MAD on the FPUs/cores. This was a property of 1.x devices. Compute capability 2.1 devices of course can issue a MUL to their extra set of cores in parallel to a MAD.

You might have to play a bit with context, alignment, and operands due to limited register file bandwidth. I've got to admit I never tried myself in earnest as my algorithms usually have a MUL/ADD ratio of 1, not 2.
I don't think compute capability 2.0 devices are capable of issuing a MUL to the special function units in parallel to a MAD on the FPUs/cores. This was a property of 1.x devices. Compute capability 2.1 devices of course can issue a MUL to their extra set of cores in parallel to a MAD.



You might have to play a bit with context, alignment, and operands due to limited register file bandwidth. I've got to admit I never tried myself in earnest as my algorithms usually have a MUL/ADD ratio of 1, not 2.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 03/09/2012 05:05 PM   
[quote name='tera' date='09 March 2012 - 10:35 PM' timestamp='1331312744' post='1380614']
I don't think compute capability 2.0 devices are capable of issuing a MUL to the special function units in parallel to a MAD on the FPUs/cores. This was a property of 1.x devices. Compute capability 2.1 devices of course can issue a MUL to their extra set of cores in parallel to a MAD.

You might have to play a bit with context, alignment, and operands due to limited register file bandwidth. I've got to admit I never tried myself in earnest as my algorithms usually have a MUL/ADD ratio of 1, not 2.
[/quote]

Yeah. According to [url="http://www.eecg.toronto.edu/~moshovos/CUDA08/arx/microbenchmark_report.pdf"]this[/url] paper it's possible in GT200 using Mathematical Intrinsics. I was wondering if there was something like that for Fermi?.
[quote name='tera' date='09 March 2012 - 10:35 PM' timestamp='1331312744' post='1380614']

I don't think compute capability 2.0 devices are capable of issuing a MUL to the special function units in parallel to a MAD on the FPUs/cores. This was a property of 1.x devices. Compute capability 2.1 devices of course can issue a MUL to their extra set of cores in parallel to a MAD.



You might have to play a bit with context, alignment, and operands due to limited register file bandwidth. I've got to admit I never tried myself in earnest as my algorithms usually have a MUL/ADD ratio of 1, not 2.





Yeah. According to this paper it's possible in GT200 using Mathematical Intrinsics. I was wondering if there was something like that for Fermi?.

#3
Posted 03/09/2012 05:17 PM   
Isn't the dual issue to two different warps ?
You seem to be thinking dual issueing like a CPU superscalar processor....
Isn't the dual issue to two different warps ?

You seem to be thinking dual issueing like a CPU superscalar processor....

#4
Posted 03/10/2012 09:53 PM   
Yes, that is indeed what we are thinking. The two schedulers issuing instructions from two independent warps come on top of that, for (theoretically) up to 4 instructions issued in parallel per SM.

I'm not aware though that this has really been demonstrated on GF100, but both GT200 and GF104 are capable of this kind of dual-issue.
Yes, that is indeed what we are thinking. The two schedulers issuing instructions from two independent warps come on top of that, for (theoretically) up to 4 instructions issued in parallel per SM.



I'm not aware though that this has really been demonstrated on GF100, but both GT200 and GF104 are capable of this kind of dual-issue.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#5
Posted 03/10/2012 10:42 PM   
Scroll To Top