Hand-Tuned SGEMM on GT200 GPU 10% ~ 20% improvement of SGEMM
  1 / 3    
I would like to report our SGEMM routine on GT200 GPU. This work inherits Volkov's code, which can be downloaded from
[url="http://forums.nvidia.com/index.php?showtopic=89084"]http://forums.nvidia.com/index.php?showtopic=89084[/url]

Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov's code on TeslaC1060 (black dash line).
Core frequency of GTX285 is 1.135x than that of TeslaC1060, and it is reasonable that performance of GTX285 is 1.165x than that of TeslaC1060.
Maximum performance of our method on TeslaC1060 in Figure 1 is 439.467Gflop/s, this number achieves 70% of peak performance without dual issue
(peak performance of single precision without dual issue on TeslaC1060 is 624 Gflop/s) whereas Volkov's code reaches 346 Gflop/s, which is 55.45% of peak performance.

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/figure1.JPG[/img]
figure 1: comparison between our method and Volkov's code

technical report and source code can be downloaded from
(1) technical report: [url="http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/HandTunedSgemm_2010_v1.1.pdf"]http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEM...m_2010_v1.1.pdf[/url]
(2) source code: [url="http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/lsc_sgemm.zip"]http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/lsc_sgemm.zip[/url]

Hope that you can give me some comments and criticism.

Lung Sheng
I would like to report our SGEMM routine on GT200 GPU. This work inherits Volkov's code, which can be downloaded from

http://forums.nvidia.com/index.php?showtopic=89084



Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov's code on TeslaC1060 (black dash line).

Core frequency of GTX285 is 1.135x than that of TeslaC1060, and it is reasonable that performance of GTX285 is 1.165x than that of TeslaC1060.

Maximum performance of our method on TeslaC1060 in Figure 1 is 439.467Gflop/s, this number achieves 70% of peak performance without dual issue

(peak performance of single precision without dual issue on TeslaC1060 is 624 Gflop/s) whereas Volkov's code reaches 346 Gflop/s, which is 55.45% of peak performance.



Image

figure 1: comparison between our method and Volkov's code



technical report and source code can be downloaded from

(1) technical report: http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEM...m_2010_v1.1.pdf

(2) source code: http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/lsc_sgemm.zip



Hope that you can give me some comments and criticism.



Lung Sheng

Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien

#1
Posted 02/08/2010 04:49 AM   
Lung-Sheng,

I always love your detailed experiments with GPU performance. It takes patience and persistence to understand and exploit CUDA's subtleties like this. Thanks for sharing the results (and code!)

From my first skim at your report, you talk about using decuda and splicing in new PTX instructions and recompiling.
Have you tried using the inline assembly features of CUDA? [b] It isn't officially supported, [/b]but Kaiyong Zhao (kyzhao here on the CUDA forum) mentioned it to me at GTC09. It uses the fact that [url="http://wiki.open64.net/images/1/10/Nvopencc-tutorial.pdf"]nvcc uses a flavor of Open64.
[/url]
For example, you can write code in CUDA toolkit 2.3 like:
[code]__device__ uint float_to_u8(float value)
{
uint result;
asm("cvt.sat.rni.u8.f32 %0, %1;" : "=r" (result) : "f" (value));
return result;
}[/code]

I'm sure the NV boys here will smile uneasily and say "Don't depend on this ASM feature in CUDA" but it's easier than splicing the cubin files!
Lung-Sheng,



I always love your detailed experiments with GPU performance. It takes patience and persistence to understand and exploit CUDA's subtleties like this. Thanks for sharing the results (and code!)



From my first skim at your report, you talk about using decuda and splicing in new PTX instructions and recompiling.

Have you tried using the inline assembly features of CUDA? It isn't officially supported, but Kaiyong Zhao (kyzhao here on the CUDA forum) mentioned it to me at GTC09. It uses the fact that nvcc uses a flavor of Open64.



For example, you can write code in CUDA toolkit 2.3 like:

__device__ uint float_to_u8(float value)

{

uint result;

asm("cvt.sat.rni.u8.f32 %0, %1;" : "=r" (result) : "f" (value));

return result;

}




I'm sure the NV boys here will smile uneasily and say "Don't depend on this ASM feature in CUDA" but it's easier than splicing the cubin files!

#2
Posted 02/08/2010 06:06 AM   
SPWorley, Thanks for this trick.

In fact, I don't modify PTX instructions but modify assembly codes generated by decuda.
PTX code realizes my idea but PTX code is not optimized. After optimizatoin done by nvcc,
then binary code does not realize my idea. That is why I modify assembly codes generated by decuda.

However it is helpful to me if this trick can work because it is time-consuming to modify
assembly codes generated by decuda by splicing the binary file. Averagely speaking, I need
near one hour to do this.

I try this trick on Volkov's code but suffer a serious problem

I replace "b_reg = b_ptr[j] ;" in the following code
[code] float *b_ptr = (float*)b;
#pragma unroll
for( int i = 0; i < 16; i++ ){
float A_reg = A[0] ; A += lda;
#pragma unroll
for( int j = 0; j < 16; j++){
b_reg = b_ptr[j];
c[j] += A_reg * b_reg;
}
b_ptr += 17; // b_ptr = &b[i][0]
}// for each column index of sub-matrix of A[/code]

by asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr[j]) )

[code] float *b_ptr = (float*)b;
#pragma unroll
for( int i = 0; i < 16; i++ ){
float A_reg = A[0] ; A += lda;
#pragma unroll
for( int j = 0; j < 16; j++){
asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr[j]) );
c[j] += A_reg * b_reg;
}
b_ptr += 17; // b_ptr = &b[i][0]
}// for each column index of sub-matrix of A[/code]

but compilation error occurs as
[code]./volkov_v2.cu(187): Advisory: Loop was not unrolled, inline assembly
./volkov_v2.cu(183): Advisory: Loop was not unrolled, not an innermost loop
ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx
, line 1; error : Missing .version directive at start of file 'C:\DOCUME~1\lsc
hien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx'
ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx
, line 1; error : Missing .target directive at start of file 'C:\DOCUME~1\lsch
ien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx'
ptxas fatal : Ptx assembly aborted due to errors[/code]

I think that this is because Memory operand constraint(m), the instruction should
be like "ld.shared.f32 %f277, [__cuda_b64 + offset ];".

I cannot find useful material (example) to do "shared memory ---> register".

any idea?
SPWorley, Thanks for this trick.



In fact, I don't modify PTX instructions but modify assembly codes generated by decuda.

PTX code realizes my idea but PTX code is not optimized. After optimizatoin done by nvcc,

then binary code does not realize my idea. That is why I modify assembly codes generated by decuda.



However it is helpful to me if this trick can work because it is time-consuming to modify

assembly codes generated by decuda by splicing the binary file. Averagely speaking, I need

near one hour to do this.



I try this trick on Volkov's code but suffer a serious problem



I replace "b_reg = b_ptr[j] ;" in the following code

float *b_ptr = (float*)b;

#pragma unroll

for( int i = 0; i < 16; i++ ){

float A_reg = A[0] ; A += lda;

#pragma unroll

for( int j = 0; j < 16; j++){

b_reg = b_ptr[j];

c[j] += A_reg * b_reg;

}

b_ptr += 17; // b_ptr = &b[i][0]

}// for each column index of sub-matrix of A




by asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr[j]) )



float *b_ptr = (float*)b;

#pragma unroll

for( int i = 0; i < 16; i++ ){

float A_reg = A[0] ; A += lda;

#pragma unroll

for( int j = 0; j < 16; j++){

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr[j]) );

c[j] += A_reg * b_reg;

}

b_ptr += 17; // b_ptr = &b[i][0]

}// for each column index of sub-matrix of A




but compilation error occurs as

./volkov_v2.cu(187): Advisory: Loop was not unrolled, inline assembly

./volkov_v2.cu(183): Advisory: Loop was not unrolled, not an innermost loop

ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx

, line 1; error : Missing .version directive at start of file 'C:\DOCUME~1\lsc

hien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx'

ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx

, line 1; error : Missing .target directive at start of file 'C:\DOCUME~1\lsch

ien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx'

ptxas fatal : Ptx assembly aborted due to errors




I think that this is because Memory operand constraint(m), the instruction should

be like "ld.shared.f32 %f277, [__cuda_b64 + offset ];".



I cannot find useful material (example) to do "shared memory ---> register".



any idea?

Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien

#3
Posted 02/09/2010 09:30 AM   
I think the loop unrolling warnings are a general problem with nvcc - if you try unrolling a loop with a texture read in it (which I presume also generates inline assembly), nvcc reports the same thing. For the inline assembly itself, try

[code]__asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+j) );[/code]

and see what happens.
I think the loop unrolling warnings are a general problem with nvcc - if you try unrolling a loop with a texture read in it (which I presume also generates inline assembly), nvcc reports the same thing. For the inline assembly itself, try



__asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+j) );




and see what happens.

#4
Posted 02/09/2010 09:39 AM   
I admit I am nowhere near as skilled as you with this kind of low level hacking, but a promising idea might be to expand the loop yourself manually.. I'm not sure you can use the [j] indirection in the asm string. So you could try:

[code] asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+0) );
c[0] += A_reg * b_reg;
asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+1) );
c[1] += A_reg * b_reg;
asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+2) );
c[2] += A_reg * b_reg;

...

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+15) );
c[15] += A_reg * b_reg;[/code]

(maybe those pointer increments go 0, 4, 8, 12.. 60 ?)
I admit I am nowhere near as skilled as you with this kind of low level hacking, but a promising idea might be to expand the loop yourself manually.. I'm not sure you can use the [j] indirection in the asm string. So you could try:



asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+0) );

c[0] += A_reg * b_reg;

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+1) );

c[1] += A_reg * b_reg;

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+2) );

c[2] += A_reg * b_reg;



...



asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+15) );

c[15] += A_reg * b_reg;




(maybe those pointer increments go 0, 4, 8, 12.. 60 ?)

#5
Posted 02/09/2010 09:42 AM   
Hi LSChien,

That is a very clever idea! [url="http://java%20script:add_smilie("]:)[/url]

I don't know if this is any help, but for my very simple example, I've been able to use NVCC by itself, to transfer smem data to regs and then do MAD instructions with all operands in regs.
[url="http://forums.nvidia.com/index.php?showtopic=158315&st=0&p=993085&#entry993085"]http://forums.nvidia.com/index.php?showtop...mp;#entry993085[/url]

Case 2 has the desired effect

The trick is to put a syncthread() after the smem to reg transfer. But I've found if you put more than one syncthread() (in my example case 1) lead to using operands from smem.
Hi LSChien,



That is a very clever idea! :)



I don't know if this is any help, but for my very simple example, I've been able to use NVCC by itself, to transfer smem data to regs and then do MAD instructions with all operands in regs.

http://forums.nvidia.com/index.php?showtop...mp;#entry993085



Case 2 has the desired effect



The trick is to put a syncthread() after the smem to reg transfer. But I've found if you put more than one syncthread() (in my example case 1) lead to using operands from smem.

#6
Posted 02/09/2010 08:47 PM   
Thank you, Nikolai.

I try your suggestion, case 2, into my algorithm, method 1.

method 1 + __synch after "b_reg = b_ptr[j] ;"
[code] b_ptr = (float*)b;
#pragma unroll
for( int i = 0; i < BLOCK_SIZE_X; i++ ){
float A0_reg = A[0]; A += lda;
float A1_reg = A1[0]; A1 += lda;
#pragma unroll
for( int j = 0; j < BLOCK_SIZE_Y; j++){
b_reg = b_ptr[j];
__syncthreads();
c0[j] += A0_reg * b_reg;
c1[j] += A1_reg * b_reg;
}[/code]

However nvcc reports error message as
[code]tmpxft_00000bb8_00000000-3_method1_thread.cudafe1.gpu
tmpxft_00000bb8_00000000-8_method1_thread.cudafe2.gpu
nvopencc ERROR: C:\CUDA\bin64/../open64/lib//be.exe returned non-zero status -10
73741819[/code]

I disable fully loop-unrolling of outer loop and change outer loop to #pragma unroll 8
(BLOCK_SIZE_X = 16, BLOCK_SIZE_Y = 16). Also I use compiler option "-maxrregcount 48",
such that number of active thread = 320 (register count per thread = 46 ),
the same as method1_variant in report.

The code is called method1(synch)
[code] b_ptr = (float*)b;
#pragma unroll 8
for( int i = 0; i < BLOCK_SIZE_X; i++ ){
float A0_reg = A[0]; A += lda;
float A1_reg = A1[0]; A1 += lda;
#pragma unroll
for( int j = 0; j < BLOCK_SIZE_Y; j++){
b_reg = b_ptr[j];
__syncthreads();
c0[j] += A0_reg * b_reg;
c1[j] += A1_reg * b_reg;
}[/code]

I indeed obtain "shared memory ---> register " from result of decuda.
However the performance in figure 2 is not good, only 250Gflop/s.
I think that this is because of too many __syncthreads().

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/figure2.JPG[/img]
figrue 2: performance comparison between method1(variant) and method1(synch)
method1(synch) only reaches 250Gflop/s, 30% slower than method1(variant).

In order to achieve high performance, we need to remove extra __synch operations.
However we must reset offset of jump instruction after removing extra __synch operations.
cudasm can configure offset of jump instructions but we need to check other binary codes
generated by cudasm.

Besides, there is one thing strange. if I unroll inner loop explicitly, then
nvcc would remove extra __synch automatically and use "MAD dest, [smem], src2, src3",
I don't know why.

by the way, could you share your example?
Thank you, Nikolai.



I try your suggestion, case 2, into my algorithm, method 1.



method 1 + __synch after "b_reg = b_ptr[j] ;"

b_ptr = (float*)b;

#pragma unroll

for( int i = 0; i < BLOCK_SIZE_X; i++ ){

float A0_reg = A[0]; A += lda;

float A1_reg = A1[0]; A1 += lda;

#pragma unroll

for( int j = 0; j < BLOCK_SIZE_Y; j++){

b_reg = b_ptr[j];

__syncthreads();

c0[j] += A0_reg * b_reg;

c1[j] += A1_reg * b_reg;

}




However nvcc reports error message as

tmpxft_00000bb8_00000000-3_method1_thread.cudafe1.gpu

tmpxft_00000bb8_00000000-8_method1_thread.cudafe2.gpu

nvopencc ERROR: C:\CUDA\bin64/../open64/lib//be.exe returned non-zero status -10

73741819




I disable fully loop-unrolling of outer loop and change outer loop to #pragma unroll 8

(BLOCK_SIZE_X = 16, BLOCK_SIZE_Y = 16). Also I use compiler option "-maxrregcount 48",

such that number of active thread = 320 (register count per thread = 46 ),

the same as method1_variant in report.



The code is called method1(synch)

b_ptr = (float*)b;

#pragma unroll 8

for( int i = 0; i < BLOCK_SIZE_X; i++ ){

float A0_reg = A[0]; A += lda;

float A1_reg = A1[0]; A1 += lda;

#pragma unroll

for( int j = 0; j < BLOCK_SIZE_Y; j++){

b_reg = b_ptr[j];

__syncthreads();

c0[j] += A0_reg * b_reg;

c1[j] += A1_reg * b_reg;

}




I indeed obtain "shared memory ---> register " from result of decuda.

However the performance in figure 2 is not good, only 250Gflop/s.

I think that this is because of too many __syncthreads().



Image

figrue 2: performance comparison between method1(variant) and method1(synch)

method1(synch) only reaches 250Gflop/s, 30% slower than method1(variant).



In order to achieve high performance, we need to remove extra __synch operations.

However we must reset offset of jump instruction after removing extra __synch operations.

cudasm can configure offset of jump instructions but we need to check other binary codes

generated by cudasm.



Besides, there is one thing strange. if I unroll inner loop explicitly, then

nvcc would remove extra __synch automatically and use "MAD dest, [smem], src2, src3",

I don't know why.



by the way, could you share your example?

Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien

#7
Posted 02/10/2010 06:31 AM   
Thank avidday and SPWorley for your opinions.

[code]__asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+j) );[/code]

and

[code]asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+0) );
c[0] += A_reg * b_reg;
asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+1) );
c[1] += A_reg * b_reg;
asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+2) );
c[2] += A_reg * b_reg;
...
asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+15) );
c[15] += A_reg * b_reg;[/code]

do not work when compiling to binary file because nvcc translate
asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+0) );
to
ld.shared.f32 %f7, b_ptr ;

but nvcc does not recognize variable "b_ptr", the error message is
[code]./volkov_v2.cu(183): Advisory: Loop was not unrolled, inline assembly
ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_00001c9c_00000000-4_volkov_v2.ptx
, line 149; error : Address expected for argument 1 of instruction 'ld'
ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_00001c9c_00000000-4_volkov_v2.ptx
, line 149; error : Unknown symbol 'b_ptr'
ptxas fatal : Ptx assembly aborted due to errors[/code]

if one uses
asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b+0) ); // b is shared memory
, then nvcc translates it to
ld.shared.f32 %f7, asm.by.address.temp_0 ;

I think that something wrong on translation of shared memory.

Finally I design a workaround

[code] float A_reg;
float *b64 = (float*)b;
A_reg = A[0]; A += lda;
asm("ld.shared.f32 %0, [__cuda_%1+0];" : "=f" (b_reg) : "m"(b64) );
c[0] += A_reg * b_reg;
asm("ld.shared.f32 %0, [__cuda_%1+4];" : "=f" (b_reg) : "m"(b64) );
c[1] += A_reg * b_reg;
asm("ld.shared.f32 %0, [__cuda_%1+8];" : "=f" (b_reg) : "m"(b64) );
c[2] += A_reg * b_reg;
....
c[12] += A_reg * b_reg;
asm("ld.shared.f32 %0, [__cuda_%1+1072];" : "=f" (b_reg) : "m"(b64) );
c[13] += A_reg * b_reg;
asm("ld.shared.f32 %0, [__cuda_%1+1076];" : "=f" (b_reg) : "m"(b64) );
c[14] += A_reg * b_reg;
asm("ld.shared.f32 %0, [__cuda_%1+1080];" : "=f" (b_reg) : "m"(b64) );
c[15] += A_reg * b_reg;[/code]

which is explicit loop-unrolling of

[code]float A_reg;
float *b64 = (float*)b;
#pragma unroll
for( int i = 0; i < 16; i++ ){
A_reg = A[0] ; A += lda;
#pragma unroll
for( int j = 0; j < 16; j++){
int offset = 4*(j+17*i);
asm("ld.shared.f32 %0, [__cuda_%1];" : "=f" (b_reg) : "m" (b64 + offset ) );
c[j] += A_reg * b_reg;
}[/code]

The PTX code shows correct translation,
[code]
ld.shared.f32 %f22, [__cuda_b64+0];
ld.shared.f32 %f24, [__cuda_b64+4];
...
ld.shared.f32 %f278, [__cuda_b64+1080];
[code]


However the binary code still uses "MAD dest, [smem], src2, src3", not "MAD dest, src1, src2, src3"
(one can verify this via decuda). Even we can use inline assembly to write PTX code,
nvcc would do optimization himeself and may not match your design.
Thank avidday and SPWorley for your opinions.



__asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+j) );




and



asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+0) );

c[0] += A_reg * b_reg;

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+1) );

c[1] += A_reg * b_reg;

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+2) );

c[2] += A_reg * b_reg;

...

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+15) );

c[15] += A_reg * b_reg;




do not work when compiling to binary file because nvcc translate

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+0) );

to

ld.shared.f32 %f7, b_ptr ;



but nvcc does not recognize variable "b_ptr", the error message is

./volkov_v2.cu(183): Advisory: Loop was not unrolled, inline assembly

ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_00001c9c_00000000-4_volkov_v2.ptx

, line 149; error : Address expected for argument 1 of instruction 'ld'

ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_00001c9c_00000000-4_volkov_v2.ptx

, line 149; error : Unknown symbol 'b_ptr'

ptxas fatal : Ptx assembly aborted due to errors




if one uses

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b+0) ); // b is shared memory

, then nvcc translates it to

ld.shared.f32 %f7, asm.by.address.temp_0 ;



I think that something wrong on translation of shared memory.



Finally I design a workaround



float A_reg; 

float *b64 = (float*)b;

A_reg = A[0]; A += lda;

asm("ld.shared.f32 %0, [__cuda_%1+0];" : "=f" (b_reg) : "m"(b64) );

c[0] += A_reg * b_reg;

asm("ld.shared.f32 %0, [__cuda_%1+4];" : "=f" (b_reg) : "m"(b64) );

c[1] += A_reg * b_reg;

asm("ld.shared.f32 %0, [__cuda_%1+8];" : "=f" (b_reg) : "m"(b64) );

c[2] += A_reg * b_reg;

....

c[12] += A_reg * b_reg;

asm("ld.shared.f32 %0, [__cuda_%1+1072];" : "=f" (b_reg) : "m"(b64) );

c[13] += A_reg * b_reg;

asm("ld.shared.f32 %0, [__cuda_%1+1076];" : "=f" (b_reg) : "m"(b64) );

c[14] += A_reg * b_reg;

asm("ld.shared.f32 %0, [__cuda_%1+1080];" : "=f" (b_reg) : "m"(b64) );

c[15] += A_reg * b_reg;




which is explicit loop-unrolling of



float A_reg;

float *b64 = (float*)b;

#pragma unroll

for( int i = 0; i < 16; i++ ){

A_reg = A[0] ; A += lda;

#pragma unroll

for( int j = 0; j < 16; j++){

int offset = 4*(j+17*i);

asm("ld.shared.f32 %0, [__cuda_%1];" : "=f" (b_reg) : "m" (b64 + offset ) );

c[j] += A_reg * b_reg;

}




The PTX code shows correct translation,

[code]

ld.shared.f32 %f22, [__cuda_b64+0];

ld.shared.f32 %f24, [__cuda_b64+4];

...

ld.shared.f32 %f278, [__cuda_b64+1080];

[code]





However the binary code still uses "MAD dest, [smem], src2, src3", not "MAD dest, src1, src2, src3"

(one can verify this via decuda). Even we can use inline assembly to write PTX code,

nvcc would do optimization himeself and may not match your design.

Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien

#8
Posted 02/10/2010 08:19 AM   
30% slower?! that's unfortunate...
i'm think the sycnthread() is causing some kind latency... i'm guessing you're running multiple blocks per MP, maybe the sycnthreads is preventing the other blocks from hiding the smem latency (but at 320 threads, i wouldn't think there would be much)

cuz if you just count instructions, every time you do a a syncthread instruction, you also do 16 MAD instructions (right?) and 1 MOV instruction . Hence if you include the syncthread it should run at 17/18 (or 94.4%) of your original method. Now that I think about it, even 5.6% is a lot considering it cuts half of your 10% improvement. So the only benefit is that you can use NVCC... but unfortunately syncthreads isn't behaving as we would like


here is the kernel to my code:

[codebox]__global__ void min_ops( float *gA, float *gB )
{
int i;
const int tx = threadIdx.x;
volatile float a[9], b[9];
volatile float c[9] = {0,0,0,0,0,0,0,0,0};
volatile __shared__ float sA[NUM_ELEMENTS], sB[NUM_ELEMENTS];



#pragma unroll 20
for ( i = 0 ; i < TEST_ITERATIONS ; i ++ )
{

smem2reg(a, b, sA, sB, tx);

__syncthreads();

mad_ops(a, b, c);

reg2smem(c, sA, tx);

}



__syncthreads();

reg2gmem(c, gA, tx);

}[/codebox]
I ran it using 1 block per MP w/ 192 threads.

smem2reg: moves 18 vectors
mad_ops: performs 27 mad operations
reg2smem: moves 9 vectors

@ 192 threads, I've found I only achieve 88% of the smem bandwidth
Incorporating the 88% with clock cycle counting, I can accurately account for every instruction issued in the loop, where the syncthread() instructions accounted for 4cc per warp (which of course is the expected throughput)


However in my example, i have 1 sychthread against (27 MAD instructions + 27 MOV instructions), thus I only experience 1/55 = 1.8% slower code due to sycnthreads

thanks for trying, great job on the sgemm [url="http://java%20script:add_smilie("]/thumbup.gif' class='bbc_emoticon' alt=':thumbup:' />[/url]
30% slower?! that's unfortunate...

i'm think the sycnthread() is causing some kind latency... i'm guessing you're running multiple blocks per MP, maybe the sycnthreads is preventing the other blocks from hiding the smem latency (but at 320 threads, i wouldn't think there would be much)



cuz if you just count instructions, every time you do a a syncthread instruction, you also do 16 MAD instructions (right?) and 1 MOV instruction . Hence if you include the syncthread it should run at 17/18 (or 94.4%) of your original method. Now that I think about it, even 5.6% is a lot considering it cuts half of your 10% improvement. So the only benefit is that you can use NVCC... but unfortunately syncthreads isn't behaving as we would like





here is the kernel to my code:



[codebox]__global__ void min_ops( float *gA, float *gB )

{

int i;

const int tx = threadIdx.x;

volatile float a[9], b[9];

volatile float c[9] = {0,0,0,0,0,0,0,0,0};

volatile __shared__ float sA[NUM_ELEMENTS], sB[NUM_ELEMENTS];







#pragma unroll 20

for ( i = 0 ; i < TEST_ITERATIONS ; i ++ )

{



smem2reg(a, b, sA, sB, tx);



__syncthreads();



mad_ops(a, b, c);



reg2smem(c, sA, tx);



}







__syncthreads();



reg2gmem(c, gA, tx);



}[/codebox]

I ran it using 1 block per MP w/ 192 threads.



smem2reg: moves 18 vectors

mad_ops: performs 27 mad operations

reg2smem: moves 9 vectors



@ 192 threads, I've found I only achieve 88% of the smem bandwidth

Incorporating the 88% with clock cycle counting, I can accurately account for every instruction issued in the loop, where the syncthread() instructions accounted for 4cc per warp (which of course is the expected throughput)





However in my example, i have 1 sychthread against (27 MAD instructions + 27 MOV instructions), thus I only experience 1/55 = 1.8% slower code due to sycnthreads



thanks for trying, great job on the sgemm /thumbup.gif' class='bbc_emoticon' alt=':thumbup:' />

#9
Posted 02/10/2010 08:32 AM   
Nikolai, thanks for explanation.

You are right, cost of __synch can be estimated.

In my code, 2 MADs + 1 Move when do a __synch, that means I have 1/3 extra performance degradation,

which matches experimental result.

So the key point is how many __synch we should pay.
Nikolai, thanks for explanation.



You are right, cost of __synch can be estimated.



In my code, 2 MADs + 1 Move when do a __synch, that means I have 1/3 extra performance degradation,



which matches experimental result.



So the key point is how many __synch we should pay.

Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien

#10
Posted 02/10/2010 08:52 AM   
oh geez you're only doing 2 MAD? cuz then using syncthreads is not viable (i.e. your 30% reduction in speed), sorry about that :">
oh geez you're only doing 2 MAD? cuz then using syncthreads is not viable (i.e. your 30% reduction in speed), sorry about that :">

#11
Posted 02/10/2010 09:03 AM   
Great performance!

Note that peak performance in multiply-and-add is 624 Gflop/s with or without dual issue. Comparing to the 50% higher peak, which implies another multiply done in parallel, would be misleading for matrix-matrix multiply performance analysis - you can't use that multiply anyway.

Then I wonder, if your 70% is the limit. If you know how to find the way around the "multiply-and-add with operand in shared memory" bottleneck, then you are no more bound by its 66% throughput. Why not push all the way up to, say, 90% of peak?

I like that you did it in the native asm. It's too bad that GPU companies are not serious about asm support. Many HPC problems would be so much more simpler if asm was available. It would also encourage independent compiler development. I secretly hope that competition with Intel will change that ;)
Great performance!



Note that peak performance in multiply-and-add is 624 Gflop/s with or without dual issue. Comparing to the 50% higher peak, which implies another multiply done in parallel, would be misleading for matrix-matrix multiply performance analysis - you can't use that multiply anyway.



Then I wonder, if your 70% is the limit. If you know how to find the way around the "multiply-and-add with operand in shared memory" bottleneck, then you are no more bound by its 66% throughput. Why not push all the way up to, say, 90% of peak?



I like that you did it in the native asm. It's too bad that GPU companies are not serious about asm support. Many HPC problems would be so much more simpler if asm was available. It would also encourage independent compiler development. I secretly hope that competition with Intel will change that ;)

#12
Posted 02/17/2010 04:45 PM   
[quote name='vvolkov' post='1003049' date='Feb 17 2010, 08:45 AM']Note that peak performance in multiply-and-add is 624 Gflop/s with or without dual issue.[/quote]

Thanks, you are right. In SGEMM, we always count number of MAD operations, so 624 Gflop/s is peak performance.

[quote name='vvolkov' post='1003049' date='Feb 17 2010, 08:45 AM']Then I wonder, if your 70% is the limit. If you know how to find the way around the "multiply-and-add with operand in shared memory" bottleneck,
then you are no more bound by its 66% throughput. Why not push all the way up to, say, 90% of peak?[/quote]
I check assembly code of my method 1, it has regular pattern, one "MOV reg, [smem]" and two "MAD dest, src1,src2,src3" interleave each other.
Such pattern is good for dual issue, I wonder if I can improve it, from 70% to 90%.
On the contrary, I also implment volkov_variant, which modifies "MAD dest, [smem], src2, src3" in volkov's code into
"MOV reg, [smem]" and "MAD dest, src1,src2,src3", however no speedup is seen in volkov_variant.
The reason is clear now since pattern of one "MOV reg, [smem]" and one "MAD dest, src1,src2,src3" is not regular,
several "MOV reg, [smem]" are grouped together (I think this is side effect of loop-unrolling).

It is necessary to write assembly manually in volkov_variant such that
one "MOV reg, [smem]" and one "MAD dest, src1,src2,src3" is regular.
Then we can check effect of dual issue on Volkov's code.

Original Volkov's code is more competitive since it has 512 active threads per SM.

I will re-survey cudasm and try to wrtie assembly manually and report the result later.
[quote name='vvolkov' post='1003049' date='Feb 17 2010, 08:45 AM']Note that peak performance in multiply-and-add is 624 Gflop/s with or without dual issue.



Thanks, you are right. In SGEMM, we always count number of MAD operations, so 624 Gflop/s is peak performance.



[quote name='vvolkov' post='1003049' date='Feb 17 2010, 08:45 AM']Then I wonder, if your 70% is the limit. If you know how to find the way around the "multiply-and-add with operand in shared memory" bottleneck,

then you are no more bound by its 66% throughput. Why not push all the way up to, say, 90% of peak?

I check assembly code of my method 1, it has regular pattern, one "MOV reg, [smem]" and two "MAD dest, src1,src2,src3" interleave each other.

Such pattern is good for dual issue, I wonder if I can improve it, from 70% to 90%.

On the contrary, I also implment volkov_variant, which modifies "MAD dest, [smem], src2, src3" in volkov's code into

"MOV reg, [smem]" and "MAD dest, src1,src2,src3", however no speedup is seen in volkov_variant.

The reason is clear now since pattern of one "MOV reg, [smem]" and one "MAD dest, src1,src2,src3" is not regular,

several "MOV reg, [smem]" are grouped together (I think this is side effect of loop-unrolling).



It is necessary to write assembly manually in volkov_variant such that

one "MOV reg, [smem]" and one "MAD dest, src1,src2,src3" is regular.

Then we can check effect of dual issue on Volkov's code.



Original Volkov's code is more competitive since it has 512 active threads per SM.



I will re-survey cudasm and try to wrtie assembly manually and report the result later.

Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien

#13
Posted 02/18/2010 02:01 PM   
[quote name='LSChien' post='1003614' date='Feb 18 2010, 07:01 AM']I check assembly code of my method 1, it has regular pattern, one "MOV reg, [smem]" and two "MAD dest, src1,src2,src3" interleave each other.
Such pattern is good for dual issue, I wonder if I can improve it, from 70% to 90%.
On the contrary, I also implment volkov_variant, which modifies "MAD dest, [smem], src2, src3" in volkov's code into
"MOV reg, [smem]" and "MAD dest, src1,src2,src3", however no speedup is seen in volkov_variant.
The reason is clear now since pattern of one "MOV reg, [smem]" and one "MAD dest, src1,src2,src3" is not regular,
several "MOV reg, [smem]" are grouped together (I think this is side effect of loop-unrolling).[/quote]

It may be not due to dual-issue then. One smem-MOV per two register-MADs means that you are using a kind of register blocking, which reduces smem traffic per flop. The problem may be not simply in MADs-with-smem-operand, but in accessing smem in parallel with MAD no matter how you do that - using one or multiple instructions. A kind of bandwidth bottleneck. In that case you might still get further speedup if you use larger register blocks - such as one smem-MOV per four register-MADs.

I also wonder why you get 55.45% on GT200 with my code. I get 60% on GTX280.
[quote name='LSChien' post='1003614' date='Feb 18 2010, 07:01 AM']I check assembly code of my method 1, it has regular pattern, one "MOV reg, [smem]" and two "MAD dest, src1,src2,src3" interleave each other.

Such pattern is good for dual issue, I wonder if I can improve it, from 70% to 90%.

On the contrary, I also implment volkov_variant, which modifies "MAD dest, [smem], src2, src3" in volkov's code into

"MOV reg, [smem]" and "MAD dest, src1,src2,src3", however no speedup is seen in volkov_variant.

The reason is clear now since pattern of one "MOV reg, [smem]" and one "MAD dest, src1,src2,src3" is not regular,

several "MOV reg, [smem]" are grouped together (I think this is side effect of loop-unrolling).



It may be not due to dual-issue then. One smem-MOV per two register-MADs means that you are using a kind of register blocking, which reduces smem traffic per flop. The problem may be not simply in MADs-with-smem-operand, but in accessing smem in parallel with MAD no matter how you do that - using one or multiple instructions. A kind of bandwidth bottleneck. In that case you might still get further speedup if you use larger register blocks - such as one smem-MOV per four register-MADs.



I also wonder why you get 55.45% on GT200 with my code. I get 60% on GTX280.

#14
Posted 02/18/2010 08:28 PM   
[quote name='vvolkov' post='1003814' date='Feb 18 2010, 12:28 PM']One smem-MOV per two register-MADs means that you are using a kind of register blocking, which reduces smem traffic per flop.
In that case you might still get further speedup if you use larger register blocks - such as one smem-MOV per four register-MADs.[/quote]

That is my original idea, reduce number of "reg <-- share memory".
However under the case, "one smem-MOV per four register-MADs", there are only 192 active threads per SM,
performance is not good.

[quote name='vvolkov' post='1003814' date='Feb 18 2010, 12:28 PM']I also wonder why you get 55.45% on GT200 with my code. I get 60% on GTX280.[/quote]

I get about 55.4% on TeslaC1060, GTX295 and GTX285.
for dimension n > 2046, I execute SGEMM twice and average it.
[quote name='vvolkov' post='1003814' date='Feb 18 2010, 12:28 PM']One smem-MOV per two register-MADs means that you are using a kind of register blocking, which reduces smem traffic per flop.

In that case you might still get further speedup if you use larger register blocks - such as one smem-MOV per four register-MADs.



That is my original idea, reduce number of "reg <-- share memory".

However under the case, "one smem-MOV per four register-MADs", there are only 192 active threads per SM,

performance is not good.



[quote name='vvolkov' post='1003814' date='Feb 18 2010, 12:28 PM']I also wonder why you get 55.45% on GT200 with my code. I get 60% on GTX280.



I get about 55.4% on TeslaC1060, GTX295 and GTX285.

for dimension n > 2046, I execute SGEMM twice and average it.

Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien

#15
Posted 02/19/2010 02:38 PM   
  1 / 3    
Scroll To Top