Degradion Performance 4.1 over 4.0
Hi all,
due the fact our application has to not be simply fast but it should perform
some operations with fixed deadlines (we analyze a continuous radio signal)
we perform several time per day benchmarks of all our algorithm.

We are experiencing a clear degradation adopting CUDA 4.1 over the old CUDA 4.0.

I have attached 4 images showing the historical performance data of 4 algorithms
(they are not all the affected ones, but the simplest to show you the kernel code).

For all graphs the reported time is in milliseconds (y-axis).

All kernels are launched in this way:

#define BLOCK_SIZE (1<<9)

dim3 myThreads(BLOCK_SIZE);
dim3 myGrid( (aSize + BLOCK_SIZE - 1) / BLOCK_SIZE);

Kernel<<< myGrid, myThreads>>>(.....);

We have the C2050 cards with ECC off.

============================================================================
Sum of two complex vectors (2^20 complex)

[attachment=24150:add_cc.png]
[code]
__global__ void
VectorVectorSumKernelCC_O(const float2* aIn1,
const float2* aIn2,
float2* aOut,
const unsigned int aSize) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
aOut[myPos].x = aIn1[myPos].x + aIn2[myPos].x;
aOut[myPos].y = aIn1[myPos].y + aIn2[myPos].y;
}
}
[/code]
============================================================================
Product of two complex vectors (2^20 complex)

[attachment=24151:mul_cc.png]

[code]
__global__ void
MulKernel_cv_cv_o(const float2* aIn1,
const float2* aIn2,
float2* aOut,
const unsigned int aSize) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
const float myReal1 = aIn1[myPos].x;
const float myReal2 = aIn2[myPos].x;
const float myImag1 = aIn1[myPos].y;
const float myImag2 = aIn2[myPos].y;
aOut[myPos].x = myReal1 * myReal2 - myImag1 * myImag2;
aOut[myPos].y = myReal1 * myImag2 + myImag1 * myReal2;
}
}
[/code]
============================================================================
Product of two complex vectors (2^20 complex), in place

[attachment=24152:mul_cc_i.png]

[code]
__global__ void
MulKernel_cv_cv_i(const float2* aIn,
float2* aInOut,
const unsigned int aSize) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
const float myTmp = aInOut[myPos].x;
const float myInR = aIn[myPos].x;
const float myInI = aIn[myPos].y;
aInOut[myPos].x = myInR * aInOut[myPos].x - myInI * aInOut[myPos].y;
aInOut[myPos].y = myInR * aInOut[myPos].y + myInI * myTmp;
}
}
[/code]
============================================================================
Tone generation (2^20 vector long)

[attachment=24153:tone.png]

[code]
__global__ void
ComplexExpKernel(float2* aInOut,
const unsigned int aSize,
const float aMagnitude,
const float aNormalizedFrequency,
const float aInverseFrequency,
const float aPhase) {

const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

if (myPos < aSize) {
const float myArgument = aNormalizedFrequency * fmodf((float)myPos, aInverseFrequency) + aPhase;
aInOut[myPos].x = aMagnitude * __cosf(myArgument);
aInOut[myPos].y = aMagnitude * __sinf(myArgument);
}
}
[/code]
============================================================================
Hi all,

due the fact our application has to not be simply fast but it should perform

some operations with fixed deadlines (we analyze a continuous radio signal)

we perform several time per day benchmarks of all our algorithm.



We are experiencing a clear degradation adopting CUDA 4.1 over the old CUDA 4.0.



I have attached 4 images showing the historical performance data of 4 algorithms

(they are not all the affected ones, but the simplest to show you the kernel code).



For all graphs the reported time is in milliseconds (y-axis).



All kernels are launched in this way:



#define BLOCK_SIZE (1<<9)



dim3 myThreads(BLOCK_SIZE);

dim3 myGrid( (aSize + BLOCK_SIZE - 1) / BLOCK_SIZE);



Kernel<<< myGrid, myThreads>>>(.....);



We have the C2050 cards with ECC off.



============================================================================

Sum of two complex vectors (2^20 complex)



[attachment=24150:add_cc.png]



__global__ void

VectorVectorSumKernelCC_O(const float2* aIn1,

const float2* aIn2,

float2* aOut,

const unsigned int aSize) {

const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

if (myPos < aSize) {

aOut[myPos].x = aIn1[myPos].x + aIn2[myPos].x;

aOut[myPos].y = aIn1[myPos].y + aIn2[myPos].y;

}

}


============================================================================

Product of two complex vectors (2^20 complex)



[attachment=24151:mul_cc.png]





__global__ void

MulKernel_cv_cv_o(const float2* aIn1,

const float2* aIn2,

float2* aOut,

const unsigned int aSize) {

const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

if (myPos < aSize) {

const float myReal1 = aIn1[myPos].x;

const float myReal2 = aIn2[myPos].x;

const float myImag1 = aIn1[myPos].y;

const float myImag2 = aIn2[myPos].y;

aOut[myPos].x = myReal1 * myReal2 - myImag1 * myImag2;

aOut[myPos].y = myReal1 * myImag2 + myImag1 * myReal2;

}

}


============================================================================

Product of two complex vectors (2^20 complex), in place



[attachment=24152:mul_cc_i.png]





__global__ void

MulKernel_cv_cv_i(const float2* aIn,

float2* aInOut,

const unsigned int aSize) {

const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

if (myPos < aSize) {

const float myTmp = aInOut[myPos].x;

const float myInR = aIn[myPos].x;

const float myInI = aIn[myPos].y;

aInOut[myPos].x = myInR * aInOut[myPos].x - myInI * aInOut[myPos].y;

aInOut[myPos].y = myInR * aInOut[myPos].y + myInI * myTmp;

}

}


============================================================================

Tone generation (2^20 vector long)



[attachment=24153:tone.png]





__global__ void

ComplexExpKernel(float2* aInOut,

const unsigned int aSize,

const float aMagnitude,

const float aNormalizedFrequency,

const float aInverseFrequency,

const float aPhase) {



const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;



if (myPos < aSize) {

const float myArgument = aNormalizedFrequency * fmodf((float)myPos, aInverseFrequency) + aPhase;

aInOut[myPos].x = aMagnitude * __cosf(myArgument);

aInOut[myPos].y = aMagnitude * __sinf(myArgument);

}

}


============================================================================

#1
Posted 02/08/2012 02:21 PM   
Which driver version are you using?
Which driver version are you using?

#2
Posted 02/08/2012 03:08 PM   
Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the [font="Courier New"].x[/font] and [font="Courier New"].y[/font] components of a float2.
It can be worked around:

[code]__global__ void
VectorVectorSumKernelCC_O(const float2* aIn1,
const float2* aIn2,
float2* aOut,
const unsigned int aSize) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
float2 a1 = aIn1[myPos];
float2 a2 = aIn2[myPos];
float2 result;
result.x = a1.x + a2.x;
result.y = a1.y + a2.y;
aOut[myPos] = result;
}
}
[/code]
This generates basically the same object code (for sm_20) with CUDA 4.1 as your original code does with CUDA 4.0.
Still I think Nvidia should fix this rather quickly.
Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the .x and .y components of a float2.

It can be worked around:



__global__ void

VectorVectorSumKernelCC_O(const float2* aIn1,

const float2* aIn2,

float2* aOut,

const unsigned int aSize) {

const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

if (myPos < aSize) {

float2 a1 = aIn1[myPos];

float2 a2 = aIn2[myPos];

float2 result;

result.x = a1.x + a2.x;

result.y = a1.y + a2.y;

aOut[myPos] = result;

}

}


This generates basically the same object code (for sm_20) with CUDA 4.1 as your original code does with CUDA 4.0.

Still I think Nvidia should fix this rather quickly.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#3
Posted 02/08/2012 03:08 PM   
[quote name='mfatica' date='08 February 2012 - 05:08 PM' timestamp='1328713714' post='1366899']
Which driver version are you using?
[/quote]

ver. 285.05.33


At the same time in order to use npp I have to remove the extra "," at the end of enums in npp.h
[quote name='mfatica' date='08 February 2012 - 05:08 PM' timestamp='1328713714' post='1366899']

Which driver version are you using?





ver. 285.05.33





At the same time in order to use npp I have to remove the extra "," at the end of enums in npp.h

#4
Posted 02/08/2012 03:35 PM   
[quote name='tera' date='08 February 2012 - 05:08 PM' timestamp='1328713716' post='1366900']
Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the [font="Courier New"].x[/font] and [font="Courier New"].y[/font] components of a float2.
It can be worked around:

[code]__global__ void
VectorVectorSumKernelCC_O(const float2* aIn1,
const float2* aIn2,
float2* aOut,
const unsigned int aSize) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
float2 a1 = aIn1[myPos];
float2 a2 = aIn2[myPos];
float2 result;
result.x = a1.x + a2.x;
result.y = a1.y + a2.y;
aOut[myPos] = result;
}
}
[/code]
This generates basically the same object code (for sm_20) with CUDA 4.1 as your original code does with CUDA 4.0.
Still I think Nvidia should fix this rather quickly.
[/quote]

Indeed that fixes the issue, funny thing is that the nppsAdd_32fc is affected by the same problems, it goes slow
as it does my original kernel. And yes I do agree that NVidia has to fix it, I have other kernels not easy like that
and I'm not going to do that "trick" on all of them.
[quote name='tera' date='08 February 2012 - 05:08 PM' timestamp='1328713716' post='1366900']

Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the .x and .y components of a float2.

It can be worked around:



__global__ void

VectorVectorSumKernelCC_O(const float2* aIn1,

const float2* aIn2,

float2* aOut,

const unsigned int aSize) {

const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

if (myPos < aSize) {

float2 a1 = aIn1[myPos];

float2 a2 = aIn2[myPos];

float2 result;

result.x = a1.x + a2.x;

result.y = a1.y + a2.y;

aOut[myPos] = result;

}

}


This generates basically the same object code (for sm_20) with CUDA 4.1 as your original code does with CUDA 4.0.

Still I think Nvidia should fix this rather quickly.





Indeed that fixes the issue, funny thing is that the nppsAdd_32fc is affected by the same problems, it goes slow

as it does my original kernel. And yes I do agree that NVidia has to fix it, I have other kernels not easy like that

and I'm not going to do that "trick" on all of them.

#5
Posted 02/08/2012 03:43 PM   
Please file a bug against the compiler. When you log into the registered developer website, there is a link to the bug reporting form. Thank you for your help.
Please file a bug against the compiler. When you log into the registered developer website, there is a link to the bug reporting form. Thank you for your help.

#6
Posted 02/08/2012 04:49 PM   
[quote name='njuffa' date='08 February 2012 - 06:49 PM' timestamp='1328719798' post='1366942']
Please file a bug against the compiler. When you log into the registered developer website, there is a link to the bug reporting form. Thank you for your help.
[/quote]

I'm not able to find it in the new site.
[quote name='njuffa' date='08 February 2012 - 06:49 PM' timestamp='1328719798' post='1366942']

Please file a bug against the compiler. When you log into the registered developer website, there is a link to the bug reporting form. Thank you for your help.





I'm not able to find it in the new site.

#7
Posted 02/08/2012 05:37 PM   
The following works for me:

(1) Login to partners.nvidia.com
(2) Click through the legalese screen
(3) On the start page of the registered developer website, there should be a menu on the left edge of the screen:

Recent Downloads
Current Bugs
Bug Report <<<<<<<<<<<<<<<<<<<<<<<<<<<<
Early Access Feedback

Please let me know if that does not work.
The following works for me:



(1) Login to partners.nvidia.com

(2) Click through the legalese screen

(3) On the start page of the registered developer website, there should be a menu on the left edge of the screen:



Recent Downloads

Current Bugs

Bug Report <<<<<<<<<<<<<<<<<<<<<<<<<<<<

Early Access Feedback



Please let me know if that does not work.

#8
Posted 02/08/2012 05:46 PM   
[quote name='njuffa' date='08 February 2012 - 07:46 PM' timestamp='1328723187' post='1366967']
The following works for me:

(1) Login to partners.nvidia.com
(2) Click through the legalese screen
(3) On the start page of the registered developer website, there should be a menu on the left edge of the screen:

Recent Downloads
Current Bugs
Bug Report <<<<<<<<<<<<<<<<<<<<<<<<<<<<
Early Access Feedback

Please let me know if that does not work.
[/quote]

But that's the old one! I'm a bit puzzled: 2 forums, 1 old (this one) and the tag based one.
New and and old registered developer website.

That's different than the site linked in this page: http://developer.nvidia.com/ (Registered Developers Website)
that points to: http://developer.nvidia.com/user/me

Submitted on the old site, while submitting it the CUDA Toolkit version 4.1 was not even listed!

Regards
Gaetano Mendola
[quote name='njuffa' date='08 February 2012 - 07:46 PM' timestamp='1328723187' post='1366967']

The following works for me:



(1) Login to partners.nvidia.com

(2) Click through the legalese screen

(3) On the start page of the registered developer website, there should be a menu on the left edge of the screen:



Recent Downloads

Current Bugs

Bug Report <<<<<<<<<<<<<<<<<<<<<<<<<<<<

Early Access Feedback



Please let me know if that does not work.





But that's the old one! I'm a bit puzzled: 2 forums, 1 old (this one) and the tag based one.

New and and old registered developer website.



That's different than the site linked in this page: http://developer.nvidia.com/ (Registered Developers Website)

that points to: http://developer.nvidia.com/user/me



Submitted on the old site, while submitting it the CUDA Toolkit version 4.1 was not even listed!



Regards

Gaetano Mendola

#9
Posted 02/08/2012 06:03 PM   
Sorry for the confusion.
For CUDA related questions, this forum (the "old" one) is the best place to get answers.

Edit: What is the bug number?
Sorry for the confusion.

For CUDA related questions, this forum (the "old" one) is the best place to get answers.



Edit: What is the bug number?

#10
Posted 02/08/2012 06:33 PM   
[quote name='mfatica' date='08 February 2012 - 07:33 PM' timestamp='1328726016' post='1366982']
Sorry for the confusion.
For CUDA related questions, this forum (the "old" one) is the best place to get answers.
[/quote]

Good to know, let them remove that "archive" in front of it :D

Bug ID: 937479

The bug search function is not working, I was able to get that ID looking at the email I got.

G.

PS: I tried to send you a private message in forum but it say you can not get private messages.
[quote name='mfatica' date='08 February 2012 - 07:33 PM' timestamp='1328726016' post='1366982']

Sorry for the confusion.

For CUDA related questions, this forum (the "old" one) is the best place to get answers.





Good to know, let them remove that "archive" in front of it :D



Bug ID: 937479



The bug search function is not working, I was able to get that ID looking at the email I got.



G.



PS: I tried to send you a private message in forum but it say you can not get private messages.

#11
Posted 02/08/2012 06:52 PM   
[quote name='tera' date='08 February 2012 - 03:08 PM' timestamp='1328713716' post='1366900']
Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the [font="Courier New"].x[/font] and [font="Courier New"].y[/font] components of a float2.
It can be worked around:
[/quote]
With a big caveat: I've just come across this in my own code and noticed that on compute capability 1.x things are exactly reversed: The original code produces vector load and store instructions, while my "workaround" generates loads and stores of the components!

I've checked this with both CUDA 4.0 and 4.1. Quite irritating...
[quote name='tera' date='08 February 2012 - 03:08 PM' timestamp='1328713716' post='1366900']

Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the .x and .y components of a float2.

It can be worked around:



With a big caveat: I've just come across this in my own code and noticed that on compute capability 1.x things are exactly reversed: The original code produces vector load and store instructions, while my "workaround" generates loads and stores of the components!



I've checked this with both CUDA 4.0 and 4.1. Quite irritating...

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#12
Posted 02/28/2012 12:46 PM   
Scroll To Top