Degradion Performance 4.1 over 4.0

kalman · February 8, 2012, 2:21pm

Hi all,

due the fact our application has to not be simply fast but it should perform

some operations with fixed deadlines (we analyze a continuous radio signal)

we perform several time per day benchmarks of all our algorithm.

We are experiencing a clear degradation adopting CUDA 4.1 over the old CUDA 4.0.

I have attached 4 images showing the historical performance data of 4 algorithms

(they are not all the affected ones, but the simplest to show you the kernel code).

For all graphs the reported time is in milliseconds (y-axis).

All kernels are launched in this way:

#define BLOCK_SIZE (1<<9)

dim3 myThreads(BLOCK_SIZE);

dim3 myGrid( (aSize + BLOCK_SIZE - 1) / BLOCK_SIZE);

Kernel<<< myGrid, myThreads>>>(…);

We have the C2050 cards with ECC off.

============================================================================

Sum of two complex vectors (2^20 complex)

__global__ void

VectorVectorSumKernelCC_O(const float2* aIn1,

                         const float2* aIn2,

                         float2* aOut,

                         const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    aOut[myPos].x = aIn1[myPos].x + aIn2[myPos].x;

    aOut[myPos].y = aIn1[myPos].y + aIn2[myPos].y;

  }

}

============================================================================

Product of two complex vectors (2^20 complex)

__global__ void

MulKernel_cv_cv_o(const float2* aIn1,

                  const float2* aIn2,

                  float2* aOut,

                  const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    const float myReal1 = aIn1[myPos].x;

    const float myReal2 = aIn2[myPos].x;

    const float myImag1 = aIn1[myPos].y;

    const float myImag2 = aIn2[myPos].y;

    aOut[myPos].x = myReal1 * myReal2 - myImag1 * myImag2;

    aOut[myPos].y = myReal1 * myImag2 + myImag1 * myReal2;

  }

}

============================================================================

Product of two complex vectors (2^20 complex), in place

__global__ void

MulKernel_cv_cv_i(const float2* aIn,

                  float2* aInOut,

                  const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    const float myTmp = aInOut[myPos].x;

    const float myInR = aIn[myPos].x;

    const float myInI = aIn[myPos].y;

    aInOut[myPos].x = myInR * aInOut[myPos].x - myInI * aInOut[myPos].y;

    aInOut[myPos].y = myInR * aInOut[myPos].y + myInI * myTmp;

  }

}

============================================================================

Tone generation (2^20 vector long)

__global__ void

ComplexExpKernel(float2* aInOut,

                const unsigned int aSize,

                const float aMagnitude,

                const float aNormalizedFrequency,

                const float aInverseFrequency,

                const float aPhase) {

const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

if (myPos < aSize) {

     const float myArgument = aNormalizedFrequency * fmodf((float)myPos, aInverseFrequency) + aPhase;

     aInOut[myPos].x = aMagnitude * __cosf(myArgument);

     aInOut[myPos].y = aMagnitude * __sinf(myArgument);

  }

}

============================================================================

mfatica · February 8, 2012, 3:08pm

Which driver version are you using?

tera · February 8, 2012, 3:08pm

Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the [font=“Courier New”].x[/font] and [font=“Courier New”].y[/font] components of a float2.

It can be worked around:

__global__ void

VectorVectorSumKernelCC_O(const float2* aIn1,

                         const float2* aIn2,

                         float2* aOut,

                         const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    float2 a1 = aIn1[myPos];

    float2 a2 = aIn2[myPos];

    float2 result;

    result.x = a1.x + a2.x;

    result.y = a1.y + a2.y;

    aOut[myPos] = result;

  }

}

This generates basically the same object code (for sm_20) with CUDA 4.1 as your original code does with CUDA 4.0.

Still I think Nvidia should fix this rather quickly.

kalman · February 8, 2012, 3:35pm

ver. 285.05.33

At the same time in order to use npp I have to remove the extra “,” at the end of enums in npp.h

kalman · February 8, 2012, 3:43pm

Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the [font=“Courier New”].x[/font] and [font=“Courier New”].y[/font] components of a float2.

It can be worked around:
__global__ void

VectorVectorSumKernelCC_O(const float2* aIn1,

                         const float2* aIn2,

                         float2* aOut,

                         const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    float2 a1 = aIn1[myPos];

    float2 a2 = aIn2[myPos];

    float2 result;

    result.x = a1.x + a2.x;

    result.y = a1.y + a2.y;

    aOut[myPos] = result;

  }

}
This generates basically the same object code (for sm_20) with CUDA 4.1 as your original code does with CUDA 4.0.

Still I think Nvidia should fix this rather quickly.

Indeed that fixes the issue, funny thing is that the nppsAdd_32fc is affected by the same problems, it goes slow

as it does my original kernel. And yes I do agree that NVidia has to fix it, I have other kernels not easy like that

and I’m not going to do that “trick” on all of them.

njuffa · February 8, 2012, 4:49pm

Please file a bug against the compiler. When you log into the registered developer website, there is a link to the bug reporting form. Thank you for your help.

kalman · February 8, 2012, 5:37pm

I’m not able to find it in the new site.

njuffa · February 8, 2012, 5:46pm

The following works for me:

(1) Login to partners.nvidia.com
(2) Click through the legalese screen
(3) On the start page of the registered developer website, there should be a menu on the left edge of the screen:

Recent Downloads
Current Bugs
Bug Report <<<<<<<<<<<<<<<<<<<<<<<<<<<<
Early Access Feedback

Please let me know if that does not work.

kalman · February 8, 2012, 6:03pm

But that’s the old one! I’m a bit puzzled: 2 forums, 1 old (this one) and the tag based one.

New and and old registered developer website.

That’s different than the site linked in this page: http://developer.nvidia.com/ (Registered Developers Website)

that points to: http://developer.nvidia.com/user/me

Submitted on the old site, while submitting it the CUDA Toolkit version 4.1 was not even listed!

Regards

Gaetano Mendola

mfatica · February 8, 2012, 6:33pm

Sorry for the confusion.
For CUDA related questions, this forum (the “old” one) is the best place to get answers.

Edit: What is the bug number?

kalman · February 8, 2012, 6:52pm

Good to know, let them remove that “archive” in front of it :D

Bug ID: 937479

The bug search function is not working, I was able to get that ID looking at the email I got.

G.

PS: I tried to send you a private message in forum but it say you can not get private messages.

tera · February 28, 2012, 12:46pm

With a big caveat: I’ve just come across this in my own code and noticed that on compute capability 1.x things are exactly reversed: The original code produces vector load and store instructions, while my “workaround” generates loads and stores of the components!

I’ve checked this with both CUDA 4.0 and 4.1. Quite irritating…