Degradion Performance 4.1 over 4.0

Hi all,

due the fact our application has to not be simply fast but it should perform

some operations with fixed deadlines (we analyze a continuous radio signal)

we perform several time per day benchmarks of all our algorithm.

We are experiencing a clear degradation adopting CUDA 4.1 over the old CUDA 4.0.

I have attached 4 images showing the historical performance data of 4 algorithms

(they are not all the affected ones, but the simplest to show you the kernel code).

For all graphs the reported time is in milliseconds (y-axis).

All kernels are launched in this way:

#define BLOCK_SIZE (1<<9)

dim3 myThreads(BLOCK_SIZE);

dim3 myGrid( (aSize + BLOCK_SIZE - 1) / BLOCK_SIZE);

Kernel<<< myGrid, myThreads>>>(…);

We have the C2050 cards with ECC off.

============================================================================

Sum of two complex vectors (2^20 complex)

__global__ void

VectorVectorSumKernelCC_O(const float2* aIn1,

                         const float2* aIn2,

                         float2* aOut,

                         const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    aOut[myPos].x = aIn1[myPos].x + aIn2[myPos].x;

    aOut[myPos].y = aIn1[myPos].y + aIn2[myPos].y;

  }

}

============================================================================

Product of two complex vectors (2^20 complex)

__global__ void

MulKernel_cv_cv_o(const float2* aIn1,

                  const float2* aIn2,

                  float2* aOut,

                  const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    const float myReal1 = aIn1[myPos].x;

    const float myReal2 = aIn2[myPos].x;

    const float myImag1 = aIn1[myPos].y;

    const float myImag2 = aIn2[myPos].y;

    aOut[myPos].x = myReal1 * myReal2 - myImag1 * myImag2;

    aOut[myPos].y = myReal1 * myImag2 + myImag1 * myReal2;

  }

}

============================================================================

Product of two complex vectors (2^20 complex), in place

__global__ void

MulKernel_cv_cv_i(const float2* aIn,

                  float2* aInOut,

                  const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    const float myTmp = aInOut[myPos].x;

    const float myInR = aIn[myPos].x;

    const float myInI = aIn[myPos].y;

    aInOut[myPos].x = myInR * aInOut[myPos].x - myInI * aInOut[myPos].y;

    aInOut[myPos].y = myInR * aInOut[myPos].y + myInI * myTmp;

  }

}

============================================================================

Tone generation (2^20 vector long)

__global__ void

ComplexExpKernel(float2* aInOut,

                const unsigned int aSize,

                const float aMagnitude,

                const float aNormalizedFrequency,

                const float aInverseFrequency,

                const float aPhase) {

const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

if (myPos < aSize) {

     const float myArgument = aNormalizedFrequency * fmodf((float)myPos, aInverseFrequency) + aPhase;

     aInOut[myPos].x = aMagnitude * __cosf(myArgument);

     aInOut[myPos].y = aMagnitude * __sinf(myArgument);

  }

}

============================================================================

Which driver version are you using?

Indeed it seems to be a surprising regression that the new compiler issues separate load and store instructions for the [font=“Courier New”].x[/font] and [font=“Courier New”].y[/font] components of a float2.

It can be worked around:

__global__ void

VectorVectorSumKernelCC_O(const float2* aIn1,

                         const float2* aIn2,

                         float2* aOut,

                         const unsigned int aSize) {

  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;

  if (myPos < aSize) {

    float2 a1 = aIn1[myPos];

    float2 a2 = aIn2[myPos];

    float2 result;

    result.x = a1.x + a2.x;

    result.y = a1.y + a2.y;

    aOut[myPos] = result;

  }

}

This generates basically the same object code (for sm_20) with CUDA 4.1 as your original code does with CUDA 4.0.

Still I think Nvidia should fix this rather quickly.

ver. 285.05.33

At the same time in order to use npp I have to remove the extra “,” at the end of enums in npp.h

Indeed that fixes the issue, funny thing is that the nppsAdd_32fc is affected by the same problems, it goes slow

as it does my original kernel. And yes I do agree that NVidia has to fix it, I have other kernels not easy like that

and I’m not going to do that “trick” on all of them.

Please file a bug against the compiler. When you log into the registered developer website, there is a link to the bug reporting form. Thank you for your help.

I’m not able to find it in the new site.

The following works for me:

(1) Login to partners.nvidia.com
(2) Click through the legalese screen
(3) On the start page of the registered developer website, there should be a menu on the left edge of the screen:

Recent Downloads
Current Bugs
Bug Report <<<<<<<<<<<<<<<<<<<<<<<<<<<<
Early Access Feedback

Please let me know if that does not work.

But that’s the old one! I’m a bit puzzled: 2 forums, 1 old (this one) and the tag based one.

New and and old registered developer website.

That’s different than the site linked in this page: http://developer.nvidia.com/ (Registered Developers Website)

that points to: http://developer.nvidia.com/user/me

Submitted on the old site, while submitting it the CUDA Toolkit version 4.1 was not even listed!

Regards

Gaetano Mendola

Sorry for the confusion.
For CUDA related questions, this forum (the “old” one) is the best place to get answers.

Edit: What is the bug number?

Good to know, let them remove that “archive” in front of it :D

Bug ID: 937479

The bug search function is not working, I was able to get that ID looking at the email I got.

G.

PS: I tried to send you a private message in forum but it say you can not get private messages.

With a big caveat: I’ve just come across this in my own code and noticed that on compute capability 1.x things are exactly reversed: The original code produces vector load and store instructions, while my “workaround” generates loads and stores of the components!

I’ve checked this with both CUDA 4.0 and 4.1. Quite irritating…