Local register - spillage causing poor performance ???

peterSteele123 · February 18, 2015, 8:18pm

Hello all,

I am creating a GPU kernel that contains a FOR-LOOP which accumulates a local variable that is used later in that same kernel. The result is accurate but the performance leaves something to be desired. I am looking for any ideas to boost this performance. I think it is occurring due to the local variable “spilling” to global memory??

The code is included below. PLEASE any help would be great.

__global__ void my_kernel(float4 *lparticle, float4 *nparticle, 
    float *f, int *nneigh, int n)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if(i < n) {
        float4 pj;
        float4 pi = lparticle[i];
        int neigh = nneigh[i];

        float val = 0.0f;
        for(int k = 0; k < neigh; ++k) {
            pj = nparticle[neigh*i + k];

            // The following line is the HIGHEST COST -
            // I have isolated it to be the 'val += ' operation 
            // and not the 'pj.w*K(pi, pj)' operation
         
            val += pj.w*K(pi, pj); 

        }

        for(int j = 0; j < i; ++j) {
            pi = lparticle[i];
            pj = lparticle[j];
            val += pj.w*K(pi, pj);
        }
        for(int j = i + 1; j < n; ++j) {
            pi = lparticle[i];
            pj = lparticle[j];
            val += pj.w*K(pi, pj);
        }
        f[i] += val;
    }

}

njuffa · February 18, 2015, 8:38pm

What is the nvcc command line used to build this code? If you add -Xptxas -v to the command line arguments, what is being reported by the compiler?

On first impression, your code looks entirely memory bound. This means you would want to pay close attention to access patterns to maximize the memory throughput. The CUDA profiler can help in that endeavor. In particular your code seems to employ an additional level of indirection through nneigh when accessing the data. This may result in basically random accesses if their is not much locality. A “random” access pattern could easily reduce the memory throughput by about an order of magnitude. What GPU are you running on?

As a quick test, assuming the arrays passed to the kernel are not aliased, try changing the function signature to the following and check whether any performance differences result:

__global__ void my_kernel (const float4 * __restrict__ lparticle, const float4 * __restrict__ nparticle, float *f, const int * __restrict__ nneigh, int n);

peterSteele123 · February 18, 2015, 9:48pm

Thank you for the reply.

I am using Kepler K40 with the architecture of 3.5 and the arrays being passed to the kernel are not aliased. The indirection is unfortunately necessary for the particular algorithm but I have been debating “padding” to a factor of a warp for coalescing of access.

Some follow-up questions/ideas:

what does the ‘restrict’ keyword applied to the argument arrays do with regards to the GPU kernel?
I have read some about the Kepler __shufl_xor operation - could a warp-reduction assist with performance?

Thank you again.

njuffa · February 18, 2015, 10:31pm

For usage of restrict, see the CUDA Programming Guide and the CUDA Best Practices Guide.

In short, “restrict” is a keyword that was added to the ISO C99 standard, but has not made its way into C++ yet. It is however supported by various C++ tool chains, including CUDA, as an extension, using the double-underscore to mark it as tool-chain specific. It is basically a programmer’s promise to the compiler that the object reached through a pointer with this attribute is not accessed through any other pointers in the same scope as the pointer declaration. In other words, it tells the compiler there is no aliasing of pointed-to data, which lets the compiler apply further optimizations, in particular load/store re-ordering. In conjunction with the ‘const’ (= read-only) attribute it also may allow the compiler to classify the data accessed through the pointer as read-only across the lifetime of the kernel, which allows the data to be read through the (non-coherent) texture path. This may result in slightly higher performance if there is some locality in the accesses that cannot be exploited when reading through regular cached loads.

I have no insight into __shufl_xor. I think the performance limiting portion of the code is really the access pattern of those indirect loads and would suggest using the profiler to assess how significant an issue that is (without seeing the data in the index vector, it is impossible to estimate how poor load performance is in this case).

Robert_Crovella · February 18, 2015, 11:19pm

These 2 reads will coalesce nicely:

float4 pi = lparticle[i];
        int neigh = nneigh[i];

This read will not:

pj = nparticle[neigh*i + k];

Consider re-arranging the storage pattern of nparticle so that the read can be done equivalently like this:

pj = nparticle[neigh*k + i];

What does K do? Does it modify pi (i.e. its first parameter?)

If it does not, this load seems highly redundant (the compiler would probably figure this out, though):

for(int j = 0; j < i; ++j) {
            pi = lparticle[i];

Jimmy_Pettersson · February 19, 2015, 7:04pm

I wouldn’t trust it though :-)

From a quick glimpse it would seem that lparticle could be suitable for placement in constant memory (it might be that the compiler has already figured this out and is doing LDU instruction).

njuffa · February 19, 2015, 7:26pm

Note that the LDU (load uniform) instruction only exists on older GPUs. If the pointers have ‘const’ and ‘restrict’ attributes, LDU should be generated for the architectures that support it (at least this worked the last time I checked, which was a couple of years ago).

Robert_Crovella · February 19, 2015, 7:37pm

i is the globally unique thread index, so I don’t think LDU applies. I don’t think it would constitute a uniform load. const and restrict are still good general suggestions, and may well give some speedup depending on GPU.

njuffa · February 19, 2015, 7:43pm

Agreed, the index ‘i’ is not uniform (= identical across all threads in a warp) in this code, and so uniform load does not apply.

peterSteele123 · February 20, 2015, 12:02am

Thanks to all for the replies.

njuffa, I used the restrict keywords as suggested and got almost double the speed as before :) Thank you.

txbob, you are correct “pi = lparticle[i]” was definitely redundant. A result of late-night coding.

I got a further boost in performance by passing the ‘-use_fast_math’. This lowered the accuracy but not to a significant degree and when I combined it with the aforementioned const and restrict the GPU kernel generated a more than double speedup over the CPU only version.

Thanks again for everything guys.

njuffa · February 20, 2015, 1:42am

I find this observation quite puzzling. Nothing in the code you showed above should be affected by -use_fast_math. Are you, by any chance, referring to other kernels in your code that saw a speed up from -use_fast_math?

Jimmy_Pettersson · February 20, 2015, 2:50pm

Totally missed that line, thanks :-)

peterSteele123 · February 21, 2015, 12:05am

The K( ) function calls a bunch of trigonometric functions.

njuffa · February 21, 2015, 12:10am

That would explain the speed-up :-) If these are mostly sines and cosines, you may be able to use functions like sincos() and sincospi() to achieve speed up without sacrificing any accuracy. Feel free to post the code for analysis. If you are happy with -use_fast_math, you can of course also just leave it at that :-)