Understanding unrolling and concurrent memory operations

What exactly happens when a loop of independent memory operations is unrolled?

How are they made concurrent by the compiler? Is the number of instructions the same aside from the loop condition? If a large number of iterations are unrolled, then it there must be some mechanism at runtime that understands the amount of memory available and thus limits the number of concurrent operations. What is this mechanism?

How does it relate to the number of instructions per cycle (IPC)?

Not sure if this is too much. It would be great if you could briefly answer these questions and direct me to a resource where I could read further.

To first order, partial unrolling of loops consists of replicating the loop body N times, and adjusting the loop control. This means unrolling may save loop control overhead, but in case of partial unrolling we may now have additional startup and cleanup costs.

If the original loop body was a single basic block, loop unrolling creates a larger basic block. This means the compiler now has more scheduling freedom in scheduling instructions inside this larger block. Instructions belonging to different iterations can be mixed more freely, loads may be able to be pulled to the front of the loop for additional latency tolerance. The number of loads is still the same, but the improved scheduling may improve throughput. Scheduling loads early may also increase register pressure, possibly lowering occupancy, which can have negative impact on performance.

Complete unrolling of a loop may convert variable indexing into compile-time constant indexing, allowing local variables indexed inside the loop to be allocated to registers after unrolling. This is an instance of scalar replacement. Partially unrolling a loop could create a loop body that no longer fits into the ICache, reducing performance slightly, due to ICache misses on loop closure (no brnach prediction on GPUs).

So loop unrolling is an optimization with complex trade-offs which the compiler tries to gauge via various suitable heuristics. As far as I recall my discussions with the CUDA compiler team about loop unrolling, there is nothing structurally different about loop unrolling for GPU code compared to making that transformation in CPU code, so any compiler book that treats loop unrolling in some detail should give you sufficient insight into what is happening.

I’m messing around with some unrolling right now and have the code handy, so I’d thought I’d illustrate njuffa’s first point with some actual code.

Here we have a very simple first pass reduction loop:

for (int i = tid; i < n; i += THREADS)
{
    sum += __ldg(in);
    in  += THREADS;
}

And here is the sass that gets generated:

TARGET4:
        MOV R12, R4;
        LDG.E.CI R6, [R4];
        IADD32I R12.CC, R12, 0x2000;
        LDG.E.CI R9, [R4+0x800];
        IADD32I R8, R8, 0x800;
        LDG.E.CI R10, [R4+0x1000];
        MOV32I R13, 0xfffffa00;
        LDG.E.CI R11, [R4+0x1800];
        MOV R14, param_3;
        FADD R6, R7, R6;
        MOV R7, R5;
        FADD R5, R6, R9;
        IADD.X R7, RZ, R7;
        VABSDIFF.ACC RZ.CC, R8, R14, R13;
        DEPBAR.LE SB5, 0x1;
        FADD R4, R5, R10;
        MOV R5, R7;
        FADD R6, R4, R11;
        MOV R4, R12;
        MOV R7, R6;
        BRA CC.GT, TARGET4;
        SYNC;
TARGET3:

        ISETP.LT.AND P0, PT, R8, param_n, PT;
   @!P0 SYNC;

TARGET2:
        SSY TARGET5;

TARGET6:
        IADD32I R8, R8, 0x200;
        LDG.E.CI R6, [R4];
        IADD32I R4.CC, R4, 0x800;
        ISETP.LT.AND P0, PT, R8, param_n, PT;
        IADD.X R5, RZ, R5;
        FADD R6, R7, R6;
        MOV R7, R6;
    @P0 BRA TARGET6;
        SYNC;
        
TARGET5:
        SYNC;

So the simple loop gets broken up into two loops. One that’s unrolled 4 times and one that’s not unrolled. The big loop handles most of the iterations, and the small one handles the ones at the end that don’t fit in a multiple of 4.

Looking at the big loop there is a clear advantage to having it unrolled 4 times. You can queue up 4 memory loads (LDG instructions) before issuing the first FADD instruction. So you’ve basically cut down the total loop latency by a factor of 4. The additional 3 loads don’t add much to the total latency since the bulk of their latencies are covered in the first load. I’m not sure why the compiler chooses 4 as the unroll factor but I’m assuming the cuda team has benchmarked this kind of code heavily and figured out 4 gives good performance (or maybe they’re just working from theoretical hardware knowledge).

A small number of additional registers are required for this unrolling, but in this case the trade off is almost certainly well worth it.

One potential positive side-effect of complete loop unrolling I failed to mention is that it can enable additional optimizations, in particular constant propagation followed by dead code elimination, which may lead to a significant reduction in the number of instructions executed.

@scottgray: Loop unrolling is one of those optimizations that can occur both in the NVVM frontend and the PTXAS backend. I recall a case where the frontend unrolled a loop by a factor four, then PTXAS unrolled the resulting code by a factor of two. The frontend transformations are largely machine-independent, at that stage little is known about register pressure for example.

One of the most important pieces of information when deciding on unrolling in general and the unroll factor in particular is the loop’s trip count which often cannot be known with certainty. I have certainly seen cases where the CUDA compiler’s heuristics applied incorrect assumptions in that regard, hurting performance by partially unrolling of loops with short trip counts. Unfortunately the CUDA compiler does not support loop attributes that would let programmers hint at expected trip count.

Compilers I have worked with (independent of CUDA) seem to routinely apply unroll factors of 4 or 8 for partial unrolling. It seems to be a common heuristic. It may make sense to dial in the “best” unroll factor by hand using a #pragma unroll. In practice I have never seen more than single-digit percentage gains from manually selecting the unroll factor.