Hi,
I’m working on some fairly large kernel, and have noticed a weird behavior exhibited by cuda-9.0 nvcc compiler. This behavior results in a very significant slow-down of my code, caused by load and store spills. This has not been the case a few major versions of cuda ago, but I cannot tell for sure when this behavior started. I managed to reduce the bug to a small test case, which I’m attaching below.
The logic of the code is as follows. The kernel maintains a small matrix of floats per thread. Kernel threads, of which I’m launching just one for this example, run a few iterations over that matrix. Each iteration copies a row of that matrix into shared memory. The kernel doesn’t produce any output, besides the change in the shared memory. My goal is to keep that small matrix in the registers, in order to achieve optimal performance, and that’s what the compiler is failing to do. In my example code, I’m launching just one block and one thread, hence there should be more than enough registers to store a 6x6 matrix.
I compiled the code using the command line I supply with the code, and I do see ld.local instructions in the generated ptx file (see an excerpt from that ptx file attached below the code). Clearly the code sample I’m supplying here is very small, and the overall operation of the program is NoOp, but those local memory spills aren’t optimized out, as indicated by the ptx. I observe the same behavior in my large kernel.
Once I apply some very basic changes to the code (see comments therein), I see the spills disappear, as indicated by the lack of the warning message from the compiler as well as the lack of ld.local instructions in the ptx file.
Would love to hear some opinions about this situation. Thanks!
Save this code as broken-unrolling.cu to use my command line.
// This program demonstrates what seems to be a loop unrolling bug. The bug
// causes loops to not be unrolled in a fairly straightforward code.
// To build and run (linux):
// $ nvcc -keep -O3 -g -lineinfo -gencode arch=compute_50,code=sm_50 -Xcompiler "-Wall -Wextra -Wno-unused -Werror -fno-strict-aliasing" -Xptxas "-v-warn-lmem-usage -warn-spills" -o broken-unrolling.bin broken-unrolling.cu && ./broken-unrolling.bin
// Optput of the above command (Note warning message in the first line of the
// output):
// ptxas warning : Local memory used for function '_Z4kernv'
// ptxas info : 0 bytes gmem
// ptxas info : Compiling entry function '_Z4kernv' for 'sm_50'
// ptxas info : Function properties for _Z4kernv
// 144 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
// ptxas info : Used 9 registers, 320 bytes cmem[0], 24 bytes cmem[2]
// Kernel launch success, status = 0
// Thread sync success, status = 0
// Toolset version:
// $ nvcc --version
// nvcc: NVIDIA (R) Cuda compiler driver
// Copyright (c) 2005-2017 NVIDIA Corporation
// Built on Fri_Sep__1_21:08:03_CDT_2017
// Cuda compilation tools, release 9.0, V9.0.176
#include <stdio.h>
extern __shared__ int sharedMem[];
const int ITERATIONS = 10;
__global__ void kern(void) {
// Constant 6 below results in local storage, while constant 5 doesn't.
const int DIM = 6;
float arr[DIM][DIM];
for (int iter = 0; iter < ITERATIONS; ++ iter) {
// Copy one row number iter from matrix arr into sharedMem.
#pragma unroll
for (int i = 0; i < DIM; ++ i)
#pragma unroll
for (int j = 0; j < DIM; ++ j)
// Logical operator == below results in local storage, while != doesn't.
if (i == iter)
sharedMem[j] = arr[i][j];
}
}
int main(int argc, char** argv) {
// Launch with some shared memory.
kern<<<1, 1, 100>>>();
cudaError_t status = cudaGetLastError();
if (status == cudaSuccess)
printf("Kernel launch success, status = %i\n", status);
else
printf("Kernel launch failure, status = %i, error: %s\n",
status, cudaGetErrorString(status) );
status = cudaThreadSynchronize();
if (status == cudaSuccess)
printf("Thread sync success, status = %i\n", status);
else
printf("Thread sync failure, status = %i, error: %s\n",
status, cudaGetErrorString(status) );
return 0;
}
Here’s an excerpt from my ptx file:
BB0_13:
.loc 1 47 4
ld.local.v2.f32 {%f61, %f62}, [%rd6];
cvt.rzi.s32.f32 %r34, %f61;
ld.local.v2.f32 {%f64, %f65}, [%rd6+8];
ld.local.v2.f32 {%f66, %f67}, [%rd6+16];
st.shared.u32 [sharedMem], %r34;
cvt.rzi.s32.f32 %r35, %f62;
st.shared.u32 [sharedMem+4], %r35;
cvt.rzi.s32.f32 %r36, %f64;
st.shared.u32 [sharedMem+8], %r36;
cvt.rzi.s32.f32 %r37, %f65;
st.shared.u32 [sharedMem+12], %r37;
cvt.rzi.s32.f32 %r38, %f66;
st.shared.u32 [sharedMem+16], %r38;
cvt.rzi.s32.f32 %r39, %f67;
st.shared.u32 [sharedMem+20], %r39;
bra.uni BB0_14;