Tensor WMMA INT8 vs FP16 processing speed

cudapop1 · February 14, 2019, 6:41pm

Hi all,

I recently got an RTX card and wanted to test out the speed when using the new INT8 mode of the Turing tensor cores vs. the regular FP16 mode.

I used the sample code from the “Programming Tensor Cores in CUDA 9” developer blog (code-samples/posts/tensor-cores at master · NVIDIA-developer-blog/code-samples · GitHub) and modified it slightly so I had one kernel that did FP16 WMMA and another kernel that did INT8 WMMA.

__global__ void wmma_example_f16(half *a, half *b)
{
   // Tile using a 2D grid
   int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize;
   int warpN = (blockIdx.y * blockDim.y + threadIdx.y);

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, WmmaDim, WmmaDim, WmmaDim, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, WmmaDim, WmmaDim, WmmaDim, half, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, WmmaDim, WmmaDim, WmmaDim, float> acc_frag;

   wmma::fill_fragment(acc_frag, 0);

   // Loop over k
   for (int i = 0; i < MatDim; i += WmmaDim) {
      int aRow = warpM * WmmaDim;
      int aCol = i;

      int bRow = i;
      int bCol = warpN * WmmaDim;

      // Bounds checking
      if (aRow < MatDim && aCol < MatDim && bRow < MatDim && bCol < MatDim) {
         // Load the inputs
         wmma::load_matrix_sync(a_frag, a + aRow + aCol * MatDim, MatDim);
         wmma::load_matrix_sync(b_frag, b + bRow + bCol * MatDim, MatDim);
 
         // Perform the matrix multiplication
         wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
      }
   }
} // wmma_example_f16

__global__ void wmma_example_i8(signed char *a, signed char *b)
{
   // Tile using a 2D grid
   int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize;
   int warpN = (blockIdx.y * blockDim.y + threadIdx.y);

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, WmmaDim, WmmaDim, WmmaDim, signed char, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, WmmaDim, WmmaDim, WmmaDim, signed char, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, WmmaDim, WmmaDim, WmmaDim, int> acc_frag;

   wmma::fill_fragment(acc_frag, 0);

   // Loop over k
   for (int i = 0; i < MatDim; i += WmmaDim) {
      int aRow = warpM * WmmaDim;
      int aCol = i;

      int bRow = i;
      int bCol = warpN * WmmaDim;

      // Bounds checking
      if (aRow < MatDim && aCol < MatDim && bRow < MatDim && bCol < MatDim) {
         // Load the inputs
         wmma::load_matrix_sync(a_frag, a + aRow + aCol * MatDim, MatDim);
         wmma::load_matrix_sync(b_frag, b + bRow + bCol * MatDim, MatDim);
 
         // Perform the matrix multiplication
         wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
      }
   }
} // wmma_example_i8

In both kernels, I’ve removed the code which stores the result fragment back out to global memory just for brevity.

I verified that the compiler does compile the “loop over K” for-loops by generating PTX files and checking that the PTX contains the “wmma.load” and “wmma.mma.sync” commands. I’ve also verified in the PTX that the INT8 version is using the “wmma.mma.sync.aligned.col.col.m16n16k16.s32.s8.s8.s32” command, while the FP16 version is using the “wmma.mma.sync.aligned.col.col.m16n16k16.f32.f32” command.

The weird thing is that both kernels show almost the same execution time (timed via CUDA events). For example: using 2048x2048 matrices, they both show around 0.11 ms execution times (on an RTX 2060) regardless of it being the INT8 kernel or FP16 kernel being run.

Since INT8 mode is supposed to have double the throughput of FP16 mode, I was expecting the INT8 kernel to execute much faster than the FP16 kernel.

Anyone have any ideas why they both show almost the same execution times?

(I’ve also verified that the compiler is not “optimizing away” the for-K-loop, even without any write out to global memory in the kernel code, because the execution times change depending on the sizes of the arrays)

cudapop1 · February 14, 2019, 6:46pm

Forgot to add: in the code above, MatDim is defined as 2048, and WmmaDim is defined as 16.

As a further test, in order to isolate the matmul operation from the memory read, I also tried removing the 2 lines which contain the “wmma::load_matrix_sync” commands. This way the for-loop is just executing the “wmma::mma_sync” command.

With that change, I verified in the resulting PTX files that the loop still contains the "“wmma.mma.sync.aligned” PTX instructions. I also verified that the execution times still varies according to the matrix size (ie. value of MatDim).

But even with this change, both the INT8 kernel and FP16 kernel still execute in the same amount of time. I’m still not getting the “double throughput” I was expecting from the INT8 mode.

Any ideas?