cuDNN v6 INT8 convolution failing with CUDNN_STATUS_NOT_SUPPORTED

jesryu · April 20, 2017, 9:25pm

I’m trying to implement INT8 convolution on cuDNN 6, and I am seeing errors that I’ve never seen for 32-bit float. I followed the instructions in page 64 of the User Manual where it requires (copied directly):

For the datatype configurations INT8_CONFIG and INT8_EXT_CONFIG, the only algo supported is CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMPUTED_GEMM with the following conditions:

xDesc Format Support: CUDNN_TENSOR_NHWC
yDesc Format Support: CUDNN_TENSOR_NHWC
Input and output features maps must be multiple of 4
wDesc Format Support: CUDNN_TENSOR_NHWC
Dilation: 1 for all dimensions

I see a few problems here:

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMPUTED_GEMM is NOT supported in cudnnConvolutionFwdAlgo_t. The closest alternative seems to be CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM ("PRECOMP" instead of "PRECOMPUTED")
The job fails at cudnnConvolutionForward() with CUDNN_STATUS_NOT_SUPPORTED error. This happens regardless of what algorithm I choose. I tested all algo types in page 16 of the manual.
32 bit float implementation (CUDNN_DATA_FLOAT) doesn't have this issue

Can someone take a look? Code snippet is attached below. Same problem for both GTX 1070 and Titan X (Pascal). Running CUDA 8 and cuDNN 6.0

checkCudaErrors(cudaSetDevice(gpuid));
checkCUDNN(cudnnCreate(&cudnnHandle));
checkCUDNN(cudnnCreateTensorDescriptor(&dataTensor));
checkCUDNN(cudnnCreateFilterDescriptor(&conv1filterDesc));
checkCUDNN(cudnnCreateConvolutionDescriptor(&conv1Desc));
checkCUDNN(cudnnCreateTensorDescriptor(&conv1Tensor));

checkCUDNN(cudnnSetTensor4dDescriptor(dataTensor, CUDNN_TENSOR_NHWC, 
  CUDNN_DATA_INT8, n, c, h, w));
checkCUDNN(cudnnSetFilter4dDescriptor(conv1filterDesc, CUDNN_DATA_INT8, 
  CUDNN_TENSOR_NHWC, out_channels, in_channels, conv.kernel_size, conv.kernel_size));
checkCUDNN(cudnnSetConvolution2dDescriptor(conv1Desc, pad_height, pad_width, 1, 1, 1, 1, 
  CUDNN_CONVOLUTION, CUDNN_DATA_INT32));
checkCUDNN(cudnnGetConvolution2dForwardOutputDim(conv1Desc, dataTensor, 
  conv1filterDesc, &n, &c, &h, &w));
checkCUDNN(cudnnSetTensor4dDescriptor(conv1Tensor, CUDNN_TENSOR_NHWC, 
  CUDNN_DATA_INT8, n, c, h, w));

// Documentation says CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMPUTED_GEMM, but apparently
// This is not in the list of supported algo types (check page 16 of the manual).
// Since CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMPUTED_GEMM is not supported, I am using
// the closest alternative here
conv1algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM;

// Get workspace size
size_t sizeInBytes = <some hardcoded number>;
// Calling checkCUDNN(cudnnGetConvolutionForwardWorkspaceSize(cudnnHandle, dataTensor, 
//                       conv1filterDesc, conv1Desc, conv1Tensor,  conv1algo, &sizeInBytes));
// fails with CUDNN_STATUS_NOT_SUPPORTED. I used a hardcoded number here.

// ... ALLOCATE WORKSPACE AND VARIABLES
....

// Fails here
checkCUDNN(cudnnConvolutionForward(cudnnHandle, &alpha, dataTensor, data,
  conv1filterDesc, pconv1,  conv1Desc, conv1algo, workspace, m_workspaceSize,
  &beta,  conv1Tensor, conv1));

Robert_Crovella · April 20, 2017, 9:47pm

What GPU are you running on?

jesryu · April 20, 2017, 10:56pm

GTX 1070
CUDA 8 and cuDNN 6.0

jesryu · April 21, 2017, 1:01am

More information: Same problem on TITAN X (Pascal)

jesryu · April 21, 2017, 4:58pm

Any help would be appreciated. Thanks in advance!

Robert_Crovella · May 1, 2017, 3:14pm

Please file a bug at developer.nvidia.com

The bug should include a full compilable test case, not a snippet, and also include the exact command line you use to compile the code as well as the command line you use to run the code.

Once you have that done, please advise back here of the bug number that the system assigned, and I will take a look.

If you have difficulty filing a bug, then please file an empty bug with just the title/synopsis, and provide the remaining information here in this thread (full compilable test case, compile command line, etc.) and I will update the bug for you.

jesryu · May 3, 2017, 8:02pm

@txbob:

Thank you for your response. The bug number is 1920356.

You can get code samples from: https://github.com/jesryu/cudnn_conv_int8

This problem can be reproduced on GTX 1070, Titan X (Pascal), and Quadro 6000.

Thank you again for taking a look. Implementing INT8-based layers can dramatically improve the inference performance, and we are really excited about the prospect of experimenting with this new feature. We look forward to hearing from you soon!

Robert_Crovella · May 4, 2017, 5:44am

Thanks for filing the bug. It is being looked at.

jesryu · May 5, 2017, 12:40am

Please be reminded that this is not merely a documentation bug for algorithm names. I tried every algorithm listed in the documentation, but still had the same issue. I hope the sourcecode will give you a clear context of the problem. I provided the working FP32 version to help you nail down the cause of failure on INT8. Please let me know of any questions. I look forward to hearing back from you soon!

Robert_Crovella · May 9, 2017, 3:21pm

Referring to your posted code here:

https://github.com/jesryu/cudnn_conv_int8/blob/master/src/cudnn_conv_int8.cc

some colleagues at NVIDIA looked at this and determined that you are setting your input and output channels incorrectly. The documentation states: (you’ve already quoted this section of the doc in your original comment):

“Input and output features maps must be multiple of 4”

When I change this section of code:

int main() {
  // parameters
  int gpu = 0;
  int iterations = 10000;

  // input dimensions
  size_t width = 960;
  size_t height = 600;
#ifdef FIX
  size_t channels = 4;
#else
  size_t channels = 3;
#endif
  int batch_size = 1;

  // Create layer architecture
#ifdef FIX
  int out_channels = 4;
#else
  int out_channels = 1;
#endif
  int kernel_size = 3;

and compile with -DFIX, then the code runs with no errors on CUDNN v6, CUDA 8.0, Ubuntu 14.04, Titan X (Pascal):

$ g++ -std=c++11 -I../cuda/include -I/usr/local/cuda/include cudnn_conv_int8.cc -o test -L../cuda/lib64 -L/usr/local/cuda/lib64 -lcudnn -lcudart    
$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../cuda/lib64 ./test
Begin forwrad pass
CUDNN failure: CUDNN_STATUS_NOT_SUPPORTED
cudnn_conv_int8.cc:233
Aborting...
$ g++ -DFIX -std=c++11 -I../cuda/include -I/usr/local/cuda/include cudnn_conv_int8.cc -o test -L../cuda/lib64 -L/usr/local/cuda/lib64 -lcudnn -lcudart
$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../cuda/lib64 ./test
Begin forwrad pass
Iteration time: 1.288046 ms
$

adit_bhrgv · October 9, 2017, 12:55pm

Hello,

I just ran your code for both fp32 and int8 .
The iteration time for forward pass which I get for int8 is more than fp32. Why is it so ?

Does the INT8 convolution here use dp4a ?

I am using Nvidia 1080 TI with INT8 support.

Fp32 :

Begin forward pass
Iteration time: 0.284869ms

Int8 :

Begin forward pass
Iteration time: 1.451339ms

hanumanth.rpa · June 15, 2018, 1:02pm

Hi,

I am also noticing this performance difference between INT8 and FP32. I am using GTX 1060.

Did you find out the reason?

singleroc · March 3, 2020, 1:12am

Hi, any update on the issue? Does anyone verify that the in8 version convolution being faster than the fp32 version in the lastest software?

Or maybe the hardware units for computing int8 and float32 are completely different ones. There are less resources for the int8 and more resources for the float32 computing. Those float processing units are specially designed for half/float/double datatype and can’t handle int8/int32.