cublasGemmEx doesn't work with INT8 utilizing __dp4a instruction on NVIDIA 1080TI

adit_bhrgv · September 13, 2017, 5:05pm

Hi,

As per documentation from this link cuBLAS :: CUDA Toolkit Documentation,
cublasGemmEx() is not working for INT8 matrix multiplications.

It says:
“For CUDA_R_32I computation type the matrix types combinations supported by cublasGemmEx are listed below. This path is only supported with alpha, beta being either 1 or 0; A, B being 32-bit aligned; and lda, ldb being multiples of 4.” and ““the combination of the parameters Atype, Btype and Ctype and the algorithm type, algo is not supported””

I am getting below error:
CUBLAS_STATUS_NOT_SUPPORTED

Attached my code below: don’t know what’s wrong. It’S not working according to documentation.
I checked for all the algos, but it doesnt work.
My cublas compilation is also correct picking cublas from cuda-8.0 :

nvcc -arch=sm_61 -o cublas cublas.cu -L /usr/local/cuda-8.0/lib64/ -lcublas

==========================================================================================================
#undef _GLIBCXX_ATOMIC_BUILTINS
#undef _GLIBCXX_USE_INT128

#include <thrust/device_vector.h>
#include <cublas_v2.h>
#include

// C-style indexing
int ci(int row, int column, int nColumns) {
return row*nColumns+column;
}

int main(void)
{
cudaSetDevice(3);
int rowD = 40 ; // number of rows of D
int colD = 40; // number of columns of D
int rowE = colD; // number of rows of E
int colE = 40; // number of columns of E
int rowF = rowD;
int colF = colE;

// initialize data
thrust::device_vector D(rowD * colD);
thrust::device_vector E(rowE * colE);
thrust::device_vector F(rowF * colF);

for (size_t i = 0; i < rowD; i++){
for (size_t j = 0; j < colD; j++){
D[ci(i,j,colD)]=(i+j) ;
// std::cout << D[ci(i,j,colD)] << " ";
}
//std::cout << “\n”;
}

for (size_t i = 0; i < rowE; i++){
for (size_t j = 0; j < colE; j++){
E[ci(i,j,colE)]=(i+j);
//std::cout << E[ci(i,j,colE)] << " ";
}
//std::cout << “\n”;
}
for (size_t i = 0; i < rowF; i++)
for (size_t j = 0; j < colF; j++)
F[ci(i,j,colF)]=0;

cublasHandle_t handle;

/* Initialize CUBLAS */

cublasStatus_t status = cublasCreate(&handle);
if (status != CUBLAS_STATUS_SUCCESS) {
std::cerr << “!!! CUBLAS initialization error\n”;
}

float alpha = 1.0f;float beta = 0.0f;
#if 0
status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
colE, rowD, colD,
&alpha, thrust::raw_pointer_cast(&E[0]), colE,
thrust::raw_pointer_cast(&D[0]), colD,
&beta, thrust::raw_pointer_cast(&F[0]), colE);// colE x rowD
#endif
status = cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N,
colE, rowD, colD,
&alpha, thrust::raw_pointer_cast(&E[0]), CUDA_R_8I ,colE,
thrust::raw_pointer_cast(&D[0]), CUDA_R_8I ,colD,
&beta, thrust::raw_pointer_cast(&F[0]), CUDA_R_32I ,colE, CUDA_R_32I,CUBLAS_GEMM_ALGO0);// colE x rowD
if (status != CUBLAS_STATUS_SUCCESS) {
std::cerr << “!!! 0 kernel execution error.\n” << status << std::endl;
}

#if 0
for (size_t i = 0; i < rowF; i++){
for (size_t j = 0; j < colF; j++){
std::cout << F[ci(i,j,colF)] << " ";
}
#endif
status = cublasDestroy(handle);
if (status != CUBLAS_STATUS_SUCCESS) {
std::cerr << “!!! shutdown error (A)\n”;
}

return 0;
}

njuffa · September 13, 2017, 5:55pm

Can you clarify whether you have installed CUDA 8.0.61 and have also installed:

Patch 2 (Released Jun 26, 2017)
cuBLAS Patch Update to CUDA 8: Includes performance enhancements and bug-fixes

Robert_Crovella · September 13, 2017, 6:02pm

When I compile and run your code using CUDA 8 on a Pascal Titan X, I get the following output:

$ nvcc -arch=sm_61 -o t414 t414.cu -lcublas
$ ./t414
 ** On entry to GEMM_EX  parameter number 6 had an illegal value
!!!! 0 kernel execution error.
15
$

If you re-read the cublas documentation, you will see that for CUDA_R_32I compute type, the use of:

float alpha = 1.0f;float beta = 0.0f;

is incorrect. When I fix that issue, your code compiles and runs without runtime error for me.

The CUBLAS_STATUS_NOT_SUPPORTED error you are getting is a different error, which suggests to me that you are attempting to run this code on a non cc6.1 device. Perhaps your

cudaSetDevice(3);

call is not selecting the device you think.

adit_bhrgv · September 14, 2017, 7:20am

Thanks
float alpha = 1.0f;float beta = 0.0f; were the culprits.
I have one more question:

cublasGemmEx() with Atype = CUDA_R_8I, Btype = CUDA_R_8I and computetype CUDA_32_8I, only does multiplication of signed 8-bit numbers(-128 to 127) . Does it also support UNSIGNED 8-bit multiplication(0-255), something like CUDA_R_8U for Atype, Btype and CUDA_R_32U for computetype.

How can I do multiplication of unsigned 8-bit multiplications ?

Thanks

adit_bhrgv · September 14, 2017, 8:03am

Does cublasGemmEx() internally work with threads? Because i am able to calculate 10001000 * 10001000 times multplications in ~100us on gpu .

What to do if my lda,ldb are not multiples of 4?padding with zeros?

does it use __dp4a instruction with CUDA_R_32I compute type?

How should i calculate Tflops?

used formula (2*n^3 - n^2)/execution_timeongpu/10^12 = 20 Tflops for above matrix.
I think it’s too much on gpu 1080ti. What i am doing wrong?

Robert_Crovella · September 15, 2017, 2:48pm

Not sure what you mean. CPU threads? I don’t know, maybe, but it’s not essential to the operation. The CUDA GPU driver will spin up CPU threads for various purposes.

GPU threads? Of course it uses GPU threads.

That should work.

Yes, it should, that is the whole point.

There are no Tflops. This is not floating-point arithmetic. You could compute Tops.

20 Tops seems plausible to me. The 1080Ti should be capable of a theoretical peak of over 40 Tops in this mode

https://devblogs.nvidia.com/parallelforall/new-pascal-gpus-accelerate-inference-in-the-data-center/

(1080Ti should be in the same range as P40 for this computation)

adit_bhrgv · September 15, 2017, 3:15pm

Hi txbob,

Thanks for your answers and clarifications very much !!!

Just another follow-up question,

cublasGemmEx() with Atype = CUDA_R_8I, Btype = CUDA_R_8I and computetype CUDA_32_8I, only does multiplication of signed 8-bit numbers(-128 to 127) .

Does it also support UNSIGNED 8-bit multiplication(0-255), something like CUDA_R_8U for Atype, Btype and CUDA_R_32I for computetype. The documentation doesn’t mention anything for this.

How can I do multiplication of unsigned 8-bit multiplications using cublasGemmEx() ?

Thanks!!!

BulatZiganshin · September 15, 2017, 3:33pm

large matrixes can be multiplied in O(n^2.7) time

Robert_Crovella · September 15, 2017, 3:57pm

The datatype is indeed CUDA_R_8U, which is mentioned here:

http://docs.nvidia.com/cuda/cublas/index.html#cuda_datatype_t

However I don’t see any indication that CUDA_R_8U is supported for any computations in CUBLAS. You could file an enhancement request (bug) at developer.nvidia.com

The dp4a instruction itself does appear to know how to handle unsigned integer types, but that is not integrated into CUBLAS AFAICT:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-dp4a

Honestly this doesn’t surprise me, as CUBLAS (roughly speaking) emulates a BLAS library implementation, and nearly every other BLAS-like function that I can find operates on signed types.

adit_bhrgv · September 25, 2017, 6:23pm

Hi txbob,

[HIDDEN API in CUDA8.0, no DOCUMENTATION available]
I found a API in cublas_api.h in CUDA 8.0 which seems to support UINT8 bit multiplications :

cublasUint8gemmBias (cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb, cublasOperation_t transc,
int m, int n, int k,
const unsigned char *A, int A_bias, int lda,
const unsigned char *B, int B_bias, int ldb,
unsigned char *C, int C_bias, int ldc,
int C_mult, int C_shift);

However, I cannot understand it’s arguements. Why is C is a unsigned int * pointer ? If it’S intended to hold the result of A.B matrix multiplication of UINT8 , it sshould be INT32.
What are other parameters : C_mult, C_shift ?

Please give your opinion.
I would like to multiply 2 UINT8 matrices and hope to see the result in C resulting matrix.

Correct me if I am wrong .
Thanks !!!

njuffa · September 25, 2017, 6:28pm

Why? If the inputs are of an unsigned integer type, it seems natural that the output would be of an unsigned integer type as well. Since DP4A accumulates the 16-bit products of 8-bit factors into a 32-bit sum, it seems very much appropriate that the output should be an ‘unsigned int’, which is essentially a uint32_t on all platforms supported by CUDA.

A more interesting question is why the API is “hidden”. Either it was inadvertently left out of the documentation, in which case NVIDIA needs to fix the docs, or it is not fully functional or used for internal purposes only, in which case you should not use it (anything undocumented has the tendency to change or go away without prior notice). You might consider filing a bug against the documentation to find out which of the two cases applies.

adit_bhrgv · September 25, 2017, 6:34pm

Hello,

I meant the resulting matrix C is given in this API as ‘unsigned char *C’ which means it can hold only 8-bit unsigned values and not uint32_t values. This is my concern. How will it accumulate ? It will rather overflow.

thx

njuffa · September 25, 2017, 6:43pm

No idea. Without documentation you would have to reverse engineer it. Maybe there was a potential use case and it was abandoned during development, but someone failed to remove the API prior to shipping. Maybe it is an experimental interface for internal or temporary use. However if it were the latter I would expect an accompanying comment warning off regular users (at least that is what I did when I had to expose a few functions in a CUDA library years ago that were only to be used internally until new infrastructure was put into place).