Inverse of a 3x3 matrix

Hi,

I would like to perform two steps:

  1. Find inverse of a 3x3 matrix, A

  2. compute “inv(A) * B” where B is also a 3x3 matrix.

In order perform above two steps what is the best method?

I wrote a kernel of my own for computing the matrix inverse. but it takes 0.5 ms for kernel configuration of <<<1,441>>>. Is there a better method to compute the above two steps?

Thanks

If you have a single 3x3 matrix to invert, it’s probably not an interesting problem for the GPU.

If you have multiple matrices to do at the same time, you could try the cublas functions getrfBatched and getriBatched. There are many examples demonstrating their use for inversion such as here:

[url]cuda - CUBLAS: Incorrect inversion for matrix with zero pivot - Stack Overflow

and a device-callable version is demonstrated here:

[url]cuda - cublas matrix inversion from device - Stack Overflow

Also, there is some batched solver code available here:

[url]https://developer.nvidia.com/rdp/assets/cuda-batched-solver-tgz[/url]

You’ll need to log in as a registered developer to access that link.

Hi txbob,

Thanks for providing the details.

  1. I tried the example given in link
    cuda - CUBLAS: Incorrect inversion for matrix with zero pivot - Stack Overflow

I use Nsight eclipse IDE to compile with cuda 7.5 on ubuntu 14.04 (64 bit OS). Now when i attempt to compile the code the error below comes :

[b] make all
Building file: …/matrix_invert_forum.cu
Invoking: NVCC Compiler

…/matrix_invert_forum.cu(55): error: argument of type “const float **” is incompatible with parameter of type “float **”

…/matrix_invert_forum.cu(72): warning: argument of type “float **” is incompatible with parameter of type “const float **”

1 error detected in the compilation of “/tmp/tmpxft_00004a6b_00000000-16_matrix_invert_forum.compute_35.cpp1.ii”.
make: *** [matrix_invert_forum.o] Error 2[/b]

Could you suggest the cause for this error?

  1. Next, i tried the device-callable version is demonstrated here:

http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device/27132762#27132762

Here i get the error as: ptxas fatal : Unresolved extern function ‘cublasCreate_v2’
make: *** [matrix_invert_forum_device.o] Error 255

Upon searching i could only find options to set in Visual studio. Is there a way to resolve this error using nsight?

Thanks.

The first problem is because the CUBLAS API has changed slightly since that code was posted. Just cast the arguments to the necessary type.

The second problem is because you have not set up the VS project to correctly link to the CUBLAS library. Either follow one of the sample codes/projects that use the CUBLAS API, or follow these instructions:

[url]visual studio 2010 - how to link library (e.g. CUBLAS, CUSPARSE) for CUDA on windows - Stack Overflow

That project will also require setup for the CUBLAS device API and RDC code generation. There is a CUBLAS sample project simpleDevLibCublas or something like that which will show all the necessary settings.

Hi,

I tried to cast arguments as below:
cuda - CUBLAS: Incorrect inversion for matrix with zero pivot - Stack Overflow

const float A[] = { src_d };
//float
* A_d;
const float A_d;
cudacall(cudaMalloc<const float
>(&A_d,sizeof(A)));
cudacall(cudaMemcpy(A_d,A,sizeof(A),cudaMemcpyHostToDevice));

cublascall(cublasSgetrfBatched(handle,n,A_d,lda,P,INFO,batchSize));

But still i get errors below:

error: no instance of overloaded function “cudaMalloc” matches the argument list
argument types are: (const float **, unsigned long)

error: argument of type “const float *” is incompatible with parameter of type “void *”

error: argument of type “const float *” is incompatible with parameter of type “float **”

error: argument of type “const float *” is incompatible with parameter of type “const float **”

Could you suggest how to fix these errors.

Thanks.

cast the pointers immediately before using them in the cublas function.

Hi txbob,

I did casting as below:

[b] float A[] = { src_d };
float
* A_d;
cudacall(cudaMalloc< float*>(&A_d,sizeof(A)));
cudacall(cudaMemcpy(A_d,A,sizeof(A),cudaMemcpyHostToDevice));
cublascall(cublasSgetrfBatched(handle,n,A_d,lda,P,INFO,batchSize));

const float *A_d_new[] = {(const float *) A_d};
cublascall(cublasSgetriBatched(handle,n,A_d_new,lda,P,C_d,lda,INFO,batchSize));
//cublascall(cublasSgetriBatched(handle,n,A_d,lda,P,C_d,lda,INFO,batchSize));[/b]

There are no compilation errors. But when i try to run, i get an error as below:
[b]Input:

0.000000 3.000000 4.000000
1.000000 3.000000 10.000000
4.000000 9.000000 16.000000

CUDA Error:
File = …/inversion3x3.cu
Line = 76
Reason = an illegal memory access was encountered[/b]

Could you please suggest how to resolve this.

Thanks.

For the code in the answer here:

https://stackoverflow.com/a/23045191/1695960

The only change you have to make to get it to compile (and run correctly) is to change this:

cublascall(cublasSgetriBatched(handle,n,A_d,lda,P,C_d,n,INFO,batchSize));

to this:

cublascall(cublasSgetriBatched(handle,n,(const float **)A_d,lda,P,C_d,n,INFO,batchSize));

Hi txbob,

Thank you. I could compile the code successfully and obtain the inverse of matrix.

I have one more doubt. When i tried to find the inverse of matrix using device-callable version as in:
http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device/27132762#27132762

Here i get the error as: ptxas fatal : Unresolved extern function ‘cublasCreate_v2’

I am working in NVIDIA Nsight IDE on ubuntu 14.04, 64 bit OS. In this case, i do not find the options you had mentioned for Visual Studio.

Could you please suggest how to resolve this error in this scenario.

Thanks.

Your project settings are not correct. You need to link against the cublas library. You also need to link against the device cublas library. You also need to specify relocatable device code generation and linking. Open the simpledevlibcublas sample project:

[url]CUDA Samples :: CUDA Toolkit Documentation

and study that. If you’re unable to do that, you could just drop the code into that sample project and compile it that way.

Hi txbob,

Thanks for sharing the link. In project settings I made changes and it works. I measured the time taken to compute the inverse of a 3x3 matrix using both the approaches:

  1. cublas functions getrfBatched and getriBatched (0.027 ms)

  2. device callable cublas functions (1.149 ms)

I observed the time taken due to second method (ie Dynamic Parallelism) is more than first method. Is this due to the overhead involved in launching kernels from the device?

Logically thinking if the data is in device memory, then dynamic parallelism should take less time right?

Could please clarify the reason why device callable function is taking more time?

No I wouldn’t be able to explain the differences.

None of this is interesting from a performance standpoint. Inversion of a 3x3 matrix is not an appropriate size problem for a GPU, and in particular these cublas functions are specifically indicated in the cublas documentation that they are not high-performance.

You would use these functions when the data is already on the GPU, as part of a larger GPU application. In such a situation, the perf difference between 0.027ms and 1.149ms should be irrelevant.

If you’re interested in high performance large matrix inversion, these are not the methodologies to use, and that is already indicated in the cublas documentation.

Am I allowed to ask why you are performing a 3x3 matrix inversion on the GPU?

As txbob has already stated there is no reason why you would ever perform a 3x3 matrix inversion on the GPU as the data transfer time is exorbitant in this case.

Do you plan to do a larger sized matrix inversion? If so using this as an example to teach yourself also doesn’t make sense.

Are you trying to learn how to do CUDA? If this is the case, there are plenty of other examples that exist that are much better.

I’ve had use-cases where I needed like N 3x3 inverses so that mapped to the GPU. But I couldn’t imagine performing a 3x3 inverse on more than just one thread.

Hi,

Thanks for your responses. Basically, I need to compute inverse value for ‘N’ 3x3 matrices in parallel. Value of N will range from 2000 to 4000. These 3x3 matrices shall be extracted from large images of 588x2048 resolution

This is the reason why i was interested in measuring the time. Also this matrix inversion is one step among a sequence of computations of an algorithm.

Thanks

A 3x3 inverse should actually be pretty easy to code by hand. You can likely just visit the wiki page and then translate those programs literally into code.

My thoughts exactly. 3x3 is small enough from a register usage aspect to have each thread handle the computation of one inverse matrix. This then requires 4000 threads to handle the entire batch, which represents a light load for a GPU.

If you go to the CUDA registered developer website, you can download a package called “batched solver” that provides batched matrix inversion code (BSD license). It’s a bit dated but should be enough to get you started. Look for the functions [s|c|d|z]matinv_batch():

/* smatinv_batch() inverts one or many square, non-singular matrices of single-
   precision elements. Partial pivoting is employed in the inversion process 
   for increased numerical stability.

   A     pointer to an array of the single-precision matrices to be inverted, 
         where each matrix is stored in column-major order
   Ainv  pointer to an array of the single-precision matrices which receive
         the inverses of the corresponding matrices pointed to by A, where 
         each matrix is stored in column-major order
   n     number of rows and columns of the matrices in the arrays pointed to 
         by A and Ainv. n must be greater than, or equal to 2. On sm_13 GPUs,
         n must be less than, or equal to, 62. On sm_2x and sm_3x GPUs, n must
         be less than, or equal to, 109.
   batch the number of matrices to be inverted. It must be greater than zero.

   Returns:

    0    operation completed successfully
   -1    n is out of bounds, batch is out of bounds
   -2    a CUDA error occured
*/

For your use case, this should drop down to this kernel in inverse.cu:

__global__ void matinv_3x3_matrix_per_thread (const T *A, T *Ainv, int batch)

You may use Cramer’s Rule for the inversion and expand the matrix multiplication explicitly and factor out the repeated computations. Store the matrices a 9 number linear vector may help as well.

Very late to this discussion, but here is something I coded for computing a 3x3 inverse using the adjoint / determinant method in the context of one warp per problem. The arrays submitted as arguments must be pre-allocated to (at least) nine elements, and the matrix / returned inverse are given in column-major format. Element 0 is row 0, column 0; element 1 is row 1, column 0, element 2 is row 2, column 0, element 3 is row 0, column 1, …

/// \brief Calculate the inverse of a rank 3 matrix using the adjoint and determinant.               
///                                                                                                  
/// \param m    Elements of the original matrix, which will be left unchanged                        
/// \param cof  Pre-allocated space for elements of the cofactor matrix                              
/// \param inv  Pre-allocated space to hold in the inverse of the matrix                             
__device__ __forceinline__ void invertRankThreeMatrix(const double* m, double* cof, double* inv) { 
 const int lane_idx = (threadIdx.x & warp_bits_mask_int); 
 const int my_col = lane_idx / 3; 
 const int my_row = lane_idx - (3 * my_col); 
 if (lane_idx < 9) { 
   int pcol_a, pcol_b, prow_a, prow_b; 
   if (my_col == 0) { 
     pcol_a = 1; 
     pcol_b = 2; 
   } 
   else if (my_col == 1) { 
     pcol_a = 0; 
     pcol_b = 2; 
   } 
   else if (my_col == 2) { 
     pcol_a = 0; 
     pcol_b = 1; 
   } 
   if (my_row == 0) { 
     prow_a = 1; 
     prow_b = 2; 
   } 
   else if (my_row == 1) { 
     prow_a = 0; 
     prow_b = 2; 
   } 
   else if (my_row == 2) { 
     prow_a = 0; 
     prow_b = 1; 
   } 
   const double p_aa = m[(3 * pcol_a) + prow_a]; 
   const double p_ab = m[(3 * pcol_b) + prow_a]; 
   const double p_ba = m[(3 * pcol_a) + prow_b]; 
   const double p_bb = m[(3 * pcol_b) + prow_b]; 
   cof[lane_idx] = (p_aa * p_bb) - (p_ba * p_ab); 
   if (lane_idx & 0x1) { 
     cof[lane_idx] = -cof[lane_idx]; 
   } 
 } 
 SYNCWARP; 
 if (lane_idx < 9) { 
   const double detr = (cof[0] * m[0]) + (cof[1] * m[1]) + (cof[2] * m[2]); 
   inv[(3 * my_row) + my_col] = cof[lane_idx] / detr; 
 } 
 SYNCWARP; 
}

The SYNCWARP function is as follows:

__device__ __forceinline__ void syncWarp() {
#if (__CUDA_ARCH__ >= 700)
 __syncwarp();
#else
 return;
#endif
}

One can imagine a float32_t variant of this, and even a templated form, which I will probably create.

this may also be of interest (and there is a 4x4 one that I did on SO also). And of course you can do this with cublas, rather than writing your own code.

1 Like