compute “inv(A) * B” where B is also a 3x3 matrix.
In order perform above two steps what is the best method?
I wrote a kernel of my own for computing the matrix inverse. but it takes 0.5 ms for kernel configuration of <<<1,441>>>. Is there a better method to compute the above two steps?
If you have a single 3x3 matrix to invert, it’s probably not an interesting problem for the GPU.
If you have multiple matrices to do at the same time, you could try the cublas functions getrfBatched and getriBatched. There are many examples demonstrating their use for inversion such as here:
I use Nsight eclipse IDE to compile with cuda 7.5 on ubuntu 14.04 (64 bit OS). Now when i attempt to compile the code the error below comes :
[b] make all
Building file: …/matrix_invert_forum.cu
Invoking: NVCC Compiler
…/matrix_invert_forum.cu(55): error: argument of type “const float **” is incompatible with parameter of type “float **”
…/matrix_invert_forum.cu(72): warning: argument of type “float **” is incompatible with parameter of type “const float **”
1 error detected in the compilation of “/tmp/tmpxft_00004a6b_00000000-16_matrix_invert_forum.compute_35.cpp1.ii”.
make: *** [matrix_invert_forum.o] Error 2[/b]
Could you suggest the cause for this error?
Next, i tried the device-callable version is demonstrated here:
The first problem is because the CUBLAS API has changed slightly since that code was posted. Just cast the arguments to the necessary type.
The second problem is because you have not set up the VS project to correctly link to the CUBLAS library. Either follow one of the sample codes/projects that use the CUBLAS API, or follow these instructions:
That project will also require setup for the CUBLAS device API and RDC code generation. There is a CUBLAS sample project simpleDevLibCublas or something like that which will show all the necessary settings.
Your project settings are not correct. You need to link against the cublas library. You also need to link against the device cublas library. You also need to specify relocatable device code generation and linking. Open the simpledevlibcublas sample project:
Thanks for sharing the link. In project settings I made changes and it works. I measured the time taken to compute the inverse of a 3x3 matrix using both the approaches:
cublas functions getrfBatched and getriBatched (0.027 ms)
device callable cublas functions (1.149 ms)
I observed the time taken due to second method (ie Dynamic Parallelism) is more than first method. Is this due to the overhead involved in launching kernels from the device?
Logically thinking if the data is in device memory, then dynamic parallelism should take less time right?
Could please clarify the reason why device callable function is taking more time?
None of this is interesting from a performance standpoint. Inversion of a 3x3 matrix is not an appropriate size problem for a GPU, and in particular these cublas functions are specifically indicated in the cublas documentation that they are not high-performance.
You would use these functions when the data is already on the GPU, as part of a larger GPU application. In such a situation, the perf difference between 0.027ms and 1.149ms should be irrelevant.
If you’re interested in high performance large matrix inversion, these are not the methodologies to use, and that is already indicated in the cublas documentation.
Am I allowed to ask why you are performing a 3x3 matrix inversion on the GPU?
As txbob has already stated there is no reason why you would ever perform a 3x3 matrix inversion on the GPU as the data transfer time is exorbitant in this case.
Do you plan to do a larger sized matrix inversion? If so using this as an example to teach yourself also doesn’t make sense.
Are you trying to learn how to do CUDA? If this is the case, there are plenty of other examples that exist that are much better.
I’ve had use-cases where I needed like N 3x3 inverses so that mapped to the GPU. But I couldn’t imagine performing a 3x3 inverse on more than just one thread.
Thanks for your responses. Basically, I need to compute inverse value for ‘N’ 3x3 matrices in parallel. Value of N will range from 2000 to 4000. These 3x3 matrices shall be extracted from large images of 588x2048 resolution
This is the reason why i was interested in measuring the time. Also this matrix inversion is one step among a sequence of computations of an algorithm.
A 3x3 inverse should actually be pretty easy to code by hand. You can likely just visit the wiki page and then translate those programs literally into code.
My thoughts exactly. 3x3 is small enough from a register usage aspect to have each thread handle the computation of one inverse matrix. This then requires 4000 threads to handle the entire batch, which represents a light load for a GPU.
If you go to the CUDA registered developer website, you can download a package called “batched solver” that provides batched matrix inversion code (BSD license). It’s a bit dated but should be enough to get you started. Look for the functions [s|c|d|z]matinv_batch():
/* smatinv_batch() inverts one or many square, non-singular matrices of single-
precision elements. Partial pivoting is employed in the inversion process
for increased numerical stability.
A pointer to an array of the single-precision matrices to be inverted,
where each matrix is stored in column-major order
Ainv pointer to an array of the single-precision matrices which receive
the inverses of the corresponding matrices pointed to by A, where
each matrix is stored in column-major order
n number of rows and columns of the matrices in the arrays pointed to
by A and Ainv. n must be greater than, or equal to 2. On sm_13 GPUs,
n must be less than, or equal to, 62. On sm_2x and sm_3x GPUs, n must
be less than, or equal to, 109.
batch the number of matrices to be inverted. It must be greater than zero.
Returns:
0 operation completed successfully
-1 n is out of bounds, batch is out of bounds
-2 a CUDA error occured
*/
For your use case, this should drop down to this kernel in inverse.cu:
__global__ void matinv_3x3_matrix_per_thread (const T *A, T *Ainv, int batch)
You may use Cramer’s Rule for the inversion and expand the matrix multiplication explicitly and factor out the repeated computations. Store the matrices a 9 number linear vector may help as well.
Very late to this discussion, but here is something I coded for computing a 3x3 inverse using the adjoint / determinant method in the context of one warp per problem. The arrays submitted as arguments must be pre-allocated to (at least) nine elements, and the matrix / returned inverse are given in column-major format. Element 0 is row 0, column 0; element 1 is row 1, column 0, element 2 is row 2, column 0, element 3 is row 0, column 1, …
/// \brief Calculate the inverse of a rank 3 matrix using the adjoint and determinant.
///
/// \param m Elements of the original matrix, which will be left unchanged
/// \param cof Pre-allocated space for elements of the cofactor matrix
/// \param inv Pre-allocated space to hold in the inverse of the matrix
__device__ __forceinline__ void invertRankThreeMatrix(const double* m, double* cof, double* inv) {
const int lane_idx = (threadIdx.x & warp_bits_mask_int);
const int my_col = lane_idx / 3;
const int my_row = lane_idx - (3 * my_col);
if (lane_idx < 9) {
int pcol_a, pcol_b, prow_a, prow_b;
if (my_col == 0) {
pcol_a = 1;
pcol_b = 2;
}
else if (my_col == 1) {
pcol_a = 0;
pcol_b = 2;
}
else if (my_col == 2) {
pcol_a = 0;
pcol_b = 1;
}
if (my_row == 0) {
prow_a = 1;
prow_b = 2;
}
else if (my_row == 1) {
prow_a = 0;
prow_b = 2;
}
else if (my_row == 2) {
prow_a = 0;
prow_b = 1;
}
const double p_aa = m[(3 * pcol_a) + prow_a];
const double p_ab = m[(3 * pcol_b) + prow_a];
const double p_ba = m[(3 * pcol_a) + prow_b];
const double p_bb = m[(3 * pcol_b) + prow_b];
cof[lane_idx] = (p_aa * p_bb) - (p_ba * p_ab);
if (lane_idx & 0x1) {
cof[lane_idx] = -cof[lane_idx];
}
}
SYNCWARP;
if (lane_idx < 9) {
const double detr = (cof[0] * m[0]) + (cof[1] * m[1]) + (cof[2] * m[2]);
inv[(3 * my_row) + my_col] = cof[lane_idx] / detr;
}
SYNCWARP;
}
this may also be of interest (and there is a 4x4 one that I did on SO also). And of course you can do this with cublas, rather than writing your own code.