cublasSetMatrix() and cublasGetMatrix() for dealing with matrices stored in row major order

jgpallero · February 23, 2015, 1:50pm

Hello:

I must work with matrices stored in row major order format and I want to use CUBLAS and CULA (and possibly cuSOLVER). The original matrices in row major order are stored in the host memory. As CUBLAS et al. work only with column major order scheme, I need to copy my data to the GPU in column major order. But the function

cublasStatus_t
cublasSetMatrix(int rows, int cols, int elemSize,
                const void *A, int lda, void *B, int ldb)

works only with matrices in colum major order.

Exist any way to copy a host row major order-strored matrix to a device-column major order matrix? Probably it could be copied row by row (or column by column) using

cublasStatus_t
cublasSetVector(int n, int elemSize,
                const void *x, int incx, void *y, int incy)

But I’m not so sure about the way to made the copy. Has anyone the same problem?

My question is also about the recovery from device to host for the results, in this case from column major order to row major order.

Maybe it can be used the generic functions from CUDA cudaMemcpy*, but I’m a bit confused about the parameters. Has anyone an example about?

Thanks

CudaaduC · February 23, 2015, 7:39pm

This is a common issue and cuBLAS give you the ability to transpose the ‘working’ matrices in their library calls(you also need to flip (m,n) to (n,m) to match the transpose);

[url]http://docs.nvidia.com/cuda/cublas/#axzz3SbAMvO5B[/url]

you can see under topic heading 1.1 a paragraph which shows examples.

jgpallero · February 24, 2015, 10:25am

Thank you for your answer. Yes the BLAS case is easy. It’s the same as for the CBLAS interface, which uses internally the generic Fortran BLAS implementation changing parameters as Trans → NoTrans, dimensions et al. I can use cublasSetMatrix() and cublasGetMatrix() to copy and recovery the matrices.

But this approximation doesn’t work for Lapack routines. For example, the DGETRF routine in cuSOLVER needs always column major order, as it doesn’t have the transpose option. Then, we have two options:

Create an auxiliary matrix and copy the original matrix from row major order to column major order. Then, use cublasSetMatrix() using the auxiliary one. Call DGETRF, use cublasGetMatrix to revovery the result to the auxiliary object, and finally transform the result in column major order to the original object in row major order. In this way, we need to allocate more memory than necessary, because we need the auxiliary matrix, but we can use the fast cublasSetMatrix() and cublasGetMatrix()
Copy the matrix into the GPU by columns from the original row major order matrix to the GPU (and from GPU to CPU). This can be done as (for a FILxCOL matrix with LDA_RMO columns in the CPU storage and exactly FIL*COL elements in the GPU)
```
//CPU to GPU
for(j=0;j<COL;j++)
    cublasSetVector(FIL,sizeof(double),&MCPU[j],LDA_RMO,&MGPU[j*FIL],1);
//GPU to CPU
for(j=0;j<COL;j++)
    cublasGetVector(FIL,sizeof(double),&MGPU[j*FIL],1,&MCPU[j],LDA_RMO);
```
I’ve done some tests and this approximation is approximately an order of magnitude slower than cublasSetMatrix()+cublasGetMatrix(). Another question is: is my approximation correct? Can I compute the GPU initial column address as &MGPU[j*FIL]?

EDIT:

I’ve also tried to copy the matrix by rows as

//CPU to GPU
for(i=0;i<FIL;i++)
    cublasSetVector(COL,sizeof(double),&MCPU1[i*LDA_RMO],1,&MGPU[i],FIL);
//GPU to CPU
for(i=0;i<FIL;i++)
    cublasGetVector(COL,sizeof(double),&MGPU[i],FIL,&MCPU2[i*LDA_RMO],1);

This code is sligtly faster that the column-wise version. In my laptop (Intel Core-i7 4800MQ+NVIDIA Quadro K2100M), using matrices of dimensions 10000x10000, the column version copy algorithm takes 5.7 seconds, the row version 4.75 seconds and the option 1 in the previous list (including the memory allocation and transposing) 3.25 seconds (cublasSetMatrix()+cublasGetMatrix() take 0.5 seconds)

Thanks

CudaaduC · February 25, 2015, 6:17am

If the matrix you want to convert from row major to column major is square, then you can transpose it in-place very quickly without allocating any more global memory. That transpose will result in it being in ‘column major’, again assuming it is square. For a rectangular transpose you will need some temporary allocation space, which might be an issue for large dense matrices.

Search around for that CUDA code, as it exists on this board and works well (txbob directed it to me a while back, and if you cannot find it I can did it up).

Having worked quite a bit with the CUDA Linear Algebra sub routines libraries, by personal opinion is that one is best off sticking with either cuBLAS or MAGMA. Both are fast and frequently used by the best.

You will notice a bigger speedup of GPU:CPU with a desktop system, but for a laptop I recommend the GTX 980m.

Since I believe it always helps to look at working examples, look in the CUDA cuBlas SDK samples or even at some specialized dense-sparse code.

For example this is some the code I wrote for the ‘Glass Brain’ Project with UCSD/UCSF which has example of how use and mix cuBLAS and cuSPARSE in a MATLAB mex dll;

[url]https://github.com/OlegKonings/BCI_EEG_blk_diag_admm_multi_lambda/blob/master/GroupMextest/GroupMextest/GLmex.cpp[/url]

vishnu92 · September 10, 2015, 5:24am

Hello:
Even i am having the same problem, where the original matrix is stored in the row major order in host memory, but the cublas libraries like cublasDger() and cublasDgemv() works only with column major order scheme. so how to copy the matrix to GPU by converting from row major to column major.

Thanks.