How to increase speed transfer of matrices GPU<->CPU for matrix multiplication (it is the limiting factor).

I am trying to parallelise a python code that basically does a matrix multiplication.
I am using PYCUDA for this, but I think my question is not really linked to Pycuda in particular but to the way cuda works in general.

The computation takes the form C=A*B.

The matrices A and B are initialized in python and are global inputs of my program (their allocation time is not taken in consideration here). I send them to the GPU and I compute the result C=A*B on it.

The computation of the matrix product in itself is way faster than the python multiplication using numpy.

But the problem is the time transfert of the variables to the GPU. In practice, transfering A and B to the GPU, and getting the result C from the GPU to python is longer than the global python execution time.

In fact the time that the matrix transfers to the GPU is approximately the total time of my PyCuda program. Indeed, the matrix computation in itself is negligible.

My question is : how can I improve the speed of this program ?
I think it is a very common problem because I know that PyCuda is sometimes used to improve matrix multiplication speed. But here I see that the transferring times between GPU and CPU memories is the limiting factor and because of it there is no interest for me to use CUDA… which I find strange bc I have read that CUDA is often used for such computations.

Here is my code :

#!python
#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Dans ce code on essaie de comprendre les temps de calcul et les bytes par s de transfert memoire
import time
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import math
import matplotlib.pyplot as plt
from sys import getsizeof

# -- initialize the device
import pycuda.autoinit

# -----------------------------------------------------
# CUDA parameters
kernel_code_template = """
__global__  void MatProd(float* C, float* A, float* B, int dimAx, int dimBx, int dimCx, int dimCy)
{
  int row = blockDim.y*blockIdx.y+threadIdx.y;
  int col = blockDim.x*blockIdx.x+threadIdx.x;

	double Result = 0;

	if (row<=dimCy-1 && col<=dimCx-1)
	{
		for (int k = 0; k < dimAx; k++)
		{
			Result += A[k + dimAx*row] * B[col + dimBx*k];
		}

		C[col + row*dimCx] = Result;
	}
}
"""

# get the kernel code from the template 
kernel_code=kernel_code_template
# compile the kernel code 
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
MatProd = mod.get_function("MatProd")
warp_size=32 # Warp size on the GPU.
# --------------------------------------------------------------------
# --------------------BEGIN of INITIALISATION-------------------------
# --------------------------------------------------------------------

# We create the python matrices for the computation C=A*B
# This part is supposed as an input, so we don't take in account any computation
# time here.

nb_columnsA=100
nb_linesA=pow(10,6)

nb_columnsB=2
nb_linesB=nb_columnsA

a_cpu=np.random.rand(nb_linesA,nb_columnsA).astype(np.float32)
b_cpu=np.random.rand(nb_linesB,nb_columnsB).astype(np.float32)

# --------------------------------------------------------------------
# --------------------End of INITIALISATION---------------------------
# --------------------------------------------------------------------

# --------------------------------------------------------------------
# --------------------CUDA PART---------------------------------------
# --------------------------------------------------------------------
# We send the data to the GPU
total_CUDA_time_Begin=time.clock()
time_memory_alloc_GPU_Begin=time.clock()
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu=gpuarray.to_gpu(b_cpu)
# We allocate the memory on the GPU for the result C=A*B
c_gpu = gpuarray.empty((nb_linesA, nb_columnsB), np.float32)
time_memory_alloc_GPU_End=time.clock()
# ----------------------------------------------------------
# Starting of the CUDA computation :
# We reserve the number of threads per block on the memory
threadPerBlockx=warp_size 
threadPerBlocky=warp_size

# We reserve a number of block on the memory.
BlockPerGridx = (int) (1 + (nb_columnsB - 1) // threadPerBlockx);
BlockPerGridy = (int) (1 + (nb_linesA - 1) // threadPerBlockx);

time_computation_CUDA_Begin=time.clock()
MatProd(
    # output
    c_gpu, 
    # inputs
    a_gpu, b_gpu,
    np.int32(nb_columnsA),np.int32(nb_columnsB),np.int32(nb_columnsB),np.int32(nb_linesA),
    # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
    block = (threadPerBlockx, threadPerBlocky, 1), grid=(BlockPerGridx,BlockPerGridy)
    )
time_computation_CUDA_End=time.clock()

time_memory_get_result_GPU_Begin=time.clock()
c_gpu_result=c_gpu.get() # We get the result
time_memory_get_result_GPU_End=time.clock()

total_CUDA_time_End=time.clock()
# --------------------------------------------------------------------
# --------------------END OF CUDA PART--------------------------------
# --------------------------------------------------------------------

# --------------------------------------------------------------------
# --------------------PYTHON PART-------------------------------------
# --------------------------------------------------------------------

# We compute in python :
total_Python_time_Begin=time.clock()
c_cpu=np.empty([nb_linesA,nb_columnsB]).astype(np.float32)
time_computation_Python_Begin=time.clock()
c_cpu=np.dot(a_cpu,b_cpu)
time_computation_Python_End=time.clock()
total_Python_time_End=time.clock()

# --------------------------------------------------------------------
# --------------------END OF PYTHON PART------------------------------
# --------------------------------------------------------------------

#------------------------------------------------------------
# We display the execution times :
# Computation times :
time_computation_CUDA=time_computation_CUDA_End-time_computation_CUDA_Begin
time_computation_Python=time_computation_Python_End-time_computation_Python_Begin
print("CUDA pure computation time : ", time_computation_CUDA)
print("Python pure computation time : ", time_computation_Python)
print(" ")
# Memory allocation times :
time_memory_alloc_GPU=time_memory_alloc_GPU_End-time_memory_alloc_GPU_Begin
time_memory_get_result_GPU=time_memory_get_result_GPU_End-time_memory_get_result_GPU_Begin
print("CUDA memory allocation time (allocating C, transferring A,B from CPU to GPU):", time_memory_alloc_GPU)
print("CUDA getting result from GPU (Pulling back C from GPU to CPU after computation) :", time_memory_get_result_GPU)
# Total time (computation + memory allocation)
print(" ")
total_CUDA_time=total_CUDA_time_End-total_CUDA_time_Begin
total_Python_time=total_Python_time_End-total_Python_time_Begin
print("CUDA total time (alloc C + A to gpu + B to gpu + comput + get result) :", total_CUDA_time)
print("Python total time (comput + alloc C) :", total_Python_time)

Here are the outputs :

CUDA pure computation time :  0.0002852345678547863
    Python pure computation time :  0.032470123456732836
     
    CUDA memory allocation time (allocating C, transferring A,B from CPU to GPU): 0.17344750617121463
    CUDA getting result from GPU (Pulling back C from GPU to CPU after computation) : 0.16262162962812
     
    CUDA total time (alloc C + A to gpu + B to gpu + comput + get result) : 0.3363879506177909
    Python total time (comput + alloc C) : 0.04350814814824844

We thus remark that indeed the computation time is way faster in CUDA than in Python, but because of the memory transfer from CPU to GPU and from GPU to CPU the CUDA code is still slower than the python one.

What’s more the CUDA computation time is negligible in regard of these allocations time.

In this example C=A*B, I worked with A matrix : 100 columns and 10^6 lines and B matrix : 100 lines and 2 columns but I also did a code that plot the computation and allocation time in function of k, where number of line of A = 10^k

The k is the x-axis of thoose graph (so the matrix has from 1 lines to 10^6 lines).

This first graph shows the total time.

For GPU it means : transferring A & B from CPU to GPU, allocating C on GPU, computing C=A*B, and transferring at the end C from GPU to CPU

For Python it means : allocating C and computing A*B (I remind that A and B are global inputs and are not taking in consideration when measuring computation time for python)

This second graph shows the allocation & data transfer time.
For the GPU it means : transferring A & B from CPU to GPU, allocating C on GPU and transferring at the end C from GPU to CPU;

For Python it means : allocating C.

Thus again : how can I increase the speed of matrices transfer between Python and Cuda ? Because it seems the parallelised code will never be faster as the none parallelised one just because of thoose memory transfers. I find this strange bc CUDA is often used to improve speed of such operations.

Remark : I am on MSI GT72 with GTX 970M graphic card.

  1. Use CUBLAS for matrix-matrix multiplication
  2. If the only thing you intend to do on the GPU is a matrix-matrix multiply, then only do the operation on the GPU for matrix sizes that will benefit. For example, multiplying A(4000,4000) by B(4000,4000) should be faster on the GPU, even including the transfer times. A(100000, 100) by B(100,2) is apparently not a big enough problem. (Or possibly not the right “shapes” to get full benefit from CUBLAS/GPU).

Steps 1 and 2 are essentially exactly what the NVBLAS library does for you.

By the way I don’t think your python timing of the CUDA kernel launch is giving you the number you think it is. CUDA kernel launches are asynchronous (even in pycuda) and that means control is returned to the CPU thread before the kernel has finished executing. You are actually timing the launch overhead, not the kernel duration.

Thank you for your fast answer.

Question 1 :

About the transfer from CPU to GPU, do you agree with me if I say that it doesn’t depend at all on the size of the numpy arrays ?

I feel like CUDA “flatten” the numpy matrix when it receives it (so the shape doesn’t really play a role on this operation) ? Indeed, C is read as a 1D array in C++ (C[col + row*dimCx] in my code when I call it).

Because you seemed to talk about the effective computation in CUBLAS when you evoked the shape problem ? But I would like to be sure, it is important in my case. Also, here, it is not the computation time that is the problem but really only the transfer CPU <-> GPU so do you think CUBLAS is really a part of the solution of my problem ? CUBLAS is only here to compute right ? It doesn’t play a role at all on transfers ?

Question 2 :

If I understand well, you think that if I take big enough matrices, the Python computation time will become greater than the CUDA one. In particular because at a point the python computation time will become greater than the CUDA allocation time (which is by far the limiting factor here).

I will try to check this, but 10^6*100 matrices are quite big so are you sure of it ? I can try with bigger matrices but I don’t think I can really go further than 10^6 or 10^7, at a point I will not have enough memory on the GPU…?

I just saw your second reply :

To be sure I understand you. The Pycuda lines when I do the computation or I copy matrices on the GPU are in fact executing the CUDA kernel “in parallel”, thus the GPU has probably not finished working when my python code has passed to the next line.

So you think that the matrix multiplication is probably longer than the number displayed here ? This is what you meant ?

The transfer time should certainly depend on matrix size. Isn’t that what your charts show?

The shape affects the arithmetic intensity of the matrix dot product.

Try the exact square matrix sizes I suggested, first. 4000x4000 matrices. After that you can explore other shapes.

Yes

your python code appears to be broken.

size_Cx and size_Cy are undefined
(and the indenting is messed up in your posting)

I edited my code, now it works (I copied paste the wrong code).

What I meant about the transfer size is : does the CPU->GPU transfer depend on the shape of the matrices (sorry for the word confusion). I think it shouldn’t but I would like to check with you.

When I run your code with only the necessary changes to get it to work (covered in my comment above) I get a pycuda time of about 0.47 and a python (numpy) time of about 0.41, on linux, CUDA 9, GTX960 (should be in the same ballpark of performance as your GTX970M) although I’m running on a slow CPU (Pentium Dual-Core).

When I mod it to a square array of 1024x1024 for all 3 arrays, and also fix the items I have mentioned above, the pycuda time is 0.07 whereas the python time is 0.30, so for this particular square arrays size, on my setup, pycuda is a lot faster than python (numpy). Here’s my test case:

$ cat t1.py
#!python
#!/usr/bin/env python
# Dans ce code on essaie de comprendre les temps de calcul et les bytes par s de transfert memoire
import time
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import math
# import matplotlib.pyplot as plt
from sys import getsizeof

# -- initialize the device
import pycuda.autoinit

# -----------------------------------------------------
# CUDA parameters
kernel_code_template = """
    __global__  void MatProd(float* C, float* A, float* B, int dimAx, int dimBx, int dimCx, int dimCy)
    {
      int row = blockDim.y*blockIdx.y+threadIdx.y;
      int col = blockDim.x*blockIdx.x+threadIdx.x;

        double Result = 0;

        if (row<=dimCy-1 && col<=dimCx-1)
        {
                for (int k = 0; k < dimAx; k++)
                {
                        Result += A[k + dimAx*row] * B[col + dimBx*k];
                }

                C[col + row*dimCx] = Result;
        }
    }
    """

# get the kernel code from the template
kernel_code=kernel_code_template
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
MatProd = mod.get_function("MatProd")
warp_size=32 # Warp size on the GPU.
    # --------------------------------------------------------------------
    # --------------------BEGIN of INITIALISATION-------------------------
    # --------------------------------------------------------------------

    # We create the python matrices for the computation C=A*B
    # This part is supposed as an input, so we don't take in account any computation
    # time here.

nb_columnsA=1024
nb_linesA=1024

nb_columnsB=1024
nb_linesB=nb_columnsA

a_cpu=np.random.rand(nb_linesA,nb_columnsA).astype(np.float32)
b_cpu=np.random.rand(nb_linesB,nb_columnsB).astype(np.float32)

    # --------------------------------------------------------------------
    # --------------------End of INITIALISATION---------------------------
    # --------------------------------------------------------------------

# --------------------------------------------------------------------
    # --------------------CUDA PART---------------------------------------
    # --------------------------------------------------------------------
    # We send the data to the GPU
total_CUDA_time_Begin=time.clock()
time_memory_alloc_GPU_Begin=time.clock()
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu=gpuarray.to_gpu(b_cpu)
    # We allocate the memory on the GPU for the result C=A*B
c_gpu = gpuarray.empty((nb_linesA, nb_columnsB), np.float32)
time_memory_alloc_GPU_End=time.clock()
    # ----------------------------------------------------------
    # Starting of the CUDA computation :
    # We reserve the number of threads per block on the memory
threadPerBlockx=warp_size
threadPerBlocky=warp_size

    # We reserve a number of block on the memory.
size_Cx = nb_columnsB
size_Cy = nb_linesA
BlockPerGridx = (int) (1 + (size_Cx - 1) // threadPerBlockx);
BlockPerGridy = (int) (1 + (size_Cy - 1) // threadPerBlockx);

time_computation_CUDA_Begin=time.clock()
MatProd(
        # output
        c_gpu,
        # inputs
        a_gpu, b_gpu,
        np.int32(nb_columnsA),np.int32(nb_columnsB),np.int32(nb_columnsB),np.int32(nb_linesA),
        # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
        block = (threadPerBlockx, threadPerBlocky, 1), grid=(BlockPerGridx,BlockPerGridy)
        )
driver.Context.synchronize()
time_computation_CUDA_End=time.clock()

time_memory_get_result_GPU_Begin=time.clock()
c_gpu_result=c_gpu.get() # We get the result
time_memory_get_result_GPU_End=time.clock()

total_CUDA_time_End=time.clock()
    # --------------------------------------------------------------------
    # --------------------END OF CUDA PART--------------------------------
    # --------------------------------------------------------------------

    # --------------------------------------------------------------------
    # --------------------PYTHON PART-------------------------------------
    # --------------------------------------------------------------------

    # We compute in python :
total_Python_time_Begin=time.clock()
c_cpu=np.empty([nb_linesA,nb_columnsB]).astype(np.float32)
time_computation_Python_Begin=time.clock()
c_cpu=np.dot(a_cpu,b_cpu)
time_computation_Python_End=time.clock()
total_Python_time_End=time.clock()

    # --------------------------------------------------------------------
    # --------------------END OF PYTHON PART------------------------------
    # --------------------------------------------------------------------

    #------------------------------------------------------------
    # We display the execution times :
    # Computation times :
time_computation_CUDA=time_computation_CUDA_End-time_computation_CUDA_Begin
time_computation_Python=time_computation_Python_End-time_computation_Python_Begin
print("CUDA pure computation time : ", time_computation_CUDA)
print("Python pure computation time : ", time_computation_Python)
print(" ")
    # Memory allocation times :
time_memory_alloc_GPU=time_memory_alloc_GPU_End-time_memory_alloc_GPU_Begin
time_memory_get_result_GPU=time_memory_get_result_GPU_End-time_memory_get_result_GPU_Begin
print("CUDA memory allocation time (allocating C, transferring A,B from CPU to GPU):", time_memory_alloc_GPU)
print("CUDA getting result from GPU (Pulling back C from GPU to CPU after computation) :", time_memory_get_result_GPU)
    # Total time (computation + memory allocation)
print(" ")
total_CUDA_time=total_CUDA_time_End-total_CUDA_time_Begin
total_Python_time=total_Python_time_End-total_Python_time_Begin
print("CUDA total time (alloc C + A to gpu + B to gpu + comput + get result) :", total_CUDA_time)
print("Python total time (comput + alloc C) :", total_Python_time)
$ python t1.py
('CUDA pure computation time : ', 0.05827899999999997)
('Python pure computation time : ', 0.2995509999999999)

('CUDA memory allocation time (allocating C, transferring A,B from CPU to GPU):', 0.009533000000000014)
('CUDA getting result from GPU (Pulling back C from GPU to CPU after computation) :', 0.00377100000000008)

('CUDA total time (alloc C + A to gpu + B to gpu + comput + get result) :', 0.07162099999999993)
('Python total time (comput + alloc C) :', 0.3052790000000001)
$

No it should not depend on the shape. Only the total number of bytes.

For 4096x4096 arrays on my GTX960, I get 18 seconds for python/numpy and 3.4 seconds for pycuda. Almost all of that 3.4 seconds is kernel computation time (the data transfer time is now “irrelevant”, <10% of the overall time), but if you want to see the kernel computation time drop a lot more, try CUBLAS.

Here’s a shared-memory-optimized version of the pycuda kernel. It gets almost a 10x speedup (over the naive kernel) on the kernel computation time for the pycuda case. This particular version requires square matrices, and it requires the matrix dimension to be whole-number divisible by the tile size (16):

$ cat t2.py
#!python
#!/usr/bin/env python
# Dans ce code on essaie de comprendre les temps de calcul et les bytes par s de transfert memoire
import time
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import math
# import matplotlib.pyplot as plt
from sys import getsizeof

# -- initialize the device
import pycuda.autoinit

# -----------------------------------------------------
# CUDA parameters
kernel_code_template = """
__global__ void MatProd(float* C, float* A, float* B, int width) {
 __shared__ float Ashare[16][16];
 __shared__ float Bshare[16][16];
 int bx = blockIdx.x, by = blockIdx.y;
 int tx = threadIdx.x, ty = threadIdx.y;
 int row = by * 16 + ty;
 int col = bx * 16 + tx;
 float result = 0;
 for (int m = 0; m < width / 16; m++) {
 //collectively load the A and B tiles into shared memory
 Ashare[ty][tx] = A[(row * width) + (m * 16) + tx];
 Bshare[ty][tx] = B[(((m * 16) + ty) * width) + col];
 __syncthreads(); //wait for all the shared memory to be loaded

 for (int k = 0; k < 16; k++) {
 result += Ashare[ty][k] * Bshare[k][tx];
 }
 __syncthreads();
 }
     C[(row * width) + col] = result;
}
    """

# get the kernel code from the template
kernel_code=kernel_code_template
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
MatProd = mod.get_function("MatProd")
warp_size=32 # Warp size on the GPU.
    # --------------------------------------------------------------------
    # --------------------BEGIN of INITIALISATION-------------------------
    # --------------------------------------------------------------------

    # We create the python matrices for the computation C=A*B
    # This part is supposed as an input, so we don't take in account any computation
    # time here.
mat_width = 4096
nb_columnsA=mat_width
nb_linesA=mat_width

nb_columnsB=mat_width
nb_linesB=nb_columnsA

a_cpu=np.random.rand(nb_linesA,nb_columnsA).astype(np.float32)
b_cpu=np.random.rand(nb_linesB,nb_columnsB).astype(np.float32)

    # --------------------------------------------------------------------
    # --------------------End of INITIALISATION---------------------------
    # --------------------------------------------------------------------

# --------------------------------------------------------------------
    # --------------------CUDA PART---------------------------------------
    # --------------------------------------------------------------------
    # We send the data to the GPU
total_CUDA_time_Begin=time.clock()
time_memory_alloc_GPU_Begin=time.clock()
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu=gpuarray.to_gpu(b_cpu)
    # We allocate the memory on the GPU for the result C=A*B
c_gpu = gpuarray.empty((nb_linesA, nb_columnsB), np.float32)
time_memory_alloc_GPU_End=time.clock()
    # ----------------------------------------------------------
    # Starting of the CUDA computation :
    # We reserve the number of threads per block on the memory
threadPerBlockx=16
threadPerBlocky=16

    # We reserve a number of block on the memory.
size_Cx = nb_columnsB
size_Cy = nb_linesA
BlockPerGridx = (int) (size_Cx // threadPerBlockx);
BlockPerGridy = (int) (size_Cy // threadPerBlockx);

time_computation_CUDA_Begin=time.clock()
MatProd(
        # output
        c_gpu,
        # inputs
        a_gpu, b_gpu,
        np.int32(mat_width),
        # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
        block = (threadPerBlockx, threadPerBlocky, 1), grid=(BlockPerGridx,BlockPerGridy)
        )
driver.Context.synchronize()
time_computation_CUDA_End=time.clock()

time_memory_get_result_GPU_Begin=time.clock()
c_gpu_result=c_gpu.get() # We get the result
time_memory_get_result_GPU_End=time.clock()

total_CUDA_time_End=time.clock()
    # --------------------------------------------------------------------
    # --------------------END OF CUDA PART--------------------------------
    # --------------------------------------------------------------------

    # --------------------------------------------------------------------
    # --------------------PYTHON PART-------------------------------------
    # --------------------------------------------------------------------

    # We compute in python :
total_Python_time_Begin=time.clock()
c_cpu=np.empty([nb_linesA,nb_columnsB]).astype(np.float32)
time_computation_Python_Begin=time.clock()
c_cpu=np.dot(a_cpu,b_cpu)
time_computation_Python_End=time.clock()
total_Python_time_End=time.clock()

    # --------------------------------------------------------------------
    # --------------------END OF PYTHON PART------------------------------
    # --------------------------------------------------------------------

    #------------------------------------------------------------
    # We display the execution times :
    # Computation times :
time_computation_CUDA=time_computation_CUDA_End-time_computation_CUDA_Begin
time_computation_Python=time_computation_Python_End-time_computation_Python_Begin
print("CUDA pure computation time : ", time_computation_CUDA)
print("Python pure computation time : ", time_computation_Python)
print(" ")
    # Memory allocation times :
time_memory_alloc_GPU=time_memory_alloc_GPU_End-time_memory_alloc_GPU_Begin
time_memory_get_result_GPU=time_memory_get_result_GPU_End-time_memory_get_result_GPU_Begin
print("CUDA memory allocation time (allocating C, transferring A,B from CPU to GPU):", time_memory_alloc_GPU)
print("CUDA getting result from GPU (Pulling back C from GPU to CPU after computation) :", time_memory_get_result_GPU)
    # Total time (computation + memory allocation)
print(" ")
total_CUDA_time=total_CUDA_time_End-total_CUDA_time_Begin
total_Python_time=total_Python_time_End-total_Python_time_Begin
print("CUDA total time (alloc C + A to gpu + B to gpu + comput + get result) :", total_CUDA_time)
print("Python total time (comput + alloc C) :", total_Python_time)
$ python t2.py
('CUDA pure computation time : ', 0.40282600000000013)
('Python pure computation time : ', 17.788588999999998)

('CUDA memory allocation time (allocating C, transferring A,B from CPU to GPU):', 0.102495)
('CUDA getting result from GPU (Pulling back C from GPU to CPU after computation) :', 0.08509900000000004)

('CUDA total time (alloc C + A to gpu + B to gpu + comput + get result) :', 0.5904419999999999)
('Python total time (comput + alloc C) :', 17.886638)
$

Your CPU is faster than mine, but I expect that this particular example would show that pycuda is faster than numpy even on your setup.

The point is not to suggest that this is an optimal code, but to suggest that there is a lot of additional speedup possible over your naive matrix-multiply kernel. If you want more performance, I suggest CUBLAS. CUBLAS won’t have the square-matrix or size restrictions of the above code.

Even with well optimized BLAS libraries, however, matrices with low aspect ratio (“squarish”) often achieve higher throughput than matrices with high aspect ratios, and matrices whose dimensions are exact multiples of the blocking size (there may be multiple levels of blocking!) usually see noticeably higher performance. Performance differences also often apply to different transpose modes.

This generally applies to all BLAS implementations, whether for CPUs and GPUs, and is simply a function of necessary overhead and memory access patterns. The important take-home message is: Use existing vendor-provided performance libraries where available. Rolling your own only makes sense if you are an expert in the field covered by the library, or if you (provably) hit a corner case not handled well by a general-purpose library.

Thank you a lot for all your answers.

This is what I would like to do then.

As PyCuda doesn’t implement CUBLAS, I think I will use Cython to be able to use C++ code inside of Python.
Then I will be able to use this optimised library to do the linear algebra.

According to the fact my input variables are numpy arrays (I am constrained to this, I can’t directly take them in C++), and the fact I can’t use CUBLAS directly in PyCuda, do you think that would be indeed a good idea to have the fastest code possible ?

I am not really a programmer I code for scientific purpose so that’s why your opinion is important for me (maybe
I shouldn’t do this for a reason I didn’t think about).

Thanks a lot.

Despite its apparent popularity, I don’t use Python. The limitation (I would say: shortcoming) that PyCUDA doesn’t provide access to CUBLAS is documented: [url]Frequently Asked Questions about PyCUDA - Andreas Klöckner's Former Wiki

A rudimentary Google search suggests the existence of multiple Python packages that provide bindings for CUBLAS. CUBLAS was designed with C bindings, so anything that can bind to C code should be able to bind to CUBLAS.

I would expect the avid Python users in this forum to be able to point you to a suitable solution.