Hello,
for my bachelorthesis I want to compare the performance of the cuBLASSgemm to my Version in several inputsizes. For that I need to test a matrix vector multiplikation (input Mat sizes: A=(M,N) B=(N,1) ) computed by the Sgemm routine !!NOT SGEMV!!
my Code is:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <time.h>
#include "cublas_v2.h"
#define M 4096
#define N 1
#define K 4096
void callCuBLASKernel(const float* A, const float*B, float*C){
float* d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, M*K*sizeof(float));
cudaMalloc((void**)&d_B, N*K*sizeof(float));
cudaMalloc((void**)&d_C, M*N*sizeof(float));
cudaMemcpy(d_A, A, M*K*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N*K*sizeof(float), cudaMemcpyHostToDevice);
const float alpha = 1.0f, beta = 0.0f;
cublasHandle_t handle;
cublasCreate_v2(&handle);
//execute cuBLAS
cublasSgemm_v2(handle, CUBLAS_OP_T, CUBLAS_OP_T, M, N, K, &alpha, d_A, K, d_B, M, &beta, d_C, M); //T for Transpose
cudaDeviceSynchronize();
cublasDestroy_v2(handle);
cudaMemcpy(C, d_C, M*N*sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
When I want to test this with N < 32 the Sgemm routine seems not to do anything (with N>32 it works fine). I started the binary with the nvprof and the kernel isn’t listed.
So my question:
Am I doing anything wrong here or is there a lower limit to the Sgemm routine so that it isn’t possible to simulate the SGEMV routine with the SGEMM?
Greetings,
Jan