Incorrect results when using cublas matrix multiplication

EDIT:
///////////////////////////////
never mind, found the type error, was inconsistency transferring from GPU to CPU and one or two others
///////////////////////////////

Probably a very simple mistake but I’ve just started with cublas so can’t see it. I’m getting the wrong results when I run this code:

#include
#include
#include “cublas_v2.h”
#include <cuda_runtime.h>

using namespace std;

template void printMatrix(int rowCount, int colCount, const T* matrix) {
for (int i=0;i<rowCount;i++) {
for (int j=0;j<colCount;j++) {
cout << matrix[j*colCount+i] << “\t”;
}
cout << endl;
}
}

void gpu_blas_mmul(const float *A, const float *B, float *C, const int m, const int k, const int n) {
int lda=m,ldb=k,ldc=m;
const float alf = 1;
const float bet = 0;
const float *alpha = &alf;
const float *beta = &bet;

cublasHandle_t handle;
cublasCreate(&handle);

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

cublasDestroy(handle);

}

void gpu_blas_mmul2(const double *A, const double *B, double *C, const int m, const int k, const int n) {
int lda=m,ldb=k,ldc=m;
const double alf = 1, bet = 0;
const double *beta = &bet, *alpha = &alf;

cublasHandle_t handle;
cublasCreate(&handle);

cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

cublasDestroy(handle);

}

int main() {
// Allocate 3 arrays on CPU
int nr_rows_A, nr_cols_A, nr_rows_B, nr_cols_B, nr_rows_C, nr_cols_C;

// Using 2x2 square arrays
nr_rows_A = nr_cols_A = nr_rows_B = nr_cols_B = nr_rows_C = nr_cols_C = 2;
float h_C[4];

// Allocate 3 arrays on GPU
double *d_A, *d_B, *d_C;
cudaMalloc(&d_A,nr_rows_A * nr_cols_A * sizeof(float));
cudaMalloc(&d_B,nr_rows_B * nr_cols_B * sizeof(float));
cudaMalloc(&d_C,nr_rows_C * nr_cols_C * sizeof(float));

// Fill host arrays and move to device
//float h_A[] = {3,1,2,4};
//float h_B[] = {1,0,0,1};

double h_A[] = {3,1,2,4};
double h_B[] = {1,0,0,1};

cout << "Matrix A:\n";
printMatrix(2,2,h_A);
cout << "Matrix B:\n";
printMatrix(2,2,h_B);

cudaMemcpy(d_A,h_A,4*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_B,h_B,4*sizeof(float),cudaMemcpyHostToDevice);

// Multiply A and B on GPU
gpu_blas_mmul2(d_A, d_B, d_C, nr_rows_A, nr_cols_A, nr_cols_B);

// Copy (and print) the result on host memory
cudaMemcpy(h_C,d_C,nr_rows_C * nr_cols_C * sizeof(float),cudaMemcpyDeviceToHost);

cout << "Matrix C:\n";
printMatrix(nr_rows_C, nr_cols_C, h_C);

//Free GPU memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

return 0;

}

I compile it with
“nvcc code.cu -lcublas”
and it compiles without error. When I run it, it prints out:

CUBLAS initialization success…
Matrix A:
3 2
1 4
Matrix B:
1 0
0 1
Matrix C:
-nan -nan
-nan -nan
Application completed, cleaning up and exiting…

which wasn’t what I was expecting. Can anyone see what is wrong? It should just be a basic matrix multiplication. If I alter the comments so the matrices are floats instead and use gpu_blas_mmul instead of gpu_blas_mmul2, I get the correct output so it must be something to do with a type error somewhere…

Any help appreciated. Thanks

First of all, check the return value of every CUDA API call and every CUBLAS call.

It’s also a good idea to run codes with cuda-memcheck.

Anyway, you have an insidious typo here:

cudaMalloc((void**)&dev_C,4&sizeof(double));
                           ^

That second ampersand should be an asterisk, like so:

cudaMalloc((void**)&dev_C,4*sizeof(double));

You should also change every instance of sizeof(double) to sizeof(float) since you are using float variables here.