cublas<S,D,C,Z>GEMM crash on multi GPUs

mh11 · November 26, 2014, 2:24pm

Hello,

I have an application that has worked perfectly using cublas<S,D,C,Z>GEMM with CUDA 4.2 and CUDA 5.0. I recently attempted an upgrade to CUDA 6.0 and CUDA 6.5 and discovered certain size matrices are failing with cublas<S,D,C,Z>GEMM when using multi-GPUs. I have managed to pull out the pieces from my application into a stand-alone application that reproduces the problem and I will post here.

A list of coding platforms in which I have demonstrated failure :

gcc and lpthread for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 340.58, NVIDIA 690
Intel 15.0 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 340.58, NVIDIA 690
Intel 15.0 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 340.58, NVIDIA 780
Intel 15.0 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 340.58, NVIDIA 4xK10
Intel 14.0.1 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 331.67, NVIDIA 690, 580

The same application works fine on the following environments :

gcc and lpthread for multi-GPU control, Ubuntu 12.04, CUDA 5.0, Driver 340.58, NVIDIA 690
Intel 15.0 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 5.0, Driver 340.58, NVIDIA 690

Description of Code :

The first failure platform is what I will post since it does not complicate the issue w/ Intel icpc and uses only gcc. The logic is related to roughly a quasi-block LU computational module. The snippets I am posting are only intended to demonstrate the failure in a simple way and do not do anything of value. In fact, the matrices are all zero matrices exercising the CUDA / CUBLAS plumbing. The multi-GPU approach is with a typical pthread context. Essentially, the application attempts to multiply a string of matrices together of size (m-by-k) x (k-by-n). Failure occurs when k << m and k << n with multiple GPUs. The same code runs fine if only one GPU is used, or if k is not so small. I am using pitched memory but I have verified this too is not the issue - and as stated before - this same pitched memory code has worked flawlessly w/ CUDA 4.2 and CUDA 5.0 and has been thoroughly tested on those versions of CUDA.

I control late-binding w/ CUBLAS version in the cudaWrap.cpp file here :

//#define CUBLAS_SHARED_OBJECT "libcublas.so.5.0"
#define CUBLAS_SHARED_OBJECT "libcublas.so.6.0"
//#define CUBLAS_SHARED_OBJECT "libcublas.so.6.5"

I control number of GPUs and m,k,n in main.cpp here :

int nGPUs = 2;
int mRows0 = 2352;
int nCols0 = 2352;
int nBlockSets = 5;
float compression = 0.99;

The settings above will cause failure on platforms described. If I set nPGUs = 1 all works fine. If I set compression < 0.80 all works fine (this causes a much larger k). Here is a dump my application makes when it senses GEMM failure and running under memcheck :

========= CUDA-MEMCHECK
Building L blocks ... 
Building U blocks ... 
Building GPU list ... 
Running ... 
iBlockSet = 0
  _multiplyCompressedBlocks : 0 : 0 : 2352 24 2352 : 2352 24 2352 : 
  _multiplyCompressedBlocks : 1 : 1 : 2352 24 2352 : 2352 24 2352 : 
  _multiplyCompressedBlocks : 2 : 0 : 2352 24 2352 : 2352 24 2352 : 
  _multiplyCompressedBlocks : 3 : 0 : 2352 24 2352 : 2352 24 2352 : 
========= Program hit error 11 on CUDA API call to cudaMemsetAsync 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x2ef773]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x23aa39]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x1186b5]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x119512]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0xf52e0]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 (cublasCgemm_v2 + 0x23d) [0xf576d]
=========     Host Frame:./main [0x278f]
=========     Host Frame:./main [0x3373]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x7e9a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0xf43fd]
=========
========= Program hit error 33 on CUDA API call to cudaEventRecord 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x2ef773]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x23c802]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x2165f]
!! kernel execution error (irs_cublasCgemm) !!=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x1186cb]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x119512]
=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0xf52e0]

=========     Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 (cublasCgemm_v2 + 0x23d) [0xf576d]
errNo = =========     Host Frame:./main [0x278f]
=========     Host Frame:./main [0x3373]
14=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x7e9a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0xf43fd]
=========

transa = 0
transb = 0
pointer A = 0x130f400000
pointer B = 0x1311e80000
pointer C = 0x1314900000
m = 24
n = 24
k = 2352
lda = 2368
ldb = 2368
ldc = 2368
alpha = 1 0
beta = 0 0
 
--- Dynamic Device Properties ---
idDevice = 1
Run time limit on kernels: Yes
--- Dynamic Device Properties ---
 
Error : _gemmDevice : irs_cublasCgemm : propogate ...
pitchA = 18944
pitchB = 18944
pitchC = 18944
========= ERROR SUMMARY: 2 errors

I believe the “Program hit error 11 on CUDA API call to cudaMemsetAsync” is incorrect. There should be no error here. The fact that it runs fine w/ one GPU means at the very minimum CUDA is inconsistent but I believe it is actually a bug in CUDA 6.0 and CUDA 6.5.

Here is the application in various files :

cudaWrap.cpp

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <dlfcn.h>
#include <sys/time.h>
#include "cudaWrap.hpp"


#include <map>
#include <stack>
#include <algorithm>
#include <limits.h>
#include <iomanip>
#include <sys/time.h>
#include <malloc.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <ctime>


using namespace std;

static void _dumpCurrentDeviceProperties()
{
  cerr << " " << endl;
  cerr << "--- Dynamic Device Properties ---" << endl;

  int idDevice = -1;

  if (irs_cudaGetDevice(&idDevice) == CUDA_SUCCESS)
  {
    cerr << "idDevice = " << idDevice << endl;

    cudaDeviceProp deviceProp;

    if (irs_cudaGetDeviceProperties(&deviceProp, idDevice) == CUDA_SUCCESS)
    {
      cerr << "Run time limit on kernels: " << (deviceProp.kernelExecTimeoutEnabled ? "Yes" : "No") << endl;
    }
    else
    {
      cerr << "Error : _dumpCurrentDeviceProperties : irs_cudaGetDeviceProperties" << endl;
    }
  }
  else
  {
    cerr << "Error : _dumpCurrentDeviceProperties : irs_cudaGetDevice" << endl;
  }

  cerr << "--- Dynamic Device Properties ---" << endl;
  cerr << " " << endl;
}

int irs_cudaGetDeviceProperties(struct cudaDeviceProp* prop, int idDevice)
{
  cudaError_t error = cudaGetDeviceProperties(prop, idDevice);

  if (error != cudaSuccess)
  {
    cerr << "!! device access error (irs_cudaGetDeviceProperties) !!" << endl;
    cerr << "idDevice = " << idDevice << endl;
    cerr << "errNo = " << error << endl;

    return CUDA_FAILURE;
  }

  return CUDA_SUCCESS;
}

int irs_cudaGetDevice(int* idDevice)
{
  cudaError_t error = cudaGetDevice(idDevice);

  if (error != cudaSuccess)
  {
    cerr << "!! acctive device access error (irs_cudaGetDevice) !!" << endl;
    cerr << "errNo = " << error << endl;

    return CUDA_FAILURE;
  }

  return CUDA_SUCCESS;
}

int irs_cudaSetDevice(int idDevice)
{
  cudaError_t error = cudaSetDevice(idDevice);

  if (error != cudaSuccess)
  {
    cerr << "!! device access error (irs_cudaSetDevice) !!" << endl;
    cerr << "idDevice = " << idDevice << endl;
    cerr << "errNo = " << error << endl;

    _dumpCurrentDeviceProperties();

    return CUDA_FAILURE;
  }

  return CUDA_SUCCESS;
}

int irs_cudaMemcpy2D(void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind)
{
  cudaError_t error = cudaMemcpy2D(dst, dpitch, src, spitch, width, height, kind);

  if (error != cudaSuccess)
  {
    cerr << "!! device memory copy error (irs_cudaMemcpy2D) !!" << endl;

    cerr << "errNo = " << error << endl;
    cerr << "dpitch = " << dpitch << endl;
    cerr << "spitch = " << spitch << endl;
    cerr << "width = " << width << endl;
    cerr << "height = " << height << endl;
    cerr << "kind = " << kind << endl;

    _dumpCurrentDeviceProperties();

    return CUDA_FAILURE;
  }

  return CUDA_SUCCESS;
}

void irs_cudaMallocPitch(void** devPtr, size_t* pitch, size_t width, size_t height)
{
  cudaError_t sts = cudaMallocPitch(devPtr, pitch, width, height);

  if (sts != cudaSuccess)
  {
    cerr << "!! device memory allocation error (_cudaMallocPitch) !!" << endl;
    cerr << "errNo = " << sts << endl;
    cerr << "pitch = " << *pitch << endl;
    cerr << "width = " << width << endl;
    cerr << "height = " << height << endl;
    exit(1);
  }
}

void irs_cudaFree(cuComplex** d_A)
{
  cudaError_t sts = cudaFree(*d_A);

  if (sts != cudaSuccess)
  {
    cerr << "!! device memory free error (irs_cudaFree) !!" << endl;
    cerr << "errNo = " << sts << endl;
    exit(1);
  }
}

/*%-----------------------------------------------------------------------------
cublasCgemm
 ----------------------------------------------------------------------------%*/

#define CUBLAS_CGEMM "cublasCgemm_v2"

static cublasStatus_t (*_pcublasCgemm)(cublasHandle_t, cublasOperation_t, cublasOperation_t, 
  int, int, int, const cuComplex *, const cuComplex *, int,
    const cuComplex *, int, const cuComplex *, cuComplex *, int) = NULL;

static void _bind_cublasCgemm(void* pCublasSharedObjectHandle)
{
  if (_pcublasCgemm == NULL)
  {
    _pcublasCgemm = 
      (cublasStatus_t (*)(cublasHandle_t, cublasOperation_t, cublasOperation_t,
        int, int, int, const cuComplex *, const cuComplex *, int, const cuComplex *,
          int, const cuComplex *, cuComplex *, int))dlsym(pCublasSharedObjectHandle, CUBLAS_CGEMM);

    char* strError = dlerror();

    if (strError != NULL)
    {
      cerr << "_bind_cublasCgemm : _pcublasCgemm : signature not found : " <<
        strError << " : exiting ..." << endl;

      exit(1);
    }
  }
}

int irs_cublasCgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
  int m, int n, int k, const cuComplex *alpha, const cuComplex *A, int lda,
    const cuComplex *B, int ldb, const cuComplex *beta, cuComplex *C, int ldc)
{
  if (_pcublasCgemm != NULL)
  {
    struct timeval start, end;
    gettimeofday(&start, NULL); 

    cublasStatus_t status = _pcublasCgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

    if (status != CUBLAS_STATUS_SUCCESS)
    {
      cerr << "!! kernel execution error (irs_cublasCgemm) !!" << endl;
      cerr << "errNo = " << status << endl;
      cerr << "transa = " << transa << endl;
      cerr << "transb = " << transb << endl;
      cerr << "pointer A = " << A << endl;
      cerr << "pointer B = " << B << endl;
      cerr << "pointer C = " << C << endl;
      cerr << "m = " << m << endl;
      cerr << "n = " << n << endl;
      cerr << "k = " << k << endl;
      cerr << "lda = " << lda << endl;
      cerr << "ldb = " << ldb << endl;
      cerr << "ldc = " << ldc << endl;
      cerr << "alpha = " << alpha->x << " " << alpha->y << endl;
      cerr << "beta = " << beta->x << " " << beta->y << endl;

      _dumpCurrentDeviceProperties();

      return CUDA_FAILURE;
    }
  }
  else
  {
    cerr << "irs_cublasCgemm : _pcublasCgemm : signature not found : exiting ..." << endl;

    exit(1);
  }

  return CUDA_SUCCESS;
}

/*%-----------------------------------------------------------------------------
cublasCreate
 ----------------------------------------------------------------------------%*/

#define CUBLAS_CREATE "cublasCreate_v2"

static cublasStatus_t (*_pcublasCreate)(cublasHandle_t*) = NULL;

static void _bind_cublasCreate(void* pCublasSharedObjectHandle)
{
  if (_pcublasCreate == NULL)
  {
    _pcublasCreate =
      (cublasStatus_t (*)(cublasHandle_t*))dlsym(pCublasSharedObjectHandle, CUBLAS_CREATE);

    char* strError = dlerror();

    if (strError != NULL)
    {
      cerr << "_bind_cublasCreate : _pcublasCreate : signature not found : " <<
        strError << " : exiting ..." << endl;

      exit(1);
    }
  }
}

int irs_cublasCreate(cublasHandle_t* pHandle)
{
  if (_pcublasCreate != NULL)
  {
    if (_pcublasCreate(pHandle) != CUBLAS_STATUS_SUCCESS)
    {
      cerr << "irs_cublasCreate: _pcublasCreate : Cublas handle not established : exiting ..." << endl;

      exit(1);
    }
  }
  else
  {
    cerr << "irs_cublasCreate : _pcublasCreate : signature not found : exiting ..." << endl;

    exit(1);
  }

  return CUBLAS_STATUS_SUCCESS;
}

/*%-----------------------------------------------------------------------------
cublasDestroy
 ----------------------------------------------------------------------------%*/

#define CUBLAS_DESTROY "cublasDestroy_v2"

static cublasStatus_t (*_pcublasDestroy)(cublasHandle_t) = NULL;

static void _bind_cublasDestroy(void* pCublasSharedObjectHandle)
{
  if (_pcublasDestroy == NULL)
  {
    _pcublasDestroy =
      (cublasStatus_t (*)(cublasHandle_t))dlsym(pCublasSharedObjectHandle, CUBLAS_DESTROY);

    char* strError = dlerror();

    if (strError != NULL)
    {
      cerr << "_bind_cublasDestroy : _pcublasDestroy : signature not found : " <<
        strError << " : exiting ..." << endl;

      exit(1);
    }
  }
}

int irs_cublasDestroy(cublasHandle_t handle)
{
  if (_pcublasDestroy != NULL)
  {
    if (_pcublasDestroy(handle) != CUBLAS_STATUS_SUCCESS)
    {
      cerr << "irs_cublasDestroy: _pcublasDestroy : Cublas handle destroy not completed : exiting ..." << endl;

      exit(1);
    }
  }
  else
  {
    cerr << "irs_cublasDestroy : _pcublasDestroy : signature not found : exiting ..." << endl;

    exit(1);
  }

  return CUBLAS_STATUS_SUCCESS;
}

//#define CUBLAS_SHARED_OBJECT "libcublas.so.5.0"
#define CUBLAS_SHARED_OBJECT "libcublas.so.6.0"
//#define CUBLAS_SHARED_OBJECT "libcublas.so.6.5"

void irs_bindCublas()
{
  void* pCublasSharedObjectHandle = dlopen(CUBLAS_SHARED_OBJECT, RTLD_NOW);

  if (pCublasSharedObjectHandle == NULL)
  {
    char* strError = dlerror();

    cerr << "irs_bindCublas : Cublas " << CUBLAS_SHARED_OBJECT << " not bound : " <<
      strError << " : exiting ..." << endl;

    exit(1);
  }

  _bind_cublasCreate(pCublasSharedObjectHandle);
  _bind_cublasDestroy(pCublasSharedObjectHandle);
  _bind_cublasCgemm(pCublasSharedObjectHandle);
}

cudaWrap.hpp

#pragma once

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <iostream>
#include <pthread.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>

#include <cuda_runtime.h>
#include <cublas.h>
#include <cublas_v2.h>
#include <pthread.h>

#define CUDA_SUCCESS 0
#define CUDA_FAILURE 1

int irs_cudaGetDevice(int* idDevice);

int irs_cudaSetDevice(int idDevice);

int irs_cudaGetDeviceProperties(struct cudaDeviceProp* prop, int device);

int irs_cudaMemcpy2D(void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind);

int irs_cudaMemset(void* devPtr, int value, size_t count);

int irs_cudaMemset2D(void* dst, size_t dpitch, int value, size_t width, size_t height);

void irs_cudaMallocPitch(void** devPtr, size_t* pitch, size_t width, size_t height);

int irs_cudaMalloc(void** pVoid, size_t size);

void irs_cudaFree(cuComplex** d_A);

int irs_cudaSetVector(int n, int elemSize, const void *x, int incx, void *devicePtr, int incy);

int irs_cudaGetVector(int n, int elemSize, const void *devicePtr, int incx, void *y, int incy);

int irs_cudaSetMatrix(int m, int n, int elemSize, const void *A, int lda, void *B, int ldb);

int irs_cudaGetMatrix(int m, int n, int elemSize, const void *A, int lda, void *B, int ldb);

int irs_cublasCgemv(cublasHandle_t handle, cublasOperation_t mode,
  int m, int n, const cuComplex *alpha, const cuComplex *A, int lda,
    const cuComplex *x, int incx, const cuComplex *beta, cuComplex *y, int incy);

int irs_cublasCgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
  int m, int n, int k, const cuComplex *alpha, const cuComplex *A, int lda,
    const cuComplex *B, int ldb, const cuComplex *beta, cuComplex *C, int ldc);

int irs_cublasCreate(cublasHandle_t* pHandle);

int irs_cublasDestroy(cublasHandle_t handle);

void irs_bindCublas();

main.cpp

#include <stdlib.h>
#include <stdio.h>
#include <vector>
#include <complex>
#include <iostream>
#include <sys/time.h>
#include "cudaWrap.hpp"
#include <pthread.h>

using namespace std;

cublasHandle_t handle;

typedef size_t Index;

template<typename T>
class Matrix
{
  public:
    Matrix() { numRows = 0; numColumns = 0; numElements = 0; leadingDimension = 0; data = NULL; };
    ~Matrix() { free(); };

    void allocate(Index nr, Index nc) { data = new T[(Index)nr * (Index)nc]; numRows = nr; numColumns = nc; 
      numElements = nr*nc; leadingDimension = nr; };
    void free(void) { if(data!=NULL) { delete[] data; data = NULL; } 
      numRows = 0; numColumns = 0; numElements = 0; leadingDimension = 0; };

  public:
    T *data;
    Index numRows, numColumns;
    Index numElements;
    Index leadingDimension;
};

static void _copyHostToDevice(void* dst, size_t dpitch, Matrix<complex<float> >& src)
{
  int sts = irs_cudaMemcpy2D(dst, dpitch, src.data, src.numRows * sizeof(complex<float>),
    src.numRows * sizeof(complex<float>), src.numColumns, cudaMemcpyHostToDevice);

  if (sts != CUDA_SUCCESS)
  {
    cerr << "Error : _copyHostToDevice : irs_cudaMemcpy2D : propogate ..." << endl;
    exit(1);
  }
}

static void _copyDeviceToHost(void* src, size_t spitch, Matrix<complex<float> >& dst)
{
  int sts = irs_cudaMemcpy2D(dst.data, dst.numRows * sizeof(complex<float>), src, spitch,
    dst.numRows * sizeof(complex<float>), dst.numColumns, cudaMemcpyDeviceToHost);

  if (sts != CUDA_SUCCESS)
  {
    cerr << "Error : _copyDeviceToHost : irs_cudaMemcpy2D : propogate ..." << endl;
    exit(1);
  }
}

static void _gemmDevice(cublasHandle_t handle, int m, int n, int k,
  const complex<float> *alpha, const cuComplex *d_A, size_t pitchA,
  const cuComplex *d_B, size_t pitchB, const complex<float> *beta,
  cuComplex *d_C, size_t pitchC)
{
  int sts = irs_cublasCgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k,
    (cuComplex *)alpha, d_A, pitchA / sizeof(complex<float>),
    d_B, pitchB / sizeof(complex<float>), (cuComplex *)beta,
    d_C, pitchC / sizeof(complex<float>));

  if (sts != CUDA_SUCCESS)
  {
    cerr << "Error : _gemmDevice : irs_cublasCgemm : propogate ..." << endl;
    cerr << "pitchA = " << pitchA << endl;
    cerr << "pitchB = " << pitchB << endl;
    cerr << "pitchC = " << pitchC << endl;
    exit(1);
  }
}

typedef struct GPUWorkMatrixPitched
{
  GPUWorkMatrixPitched(int _mRowsL, int _nColsL)
  {
    mRows = _mRowsL;
    nCols = _nColsL;
    pitchWidthDevice = nCols * sizeof(complex<float>);
    pitchHeightDevice = mRows;
    irs_cudaMallocPitch((void**)&d_W, &pitchDevice, mRows * sizeof(complex<float>), nCols);
  }

  cuComplex* d_W;
  int mRows;
  int nCols;
  size_t pitchDevice;
  size_t pitchWidthDevice;
  size_t pitchHeightDevice;
} GPUWorkMatrixPitched;

typedef struct GPUlu
{
  GPUlu(int _mRowsL, int _nColsL, int idDeviceThis)
  {
    idDevice = idDeviceThis;

    if (irs_cudaSetDevice(idDevice) != CUDA_SUCCESS)
    {
      cerr << "Error : GPUlu : irs_cudaSetDevice : exiting ..." << endl;
      exit(1);
    }

    for (int iWork = 0; iWork < 4; iWork++)
    {
      GPUWorkMatrixPitched* pGPUWorkMatrixPitched = new GPUWorkMatrixPitched(_mRowsL, _nColsL);
      workMatrices.push_back(pGPUWorkMatrixPitched);
    }
  }

  ~GPUlu()
  {
    if (irs_cudaSetDevice(idDevice) != CUDA_SUCCESS)
    {
      cerr << "Error : GPUlu : irs_cudaSetDevice : exiting ..." << endl;
      exit(1);
    }

    for (int iWork = 0; iWork < 4; iWork++)
      irs_cudaFree(&workMatrices[iWork]->d_W);
  }

  int idDevice;
  vector<GPUWorkMatrixPitched *> workMatrices;
} GPUlu;

class Block
{
public:
  int mRows;
  int nCols;
  int nRank;
  Matrix<complex<float> > U;
  Matrix<complex<float> > V;
};

static void _addCompressedBlock(Block* pB, float compression, vector<Block*>& Blocks)
{
  int nMax = (pB->mRows > pB->nCols) ? pB->mRows : pB->nCols;
  pB->nRank = nMax - (int)(compression * nMax);
  pB->U.allocate(pB->mRows, pB->nRank);
  pB->V.allocate(pB->nRank, pB->nCols);
  Blocks.push_back(pB);
}

static void _multiplyCompressedBlocks(Block* pBL, Block* pBU, GPUlu* pGPUlu)
{
  // Goal : -pBL->U % ((pBL->V % pBU->U) % pBU->V) -> W3

  complex<float> alpha = complex<float>(1.0f, 0.0f);
  complex<float> beta = complex<float>(0.0f, 0.0f);

  // (pBL->V % pBU->U) -> W2
  _copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBL->V);
  _copyHostToDevice(pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, pBU->U);
  _gemmDevice(handle, pBL->V.numRows, pBU->U.numColumns, pBL->V.numColumns,
    &alpha, pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice,
    pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, &beta,
    pGPUlu->workMatrices[2]->d_W, pGPUlu->workMatrices[2]->pitchDevice);

  // ((pBL->V % pBU->U) % pBU->V) -> W1
  _copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBU->V);
  _gemmDevice(handle, pBL->V.numRows, pBU->V.numColumns, pBU->U.numColumns,
    &alpha, pGPUlu->workMatrices[2]->d_W, pGPUlu->workMatrices[2]->pitchDevice,
    pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, &beta,
    pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice);

  alpha = complex<float>(-1.0f, 0.0f);
  beta = complex<float>(1.0f, 0.0f);

  // -pBL->U % ((pBL->V % pBU->U) % pBU->V) -> W3
  _copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBL->U);
  _gemmDevice(handle, pBL->U.numRows, pBU->V.numColumns, pBL->U.numColumns,
    &alpha, pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice,
    pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, &beta,
    pGPUlu->workMatrices[3]->d_W, pGPUlu->workMatrices[3]->pitchDevice);
}

pthread_mutex_t lock;
pthread_mutex_t print1;

int iBlockCount = -1;
int nBlocks = 4;
vector<Block*> LBlocks;
vector<Block*> UBlocks;

void *pThreadWorker(void* pData)
{
  GPUlu* pGPUlu = (GPUlu *)pData;

  irs_cudaSetDevice(pGPUlu->idDevice);

  while (1)
  {
    int iBlockCountPinned = -1;

    pthread_mutex_lock(&lock);

      iBlockCount++;
      iBlockCountPinned = iBlockCount;

    pthread_mutex_unlock(&lock);

    if (iBlockCountPinned >= nBlocks)
      break;
    
    Block* pBL = LBlocks[iBlockCountPinned];
    Block* pBU = UBlocks[iBlockCountPinned];
    
    pthread_mutex_lock(&print1);

      cout << "  _multiplyCompressedBlocks : " << iBlockCountPinned << " : " << pGPUlu->idDevice << " : " << 
      pBL->mRows << " " << pBL->nRank << " " << pBL->nCols << " : " <<
      pBU->mRows << " " << pBU->nRank << " " << pBU->nCols << " : " << endl;

    pthread_mutex_unlock(&print1);

    _multiplyCompressedBlocks(pBL, pBU, pGPUlu);
  }
}

static void _test()
{
  int nGPUs = 2;
  int mRows0 = 2352;
  int nCols0 = 2352;
  int nBlockSets = 5;
  float compression = 0.99;

  cout << "Building L blocks ... " << endl;

  for (int iBlock = 0; iBlock < nBlocks; iBlock++)
  {
    Block* pB = new Block();
    pB->mRows = mRows0;
    pB->nCols = nCols0;
    _addCompressedBlock(pB, compression, LBlocks);
  }

  cout << "Building U blocks ... " << endl;

  for (int iBlock = 0; iBlock < nBlocks; iBlock++)
  {
    Block* pB = new Block();
    pB->mRows = mRows0;
    pB->nCols = nCols0;
    _addCompressedBlock(pB, compression, UBlocks);
  }

  cout << "Building GPU list ... " << endl;

  vector<GPUlu* > GPUluList;
  for (int iDevice = 0; iDevice < nGPUs; iDevice++)
  {
    GPUlu* pGPUlu = new GPUlu(mRows0, nCols0, iDevice);
    GPUluList.push_back(pGPUlu);
  }

  cout << "Running ... " << endl;

  for (int iBlockSet = 0; iBlockSet < nBlockSets; iBlockSet++)
  {
    cout << "iBlockSet = " << iBlockSet << endl;

    iBlockCount = -1;

    pthread_t threads[nGPUs];

    for (int iDevice = 0; iDevice < nGPUs; iDevice++)
    {
      int rc;
      if ((rc = pthread_create(&threads[iDevice], NULL, &pThreadWorker, (void *)GPUluList[iDevice])))
        printf("Thread creation failed: %d\n", rc);
    }

    for (int iDevice = 0; iDevice < nGPUs; iDevice++)
      pthread_join(threads[iDevice], NULL);
  }
}

int main(int argc, char **argv)
{
  cublasStatus_t status;

  if (cublasCreate(&handle) != CUBLAS_STATUS_SUCCESS)
  {
    fprintf (stderr, "!!!! CUBLAS initialization error\n");
    exit(0);
  }

  irs_bindCublas();

  _test();

  if (cublasDestroy(handle) != CUBLAS_STATUS_SUCCESS)
  {
    fprintf (stderr, "!!!! shutdown error (A)\n");
    exit(0);
  }

  return 0;
}

To run the code I use buildNRun.sh

make clean
make print
make all

#LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-5.0/lib64
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.0/lib64
#LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.5/lib64

nohup ./main > main.out 2> main.err &

which invokes the following makefile :

CC=gcc
OPT=-O3

CUDAROOT=/usr/local/cuda

INC=-I$(CUDAROOT)/include
LIBS=-lpthread -L$(CUDAROOT)/lib64 -lcudart -lcublas
CFLAGS=$(OPT) $(INC)

print:
	@echo $(CUDA)
	@echo $(LIBS)
	@echo $(INC)

all: main

main: main.o cudaWrap.o
	@$(CC) $(CFLAGS) -o main *.o $(LIBS)

main.o: main.cpp 
	@$(CC) $(CFLAGS) -c -o main.o main.cpp

cudaWrap.o: cudaWrap.cpp cudaWrap.hpp
	@$(CC) $(CFLAGS) -c -o cudaWrap.o cudaWrap.cpp

clean:
	rm *.o main

I can zip the files up if that is easier but I do not know the best way to post the application. More than happy to post zip files just tell me where.

This issue is a very serious problem for our application. We cannot migrate to CUDA >= 6.0. It is holding up customer deliveries at this time. Please help me understand what is going on.

njuffa · November 26, 2014, 2:39pm

Based on your description, you would want to file a bug report right away, attaching your self-contained repro code that demonstrates the issue. The bug reporting form is linked from the registered developer website [url]https://developer.nvidia.com/[/url].

In case you are not yet a registered CUDA developer, the registration process is straightforward and turn-around for approval is typically one business day. Please note there is a US holiday (Thanksgiving) on Thursday 11/27, this could delay response.

mh11 · November 26, 2014, 3:45pm

Thank you for your advice. I have registered and currently waiting. In the mean time, does this mean I should not expect any activity of my issue here in the forum?

njuffa · November 26, 2014, 3:57pm

These developer forums serve as a community platform for users talking to users, and are not designed as an official bug reporting channel. NVIDIA employees may peek in occasionally and may or may not take up issues reported here. The one sure way of getting any issue, whether bug report or feature request, tracked by NVIDIA is to submit it via the bug reporting form.

Robert_Crovella · November 26, 2014, 4:04pm

The forums are not intended to be a replacement for the bug submission process. However I will probably look at the issue, and if I can convince myself that to the best of my knowledge it looks like a bug, I would go ahead and file a bug, whether you did or not. The rate of that process will depend on my ability to spend cycles reproducing and reviewing the problem.

So that may take some time.

One question I have is whether you have tested with the production release version of CUDA 6.5 (i.e. not CUDA 6.5 RC). I assume you would have. The reason for my question is that there were some matrix-shape dependent bugs up through CUDA 6.5 RC that got fixed in CUDA 6.5. These bugs manifested when k was small (less than 8) and you were running the code on a cc3.5 GPU (e.g. Tesla K20) (cublas library has architecture-dependent code paths). An example report is here:

[url]c - CUDA program gives cudaErrorIllegalAddress on sm_35 Kepler GPUs, but runs on fine on other GPUs - Stack Overflow

I will also mention that the complexity of your repro case will probably slow down the analysis as well. If you can reduce it (e.g. especially useful if you can eliminate the dependency on pthreads or any CPU threading model) I’m sure it would accelerate analysis and resolution.

mh11 · November 26, 2014, 5:18pm

I assume production release is what I obtained from the Cuda Toolkit download pages. If so, then that is what I have.

Yes, I am aware of similar problems others reported. There does seem to be yet further issues related to small k that have not been fixed.

Unfortunately, this was a very difficult problem to chase and then to extract out a workable stand-alone application. Many days on it thus far. I can do multi-GPU w/ Intel and OpenMP or gcc pthreads. As stated in the log, one GPU is not an issue - I need two controlled by CPU threads to do the trick (which I understand is a rather common technique in mixed CPU / GPU computing.) I am open to suggestion about how to reduce it further but the problems just don’t arise on simple single GPU cases or simple mat-mat calls. It needs the subtleties as far as I can tell.

mh11 · November 26, 2014, 5:23pm

mh1:

I assume production release is what I obtained from the Cuda Toolkit download pages. If so, then that is what I have.

Yes, I am aware of similar problems others reported. There does seem to be yet further issues related to small k that have not been fixed. However, in my case, the issue may be more related to an inner multiply in which k >> m or k >> n. That isn’t clear from what I wrote for m,k,n take on different contexts depending on order of multiplies. For example m = n = 24, k = 2352 so that an attempt is made on (24 x 2352) x (2352 x 24), but it doesn’t fail on first attempt. It takes repeated attempts in a multi-GPU arrangement for failure.

Unfortunately, this was a very difficult problem to chase and then to extract out a workable stand-alone application. Many days on it thus far. I can do multi-GPU w/ Intel and OpenMP or gcc pthreads. As stated in the log, one GPU is not an issue - I need two controlled by CPU threads to do the trick (which I understand is a rather common technique in mixed CPU / GPU computing.) This is also the way in which my real application is structured too so I do want to stay as close in spirit as possible. I am open to suggestions about how to reduce it further but the problems just don’t arise on simple single GPU cases or simple mat-mat calls. It needs the subtleties as far as I can tell.

Robert_Crovella · November 26, 2014, 10:44pm

I spent some time playing with your code and was able to reproduce your observations (apparent failure) on CUDA 6.5. However I don’t think there is an underlying bug (in the CUBLAS library). I acknowledge that your code seems to work as-is on CUDA 5.5, but it’s pretty clear to me that your code is not correct per the current CUBLAS documentation.

You might want to read the various sections of the CUBLAS docs that pertain to multi-threading. In particular here:

http://docs.nvidia.com/cuda/cublas/index.html#cublas-context

“The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestory() calls”

“So it is not recommended that multiple thread share the same CUBLAS handle.”

However this is exactly what your code is doing. You call cublasCreate exactly once, and use a global handle that is passed to every thread’s usage of the cublas API, regardless of the device that thread is targetting. This is explicitly not recommended in the documentation.

When I modify your code (only necessary mods were to main.cpp) to instantiate a separate, unique handle per thread, and ensure that handle is created when that thread’s intended device is selected, then I can no longer reproduce the issue, and the output is identical whether I use CUDA 5.5 or CUDA 6.5

Here’s my modified main.cpp:

#include <stdlib.h>
#include <stdio.h>
#include <vector>
#include <complex>
#include <iostream>
#include <sys/time.h>
#include "cudaWrap.hpp"
#include <pthread.h>

using namespace std;

cublasHandle_t handle;

typedef size_t Index;

template<typename T>
class Matrix
{
  public:
    Matrix() { numRows = 0; numColumns = 0; numElements = 0; leadingDimension = 0; data = NULL; };
    ~Matrix() { free(); };

    void allocate(Index nr, Index nc) { data = new T[(Index)nr * (Index)nc]; numRows = nr; numColumns = nc; 
      numElements = nr*nc; leadingDimension = nr; };
    void free(void) { if(data!=NULL) { delete[] data; data = NULL; } 
      numRows = 0; numColumns = 0; numElements = 0; leadingDimension = 0; };

  public:
    T *data;
    Index numRows, numColumns;
    Index numElements;
    Index leadingDimension;
};

static void _copyHostToDevice(void* dst, size_t dpitch, Matrix<complex<float> >& src)
{
  int sts = irs_cudaMemcpy2D(dst, dpitch, src.data, src.numRows * sizeof(complex<float>),
    src.numRows * sizeof(complex<float>), src.numColumns, cudaMemcpyHostToDevice);

  if (sts != CUDA_SUCCESS)
  {
    cerr << "Error : _copyHostToDevice : irs_cudaMemcpy2D : propogate ..." << endl;
    exit(1);
  }
}

static void _copyDeviceToHost(void* src, size_t spitch, Matrix<complex<float> >& dst)
{
  int sts = irs_cudaMemcpy2D(dst.data, dst.numRows * sizeof(complex<float>), src, spitch,
    dst.numRows * sizeof(complex<float>), dst.numColumns, cudaMemcpyDeviceToHost);

  if (sts != CUDA_SUCCESS)
  {
    cerr << "Error : _copyDeviceToHost : irs_cudaMemcpy2D : propogate ..." << endl;
    exit(1);
  }
}

static void _gemmDevice(cublasHandle_t lhandle, int m, int n, int k,
  const complex<float> *alpha, const cuComplex *d_A, size_t pitchA,
  const cuComplex *d_B, size_t pitchB, const complex<float> *beta,
  cuComplex *d_C, size_t pitchC)
{
  int sts = irs_cublasCgemm(lhandle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k,
    (cuComplex *)alpha, d_A, pitchA / sizeof(complex<float>),
    d_B, pitchB / sizeof(complex<float>), (cuComplex *)beta,
    d_C, pitchC / sizeof(complex<float>));

  if (sts != CUDA_SUCCESS)
  {
    cerr << "Error : _gemmDevice : irs_cublasCgemm : propogate ..." << endl;
    cerr << "pitchA = " << pitchA << endl;
    cerr << "pitchB = " << pitchB << endl;
    cerr << "pitchC = " << pitchC << endl;
    exit(1);
  }
}

typedef struct GPUWorkMatrixPitched
{
  GPUWorkMatrixPitched(int _mRowsL, int _nColsL)
  {
    mRows = _mRowsL;
    nCols = _nColsL;
    pitchWidthDevice = nCols * sizeof(complex<float>);
    pitchHeightDevice = mRows;
    irs_cudaMallocPitch((void**)&d_W, &pitchDevice, mRows * sizeof(complex<float>), nCols);
  }

  cuComplex* d_W;
  int mRows;
  int nCols;
  size_t pitchDevice;
  size_t pitchWidthDevice;
  size_t pitchHeightDevice;
} GPUWorkMatrixPitched;

typedef struct GPUlu
{
  GPUlu(int _mRowsL, int _nColsL, int idDeviceThis)
  {
    idDevice = idDeviceThis;

    if (irs_cudaSetDevice(idDevice) != CUDA_SUCCESS)
    {
      cerr << "Error : GPUlu : irs_cudaSetDevice : exiting ..." << endl;
      exit(1);
    }

    if (cublasCreate(&my_handle) != CUBLAS_STATUS_SUCCESS)
    {
      fprintf (stderr, "!!!! CUBLAS initialization error\n");
      exit(1);
    }

    for (int iWork = 0; iWork < 4; iWork++)
    {
      GPUWorkMatrixPitched* pGPUWorkMatrixPitched = new GPUWorkMatrixPitched(_mRowsL, _nColsL);
      workMatrices.push_back(pGPUWorkMatrixPitched);
    }
  }

  ~GPUlu()
  {
    if (irs_cudaSetDevice(idDevice) != CUDA_SUCCESS)
    {
      cerr << "Error : GPUlu : irs_cudaSetDevice : exiting ..." << endl;
      exit(1);
    }

    for (int iWork = 0; iWork < 4; iWork++)
      irs_cudaFree(&workMatrices[iWork]->d_W);
  }

  int idDevice;
  cublasHandle_t my_handle;
  vector<GPUWorkMatrixPitched *> workMatrices;
} GPUlu;

class Block
{
public:
  int mRows;
  int nCols;
  int nRank;
  Matrix<complex<float> > U;
  Matrix<complex<float> > V;
};

static void _addCompressedBlock(Block* pB, float compression, vector<Block*>& Blocks)
{
  int nMax = (pB->mRows > pB->nCols) ? pB->mRows : pB->nCols;
  pB->nRank = nMax - (int)(compression * nMax);
  pB->U.allocate(pB->mRows, pB->nRank);
  pB->V.allocate(pB->nRank, pB->nCols);
  Blocks.push_back(pB);
}

static void _multiplyCompressedBlocks(Block* pBL, Block* pBU, GPUlu* pGPUlu)
{
  // Goal : -pBL->U % ((pBL->V % pBU->U) % pBU->V) -> W3

  complex<float> alpha = complex<float>(1.0f, 0.0f);
  complex<float> beta = complex<float>(0.0f, 0.0f);

  // (pBL->V % pBU->U) -> W2
  _copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBL->V);
  _copyHostToDevice(pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, pBU->U);
  _gemmDevice(pGPUlu->my_handle, pBL->V.numRows, pBU->U.numColumns, pBL->V.numColumns,
    &alpha, pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice,
    pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, &beta,
    pGPUlu->workMatrices[2]->d_W, pGPUlu->workMatrices[2]->pitchDevice);

  // ((pBL->V % pBU->U) % pBU->V) -> W1
  _copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBU->V);
  _gemmDevice(pGPUlu->my_handle, pBL->V.numRows, pBU->V.numColumns, pBU->U.numColumns,
    &alpha, pGPUlu->workMatrices[2]->d_W, pGPUlu->workMatrices[2]->pitchDevice,
    pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, &beta,
    pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice);

  alpha = complex<float>(-1.0f, 0.0f);
  beta = complex<float>(1.0f, 0.0f);

  // -pBL->U % ((pBL->V % pBU->U) % pBU->V) -> W3
  _copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBL->U);
  _gemmDevice(pGPUlu->my_handle, pBL->U.numRows, pBU->V.numColumns, pBL->U.numColumns,
    &alpha, pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice,
    pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, &beta,
    pGPUlu->workMatrices[3]->d_W, pGPUlu->workMatrices[3]->pitchDevice);
}

pthread_mutex_t lock;
pthread_mutex_t print1;

int iBlockCount = -1;
int nBlocks = 4;
vector<Block*> LBlocks;
vector<Block*> UBlocks;

void *pThreadWorker(void* pData)
{
  GPUlu* pGPUlu = (GPUlu *)pData;

  irs_cudaSetDevice(pGPUlu->idDevice);

  while (1)
  {
    int iBlockCountPinned = -1;

    pthread_mutex_lock(&lock);

      iBlockCount++;
      iBlockCountPinned = iBlockCount;

    pthread_mutex_unlock(&lock);

    if (iBlockCountPinned >= nBlocks)
      break;
    
    Block* pBL = LBlocks[iBlockCountPinned];
    Block* pBU = UBlocks[iBlockCountPinned];
    
    pthread_mutex_lock(&print1);

      cout << "  _multiplyCompressedBlocks : " << iBlockCountPinned << " : " << pGPUlu->idDevice << " : " << 
      pBL->mRows << " " << pBL->nRank << " " << pBL->nCols << " : " <<
      pBU->mRows << " " << pBU->nRank << " " << pBU->nCols << " : " << endl;

    pthread_mutex_unlock(&print1);

    _multiplyCompressedBlocks(pBL, pBU, pGPUlu);
  }
}

static void _test()
{
  int nGPUs = 2;
  int mRows0 = 2352;
  int nCols0 = 2352;
  int nBlockSets = 5;
  float compression = 0.99;

  cout << "Building L blocks ... " << endl;

  for (int iBlock = 0; iBlock < nBlocks; iBlock++)
  {
    Block* pB = new Block();
    pB->mRows = mRows0;
    pB->nCols = nCols0;
    _addCompressedBlock(pB, compression, LBlocks);
  }

  cout << "Building U blocks ... " << endl;

  for (int iBlock = 0; iBlock < nBlocks; iBlock++)
  {
    Block* pB = new Block();
    pB->mRows = mRows0;
    pB->nCols = nCols0;
    _addCompressedBlock(pB, compression, UBlocks);
  }

  cout << "Building GPU list ... " << endl;

  vector<GPUlu* > GPUluList;
  for (int iDevice = 0; iDevice < nGPUs; iDevice++)
  {
    GPUlu* pGPUlu = new GPUlu(mRows0, nCols0, iDevice);
    GPUluList.push_back(pGPUlu);
  }

  cout << "Running ... " << endl;

  for (int iBlockSet = 0; iBlockSet < nBlockSets; iBlockSet++)
  {
    cout << "iBlockSet = " << iBlockSet << endl;

    iBlockCount = -1;

    pthread_t threads[nGPUs];

    for (int iDevice = 0; iDevice < nGPUs; iDevice++)
    {
      int rc;
      if ((rc = pthread_create(&threads[iDevice], NULL, &pThreadWorker, (void *)GPUluList[iDevice])))
        printf("Thread creation failed: %d\n", rc);
    }

    for (int iDevice = 0; iDevice < nGPUs; iDevice++)
      pthread_join(threads[iDevice], NULL);
  }
}

int main(int argc, char **argv)
{
  cublasStatus_t status;

  if (cublasCreate(&handle) != CUBLAS_STATUS_SUCCESS)
  {
    fprintf (stderr, "!!!! CUBLAS initialization error\n");
    exit(0);
  }

  irs_bindCublas();

  _test();

  if (cublasDestroy(handle) != CUBLAS_STATUS_SUCCESS)
  {
    fprintf (stderr, "!!!! shutdown error (A)\n");
    exit(0);
  }

  return 0;
}

You’re welcome to file a bug if you wish. Like I said, I can’t explain the successful behavior on CUDA 5.5 However I’m pretty sure your application has a defect. Note that the passing scenario with a single device is easily explained in light of the above. There are no issues with handle usage if the device ID never changes, and matches the device ID in effect when the handle was created.

Disclaimer: I make no warranties about the modifications I made to your code. They are for demonstrational purposes, showing that the issue can be fixed in your application. I do not claim that my modifications are defect free or otherwise useful. I made the fewest possible changes, according to my understanding of your code, to achieve the stated demonstration. I did not take care to create production worthy code.

In retrospect, the discussion around small k was completely misguided. Your own initial posting indicates, in fact, that k is not small, k is not much much smaller than m or n, and and in fact k >> m and k >> n, in your failing case.

As a further note, since you mention CUDA 4.2 and CUDA 5.0, I took a look at CUBLAS 4.1 documentation and there is similar wording about the necessary relationship between cuda device and cublas handle:

http://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/CUDALibraries/doc/CUBLAS_Library.pdf

ref. section 2.2

mh11 · November 27, 2014, 3:34am

I do believe you are correct on all accounts. Yes, the m,n,k failure was observed on both k << m k << n and vice-versa, but I agree it is not the real issue here at all. I have made the handle modifications and early tests in my actual application suggest this is a glaring oversight on my part.

I am most grateful you spent the time to examine this posting (and so close to the holidays at that) and make the key observation. At this time I consider the matter closed, no need to file a bug (I consider myself only lucky in 4.2 and 5.0), and I extend my apologies for improper usage of the CUBLAS context.

Kind regards and a big thanks to you.