Hello,
I have an application that has worked perfectly using cublas<S,D,C,Z>GEMM with CUDA 4.2 and CUDA 5.0. I recently attempted an upgrade to CUDA 6.0 and CUDA 6.5 and discovered certain size matrices are failing with cublas<S,D,C,Z>GEMM when using multi-GPUs. I have managed to pull out the pieces from my application into a stand-alone application that reproduces the problem and I will post here.
A list of coding platforms in which I have demonstrated failure :
gcc and lpthread for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 340.58, NVIDIA 690
Intel 15.0 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 340.58, NVIDIA 690
Intel 15.0 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 340.58, NVIDIA 780
Intel 15.0 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 340.58, NVIDIA 4xK10
Intel 14.0.1 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 6.0 / 6.5, Driver 331.67, NVIDIA 690, 580
The same application works fine on the following environments :
gcc and lpthread for multi-GPU control, Ubuntu 12.04, CUDA 5.0, Driver 340.58, NVIDIA 690
Intel 15.0 icpc with OpenMPI for multi-GPU control, Ubuntu 12.04, CUDA 5.0, Driver 340.58, NVIDIA 690
Description of Code :
The first failure platform is what I will post since it does not complicate the issue w/ Intel icpc and uses only gcc. The logic is related to roughly a quasi-block LU computational module. The snippets I am posting are only intended to demonstrate the failure in a simple way and do not do anything of value. In fact, the matrices are all zero matrices exercising the CUDA / CUBLAS plumbing. The multi-GPU approach is with a typical pthread context. Essentially, the application attempts to multiply a string of matrices together of size (m-by-k) x (k-by-n). Failure occurs when k << m and k << n with multiple GPUs. The same code runs fine if only one GPU is used, or if k is not so small. I am using pitched memory but I have verified this too is not the issue - and as stated before - this same pitched memory code has worked flawlessly w/ CUDA 4.2 and CUDA 5.0 and has been thoroughly tested on those versions of CUDA.
I control late-binding w/ CUBLAS version in the cudaWrap.cpp file here :
//#define CUBLAS_SHARED_OBJECT "libcublas.so.5.0"
#define CUBLAS_SHARED_OBJECT "libcublas.so.6.0"
//#define CUBLAS_SHARED_OBJECT "libcublas.so.6.5"
I control number of GPUs and m,k,n in main.cpp here :
int nGPUs = 2;
int mRows0 = 2352;
int nCols0 = 2352;
int nBlockSets = 5;
float compression = 0.99;
The settings above will cause failure on platforms described. If I set nPGUs = 1 all works fine. If I set compression < 0.80 all works fine (this causes a much larger k). Here is a dump my application makes when it senses GEMM failure and running under memcheck :
========= CUDA-MEMCHECK
Building L blocks ...
Building U blocks ...
Building GPU list ...
Running ...
iBlockSet = 0
_multiplyCompressedBlocks : 0 : 0 : 2352 24 2352 : 2352 24 2352 :
_multiplyCompressedBlocks : 1 : 1 : 2352 24 2352 : 2352 24 2352 :
_multiplyCompressedBlocks : 2 : 0 : 2352 24 2352 : 2352 24 2352 :
_multiplyCompressedBlocks : 3 : 0 : 2352 24 2352 : 2352 24 2352 :
========= Program hit error 11 on CUDA API call to cudaMemsetAsync
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x2ef773]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x23aa39]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x1186b5]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x119512]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0xf52e0]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 (cublasCgemm_v2 + 0x23d) [0xf576d]
========= Host Frame:./main [0x278f]
========= Host Frame:./main [0x3373]
========= Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x7e9a]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0xf43fd]
=========
========= Program hit error 33 on CUDA API call to cudaEventRecord
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x2ef773]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x23c802]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x2165f]
!! kernel execution error (irs_cublasCgemm) !!========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x1186cb]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0x119512]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 [0xf52e0]
========= Host Frame:/usr/local/cuda-6.0/lib64/libcublas.so.6.0 (cublasCgemm_v2 + 0x23d) [0xf576d]
errNo = ========= Host Frame:./main [0x278f]
========= Host Frame:./main [0x3373]
14========= Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x7e9a]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0xf43fd]
=========
transa = 0
transb = 0
pointer A = 0x130f400000
pointer B = 0x1311e80000
pointer C = 0x1314900000
m = 24
n = 24
k = 2352
lda = 2368
ldb = 2368
ldc = 2368
alpha = 1 0
beta = 0 0
--- Dynamic Device Properties ---
idDevice = 1
Run time limit on kernels: Yes
--- Dynamic Device Properties ---
Error : _gemmDevice : irs_cublasCgemm : propogate ...
pitchA = 18944
pitchB = 18944
pitchC = 18944
========= ERROR SUMMARY: 2 errors
I believe the “Program hit error 11 on CUDA API call to cudaMemsetAsync” is incorrect. There should be no error here. The fact that it runs fine w/ one GPU means at the very minimum CUDA is inconsistent but I believe it is actually a bug in CUDA 6.0 and CUDA 6.5.
Here is the application in various files :
cudaWrap.cpp
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <dlfcn.h>
#include <sys/time.h>
#include "cudaWrap.hpp"
#include <map>
#include <stack>
#include <algorithm>
#include <limits.h>
#include <iomanip>
#include <sys/time.h>
#include <malloc.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;
static void _dumpCurrentDeviceProperties()
{
cerr << " " << endl;
cerr << "--- Dynamic Device Properties ---" << endl;
int idDevice = -1;
if (irs_cudaGetDevice(&idDevice) == CUDA_SUCCESS)
{
cerr << "idDevice = " << idDevice << endl;
cudaDeviceProp deviceProp;
if (irs_cudaGetDeviceProperties(&deviceProp, idDevice) == CUDA_SUCCESS)
{
cerr << "Run time limit on kernels: " << (deviceProp.kernelExecTimeoutEnabled ? "Yes" : "No") << endl;
}
else
{
cerr << "Error : _dumpCurrentDeviceProperties : irs_cudaGetDeviceProperties" << endl;
}
}
else
{
cerr << "Error : _dumpCurrentDeviceProperties : irs_cudaGetDevice" << endl;
}
cerr << "--- Dynamic Device Properties ---" << endl;
cerr << " " << endl;
}
int irs_cudaGetDeviceProperties(struct cudaDeviceProp* prop, int idDevice)
{
cudaError_t error = cudaGetDeviceProperties(prop, idDevice);
if (error != cudaSuccess)
{
cerr << "!! device access error (irs_cudaGetDeviceProperties) !!" << endl;
cerr << "idDevice = " << idDevice << endl;
cerr << "errNo = " << error << endl;
return CUDA_FAILURE;
}
return CUDA_SUCCESS;
}
int irs_cudaGetDevice(int* idDevice)
{
cudaError_t error = cudaGetDevice(idDevice);
if (error != cudaSuccess)
{
cerr << "!! acctive device access error (irs_cudaGetDevice) !!" << endl;
cerr << "errNo = " << error << endl;
return CUDA_FAILURE;
}
return CUDA_SUCCESS;
}
int irs_cudaSetDevice(int idDevice)
{
cudaError_t error = cudaSetDevice(idDevice);
if (error != cudaSuccess)
{
cerr << "!! device access error (irs_cudaSetDevice) !!" << endl;
cerr << "idDevice = " << idDevice << endl;
cerr << "errNo = " << error << endl;
_dumpCurrentDeviceProperties();
return CUDA_FAILURE;
}
return CUDA_SUCCESS;
}
int irs_cudaMemcpy2D(void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind)
{
cudaError_t error = cudaMemcpy2D(dst, dpitch, src, spitch, width, height, kind);
if (error != cudaSuccess)
{
cerr << "!! device memory copy error (irs_cudaMemcpy2D) !!" << endl;
cerr << "errNo = " << error << endl;
cerr << "dpitch = " << dpitch << endl;
cerr << "spitch = " << spitch << endl;
cerr << "width = " << width << endl;
cerr << "height = " << height << endl;
cerr << "kind = " << kind << endl;
_dumpCurrentDeviceProperties();
return CUDA_FAILURE;
}
return CUDA_SUCCESS;
}
void irs_cudaMallocPitch(void** devPtr, size_t* pitch, size_t width, size_t height)
{
cudaError_t sts = cudaMallocPitch(devPtr, pitch, width, height);
if (sts != cudaSuccess)
{
cerr << "!! device memory allocation error (_cudaMallocPitch) !!" << endl;
cerr << "errNo = " << sts << endl;
cerr << "pitch = " << *pitch << endl;
cerr << "width = " << width << endl;
cerr << "height = " << height << endl;
exit(1);
}
}
void irs_cudaFree(cuComplex** d_A)
{
cudaError_t sts = cudaFree(*d_A);
if (sts != cudaSuccess)
{
cerr << "!! device memory free error (irs_cudaFree) !!" << endl;
cerr << "errNo = " << sts << endl;
exit(1);
}
}
/*%-----------------------------------------------------------------------------
cublasCgemm
----------------------------------------------------------------------------%*/
#define CUBLAS_CGEMM "cublasCgemm_v2"
static cublasStatus_t (*_pcublasCgemm)(cublasHandle_t, cublasOperation_t, cublasOperation_t,
int, int, int, const cuComplex *, const cuComplex *, int,
const cuComplex *, int, const cuComplex *, cuComplex *, int) = NULL;
static void _bind_cublasCgemm(void* pCublasSharedObjectHandle)
{
if (_pcublasCgemm == NULL)
{
_pcublasCgemm =
(cublasStatus_t (*)(cublasHandle_t, cublasOperation_t, cublasOperation_t,
int, int, int, const cuComplex *, const cuComplex *, int, const cuComplex *,
int, const cuComplex *, cuComplex *, int))dlsym(pCublasSharedObjectHandle, CUBLAS_CGEMM);
char* strError = dlerror();
if (strError != NULL)
{
cerr << "_bind_cublasCgemm : _pcublasCgemm : signature not found : " <<
strError << " : exiting ..." << endl;
exit(1);
}
}
}
int irs_cublasCgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k, const cuComplex *alpha, const cuComplex *A, int lda,
const cuComplex *B, int ldb, const cuComplex *beta, cuComplex *C, int ldc)
{
if (_pcublasCgemm != NULL)
{
struct timeval start, end;
gettimeofday(&start, NULL);
cublasStatus_t status = _pcublasCgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
if (status != CUBLAS_STATUS_SUCCESS)
{
cerr << "!! kernel execution error (irs_cublasCgemm) !!" << endl;
cerr << "errNo = " << status << endl;
cerr << "transa = " << transa << endl;
cerr << "transb = " << transb << endl;
cerr << "pointer A = " << A << endl;
cerr << "pointer B = " << B << endl;
cerr << "pointer C = " << C << endl;
cerr << "m = " << m << endl;
cerr << "n = " << n << endl;
cerr << "k = " << k << endl;
cerr << "lda = " << lda << endl;
cerr << "ldb = " << ldb << endl;
cerr << "ldc = " << ldc << endl;
cerr << "alpha = " << alpha->x << " " << alpha->y << endl;
cerr << "beta = " << beta->x << " " << beta->y << endl;
_dumpCurrentDeviceProperties();
return CUDA_FAILURE;
}
}
else
{
cerr << "irs_cublasCgemm : _pcublasCgemm : signature not found : exiting ..." << endl;
exit(1);
}
return CUDA_SUCCESS;
}
/*%-----------------------------------------------------------------------------
cublasCreate
----------------------------------------------------------------------------%*/
#define CUBLAS_CREATE "cublasCreate_v2"
static cublasStatus_t (*_pcublasCreate)(cublasHandle_t*) = NULL;
static void _bind_cublasCreate(void* pCublasSharedObjectHandle)
{
if (_pcublasCreate == NULL)
{
_pcublasCreate =
(cublasStatus_t (*)(cublasHandle_t*))dlsym(pCublasSharedObjectHandle, CUBLAS_CREATE);
char* strError = dlerror();
if (strError != NULL)
{
cerr << "_bind_cublasCreate : _pcublasCreate : signature not found : " <<
strError << " : exiting ..." << endl;
exit(1);
}
}
}
int irs_cublasCreate(cublasHandle_t* pHandle)
{
if (_pcublasCreate != NULL)
{
if (_pcublasCreate(pHandle) != CUBLAS_STATUS_SUCCESS)
{
cerr << "irs_cublasCreate: _pcublasCreate : Cublas handle not established : exiting ..." << endl;
exit(1);
}
}
else
{
cerr << "irs_cublasCreate : _pcublasCreate : signature not found : exiting ..." << endl;
exit(1);
}
return CUBLAS_STATUS_SUCCESS;
}
/*%-----------------------------------------------------------------------------
cublasDestroy
----------------------------------------------------------------------------%*/
#define CUBLAS_DESTROY "cublasDestroy_v2"
static cublasStatus_t (*_pcublasDestroy)(cublasHandle_t) = NULL;
static void _bind_cublasDestroy(void* pCublasSharedObjectHandle)
{
if (_pcublasDestroy == NULL)
{
_pcublasDestroy =
(cublasStatus_t (*)(cublasHandle_t))dlsym(pCublasSharedObjectHandle, CUBLAS_DESTROY);
char* strError = dlerror();
if (strError != NULL)
{
cerr << "_bind_cublasDestroy : _pcublasDestroy : signature not found : " <<
strError << " : exiting ..." << endl;
exit(1);
}
}
}
int irs_cublasDestroy(cublasHandle_t handle)
{
if (_pcublasDestroy != NULL)
{
if (_pcublasDestroy(handle) != CUBLAS_STATUS_SUCCESS)
{
cerr << "irs_cublasDestroy: _pcublasDestroy : Cublas handle destroy not completed : exiting ..." << endl;
exit(1);
}
}
else
{
cerr << "irs_cublasDestroy : _pcublasDestroy : signature not found : exiting ..." << endl;
exit(1);
}
return CUBLAS_STATUS_SUCCESS;
}
//#define CUBLAS_SHARED_OBJECT "libcublas.so.5.0"
#define CUBLAS_SHARED_OBJECT "libcublas.so.6.0"
//#define CUBLAS_SHARED_OBJECT "libcublas.so.6.5"
void irs_bindCublas()
{
void* pCublasSharedObjectHandle = dlopen(CUBLAS_SHARED_OBJECT, RTLD_NOW);
if (pCublasSharedObjectHandle == NULL)
{
char* strError = dlerror();
cerr << "irs_bindCublas : Cublas " << CUBLAS_SHARED_OBJECT << " not bound : " <<
strError << " : exiting ..." << endl;
exit(1);
}
_bind_cublasCreate(pCublasSharedObjectHandle);
_bind_cublasDestroy(pCublasSharedObjectHandle);
_bind_cublasCgemm(pCublasSharedObjectHandle);
}
cudaWrap.hpp
#pragma once
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <iostream>
#include <pthread.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <cuda_runtime.h>
#include <cublas.h>
#include <cublas_v2.h>
#include <pthread.h>
#define CUDA_SUCCESS 0
#define CUDA_FAILURE 1
int irs_cudaGetDevice(int* idDevice);
int irs_cudaSetDevice(int idDevice);
int irs_cudaGetDeviceProperties(struct cudaDeviceProp* prop, int device);
int irs_cudaMemcpy2D(void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind);
int irs_cudaMemset(void* devPtr, int value, size_t count);
int irs_cudaMemset2D(void* dst, size_t dpitch, int value, size_t width, size_t height);
void irs_cudaMallocPitch(void** devPtr, size_t* pitch, size_t width, size_t height);
int irs_cudaMalloc(void** pVoid, size_t size);
void irs_cudaFree(cuComplex** d_A);
int irs_cudaSetVector(int n, int elemSize, const void *x, int incx, void *devicePtr, int incy);
int irs_cudaGetVector(int n, int elemSize, const void *devicePtr, int incx, void *y, int incy);
int irs_cudaSetMatrix(int m, int n, int elemSize, const void *A, int lda, void *B, int ldb);
int irs_cudaGetMatrix(int m, int n, int elemSize, const void *A, int lda, void *B, int ldb);
int irs_cublasCgemv(cublasHandle_t handle, cublasOperation_t mode,
int m, int n, const cuComplex *alpha, const cuComplex *A, int lda,
const cuComplex *x, int incx, const cuComplex *beta, cuComplex *y, int incy);
int irs_cublasCgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k, const cuComplex *alpha, const cuComplex *A, int lda,
const cuComplex *B, int ldb, const cuComplex *beta, cuComplex *C, int ldc);
int irs_cublasCreate(cublasHandle_t* pHandle);
int irs_cublasDestroy(cublasHandle_t handle);
void irs_bindCublas();
main.cpp
#include <stdlib.h>
#include <stdio.h>
#include <vector>
#include <complex>
#include <iostream>
#include <sys/time.h>
#include "cudaWrap.hpp"
#include <pthread.h>
using namespace std;
cublasHandle_t handle;
typedef size_t Index;
template<typename T>
class Matrix
{
public:
Matrix() { numRows = 0; numColumns = 0; numElements = 0; leadingDimension = 0; data = NULL; };
~Matrix() { free(); };
void allocate(Index nr, Index nc) { data = new T[(Index)nr * (Index)nc]; numRows = nr; numColumns = nc;
numElements = nr*nc; leadingDimension = nr; };
void free(void) { if(data!=NULL) { delete[] data; data = NULL; }
numRows = 0; numColumns = 0; numElements = 0; leadingDimension = 0; };
public:
T *data;
Index numRows, numColumns;
Index numElements;
Index leadingDimension;
};
static void _copyHostToDevice(void* dst, size_t dpitch, Matrix<complex<float> >& src)
{
int sts = irs_cudaMemcpy2D(dst, dpitch, src.data, src.numRows * sizeof(complex<float>),
src.numRows * sizeof(complex<float>), src.numColumns, cudaMemcpyHostToDevice);
if (sts != CUDA_SUCCESS)
{
cerr << "Error : _copyHostToDevice : irs_cudaMemcpy2D : propogate ..." << endl;
exit(1);
}
}
static void _copyDeviceToHost(void* src, size_t spitch, Matrix<complex<float> >& dst)
{
int sts = irs_cudaMemcpy2D(dst.data, dst.numRows * sizeof(complex<float>), src, spitch,
dst.numRows * sizeof(complex<float>), dst.numColumns, cudaMemcpyDeviceToHost);
if (sts != CUDA_SUCCESS)
{
cerr << "Error : _copyDeviceToHost : irs_cudaMemcpy2D : propogate ..." << endl;
exit(1);
}
}
static void _gemmDevice(cublasHandle_t handle, int m, int n, int k,
const complex<float> *alpha, const cuComplex *d_A, size_t pitchA,
const cuComplex *d_B, size_t pitchB, const complex<float> *beta,
cuComplex *d_C, size_t pitchC)
{
int sts = irs_cublasCgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k,
(cuComplex *)alpha, d_A, pitchA / sizeof(complex<float>),
d_B, pitchB / sizeof(complex<float>), (cuComplex *)beta,
d_C, pitchC / sizeof(complex<float>));
if (sts != CUDA_SUCCESS)
{
cerr << "Error : _gemmDevice : irs_cublasCgemm : propogate ..." << endl;
cerr << "pitchA = " << pitchA << endl;
cerr << "pitchB = " << pitchB << endl;
cerr << "pitchC = " << pitchC << endl;
exit(1);
}
}
typedef struct GPUWorkMatrixPitched
{
GPUWorkMatrixPitched(int _mRowsL, int _nColsL)
{
mRows = _mRowsL;
nCols = _nColsL;
pitchWidthDevice = nCols * sizeof(complex<float>);
pitchHeightDevice = mRows;
irs_cudaMallocPitch((void**)&d_W, &pitchDevice, mRows * sizeof(complex<float>), nCols);
}
cuComplex* d_W;
int mRows;
int nCols;
size_t pitchDevice;
size_t pitchWidthDevice;
size_t pitchHeightDevice;
} GPUWorkMatrixPitched;
typedef struct GPUlu
{
GPUlu(int _mRowsL, int _nColsL, int idDeviceThis)
{
idDevice = idDeviceThis;
if (irs_cudaSetDevice(idDevice) != CUDA_SUCCESS)
{
cerr << "Error : GPUlu : irs_cudaSetDevice : exiting ..." << endl;
exit(1);
}
for (int iWork = 0; iWork < 4; iWork++)
{
GPUWorkMatrixPitched* pGPUWorkMatrixPitched = new GPUWorkMatrixPitched(_mRowsL, _nColsL);
workMatrices.push_back(pGPUWorkMatrixPitched);
}
}
~GPUlu()
{
if (irs_cudaSetDevice(idDevice) != CUDA_SUCCESS)
{
cerr << "Error : GPUlu : irs_cudaSetDevice : exiting ..." << endl;
exit(1);
}
for (int iWork = 0; iWork < 4; iWork++)
irs_cudaFree(&workMatrices[iWork]->d_W);
}
int idDevice;
vector<GPUWorkMatrixPitched *> workMatrices;
} GPUlu;
class Block
{
public:
int mRows;
int nCols;
int nRank;
Matrix<complex<float> > U;
Matrix<complex<float> > V;
};
static void _addCompressedBlock(Block* pB, float compression, vector<Block*>& Blocks)
{
int nMax = (pB->mRows > pB->nCols) ? pB->mRows : pB->nCols;
pB->nRank = nMax - (int)(compression * nMax);
pB->U.allocate(pB->mRows, pB->nRank);
pB->V.allocate(pB->nRank, pB->nCols);
Blocks.push_back(pB);
}
static void _multiplyCompressedBlocks(Block* pBL, Block* pBU, GPUlu* pGPUlu)
{
// Goal : -pBL->U % ((pBL->V % pBU->U) % pBU->V) -> W3
complex<float> alpha = complex<float>(1.0f, 0.0f);
complex<float> beta = complex<float>(0.0f, 0.0f);
// (pBL->V % pBU->U) -> W2
_copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBL->V);
_copyHostToDevice(pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, pBU->U);
_gemmDevice(handle, pBL->V.numRows, pBU->U.numColumns, pBL->V.numColumns,
&alpha, pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice,
pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, &beta,
pGPUlu->workMatrices[2]->d_W, pGPUlu->workMatrices[2]->pitchDevice);
// ((pBL->V % pBU->U) % pBU->V) -> W1
_copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBU->V);
_gemmDevice(handle, pBL->V.numRows, pBU->V.numColumns, pBU->U.numColumns,
&alpha, pGPUlu->workMatrices[2]->d_W, pGPUlu->workMatrices[2]->pitchDevice,
pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, &beta,
pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice);
alpha = complex<float>(-1.0f, 0.0f);
beta = complex<float>(1.0f, 0.0f);
// -pBL->U % ((pBL->V % pBU->U) % pBU->V) -> W3
_copyHostToDevice(pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice, pBL->U);
_gemmDevice(handle, pBL->U.numRows, pBU->V.numColumns, pBL->U.numColumns,
&alpha, pGPUlu->workMatrices[0]->d_W, pGPUlu->workMatrices[0]->pitchDevice,
pGPUlu->workMatrices[1]->d_W, pGPUlu->workMatrices[1]->pitchDevice, &beta,
pGPUlu->workMatrices[3]->d_W, pGPUlu->workMatrices[3]->pitchDevice);
}
pthread_mutex_t lock;
pthread_mutex_t print1;
int iBlockCount = -1;
int nBlocks = 4;
vector<Block*> LBlocks;
vector<Block*> UBlocks;
void *pThreadWorker(void* pData)
{
GPUlu* pGPUlu = (GPUlu *)pData;
irs_cudaSetDevice(pGPUlu->idDevice);
while (1)
{
int iBlockCountPinned = -1;
pthread_mutex_lock(&lock);
iBlockCount++;
iBlockCountPinned = iBlockCount;
pthread_mutex_unlock(&lock);
if (iBlockCountPinned >= nBlocks)
break;
Block* pBL = LBlocks[iBlockCountPinned];
Block* pBU = UBlocks[iBlockCountPinned];
pthread_mutex_lock(&print1);
cout << " _multiplyCompressedBlocks : " << iBlockCountPinned << " : " << pGPUlu->idDevice << " : " <<
pBL->mRows << " " << pBL->nRank << " " << pBL->nCols << " : " <<
pBU->mRows << " " << pBU->nRank << " " << pBU->nCols << " : " << endl;
pthread_mutex_unlock(&print1);
_multiplyCompressedBlocks(pBL, pBU, pGPUlu);
}
}
static void _test()
{
int nGPUs = 2;
int mRows0 = 2352;
int nCols0 = 2352;
int nBlockSets = 5;
float compression = 0.99;
cout << "Building L blocks ... " << endl;
for (int iBlock = 0; iBlock < nBlocks; iBlock++)
{
Block* pB = new Block();
pB->mRows = mRows0;
pB->nCols = nCols0;
_addCompressedBlock(pB, compression, LBlocks);
}
cout << "Building U blocks ... " << endl;
for (int iBlock = 0; iBlock < nBlocks; iBlock++)
{
Block* pB = new Block();
pB->mRows = mRows0;
pB->nCols = nCols0;
_addCompressedBlock(pB, compression, UBlocks);
}
cout << "Building GPU list ... " << endl;
vector<GPUlu* > GPUluList;
for (int iDevice = 0; iDevice < nGPUs; iDevice++)
{
GPUlu* pGPUlu = new GPUlu(mRows0, nCols0, iDevice);
GPUluList.push_back(pGPUlu);
}
cout << "Running ... " << endl;
for (int iBlockSet = 0; iBlockSet < nBlockSets; iBlockSet++)
{
cout << "iBlockSet = " << iBlockSet << endl;
iBlockCount = -1;
pthread_t threads[nGPUs];
for (int iDevice = 0; iDevice < nGPUs; iDevice++)
{
int rc;
if ((rc = pthread_create(&threads[iDevice], NULL, &pThreadWorker, (void *)GPUluList[iDevice])))
printf("Thread creation failed: %d\n", rc);
}
for (int iDevice = 0; iDevice < nGPUs; iDevice++)
pthread_join(threads[iDevice], NULL);
}
}
int main(int argc, char **argv)
{
cublasStatus_t status;
if (cublasCreate(&handle) != CUBLAS_STATUS_SUCCESS)
{
fprintf (stderr, "!!!! CUBLAS initialization error\n");
exit(0);
}
irs_bindCublas();
_test();
if (cublasDestroy(handle) != CUBLAS_STATUS_SUCCESS)
{
fprintf (stderr, "!!!! shutdown error (A)\n");
exit(0);
}
return 0;
}
To run the code I use buildNRun.sh
make clean
make print
make all
#LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-5.0/lib64
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.0/lib64
#LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.5/lib64
nohup ./main > main.out 2> main.err &
which invokes the following makefile :
CC=gcc
OPT=-O3
CUDAROOT=/usr/local/cuda
INC=-I$(CUDAROOT)/include
LIBS=-lpthread -L$(CUDAROOT)/lib64 -lcudart -lcublas
CFLAGS=$(OPT) $(INC)
print:
@echo $(CUDA)
@echo $(LIBS)
@echo $(INC)
all: main
main: main.o cudaWrap.o
@$(CC) $(CFLAGS) -o main *.o $(LIBS)
main.o: main.cpp
@$(CC) $(CFLAGS) -c -o main.o main.cpp
cudaWrap.o: cudaWrap.cpp cudaWrap.hpp
@$(CC) $(CFLAGS) -c -o cudaWrap.o cudaWrap.cpp
clean:
rm *.o main
I can zip the files up if that is easier but I do not know the best way to post the application. More than happy to post zip files just tell me where.
This issue is a very serious problem for our application. We cannot migrate to CUDA >= 6.0. It is holding up customer deliveries at this time. Please help me understand what is going on.