cuSolver LU factorization inside a for loop problem

Hi guys. I have an application that demands solving a lot of linear systems, so naturally I went to a for loop and called many times the cusolverDnSgetrf function. The problem is that, at a random iteration, CUDA just hangs, the screen goes black and all the subsequent calls to cuSolver are ignored. I’ve made the following minimal example to try and prove my point:

Eigen::MatrixXf A;
	Eigen::read_binary("mymatrix.bin", A);

	cusolverDnHandle_t cusolverhandle;
	cusolverDnCreate(&cusolverhandle);

	cudaStream_t stream1 = 0;
	cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
	cusolverDnSetStream(cusolverhandle, stream1);

	float * devPtrgcmgtcd = NULL, *d_work = NULL;

	cudaMalloc((void**)&devPtrgcmgtcd, sizeof(float)*A.size());

	int lwork = 0;
	int m = A.rows();
	cusolverDnSgetrf_bufferSize(cusolverhandle, m, m, devPtrgcmgtcd, m, &lwork);
	cudaDeviceSynchronize();

	cudaMalloc((void**)&d_work, sizeof(float)*lwork);

	int *devInfo = NULL;
	cudaMalloc((void**)&devInfo, sizeof(int));

	int *devIpiv = NULL;
	cudaMalloc((void**)&devIpiv, m*sizeof(int));
	int *hostipv = new int[m];
	int *hostinfo = new int;

	for (int i = 0; i < 1000000; i++){

		cudaMemcpy(devPtrgcmgtcd, A.data(), A.size()*sizeof(float), cudaMemcpyHostToDevice);
		cusolverStatus_t stat = cusolverDnSgetrf(cusolverhandle, m, m, devPtrgcmgtcd, m, d_work, devIpiv, devInfo);
		cudaError_t error = cudaDeviceSynchronize();

		std::cout << i << std::endl;
	}

Do I need to create the handles for cusolver inside the for loop?
Please, tell me what I am doing wrong.
The results for the first few iterations are good! The real application changes the A matrix for every for loop.
Thanks, any help is appreciated.

My A matrix is 5400x5400. I’ve tried to allocate all of the device pointers inside the for loop and it still crashes. It runs for 2000 iterations at most. I’ve noticed that the lwork size is at most 27, does cusolver has a maximum matrix size for this function?

This is most likely a bug at the cusolverDnSgetrf_bufferSize. I am trying to use the QR factorization instead of LU, and the cusolverDnSgeqrf_bufferSize function of QR gives me 29M of buffer size vs. 27 from cusolverDnSgetrf_bufferSize

I’m going to loop the hell out of the QR factorization code and report here if it crashes.

I’ve replaced the LU factorization by the QR and it ran all night inside that loop without a problem! 140,000 iterations without crashing.
There is something wrong with the LU routines, how can we file a bug report?

I’ve tried to use the QR buffer function to allocate the d_work memory in the GPU and then call the LU function with no success.

you can file a bug report at developer.nvidia.com

You will need to be a registered developer and log in with your registered developer credentials.

You will likely be asked to provide a complete example that someone else could compile, run, and see the issue. What you have shown here is not that.

Ok, thanks for your reply. Sorry for not providing a full working code. I’m submitting the following code for the bug report.

#include "stdafx.h"
#include <cuda_runtime.h>
#include <cusolver_common.h>
#include <cusolverDn.h>
#include <iostream>

	int _tmain(int argc, _TCHAR* argv[])
	{
		int m = 5400;
		float * A;
		A = new float[m*m];

		for (int i = 1; i < m*m; i++){
			A[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
		}

		cusolverDnHandle_t cusolverhandle;
		cusolverStatus_t err = cusolverDnCreate(&cusolverhandle);

		float * devPtrA = NULL;
		cudaMalloc((void**)&devPtrA, sizeof(float)*m*m);

		int lwork = 0;
		cusolverDnSgetrf_bufferSize(cusolverhandle, m, m, devPtrA, m, &lwork);
		cudaDeviceSynchronize();

		float *d_work = NULL;
		cudaMalloc((void**)&d_work, sizeof(float)*lwork);

		int *devInfo = NULL;
		cudaMalloc((void**)&devInfo, sizeof(int));

		int *devIpiv = NULL;
		cudaMalloc((void**)&devIpiv, m*sizeof(int));

		for (int i = 0; i < 1000000; i++){

			cudaMemcpy(devPtrA, A, m*m*sizeof(float), cudaMemcpyHostToDevice);
			cusolverStatus_t stat = cusolverDnSgetrf(cusolverhandle, m, m, devPtrA, m, d_work, devIpiv, devInfo);
			cudaError_t error = cudaDeviceSynchronize();

			if (error){
				cudaFree(devIpiv);
				cudaFree(devInfo);
				cudaFree(d_work);
				cudaFree(devPtrA);

				cusolverDnDestroy(cusolverhandle);

				return -error;
			}

			std::cout << i << std::endl;
		}

		cudaFree(devIpiv);
		cudaFree(devInfo);
		cudaFree(d_work);
		cudaFree(devPtrA);

		cusolverDnDestroy(cusolverhandle);
		return 0;

	}

I’ve run your posted code on CUDA 9 on linux (9.0.176, driver 384.90, Ubuntu 14.04, Pascal Titan X) and it has run for over 6000 iterations so far with no hangs or issues.

I am using cuda 9.0.176 on windows 10 64 bits, GTX1080 Ti 11Gb, driver 385.54.

I’ve tested the exact same compiled binary on another machine with GTX780, CUDA 9, driver 385.54 and it worked for over 13000 iterations with no problems.

I’m updating the drivers to 388 now, let’s see if the issue remains

Anyway it would be nice to find someone with the same device and OS as mine to test this. Thank you for your help.

I’ve ran this on ubuntu 17.10 with cuda 9 and driver 384 and it runs fine!
At least I know that my card is working fine.

Hi, I’m dealing with a very similar problem here, my cusolverDnDgeqrf_bufferSize is retuning a very big size for the buffer.

I’m using a cuda 10 in a GTX960M with my arch linux.

I’m trying to solve very small systems of 20x20, and I get a buffer size of 49152 elements for the QR decomposition.

I also get the very same number running the example “C.1. QR Factorization Dense Linear Solver” of NVDIA documentation: https://docs.nvidia.com/cuda/cusolver/index.html#ormqr-example1

Is there something I can do about it? What would be a safe estimative for my buffer size without using this function?

Thank you very much in advance.

There isn’t anything you can do about it. There is no alternative method to estimate buffer size.