cudaMalloc error on Windows 10

Description:
cudaMalloc crashes the process or produces the following error message:
all CUDA-capable devices are busy or unavailable

The bug is reproducable on Windows 10 with driver version between 355.60 and 361.91 using CUDA toolkit 7.0 and 7.5. I was unable to reproduce it on Windows 8.1.

The result (crash or error message) depends on the model of GPU card. We tested with several single and multiGPU configurations including GTX 640, 750, TITAN BLACK, TITAN Z, 690, 980 and 980Ti cards.

TDR was disabled.

Duplication steps:
Compile a test app with Visual Studio 2012 with simple device query and cudamalloc (small amount of memory). Run the program a few hundred times with a batch file.

Product:
GTX 690, TITAN, TITAN Z, GTX 980, GTX 980TI

Toolkit versions:
7.5
7.0

Op
Problem occures on Windows 10. Works fine on Windows 8.1.

Note: I was unable to report this as a bug because the report page crashes every time.

Sample code:

#pragma once

#include <stdio.h>
#include <stdlib.h>

#include <cuda.h>
#include <cuda_runtime_api.h>

inline void gpuAssert( cudaError code, const char *file, int line, bool abort=true )
{
	if ( code != cudaSuccess ) 
	{
		fprintf( stderr, "GPUassert: %s %s %d\n", cudaGetErrorString( code ), file, line );
		if ( abort ) 
		{
			exit( code );
		}
	}
}

#define CUDA_SAFE_CALL( ans ) { gpuAssert( ( ans ), __FILE__, __LINE__ ); }

int main( int argc, char *argv[] )
{
	int numberOfDevices = 0;
	size_t sizeInBytes = 16*1024*1024;
	
	CUDA_SAFE_CALL( cudaGetDeviceCount	( &numberOfDevices ) );
	for ( int i = 0; i < numberOfDevices; i++ )
	{
		CUDA_SAFE_CALL( cudaSetDevice( i ) );

		{
			int device;
			CUDA_SAFE_CALL( cudaGetDevice( &device ) );

			size_t mem_free, mem_tot;
			CUDA_SAFE_CALL( cudaMemGetInfo(&mem_free, & mem_tot) );

			fprintf( stdout, "before cudaMalloc #%d: %llu bytes from %llu / %llu\n", device, sizeInBytes, mem_free, mem_tot );
		}

		float *devPtr = nullptr;
		CUDA_SAFE_CALL( cudaMalloc( (void**)&devPtr, sizeInBytes ) );
		{
			int device;
			CUDA_SAFE_CALL( cudaGetDevice( &device ) );

			size_t mem_free, mem_tot;
			CUDA_SAFE_CALL( cudaMemGetInfo(&mem_free, & mem_tot) );

			fprintf( stdout, "after cudaMalloc #%d: %llu bytes free\n", device, mem_free );
		}
	}
	return 0;
}

I’ve seen this sporadic error as well on a Win10x64/7.5 system with the latest drivers.

I am highlighting this here by way of quotation to increase the likelihood that someone from NVIDIA will see it. Has anybody else been successful in reporting this issue as a bug (via the bug reporting form linked from CUDA registered developer website)?

The issue was reproduced and a bug was filed internal to NVIDIA. I don’t have any further details at this time.

regarding this:

The web bug submission process includes a mechanism to attempt to detect threats in the submitted bug report. If a “threat” is detected, you will get an obscure web error when you attempt to submit the bug, and no report will be filed.

The specifics of the threat detection algorithm are not published for obvious reasons, and unfortunately it does generate false positives, somewhat frequently. For example, although I haven’t tried it, I think something of the form of a linux rm command that also specified sudo and other things that would wipe out your disk drive would be flagged by the threat detection system, and the bug report would be rejected (again, not via a clear direct message but via an obscure web error), if it contained that text anywhere. If you want to, try creating a posting here with such a text sequence - it will likely be detected.

If such a situation occurs, there are two suggestions which may be of interest

  1. Remove elements from your bug report in a trial-and-error fashion, until the bug report submission is successful. This can be admittedly tedious, however you can try a binary divide-and-conquer to speed it up a bit. Once the bug report is successful, you can edit it later to submit additional content.

  2. Post here. Although there is no guarantee that items posted on this forum will be automatically handled or automatically converted to a bug for you, in some instances I or other NVIDIA employees may see your posting and take action. Speaking for myself, I generally will not file a bug report on anyone’s behalf, unless I can reproduce the issue myself. My willingness and ability to do so depends on the nature of the actual report/issue, and other factors such as my available cycles.

We’ve tried to report this bug several times (with or without sample code) but all we’ve got was that obscure error message. This should be more user friendly!

Also, the form data should be saved after the error… I had to re-type everything after my first attempt.

The issue seems to be related to CUDA context creation, and not only to cudaMalloc. I was able to reproduce the problem with other CUDA functions as well. The crash randomly happens after the first CUDA call with context initialization.

I don’t see any mention of this issue among the release notes of the newest driver (neither fixed or open):
http://us.download.nvidia.com/Windows/364.47/364.47-win10-win8-win7-winvista-desktop-release-notes.pdf

Does anyone have more details about the state of this bug?

Generally speaking, the status of a bug is accessible to its filer, and the NVIDIA engineers working on resolving it. That is the main reason I always recommend that a CUDA user who encounters an issue file their own bug report, so they have visibility into the status.

Thanks a lot for reporting this issue as a bug to NVidia.
I was unable to login to the forum (The password reminder didn’t work.) but I always read the new posts.

njuffa: I have tried several times to report the issue…anyway, generally I would prefer a working bug report page :)

Could anyone give me an update what is the state of the bug?

Thanks again

a bug has been filed, and the issue has been reproduced internally.

A developer has taken an initial look at it, and has identified an internal issue. It’s not root cause, but it is something to investigate. There is no root cause and no timeline for a fix.

If you desire personalized communication, I suggest filing a bug. I’ve indicated how to do so in this thread.

I don’t expect to provide any further communication on this topic until the bug has been root caused, and a fix is identified, and a driver is released with the fix.

Until then, I won’t be responding to further requests for information of any kind.

Again, if you desire personalized communication, file a bug, and reference internal bug ID 1736037

To be be clear, I’m not just trying to be a jerk. There are a couple reasons for this attitude.

  1. According to NVIDIA’s definition, these forums are not a “scalable” process for providing developer feedback for the bugs they have “filed”. The bug reporting system is.

  2. Developers work on issues at NVIDIA according to certain priorities. Not all issues get worked with the same level of priority. If I or someone at NVIDIA file all the bugs that may be reported in these forums, there is little differentiation or sense of priority. However if a developer files a bug (perhaps in addition to one I or someone else has filed) then the priority is somewhat clearer. If you are then active in your bug, asking for updates, that also communicates a kind of priority.

Thanks again for your help and the information.

I’ve reported a bug with the internal bug ID in the description to be informed about the state of the issue.

Any updates on this ? Are these sporadic erros on windows 10 system already resolved with a newer driver version (we use Cuda Toolkit 7.0) ? I will get a new PC soon and have to decide whether to install Windows 7 (64-bit) or Windows 10 on it …

Short update: On a notebook with a Geforce 960M (CC 5.0) we get the same error (all CUDA-capable devices are busy or unavailable) already after ~ 30 sec to 1 minute, when running the test program in a loop in a batch-file. Even with the latest drivers. Definitly not good …

@BKoszegi: What is the bug ID ?

Hi HannesF99,

The internal bug ID is 1736037. (It was mentioned in the 11th comment.) If you create a bug report referring this ID you will get a link to partners.nvidia.com. If you are luck you can login and see the newest information about the bug. I couldn’t login (it seems you need a different account) and I wrote severeal mail to nvidia support how to login/register to partners.nvidia.com but I’ve never got any information.

My last information about the bug:
“And now, it has been assigned to the appropriate developer team for further investigation, we’ll keep you posted here once we have a fix.
Sorry for any inconvenience brought by this issue.” - by Kevin Kang - 03/21/2016 9:07 AM

Sorry for the late response. I just gave up to have any information or solution from nvidia. So I stopped checking this topic.

Update: Got a new workstation with Windows 10 (64-bit). On this machine, the error mentioned in this thread does not occur, so it seems that it got fixed.

It seems that installing 372.54 solves the problem (Windows Version 1511, OS Build 10586.420). Could someone verify this?

Unable to reproduce this issue either with the latest 372.70 driver on Windows 10 64bit(version 1607, OS Build: 14393)/GTX TITAN X.