CUDA_ERROR_NO_BINARY_FOR_GPU loading PTX ? Can't load PTX 'image' no matter what I do..

Billy_Rubin · February 3, 2012, 10:11pm

Hey y’all,

Just to give everyone out there a head’s up, this issue is very specific to the CUDA Driver API…

I can’t seem to get either cuModuleLoadData() or cuModuleLoadDataEx() to load a simple PTX module.

The docs specifically say that I should be able to do this. To wit, it describes one of the valid ways to submit a CUDA module as:

The error that keeps coming back from either Driver API function in always the same: CUDA_ERROR_NO_BINARY_FOR_GPU…

The docs say this about that error (and no, I couldn’t find anything else about it anywhere):

Well, I had to read that quote many times before I could even begin to wrap my head around what they were trying to say, and I still don’t quite get it… External Image

However, in terms of the options I’m passing in, they are as follows:

cuModuleLoadData(): none (it doesn’t take any)…

cuModuleLoadDataEx():

CU_JIT_INFO_LOG_BUFFER:			new char [1024]

CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES:	1024

CU_JIT_ERROR_LOG_BUFFER:		new char [1024]

CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES:	1024

CU_JIT_TARGET:				CU_TARGET_COMPUTE_21

CU_JIT_FALLBACK_STRATEGY:		CU_PREFER_PTX

The computer that the program is being run on only has one NVidia GPU, and it’s an NVIDIA GeForce GT 525M, with 96 CUDA Cores, 2 streaming multiprocessors, and Compute Capability of 2.1…

The CUDA Driver Version is 8.17.12.6830. Unfortunately, I haven’t yet figured out how this version number is supposed to relate to the “r270” or “r260” versioning terminology that is so frequently used in the CUDA docs…

The PTX code for the CUDA Module I’m attempting to load is as follows, which I’m passing in as a NULL-terminated string pointer (and using hard tabs):

.version 2.3

.target sm_20

.address_size 32

.global .u32 Frst[512];

.global .u32 Scnd[512];

.global .u32 USum[512];

.entry AddVec

{

	.reg .u32 ndx;

	.reg .u32 Fptr;

	.reg .u32 Sptr;

	.reg .u32 Uptr;

	.reg .u32 Fval;

	.reg .u32 Sval;

	.reg .u32 Uval;

	cvta.global.u32 Fptr, Frst;

	cvta.global.u32 Sptr, Scnd;

	cvta.global.u32 Uptr, USum;

	mad.lo.u32 ndx, %ctaid.x, %ntid.x, %tid.x;

	shl.b32 ndx, ndx, 2;

	add.u32 Fptr, Fptr, ndx;

	add.u32 Sptr, Sptr, ndx;

	add.u32 Uptr, Uptr, ndx;

	ld.global.u32 Fval, [Fptr];

	ld.global.u32 Sval, [Sptr];

	add.u32 Uval, Fval, Sval;

	st.global.u32 [Uptr], Uval;

	ret.uni;

}

Basically, all the above code does is create a kernel that adds two 32-bit integer array elements, and puts the result in another 32-bit integer array element, the subscripts for which are all the same, and are solely determined by the thread index… You might say that it adds one 512-element vector to another of the same size, and puts the result in a ‘result’ vector (all defined in the global address space)…

And, oh yeah, I’m calling into the nvcuda.dll that’s in the \Windows\SysWOW64 folder (4,936,808 bytes), because I’m calling it from a 32-bit executable running under a 64-bit version of Windows 7… That’s probably more than you needed or wanted to know, but whatever…

I tried replacing “.version 2.3” with “.version 2.2” - no difference…

I tried replacing “.version 2.3” with “.version 1.4” - no difference…

I tried removing the “ret.uni” - no difference…

I tried following the kernel name with empty parentheses - no difference…

Can anyone out there offer any help with this, or perhaps some sage advice? Any of either would be much appreciated.

Thanks in advance…

orthopteroid · April 5, 2012, 9:32pm

I’m using windows. My main() is a .c file and my kernel is in a .cu file. My .cu is compiled to a ptx which my project then links to. To load the ptx module I use something like:

HRSRC rhPTX = ::FindResource( 0, MAKEINTRESOURCE( dwResource ), “PTX” );

DWORD dwPTX = ::SizeofResource( NULL, rhPTX );

HGLOBAL ghPTX = ::LoadResource( 0, rhPTX );

szPTXResource = (char*)::LockResource( ghPTX );

char* szPTXOriginal = szPTXResource;

szPTXResource = (char*)malloc( dwPTX + 1 );;

memcpy( szPTXResource, szPTXOriginal, dwPTX );

szPTXResource[ dwPTX ] = ‘\0’;

And then hand directly to cuModuleLoadData. I’ve never had the CUDA_ERROR_NO_BINARY_FOR_GPU error that you seem to be getting. Can’t help there, sorry. I have a GTX285/compute1.3 build target.

Although I had this working, I’ve since broken it as I now want my kernel to share a file with my main() - which seems to necessitate some other way of finding the kernel resource in my exe. What option do I build with? What resource do I then go fishing for?

I’ve noticed some goofiness in the version numbering as well, but in linux - I think they treat a windows driver 8.17.12.6830 as a linux driver version 268.30. But I’ve never quite figured it out. Perhaps I had the wrong glasses on.

See you around again sometime!

orthopteroid · April 5, 2012, 10:30pm

Ahh, silly me. I only had to use main.ptx as my linked resource and then use a ‘extern “C” global’ decoration on my file-scope kernel for demangling so that cuModuleGetFunction could find it. cheers