Hey y’all,
Just to give everyone out there a head’s up, this issue is very specific to the CUDA Driver API…
I can’t seem to get either cuModuleLoadData() or cuModuleLoadDataEx() to load a simple PTX module.
The docs specifically say that I should be able to do this. To wit, it describes one of the valid ways to submit a CUDA module as:
The error that keeps coming back from either Driver API function in always the same: CUDA_ERROR_NO_BINARY_FOR_GPU…
The docs say this about that error (and no, I couldn’t find anything else about it anywhere):
Well, I had to read that quote many times before I could even begin to wrap my head around what they were trying to say, and I still don’t quite get it… External Image
However, in terms of the options I’m passing in, they are as follows:
cuModuleLoadData(): none (it doesn’t take any)…
cuModuleLoadDataEx():
CU_JIT_INFO_LOG_BUFFER: new char [1024]
CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES: 1024
CU_JIT_ERROR_LOG_BUFFER: new char [1024]
CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES: 1024
CU_JIT_TARGET: CU_TARGET_COMPUTE_21
CU_JIT_FALLBACK_STRATEGY: CU_PREFER_PTX
The computer that the program is being run on only has one NVidia GPU, and it’s an NVIDIA GeForce GT 525M, with 96 CUDA Cores, 2 streaming multiprocessors, and Compute Capability of 2.1…
The CUDA Driver Version is 8.17.12.6830. Unfortunately, I haven’t yet figured out how this version number is supposed to relate to the “r270” or “r260” versioning terminology that is so frequently used in the CUDA docs…
The PTX code for the CUDA Module I’m attempting to load is as follows, which I’m passing in as a NULL-terminated string pointer (and using hard tabs):
.version 2.3
.target sm_20
.address_size 32
.global .u32 Frst[512];
.global .u32 Scnd[512];
.global .u32 USum[512];
.entry AddVec
{
.reg .u32 ndx;
.reg .u32 Fptr;
.reg .u32 Sptr;
.reg .u32 Uptr;
.reg .u32 Fval;
.reg .u32 Sval;
.reg .u32 Uval;
cvta.global.u32 Fptr, Frst;
cvta.global.u32 Sptr, Scnd;
cvta.global.u32 Uptr, USum;
mad.lo.u32 ndx, %ctaid.x, %ntid.x, %tid.x;
shl.b32 ndx, ndx, 2;
add.u32 Fptr, Fptr, ndx;
add.u32 Sptr, Sptr, ndx;
add.u32 Uptr, Uptr, ndx;
ld.global.u32 Fval, [Fptr];
ld.global.u32 Sval, [Sptr];
add.u32 Uval, Fval, Sval;
st.global.u32 [Uptr], Uval;
ret.uni;
}
Basically, all the above code does is create a kernel that adds two 32-bit integer array elements, and puts the result in another 32-bit integer array element, the subscripts for which are all the same, and are solely determined by the thread index… You might say that it adds one 512-element vector to another of the same size, and puts the result in a ‘result’ vector (all defined in the global address space)…
And, oh yeah, I’m calling into the nvcuda.dll that’s in the \Windows\SysWOW64 folder (4,936,808 bytes), because I’m calling it from a 32-bit executable running under a 64-bit version of Windows 7… That’s probably more than you needed or wanted to know, but whatever…
I tried replacing “.version 2.3” with “.version 2.2” - no difference…
I tried replacing “.version 2.3” with “.version 1.4” - no difference…
I tried removing the “ret.uni” - no difference…
I tried following the kernel name with empty parentheses - no difference…
Can anyone out there offer any help with this, or perhaps some sage advice? Any of either would be much appreciated.
Thanks in advance…