cuMemAlloc_v2 return address out of range

Hi, tell me please how it is possible that cuMemAlloc_v2 return address out of range ?
I use cuMemAlloc_v2 instead of cuMemAlloc due to 32bit limitation in cuMemAlloc

So i try use this cuMemAlloc_v2(@DeviceReturnNumberUnAlign,384)
and get pointer 38849740800, but my GPU have only 11811160064 bytes of memory.

I try to check pointer cuMemGetAddressRange_v2 ( @pbase, @bytesizepbase, DeviceReturnNumberUnAlign)
and get pbase:38849740800 bytesizepbase:384 the result matches, but out of range!

cuMemAlloc_v2 doesn’t appear to be a documented part of the CUDA driver API:

[url]https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM[/url]

I’m not aware of any 32-bit limitation on cuMemAlloc:

[url]https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb82d2a09844a58dd9e744dc31e8aa467[/url]

size_t is 64-bits on a 64-bit platform.

Not sure why you have @ signs in your code. Are you intending those to be ampersands? (&)

The address space is a virtual address space. There is no reason to assume that just because your device has 12 GB of memory, that any pointer to an allocation will be in the first 12GB of address space.

Ignore the @ symbol, it is a pointer to variable in the programming language in which I am writing.
I am use 64-bit platform and 64 compiler and 64 bit integer variables and cuda driver API.

Ok, lets begin from start… to see all limitation in cuda api:

cuDeviceTotalMem(@totmemoty,CudaDevice) give me invalid result and totmemoty = 4294967295 = 0хFFFFFFFF
cuDeviceTotalMem_v2(@totmemoty,CudaDevice) return totmemoty = 11811160064 = 0x2C0000000
As you can see only _v2 give correct result
Next…

cuMemGetInfo (@freebytes,@totalbytes)
freebytes = 3713007616 = 3541Mb
totalbytes = 3757047808 = 3583Mb
So it is invalid result!

Try _v2:
cuMemGetInfo_v2 (@freebytes,@totalbytes)
freebytes = 9652142080 = 9205Mb
totalbytes = 11811160064 = 11264Mb
Correct!
That is why i switch all possible function to _v2

cuMemAlloc_v2(@DeviceReturnNumberUnAlign,384) return DeviceReturnNumberUnAlign = 38849740800 that is out of range.
So if you say that this is virtual address space i can do this:
DeviceReturnNumberUnAlign = DeviceReturnNumberUnAlign % totmemoty
38849740800 % 1811160064 = 815379456 but this is not help because when i try next step i get error 1
cuMemcpyHtoD_v2(DeviceReturnNumberUnAlign ,*hostmemoryarray, 50)

This code is working on old api request but with 32bit limitation only, without _v2
That is why i try to rebuild my code to new _v2 feature…

when I do:

cuDeviceTotalMem(&totmem, CudaDevice) I get a correct result. So I would say you are doing something wrong.

And of course you cannot do this with pointers:

DeviceReturnNumberUnAlign = DeviceReturnNumberUnAlign % totmemoty

Ok, maybe i do something wrong but early all functions (without suffix _v2) works fine,
and it didn’t bother me because the cards did not appear more than 4gb
After that i see this limitation in each release of CUDA DRIVER API

And only few days ago I found out about _v2 preffix.
And I tried to change my program so that it worked correctly with more than 4GB of GPU memory.

So if i do something wrong why cuDeviceTotalMem_v2 give me correct result but cuDeviceTotalMem not or at least please explain why the results are different ??

The same like cuMemGetInfo_v2 - corect result but cuMemGetInfo - invalid result
And vice versa cuMemAlloc give correct pointer in GPU memory range (with 32bit limitation) but cuMemAlloc_v2 give incredible 38GB pointer…
I do not change anything other than simply adding a suffix _v2 at the end…
Thanks!

Here is an example on CUDA 10.0, RHEL 7, Tesla P100 (16GB memory):

$ cat t487.cu
#include <cuda.h>
#include <iostream>
int main(){

  cuInit(0);
  CUdevice dev;
  cuDeviceGet(&dev, 0);
  size_t tot;
  cuDeviceTotalMem(&tot,dev);
  std::cout << "total memory: " << tot << std::endl;
}
$ nvcc -o t487 t487.cu -lcuda
$ ./t487
total memory: 17071734784
$

Here is my example:

ImportC "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\lib\x64\cuda.lib"
  cuInit(Flags.i)
  cuDeviceTotalMem(bytes.i,dev.i)
  cuDeviceComputeCapability(major.i,minor.i,dev.i) 	
  cuDeviceGetCount(count.i)
  cuDeviceGetName(name.s,len.i,dev.i)  
  cuDeviceGetAttribute(pi.i,attrib.i,dev.i)
  cuDeviceGet(device.i, ordinal.i)
EndImport

Procedure exit()
  Delay(2000)
  CloseConsole()
  End
  
EndProcedure

OpenConsole()
count.i = 0
namedev.s=Space(128)
CudaDevice.i = 0
major.i = 0
minor.i = 0
piattrib.i = 0
sizebytes.i = 0

cuInit(0)
cuDeviceGetCount(@count)
Debug count
If count>0
  PrintN("Found "+count+" cuda device.")
  PrintN("----------------------------")
Else
  PrintN("No cuda device found.")
  exit()
EndIf

For i=0 To count-1 
  cuDeviceGet(@CudaDevice, i)                 
  cuDeviceGetName(namedev,128,CudaDevice)
  cuDeviceTotalMem(@sizebytes,CudaDevice)  
  PrintN("Cuda device["+Str(i)+"]:"+namedev+"("+Str(sizebytes/1048576)+"Mb)")  
Next i
Input()
exit()

And result:
Found 1 cuda device.

Cuda device[0]:GeForce RTX 2080 Ti(4095Mb)

As you can see in this link I’m not the only one having this problem cuDeviceTotalMem for Total Global Memory 32 bits limited ? - NI Community

I have no idea what your example is, it’s clearly not C or C++ code.

The article you linked refers to a problem with the NI implementation, not the CUDA toolkit itself.

I’m not going to try to explain undocumented API functions. The documented functions are the ones that I can explain. And when used correctly, they work fine. I’ve already given you an example.

You’re welcome to do whatever you wish, of course. Good Luck!