opencl 6GB memory problem get error message at 4.2GB of memory

thestand · July 7, 2011, 6:42am

We use openCL for a geophysical program an want to use the hole memory of 6GB of our C2070 Tesla GPUs. Our software development team received a “clEnqueueWriteBuffer: Memory object allocation failure” at 4.3 GB. So we can only use 4.2GB.

We tested a CUDA4 program and it works fine with 6GB.

Is anybody here who could help us? Where is the error?

Thanks for your help!

eyebex · July 7, 2011, 7:18am

What values does the OpenCL runtime return for CL_DEVICE_GLOBAL_MEM_SIZE and CL_DEVICE_MAX_MEM_ALLOC_SIZE on your Tesla device (also see this and this thread)? It might be that you’re allocating more than CL_DEVICE_MAX_MEM_ALLOC_SIZE but less than CL_DEVICE_GLOBAL_MEM_SIZE. Currently, it seems there’s no NVIDIA device that supports CL_DEVICE_MAX_MEM_ALLOC_SIZE == CL_DEVICE_GLOBAL_MEM_SIZE, so the only way to access the full CL_DEVICE_GLOBAL_MEM_SIZE amount of memory is to use more than one buffer, each with <= CL_DEVICE_MAX_MEM_ALLOC_SIZE in size.

thestand · July 7, 2011, 7:35am

Thanks for your answer.

CL_DEVICE_GLOBAL_MEM_SIZE = 6143MB

CL_DEVICE_MAX_MEM_ALLOC_SIZE = 1535MB

We want to allocate 10 x 500MB and we are wondered about the 4.2GB…So the problem is still there…

thestand · July 13, 2011, 8:25am

External Image

Is there nobody who can help us???

jonathan81 · July 13, 2011, 1:11pm

Hello,
You could use 6GB of memory on 64 bit OS and if your program is compiled in 64 bits otherwise you will 6GB in the GLOBAL_MEM_SIZE but you can allocate only 4GB due to 32bits limitation
Thanks
Jonathan

thestand · July 14, 2011, 7:06am

Hi Jonathan,

thanks for your answer.

We use a 64 bit OS (openSUSE 11.3 (x86_64)).

output DEVICE_QUERY:

Platform Name : NVIDIA CUDA

Platform Version : OpenCL 1.0 CUDA 4.0.1

NAME : Tesla C2070

VENDOR : NVIDIA Corporation

VERSION : 275.09.07

PROFILE : FULL_PROFILE

VERSION : OpenCL 1.0 CUDA

GLOBAL_MEM_SIZE : 6143 MiB

GLOBAL_MEM_CACHE_SIZE : 229376 B

MAX_COMPUTE_UNITS : 14

MAX_WORK_GROUP_SIZE : 1024

MAX_CLOCK_FREQUENCY : 1147 kHz

MAX_MEM_ALLOC_SIZE : 1535 MiB

CL_DEVICE_ADDRESS_BITS: 32

I wondered about “CL_DEVICE_ADDRESS_BITS: 32”. This should be 64 bit or is this correct?

Regards

Michael

sonnyn · August 3, 2011, 11:32am

In this case the only thing i see is Data reduction when it possible of course!

thestand · August 9, 2011, 8:16am

Hi thanks for reply.

We have no change to reduce the Data. We bought 6gb C2070 and want to use the whole memory. Is there any change to setup the CL_ADRESS_BITS to 64? (BIOS update, special driver, another linux distribution …)

eyebex · August 9, 2011, 1:30pm

Did you try the new 280.19 beta driver with OpenCL 1.1?

thestand · August 10, 2011, 7:18am

yes

jonathan81 · September 12, 2011, 8:00am

Hello,
Just make a very simple sample when you try to allocate 6 buffer of 1GB (createBuffer + WriteBuffer)
Thanks
jonathan

cantallo · November 6, 2012, 7:37am

Hello, I have the same problem with a radar processing application: I can allocate the 6Gb from two different processes running simultaneously, but a single process cannot allocate the full 6Gb (it is limited to 4Gb).

Same as here, I am on CentOS 64bits or Ubuntu 64bits (two different machines, same problem).

I will check CL_DEVICE_ADDRESS_BITS

cantallo · November 6, 2012, 8:03am

(…)
version de la plateforme openCL=OpenCL 1.1 CUDA 4.2.1
GPU n°1:
nom=Tesla C2075
adressage 32 bits
(…)

some way to change that ?

cantallo · November 6, 2012, 8:10am

CL_DEVICE_ADDRESS_BITS is defined as the DEFAULT address space in the specs. Does that mean that there is some possibility to select 64 or 32 bits ?

cantallo · November 6, 2012, 9:08am

By the way, since a single memory object cannot exceed 1/4 of the GPU memory, 32 bits could suffice when kernels use only one or two of the 4 memory objects.
It is suprising that the restriction to 4G is at the PROCESS level because when two processes allocate the full 6G as 4 buffers (2 each) of 1.5G there is no reason that the 2 buffers of one process are not interleaved with the two buffers of the other process (i.e. it would fail if there is someting as a ‘‘base’’ address fo the process and an ‘‘offset’’ for each buffer in the case the order in memory is (buffer 1 of proc 1)-(buffer 1 of proc 2)-(buffer 2 of proc 1)-(buffer 2 of proc 2))
The reason I do not believe that there is such a ‘‘base’’ for each process is that I can start two processes which progressively allocate
1+1G and 1.33+1.33+1.33G
or
1.5+1.5G and 1.5+1.5G
or
1.33+1.33+1.33G and 1+1G
and there is no problem (I guess the interface has no way of predicting how much RAM process #1 will eventually use and set the base for process #2 at:
2G
or
3G
or
4G
in my examples…)
The 32 bits of addressing are more probably restrictive for some housekeeping in the OpenCL interface than at the kernel level (as I guess the 1/4 allocation restricion is due to some segmentation of the graphic RAM in 4 banks for access parallelizing)

bittenbybytes · November 7, 2012, 10:56am

Hi,

We are also using a quite large buffer in our system and noticed some quite weird things. Untill reading this thread i was unaware of this limitation of buffer size usually to 1/4 of the VRam size. Somehow our kernels worked beyond this limit. Our system works flawlessly(only slow due to weak GPU) with a 220MB large buffer on a small Quadro NVS 3100 with 512MB VRam (Max Alloc Size: 128MB). On a Quadro 2000M (2GB Vram, Max Alloc Size: 512MB) it worked up to about 1.8GB with older drivers. We only noticed that there might be a problem as after a driver update to version 300+ all attempts to request a buffer larger than about 1270MB failed (on 2GB 2000M), though there were enougth unoccupied resources. We found a similar behaviour on a Geforce GTX 580 (3GB VRam) with proportionally larger limits.

We are still using buffer sizes beyond the specified allocation size and haven’t experienced any corruption of data. I assume that this size is not always the real possible maximum, but if you go beyond and are lucky to have it still working, you are solely depending on luck at the next device/driver version.

Did someone else encounter similar behaviour? Could it even be considered a bug (missing/wrong condition) in the NVIDIA OpenCL Implementation that API let’s you procceed without returning an error using a buffer larger than the specified maximum size?

aliosmanulusoy · February 8, 2013, 3:39am

Hi,
I am having the same exact problem with Tesla K20c which has 5gb of global memory. I can allocate up to 2.5gb in two separate processes. However, I cannot allocate more than 4gb in one process. I believe this is to do with CL_DEVICE_ADDRESS_BITS. Anybody who has a workaround?

marco_diers · June 28, 2013, 12:57pm

Hello,

the same problem with the nvidia opencl driver. The CL_DEVICE_ADDRESS_BITS is hardcoded on 32 and it is not possible to allocate more than 4gb.
Here is the answer of the nvidia-support:

'we do not support >4GB memory using OpenCL.  We recommend the customer 
uses CUDA to access the full 6GB of memory.'

Looks as if nvidia will push cuda with the driver limitation.

mirkomyl · June 5, 2014, 6:11am

Does anyone know what is the current situation? Is the CL_DEVICE_ADDRESS_BITS still hardcoded to 32? If it is not, how do I enable 64-bit mode? I have a Nvidia Quadro K6000 with 12GB memory but I can only access about 3GB through OpenCL. The computer/server is running Ubuntu.

Robert_Crovella · June 7, 2014, 3:58am

The current NVIDIA OpenCL implementation is limited to a 32-bit address space, ~4GB total addressable/allocatable.