why is CL_DEVICE_MAX_MEM_ALLOC_SIZE never larger than 25% of CL_DEVICE_GLOBAL_MEM_SIZE only on NVIDIA?

TychoTithonus · February 7, 2017, 3:37pm

Summary:

Why is NVIDIA OpenCL CL_DEVICE_MAX_MEM_ALLOC_SIZE (allocatable memory) never more than 25% of CL_DEVICE_GLOBAL_MEM_SIZE, when other platforms are sometimes 50%, 70% or even 100%? If it is because of a mistaken interpretation of the OpenCL 1.2 standard, there may be an opportunity for NVIDIA to increase the memory available to applications. If this is an NVIDIA-specific constraint, if it is not arbitrary, what is the root cause?

Details:

In the OpenCL specification, CL_DEVICE_MAX_MEM_ALLOC_SIZE controls how much GPU memory is available for allocation. Quoting the spec:

“CL_DEVICE_MAX_MEM_ALLOC_SIZE - Max size of memory object allocation in bytes. The minimum value is max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE, 12810241024) for devices that are not of type CL_DEVICE_TYPE_CUSTOM.”

This phrasing is potentially confusing. Paraphrased, I interpret it as follows:

“When implementing OpenCL, CL_DEVICE_MAX_MEM_ALLOC_SIZE must be set in order to inform applications about maximum allocatable memory. To be compliant with this specification, this maximum must be either either one fourth of the device’s physical memory (CL_DEVICE_GLOBAL_MEM_SIZE) or 128 binary megabytes, whichever is greater.”

This defines the minimum for CL_DEVICE_MAX_MEM_ALLOC_SIZE. Notably, the specification does not define how to calculate a maximum for CL_DEVICE_MAX_MEM_ALLOC_SIZE.

Unless there is an undocumented additional NVIDIA-only constraint, for all known NVIDIA OpenCL-compatible cards that I surveyed [1], allocatable memory appears to be artificially constrained to only 25% of physical memory. It is sometimes less, but it appears to never be more.

NVIDIA appears to be alone in this. Intel, AMD, and pocl implementation sometimes exceed the 25% mark. Some AMD implementations appear to work from a higher limit, sometimes 50% or 70% of memory. Some pocl implementations appear to make the maximum physical amount fully available for OpenCL.

Other messages in this forum have asked this question, and people point to the spec as the reason, but this is based on a flawed interpretation of the spec.

See hashcat-opencl-memory-allocatable.txt · GitHub for a survey of discovered values for many hardware platforms.

SPWorley · February 7, 2017, 6:49pm

As you probably saw, 6 year old thread is relevant. The single line in the document causes confusion in Intel’s OpenCL forum as well.

Despite the name, CL_DEVICE_MAX_MEM_ALLOC_SIZE is not the limit of the total amount of memory possible to allocate (over multiple allocations), but instead I read it as reporting the declared upper size limit for any single specific allocation. Probably value this was relevant back when 32 bit memory addresses were common and the GPU memory management was simplistic and limited, and users needed to know “I will never get a single 1GB memory chunk allocated at once, so I won’t bother trying”

In theory today’s GPU hardware could even let you allocate more memory (even in a single allocation) than your device holds and just page it in silently as needed. That’s not what happens (certainly not in NVidia’s 1.2 OpenCL, which by your gist link shows NVidia’s OpenCL allocator clearly limits max alloc size ) but it wouldn’t be in disagreement with the clGetDevice query.

Robert_Crovella · February 7, 2017, 7:23pm

Probably the right thing to do is to file an enhancement request - go to developer.nvidia.com and file a bug with RFE in the synopsis.

State your case there.

I’m not an OpenCL lawyer, but NVIDIA certainly has representatives with Khronos. Your interpretation is plausible, but I don’t know for sure that it is correct.

It seems like quite a few Intel and some AMD platforms also reflect the 1/4 limit, rightly or wrongly, so even though you use the words “only on NVIDIA” (cleverly conditioned by “never”) it would seem that a great many platforms reflect this concept. The issue doesn’t seem to be unique to NVIDIA, although it may be the case that there are no counterexamples with NVIDIA platforms, whereas you have found some counterexamples on Intel and AMD.

TychoTithonus · February 7, 2017, 7:59pm

SPWorley - thanks for your reply. I am not an OpenCL developer, but I volunteer with projects that are, and did this footwork on their behalf. I will pass your information on to them, and I will post any relevant information that I get back from them.

txbob - thanks for your reply as well.

I did this first, and they sent me here.

That being said, I did not put “RFE” in the synopsis. Do you think it’s worth trying again?

By referring to legal issues, it sounds like you may be ascribing an agenda to me that I do not have. I have no undisclosed skin in this game. I have no understanding of legality relative to following or not following the OpenCL standard. I have no understanding of the benefits of CUDA vs OpenCL. All I know is that projects that I am interested in are A) using OpenCL and B) currently constrained in using more memory - either because of a hard cap that they cannot control, or because they need more information on the right way to allocate memory. I want to understand the technical root cause of this apparent cap.

You are right, of course, that there is an overall tendency across platforms towards 25%, especially in the GPU-only area. If there is a technical reason for this, I would like to understand better what that is.

I’m not alone in being confused; this issue that appears to have cropped up in various forums before, with no solid answer that I could find. Once I understand this issue, I want to capture the results publicly so that all NVIDIA OpenCL developers understand how to use the maximum amount of memory available on the card.

I used the word “never” to be precise, not to be clever. A single NVIDIA counterexample would refute my theory that the spec might be being misinterpreted, but I am in no way trying to hint at potential motivations. All I know is that, so far, with a solid hours’ worth of searching, I can find no counterexamples for NVIDIA GPUs, and was able to easily locate exceptions for other platforms, especially AMD.

All that being said, if I’m reading SPWorley’s answer correctly, then it’s just a matter of making additional allocations. I do not know how difficult it is to make multiple allocations, or whether or not there would be a performance impact.

If it’s not hard and there’s no performance impact, then this is a non-issue. But if code gets more complex or performance is impacted by having to work within that cap, and if that cap is either a mistake or a legacy constraint, then it could be raised, and all OpenCL developers on NVIDIA would benefit.

Robert_Crovella · February 7, 2017, 8:09pm

Could you give me the bug number? I’d like to take a look at it.

Sorry, it was unwise and inappropriate of me to use the word lawyer. I meant to say that I am not an OpenCL expert (although I have some familiarity with it) and therefore have no well-reasoned or well-formed opinion about what is or should be correct here. I meant “lawyer” in a colloquial sense similar to “language lawyer” i.e. someone who is well versed on the specification of a programming language as well as the correct interpretation of it – I am certainly not that for OpenCL.

Apologies for legal reference or interpretation. Your question is reasonable.

In my view, however, making multiple allocations is not a completely sufficient answer. There is obvious utility in having available a single allocation of desired size, if the platform can support it.

TychoTithonus · February 7, 2017, 11:02pm

txbob, thanks - bug number forwarded under separate cover.

For the general thread, I bounced this off of a developer who said that for some workloads, separate memory allocations would require branching that would slow down processing for high-performance compute. So if feasible, there would definitely be value in increasing CL_DEVICE_MAX_MEM_ALLOC_SIZE for HPC.

TychoTithonus · February 8, 2017, 12:12am

For future subject-matter experts who find this thread:

txbob’s observation that there is a 25% tendency across platforms is correct. Any additional information that leads to a solid explanation for this - and how/why some platform combinations decide to only sometimes make it larger - would be very informative.

TychoTithonus · February 8, 2017, 12:15am

And in full disclosure, what I’d filed previously was a support request, not a bug report. txbob guided me through the process for filing a true bug report - much appreciated!

PolarNick · February 8, 2017, 4:20pm

+1

I will be very glad to see CL_DEVICE_MAX_MEM_ALLOC_SIZE increased. Some facts:

CUDA has no restrictions on single allocations size. Are there are any major differences in memory handling between CUDA and OpenCL? If so - I am very interested to know about them
Nvidia driver in practice seems to successfully allocate single memory chunks (for OpenCL) far beyond result of CL_DEVICE_MAX_MEM_ALLOC_SIZE value, and everything works well (but of course such allocations are inappropriate for production code)

reirab · October 26, 2017, 10:01pm

Hi, I was wondering what came of this? Is there a link to the bug report?

Robert_Crovella · October 26, 2017, 10:33pm

The NVIDIA bug nmber is 1872623, but even if I gave you a link you wouldn’t be able to look at it. Bug reports are private, in this case only accessible by the person who filed it (apart from NVIDIA personnel).

The response from the development team at NVIDIA is as follows:

The value is intentional. The limit is set based on our understanding of the specification, including participation in the OpenCL forum standards board and our experience with compliance testing results. A detailed justification beyond that probably won’t be forthcoming. As already indicated, this behavior is not unique to NVIDIA platforms.
Developers can try to allocate more memory than CL_DEVICE_MAX_MEM_ALLOC_SIZE, but the successful allocation is not guaranteed (this is same for any allocation call). The developers should check for error returned by clCreateBuffer and use the allocation only if the call returns CL_SUCCESS

It’s already been indicated in this thread that developers can successfully allocate more than this “limit”, in practice, in some situations.

reirab · October 27, 2017, 7:16pm

Ok, thanks for sharing the response from the dev team.