Tesla K40 1TB RAM problem

Hello, I have a custom server that I built with 3 Tesla K40’s for some high-intensity simulation processing. The system blue screen’s anytime I have the full 1TB of RAM installed, I have determined this to be the fault of the K40’s driver.

The problem is that the system is a quad-channel system. If I remove a stick of RAM, and bring the system down to 960G of RAM, the system performance degrades to a single-channel configuration, which hinders performance noticeably (about 20%). I can bring the system down to 512G of RAM and everything runs optimally, but then I can’t run the larger simulations that need the 1TB of memory space, which is what I built the system for, and 20% doesn’t seem like much but in this instance, it is a measure of days, and some of these simulations are time-critical.

I was wondering if there was an environment variable (or something like it) within the K40 that I could manually set to only give it access to 960G of the system RAM? That way I could have the full 1TB plugged into the system, and the motherboard/processors would still operate under the quad-channel configuration and the system wouldn’t blue screen due to the NVidia driver limitation.

Any assistance would be appreciated, thanks in advance!

Have you thought about trying Linux instead?

[s]If you have not done so yet, I would highly recommend reporting this as a bug to NVIDIA. Even companies the size of NVIDIA do not routinely have Windows systems with 1 TB of system memory sitting around in their QA departments (or anywhere for that matter).

I am curious how you determined conclusively that the GPU driver is at fault. It seems possible for a TCC driver to cause a system panic, but I would have considered an OS component a more likely source of that, or instability of the hardware.[/s]

It’s not (primarily) a windows or linux issue. It’s a limitation of the K40 and all pre-pascal GPUs that have a 40-bit TLB map (and, to some extent, system BIOS dependent).

You’ll be limited to 1TB of memory (or 512GB for Fermi GPUs, not in view here), and that is only achievable in special situations. For most typical situations and the way most server BIOSes work, you are limited to less than 1TB. Here’s some indication for this issue:

[url]https://us.download.nvidia.com/XFree86/Linux-x86/331.20/README/addressingcapabilities.html[/url]

AFAIK there are no environment variables that can work around this or modify it. It’s a function of where the system assigns resources that the K40 needs, above or below the 1 TB barrier. You may be able to find some system BIOS entries that affect mapping, and/or OS config parameters, and it may be worth a try, but I’ve not personally worked thru the process.

Pascal (and future) GPUs should not have this limitation. They have something like 49bits of TLB map range:

[url]https://devblogs.nvidia.com/parallelforall/inside-pascal/[/url]

"GP100 extends GPU addressing capabilities to enable 49-bit (512 TB) virtual memory addressing (note that GP100 also supports 47-bit (128 TB) physical memory addressing). "

Thanks for the input everyone, Linux isn’t an option but it’s not a Windows issue. The K40 driver limitation is known by NVidia.

Thx TxBob, yes the P100’s are the end solution to my issue but I have a great number of sims that need to run between now and when I get funding for a new $60,000 server, so I was hoping there was a stop-gap measure I could institute in the mean time. The open-source community is usually better at these things than the direct company, so here I am. lol.

Are there steps you can take in the configuration of your simulation software to reduce memory footprint? Most sophisticated simulation environments I have seen come with myriad configuration switches covering all kind of potential tradeoffs, memory footprint often being one of them (not many people have a 1 TB system at their disposal).