[980 Ti, Windows 10, CUDA 7.5] Out of memory after allocating 4.5 out of 6gb

I run int a strange issue after upgrading my graphics cards from TITAN to GTX980 Ti and OS from Windows 7 to Windows 10.

The same application that was abler to utilize 6GB of memory on the TITAN cannot allocated more than 4.5 GB on a GTX980 Ti.
I am compiling with a 64bit switch on.

The device_query shows 6GB of memory as well:

deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 980 Ti"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 6144 MBytes (6442450944 bytes)
  (22) Multiprocessors, (128) CUDA Cores/MP:     2816 CUDA Cores
  GPU Max Clock rate:                            1291 MHz (1.29 GHz)
  Memory Clock rate:                             3600 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 980 Ti
Result = PASS

I even have a unit test that reproduces it:

TEST(GpuAllocator_AllocateDeallocate) {
  static const frames_size_type SIZE = 1024 * 1024 * 512;  // 512 mb
  static const frames_size_type ALLOCATION_COUNT = 11;  // 5.5gb total

  void* pointers[ALLOCATION_COUNT];
  for(frames_size_type i = 0; i < ALLOCATION_COUNT; ++i) {
    pointers[i] = FramesLib::Memory::GpuAllocator::Allocate(SIZE);
  }

  for(frames_size_type i = 0; i < ALLOCATION_COUNT; ++i) {
    FramesLib::Memory::GpuAllocator::Deallocate(pointers[i]);
  }
}

This fail while trying to do allocation for i = 10.

Has someone seen this before? What can be the issue?

You may want to try nvidia-smi to see what processes are using GPU memory besides your CUDA program. I do not use Windows 10, but I have seen anecdotal reports that it has higher GPU memory usage than Windows 7, which may be connected to the fact that Windows 10 uses a different driver model than Windows 7 (WDDM 2.0 instead of WDDM).

BTW, with allocation granularity of 0.5 GB you should see at most 5.5 GB (out of 6 GB) being allocated on Windows 7, as a CUDA context requires roughly 100 MB of GPU memory by itself, and there are likely other uses of GPU memory through the GUI.

Hi,

Thanks for the response. I agree that I should get up to 5.5gb only with the way the unit test is structured. My application need about 5.3gb so what the unit test does is fine albeit a bit coarse.

I followed up on the nvidia-smi. Bottom line - there is memory but I cannot allocate it.

This is what I get when running nvidia-smi on my machine:

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Sun Dec 06 13:04:53 2015
+------------------------------------------------------+
| NVIDIA-SMI 359.06     Driver Version: 359.06         |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti WDDM  | 0000:01:00.0      On |                  N/A |
| 20%   33C    P8    19W / 260W |    182MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       484  C+G   Insufficient Permissions                     N/A      |
|    0       564  C+G   Insufficient Permissions                     N/A      |
|    0       968  C+G   Insufficient Permissions                     N/A      |
|    0      4252  C+G   C:\Windows\explorer.exe                      N/A      |
|    0      4708  C+G   ...ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    0      4904  C+G   ...indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
+-----------------------------------------------------------------------------+

And this is when I run my app and it breaks on the out of memory from CUDA. You can see there is still 1gb left.

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Sun Dec 06 13:07:05 2015
+------------------------------------------------------+
| NVIDIA-SMI 359.06     Driver Version: 359.06         |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti WDDM  | 0000:01:00.0      On |                  N/A |
| 20%   43C    P2    95W / 260W |   4945MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       484  C+G   Insufficient Permissions                     N/A      |
|    0       564  C+G   Insufficient Permissions                     N/A      |
|    0       968  C+G   Insufficient Permissions                     N/A      |
|    0      4252  C+G   C:\Windows\explorer.exe                      N/A      |
|    0      4708  C+G   ...ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    0      4904  C+G   ...indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
|    0      6828    C   ...CUDA\Builds\x64\Debug\FramesLib.Tests.exe N/A      |
|    0      7892  C+G   ...Visual Studio 12.0\Common7\IDE\devenv.exe N/A      |
+-----------------------------------------------------------------------------+

Additionally I can use my integarated graphics card for display bringing the OS memory consumption to 95MiB:

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Sun Dec 06 12:55:54 2015
+------------------------------------------------------+
| NVIDIA-SMI 359.06     Driver Version: 359.06         |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti WDDM  | 0000:01:00.0     Off |                  N/A |
| 20%   36C    P8    15W / 260W |     95MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       484  C+G   Insufficient Permissions                     N/A      |
+-----------------------------------------------------------------------------+

Could be a memory fragmentation problem. What happens when you test with blocks smaller than 512 MB?

I tried different allocations sizes. The unit test is now:

static const frames_size_type GB = 1024 * 1024 * 1024;
static const frames_size_type MB = 1024 * 1024;
static const frames_size_type KB = 1024;

std::string GetHumanSize(size_t size)
{
  auto gb = size / GB;
  size = size % GB;

  auto mb = size / MB;
  size = size % MB;

  auto kb = size / KB;
  size = size % KB;

  std::stringstream ss;
  if (gb > 0)
  {
    ss << gb << "GB ";
  }
  
  if (mb > 0)
  {
    ss << mb << "MB ";
  }
  
  if (kb > 0)
  {
    ss << kb << "KB ";
  }

  ss << size << "B";

  return ss.str();
}

TEST(GpuAllocator_AllocateDeallocate) {
  static const frames_size_type TO_ALLOCATE_TOTAL = 5 * GB + 512 * MB;
  static const frames_size_type TO_ALLOCATE_UNIT = MB;

  std::cout << "Allocation target " << GetHumanSize(TO_ALLOCATE_TOTAL) << std::endl;
  std::cout << "Allocation unit " << GetHumanSize(TO_ALLOCATE_UNIT) << std::endl;

  std::vector<void*> pointers;

  frames_size_type total = 0;
  while (total < TO_ALLOCATE_TOTAL) {
    auto ptr = FramesLib::Memory::GpuAllocator::Allocate(TO_ALLOCATE_UNIT);
    pointers.push_back(ptr);

    total += TO_ALLOCATE_UNIT;
    std::cout << "\r" << "Allocated " << GetHumanSize(total);
  }
  std::cout << std::endl;

  for (auto ptr : pointers) {
    FramesLib::Memory::GpuAllocator::Deallocate(ptr);
  }
}

Results for 512MB allocations:

Allocation target 5GB 512MB 0B
Allocation unit 512MB 0B
Allocated 4GB 512MB 0B

Results for 1MB allocations:

Allocation target 5GB 512MB 0B
Allocation unit 1MB 0B
Allocated 4GB 944MB 0B

Results for 256KB allocations:

Allocation target 5GB 512MB 0B
Allocation unit 256KB 0B
Allocated 4GB 944MB 0B

Results for 4KB allocations:

Allocation target 5GB 512MB 0B
Allocation unit 4KB 0B
Allocated 4GB 944MB 0B

Trying 1B takes a long time and crashes the driver :/

Overall - it does not seem like a fragmentation issue. More like some limit that I am consistently hitting.

So it looks like you can allocate about 4.9 GB out of 6 GB under Windows 10, whereas under Windows 7 you were able to allocate about 5.7 GB ? Meaning you are “missing” about 800 MB under Windows 10?

nvidia-smi clearly shows that there are other processes that allocated GPU memory. Strangely, it does not show the amount of memory taken, just N/A. So a working hypothesis would be that the “missing” GPU memory that your CUDA application cannot allocate is already occupied by these processes. You might want to go through the relevant PIDs and check whether any of the processes listed by nvidia-smi can safely be disabled.

On various operating system I have found in the past that fancy 3D-enabled desktops can take up significant amounts of GPU memory. If there is an option for that in Windows 10, try turning off 3D desktop functionality.

Yes - it seems I am missing about 800MB.

While nvidia-smi does not list the allocated memory per process it does list the total allocated from all applications unless I misunderstood the Memory-Usage part of the header it prints.

So it seems like when I am using my 980 Ti as a primary display card all processes allocate memory from it totaling 182MiB. When I switch to a built in Intel card as a primary display I only get one process allocating 95MiB.

Lastly when my app breaks because of the insufficient memory (when allocating 512mb a t the time) I see 4945MiB in use which seems right.

Any more ideas?

In practical terms, a reasonable conclusion could be to avoid switching to Windows 10 unless absolutely necessary. Personally, I am steadfastly ignoring GWX’s constant nagging, because I am happy with Windows 7 Professional and see no reason to change my setup. “Don’t fix what ain’t broke”.