Huge RAM usage with keras, tensorflow, CUDA 9.1, and CudNN 7 under Linux

dov.grobgeld · March 11, 2018, 8:12pm

I’m playing around with small neural networks on my GTX1070 card, and I have experienced very large RAM (not GPU memory) when using CUDA through keras (and pytorch). Consider the following python keras program:

import keras
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

input_shape=(28,28,1)
num_classes = 10
model = Sequential()
model.add(Conv2D(10, kernel_size=(5, 5), input_shape=input_shape, activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(20, kernel_size=(5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

batch_size=64
epochs = 10
img_rows, img_cols = 28,28

(x_train, y_train), (x_test, y_test) = mnist.load_data()
#x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
#x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
#input_shape = (1, img_rows, img_cols)
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

I’m running it while monitoring the host virtual memory usage . First, if running it without CUDA as follows:

CUDA_VISIBLE_DEVICES="" python3 mnist.py

Virtual memory usage peaks on about 3.1GB of memory, well below the 16GB of physical memory on my box. But when running it with CUDA enabled:

python3 mnist.py

Virtual memory hits 23.9G of memory.

I saw a similar behavior when running under pytorch, so it’s not very likely that it is a Keras issue.

When trying to run ResNet50 the virtual memory got above 50GB and then my box froze. Is this normal? Is there any way of requesting to reduce the amount of RAM used? Should I just accept that this is what is needed and buy another 16GB of RAM?

Here is the output of nvidia-smi during training:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0  On |                  N/A |
|  0%   39C    P2    70W / 180W |   7892MiB /  8119MiB |     50%   E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3640      G   /usr/libexec/Xorg                             41MiB |
|    0     10928      C   python3                                     7839MiB |
+-----------------------------------------------------------------------------+

Any help and or explanations would be very much appreciated!

My system is running Fedora 27:

Linux groovy 4.15.6-300.fc27.x86_64 #1 SMP Mon Feb 26 18:43:03 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

njuffa · March 11, 2018, 10:29pm

Virtual memory usage reflects the reservation of address space, it says nothing about actual memory usage. As far as I know, CUDA places all of physical host memory plus all of GPU memory into a single large address map to provide a unified address space. In your case, that sum is (16+8) = 24 GB, and you should expect to see virtual memory of that size when you use CUDA.

I know nothing about ResNet memory usage. Does the name ResNet50 possibly imply that it needs 50 GB of memory to run? In any event it seems plausible that it could be requesting more memory than your machine provides, leading to massive swapping between disk and host memory, giving the impression that the system “froze”.

The general rule of thumb for an optimally configured GPU-accelerated system is that the host’s system memory should be four times the size of the total GPU memory.

Robert_Crovella · March 12, 2018, 1:18am

The 50 in resnet50 refers to the fact that it is a specific neural network design, and that specific design includes 50 layers.

Virtual memory is not ordinarily something to worry about.

Investigation of the system freeze should follow some other diagnostic path, and should not start out assuming that a large virtual memory reservation is the issue.

dov.grobgeld · March 13, 2018, 6:29pm

Thanks for the help and the explanations! After adding 16GB of swap on my SSD and reducing the power drain of my GTX1070 to 90W by sudo nvidia-smi -pl 90, the pytorch ResNet50 training is finally running without crashing the computer. The system is heavily throttled by lots of swapping, but at least it is not crashing. So now I can finally differentiate between cats and dogs. :-)