Out of memory message trying to run cnn network benchmark

sanchezvr7 · April 7, 2018, 7:48pm

Hi,

I am running the tensorflow benchmark cnn network using the command:

root@7e8e9113fa85:/workspace/nvidia-examples/cnn# python3 nvcnn.py --model=vgg19 --batch_size=256 --num_gpus=3

I have increased the SHMEM allocation limit, but the system still throws the below message:
“2018-04-07 18:46:39.083434: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.45GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.”

Also, I need to know how to monitor the gpu performance when running this benchmark model, I have tried nvidia-smi getting the message “Invalid combination of input arguments”

Could you please help with both issues?

Thanks,

VS

philrogers · April 7, 2018, 8:58pm

Could you please answer the following questions to help us debug this:

Does the training run fine with batch size 128?
Can you please share your docker run command?
Can you please share the nvidia-smi command you are using, complete with the arguments?
Also, just curious, but why did you decide to use 3 gpus?

sanchezvr7 · April 7, 2018, 9:33pm

Here the requested info:

Does the training run fine with batch size 128?
For 128 batch size / 1 GPU, the training is running fine

For 128 batch size / 2 or 3 GPUs, it throws the message:
Initializing variables
Unexpected end of /proc/mounts line overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/QMKMRAUCQ5JIVHP5GZQ3O7PCNS:/var/lib/docker/overlay2/l/Q5QRVQX2RIASGB3ON73STZWNSK:/var/lib/docker/overlay2/l/IAW4NQQBQZR3TMI6CCN5ZPILUJ:/var/lib/docker/overlay2/l/MEJKNJMDXSN2R25Q5W52QE45NW:/var/lib/docker/overlay2/l/V6MYOSGX4HD65LUB6S7PGTNGSR:/var/lib/docker/overlay2/l/UYHWSUNQ3D4UPO7UNXJHSRJXOA:/var/lib/docker/overlay2/l/FSMRQZIRD3QBN5T3TSPUXKAXQG:/var/lib/docker/overlay2/l/OW7H4HIAG66BSVO7LLGQ4MHQAM:/var/lib/docker/overlay2/l/P7R6YUXGD3O2A' Unexpected end of /proc/mounts line DTO7NF5OKQ76B:/var/lib/docker/overlay2/l/EDPAZPDQQGCI4A4NNDTNHBAQ2G:/var/lib/docker/overlay2/l/JNMAGLPUR2TK6QMNOZZXPH6C7B:/var/lib/docker/overlay2/l/N2TY3YVRIWD4EY4Z2PY3FGRFVQ:/var/lib/docker/overlay2/l/GJHVR3Q2VUZ7AAYZSMHTLR34HV:/var/lib/docker/overlay2/l/SMJOSBRISKTVURIT6SISVSDRXH:/var/lib/docker/overlay2/l/ZFJJ4777GN4XN7W6TSOOMOQFOZ:/var/lib/docker/overlay2/l/TCRRKFLD623SVIYVRYAHER7QKQ:/var/lib/docker/overlay2/l/DNTV366AGJ3C7OMR7WUE2WKQIN:/var/lib/docker/overlay2/l/V4MDBKGPSABUTYIHEFGSGQBINO:/var/lib/do’
Unexpected end of /proc/mounts line cker/overlay2/l/YN24UX4ROKGCWZS5QWPC4XGOVF:/var/lib/docker/overlay2/l/4GT7YCEOSLRQTKBNDUF4R6XCBE:/var/lib/docker/overlay2/l/6YUC5Z5NFMK4MMMLJ6BTLWEHII:/var/lib/docker/overlay2/l/2F4Y53MVPEAZBLUOU42YYO23VW:/var/lib/docker/overlay2/l/2VU36ALVHULZCQYR3NPTTPZG46:/var/lib/docker/overlay2/l/GZCTL2S2PBAE5WNHMBNREGA2TL:/var/lib/docker/overlay2/l/L2FNH3FRBSZD3KBVKZ7BV546IJ:/var/lib/docker/overlay2/l/2PFOSJ3DCMADGLMDNKPVMS2RBS:/var/lib/docker/overlay2/l/C2YR3XE2I2HF4GZG3G6B4V3VDK:/var/lib/docker/overlay2/l/JFUC65Q3A' Unexpected end of /proc/mounts line XVYU73NL7OJA3LLYV:/var/lib/docker/overlay2/l/KNHA4RWXLTKWZORPD724J5EECA:/var/lib/docker/overlay2/l/NS7TDD3SL4MCUPTWXPEHPCWNPI:/var/lib/docker/overlay2/l/XON36A36EXXGDHZAINQY7ULQ7P:/var/lib/docker/overlay2/l/PHB6AFJAFONRSRAEHPAUMDRJIJ:/var/lib/docker/overlay2/l/B4AELY72WPCNDSFDJUG5FN3ZFY:/var/lib/docker/overlay2/l/NJMK23DUY77D563JSYADNDZSNC:/var/lib/docker/overlay2/l/3QLD57BN2XZXAUP3DW6Q4JNP7D:/var/lib/docker/overlay2/l/HQYUONMSPBWLBFDE47OZ3UGGJ5,upperdir=/var/lib/docker/overlay2/176879c32e21d4c3d9b46e3e196’

For 256 batch size / 1, 2, or 3 GPUs, it throws the message:
2018-04-07 21:14:15.848048: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.45GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.

For 512 batch size /1 GPU, it throws the message:
No results, did not get past burn-in phase (20 steps)
Out of memory error detected, exiting

Can you please share your docker run command?
Here is an example:
root@c06440ad6425:/workspace/nvidia-examples/cnn# python3 nvcnn.py --model=vgg19 --batch_size=256 --num_gpus=1
Can you please share the nvidia-smi command you are using, complete with the arguments?
nvidia-smi python3 nvcnn.py --model=vgg19 --batch_size=256 --num_gpus=1
Also, just curious, but why did you decide to use 3 gpus?.
Sure, I am benchmarking servers that have more than 1 GPU. Please let me know if there are other scripts more suitable to benchmark these kind of servers

philrogers · April 7, 2018, 9:43pm

Warnings that start “Unexpected end of /proc/mounts line `overlay / overlay …” can be ignored. They are harmless and you can ignore them. We have a fix coming that will prevent those messages being thrown.

What type of GPUs are you using? P3 Instance on AWS or something else?

I did not see an ‘nvidia-docker run …’ command in your answers above. Are you using NGC containers? This forum is for support of NGC users running NGC containers on supported platforms.

sanchezvr7 · April 7, 2018, 10:01pm

What about these messages related to the performance gain:
“W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.45GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.”

Or these one that didn’t give any result at all for 512 batch size /1 GPU:
“No results, did not get past burn-in phase (20 steps)
Out of memory error detected, exiting”

I am using private servers with Tesla P40 GPUs, also I am using NGC containers (nvidia-docker tensorflow), here is the run command: nvidia-docker run -it nvcr.io/nvidia/tensorflow:18.03-py3

The script used was nvcnn.py v1.4. Please let me know if there are other scripts/tools more suitable to benchmark these kind of servers, these are the metrics I need to benchmark: imags/sec , imags/watt , latency , accuracy

Thanks,

VS