NVCaffe training crash on Titan V (memory related?)
I am trying to train a convolutional neural network on NVCaffe and I am getting what seems to be a memory related issue. I am running the NVCaffe 17.12 docker container (which I pulled from NGC) on Ubuntu 16.04. I am launching the container with command: sudo nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -ti nvcr.io/nvidia/caffe:17.12 The version of docker is 'Docker version 17.09.1-ce, build 19e2cf6'. In terms of hardware, I am training on a Titan V GPU, with 12Gb of video ram, with Nvidia driver 387.34. The training prototxt specifies training using FLOAT32 type in both forward and backward modes. I followed the instructions at: http://docs.nvidia.com/ngc/ngc-titan-setup-guide/index.html The only difference I observed is that upon running nvidia-smi is that the name of the device is reported as 'Graphics Device' rather than 'Titan V' +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.34 Driver Version: 387.34 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Graphics Device Off | 00000000:01:00.0 On | N/A | | 31% 45C P2 37W / 250W | 689MiB / 12055MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Graphics Device Off | 00000000:02:00.0 Off | N/A | | 29% 43C P8 26W / 250W | 0MiB / 12058MiB | 0% Default | +-------------------------------+----------------------+----------------------+ Please note that I have a 2nd Titan V in the same system, but I am only training on a single GPU for now. This is the command I use for training: caffe train --solver=/data/TrainingParameters/solver.prototxt --gpu=0 The training crashes with the following message: I1221 17:57:15.745223 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's5_conv1_7_joint_vec' with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.04 0.03 I1221 17:57:15.789032 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_1_joint_vec' with space 6.25G 191/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.12 0.16 I1221 17:57:15.831817 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_2_joint_vec' with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.13 0.12 I1221 17:57:15.857851 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_3_joint_vec' with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.13 0.16 I1221 17:57:15.890970 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_4_joint_vec' with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.12 0.16 I1221 17:57:15.925066 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_5_joint_vec' with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.14 0.13 I1221 17:57:15.932271 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_6_joint_vec' with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.05 0.04 I1221 17:57:15.938091 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_7_joint_vec' with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.04 0.04 *** Aborted at 1513879035 (unix time) try "date -d @1513879035" if you are using GNU date *** PC: @ 0x7f1eed75bb60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo() *** SIGSEGV (@0x0) received by PID 4876 (TID 0x7f1eeeb000c0) from PID 0; stack trace: *** @ 0x7f1eeb3ed4b0 (unknown) @ 0x7f1eed75bb60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo() @ 0x7f1eed771ce1 caffe::CuDNNConvolutionLayer<>::Reshape() @ 0x7f1eed560f0a caffe::Layer<>::Forward() @ 0x7f1eed8da0fb caffe::Net::ForwardFromTo() @ 0x7f1eed8da267 caffe::Net::Forward() @ 0x7f1eed8dda45 caffe::Net::ForwardBackward() @ 0x7f1eed8baf65 caffe::Solver::Step() @ 0x7f1eed8bcbc0 caffe::Solver::Solve() @ 0x40f85d train() @ 0x40c198 main @ 0x7f1eeb3d8830 __libc_start_main @ 0x40ca09 _start @ 0x0 (unknown) According to nvidia-smi, the peak memory usage was 87% right before the crash. I had previously successfully trained the same neural network, with exact same parameters and data set using (vanilla) caffe. The memory footprint was nowhere that high in vanilla caffe. What could be the cause for this issue on NVCaffe?
I am trying to train a convolutional neural network on NVCaffe and I am getting what seems to be a memory related issue.

I am running the NVCaffe 17.12 docker container (which I pulled from NGC) on Ubuntu 16.04. I am launching the container with command:

sudo nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -ti nvcr.io/nvidia/caffe:17.12

The version of docker is 'Docker version 17.09.1-ce, build 19e2cf6'. In terms of hardware, I am training on a Titan V GPU, with 12Gb of video ram, with Nvidia driver 387.34. The training prototxt specifies training using FLOAT32 type in both forward and backward modes.

I followed the instructions at:

http://docs.nvidia.com/ngc/ngc-titan-setup-guide/index.html

The only difference I observed is that upon running nvidia-smi is that the name of the device is reported as 'Graphics Device' rather than 'Titan V'

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34 Driver Version: 387.34 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 00000000:01:00.0 On | N/A |
| 31% 45C P2 37W / 250W | 689MiB / 12055MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Graphics Device Off | 00000000:02:00.0 Off | N/A |
| 29% 43C P8 26W / 250W | 0MiB / 12058MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

Please note that I have a 2nd Titan V in the same system, but I am only training on a single GPU for now. This is the command I use for training:

caffe train --solver=/data/TrainingParameters/solver.prototxt --gpu=0

The training crashes with the following message:

I1221 17:57:15.745223 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's5_conv1_7_joint_vec' with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.04 0.03
I1221 17:57:15.789032 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_1_joint_vec' with space 6.25G 191/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.12 0.16
I1221 17:57:15.831817 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_2_joint_vec' with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.13 0.12
I1221 17:57:15.857851 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_3_joint_vec' with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.13 0.16
I1221 17:57:15.890970 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_4_joint_vec' with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.12 0.16
I1221 17:57:15.925066 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_5_joint_vec' with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.14 0.13
I1221 17:57:15.932271 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_6_joint_vec' with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.05 0.04
I1221 17:57:15.938091 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): 's6_conv1_7_joint_vec' with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.04 0.04
*** Aborted at 1513879035 (unix time) try "date -d @1513879035" if you are using GNU date ***
PC: @ 0x7f1eed75bb60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
*** SIGSEGV (@0x0) received by PID 4876 (TID 0x7f1eeeb000c0) from PID 0; stack trace: ***
@ 0x7f1eeb3ed4b0 (unknown)
@ 0x7f1eed75bb60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
@ 0x7f1eed771ce1 caffe::CuDNNConvolutionLayer<>::Reshape()
@ 0x7f1eed560f0a caffe::Layer<>::Forward()
@ 0x7f1eed8da0fb caffe::Net::ForwardFromTo()
@ 0x7f1eed8da267 caffe::Net::Forward()
@ 0x7f1eed8dda45 caffe::Net::ForwardBackward()
@ 0x7f1eed8baf65 caffe::Solver::Step()
@ 0x7f1eed8bcbc0 caffe::Solver::Solve()
@ 0x40f85d train()
@ 0x40c198 main
@ 0x7f1eeb3d8830 __libc_start_main
@ 0x40ca09 _start
@ 0x0 (unknown)

According to nvidia-smi, the peak memory usage was 87% right before the crash.
I had previously successfully trained the same neural network, with exact same parameters and data set using (vanilla) caffe. The memory footprint was nowhere that high in vanilla caffe. What could be the cause for this issue on NVCaffe?

#1
Posted 12/21/2017 07:50 PM   
Scroll To Top

Add Reply