NVCaffe docker memory leak when using pycaffe

We are using NVCaffe to train one of our networks because it has much better grouped convolution and depthwise/pointwise convolutions. Thanks to the Nvidia team for making this possible.

Sadly, when using pycaffe inside a container based on the latest nvcr.io/nvidia/caffe:18.04-py2 image we are experiencing a memory leak.

We are using a server with three GTX 1080 cards, NVIDIA-SMI displayed below:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A |
| 27% 28C P0 39W / 180W | 0MiB / 8119MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 27% 30C P0 39W / 180W | 0MiB / 8119MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 0% 33C P0 38W / 180W | 0MiB / 8119MiB | 2% Default |
±------------------------------±---------------------±---------------------+

We are running our training scripts directly inside the docker by first opening bash and then just running our training manually at the bash prompt. After every few iterations we can clearly see the memory usage increase.

We have put together a basic training example using pycaffe to replicate the memory leak.

In order to run this example using nvidia-docker you can do the following:

docker pull camerai/nvcr.io-nvidia-caffe-18.04-py2:mem_leak
nvidia-docker run -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 camerai/nvcr.io-nvidia-caffe-18.04-py2:mem_leak /bin/bash
cd mem_leak_test
python job.py <gpu_num>

This initialises a very simple cnn and enters a training loop where a python data layer is called to load a single image and label every time net.forward is called.

example output:

iteration: 500
| ID | GPU | MEM |

| 0 | 33% | 7% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
iteration: 1000
| ID | GPU | MEM |

| 0 | 28% | 8% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
iteration: 1500
| ID | GPU | MEM |

| 0 | 27% | 9% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |

We have tried deleting the pycaffe solver object and then reloading the network from the last snapshot (which should free the net and blob data) but unfortunately this does not free any memory and it continues to increase until finally the job crashes (Check failed: error == cudaSuccess (2 vs. 0) out of memory).

Any help and advice on how to proceed to debug or fix the situation would be really appreciated.