cuDNN6 example with/without bidirectional LSTM and memory use

Hello,

I am beginning to poke LSTMs and cudnn and I would be grateful for your advice with the following problem:

I’m using cuDNN6 with the Ubuntu 16.04 cuda 8.0.61/375.26 and cuDNN 6.0.20 deb packages on a GTX1080.

I took the RNN_example.cu and modified it in the following way:

  • I added a "epoch" loop from the comment "We now need to pass through the RNN" to "int numMats".
  • I commented out the declaration and all uses of timeForward/Backward1/Backward2.
  • I made the bidirectional bool controlled by an additional command line argument (0=unidir, any other number bidir).

When I run this for a reasonable number of epochs (e.g. 1000) with

  • ./RNN 100 4 512 64 2 0 (i.e. the same args as make run and no bidir) the CPU memory used (RSS in top) is approximatedly constant
  • ./RNN 100 4 512 64 2 1 (i.e. the same args as make run and but bidirectional=True) the CPU memory used (RSS in top) keeps increasing

I then turned to valgrind (using “valgrind --leak-check=full --num-callers=20 …” to be precise).
This was done with 100 “epoch” loops.
Sure enough, there appears to be ~40 MB lost memory with bidirectional LSTM, but not with unidirectional LSTM:

grep '= *defini' *.log
with-bidir.log:==16389==    definitely lost: 40,978,136 bytes in 160,068 blocks
without-bidir.log:==16450==    definitely lost: 4,568 bytes in 17 blocks

Per this snippet of valgrind output, the main loss appears to be allocated in cudnnRNNForwardTraining:

==16730== 40,117,504 bytes in 156,707 blocks are definitely lost in loss record 1,643 of 1,646
==16730==    at 0x4C2DBC5: calloc (vg_replace_malloc.c:711)
...
==16730==    by 0x128FF02F: cuLaunchKernel (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.26)
...
==16730==    by 0x804F461: cudnnRNNForwardTraining (in /usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.20)
==16730==    by 0x402EBE: main (in /usr/src/cudnn_samples_v6/RNN/RNN)

As this adds up for more elaborate training, I would be highly appreciate your hint how one might cause this memory to either not be allocated or be released.

If you permit, I would also be curious whether the continuous increase in memory use is to be expected for bidirectional (as opposed to unidirectional) LSTM RNNs.

My initial motivation was from an observation using pytorch (https://discuss.pytorch.org/t/tracking-down-a-suspected-memory-leak/1130, where cuDNN5 seemed to have better RNN memory properties), but I suspect that the RNN example might be closer to best practice code.

My apologies if this is a beginners question, I am only doing this as a hobby and know practically zero about CUDA.

Best regards

Thomas

P.S.: I’d be happy to post the exact code I used, but I’m unsure if it is allowed to just upload the sample code as a gist or so.