Torch 7 cuda tests "Killed" on the TX1

cpadwick11 · August 27, 2016, 12:05am

Hi - is anyone else noticing that some of the torch7 tests get killed on the TX1?

Here is the cudnn.torch benchmark (in the test directory from cudnn.torch):

@tegra-ubuntu:~/code/cudnn.torch/test$ th benchmark.lua
CUDNN Version:  5005
cudnn.SpatialConvolution
Forward AutoTuned            :  14      13      15      9       48      18      29      0.0085029602050781
Forward implicit gemm        :  14      13      15      9       48      18      29      0.015181064605713
Forward implicit precomp gemm:  14      13      15      9       48      18      29      0.004382848739624
Forward gemm                 :  14      13      15      9       48      18      29      0.018018007278442
Forward FFT                  :  14      13      15      9       48      18      29      0.0088889598846436
Forward FFT tiling           :  14      13      15      9       48      18      29      0.0045740604400635
cudnn.VolumetricConvolution
Killed

tegra-ubuntu:~/code/cudnn.torch/test$ luajit -l cutorch -e 'cutorch.test()'
seed:   1472255993
Running 157 tests
...
125/157 cdiv3 ........................................................... [PASS]
126/157 add ............................................................. [PASS]
127/157 log1 ............................................................ [PASS]
128/157 cpow ............................................................ [PASS]
129/157 sort ............................................................ [WAIT]Killed

I thought it might be because of some kind of watchdog timer in X as described here: CUDA Visual Profiler 'Interactive' X config option? - Stack Overflow

I changed xorg.conf to have a line with Option “Interactive” “0” and restarted the TX1 but I don’t see any difference. Is the gpu card on the TX1 just running out of memory? Or is something explicitly killing the processes?

thanks!

linuxdev · August 27, 2016, 12:56am

You could run something htop or xosview remotely displaying on another computer while running the test to see what memory is doing. On R24.1 there is also a memory leak which should be fixed in R24.2 (“soon” to be released…nobody gave an official release date yet). The Linux environment is set to kill off user space processes as needed if running out of memory, so this might be the case. Otherwise you might be able to check dmesg or “/var/log/Xorg.0.log” and see something. Note that the GPU and video do not have their own memory on a Jetson…they use main system memory via direct connection to the MMU.

cpadwick11 · August 29, 2016, 9:49pm

Thanks for the tip! I verified that the memory maxes out in htop for both those commands just before they are killed. You are right, the OS is killing the process:

[252105.880915] Out of memory: Kill process 24292 (luajit) score 530 or sacrifice child

this was in dmesg.

thanks for the tip!

jkjung · September 1, 2016, 2:02am

I just saw this thread after posting my similar problem here: [url]https://devtalk.nvidia.com/default/topic/959945/?comment=4960292[/url]

So this has explained the “killed” part of the problem.

@cpadwick11, did you also see some failures/errors before the test was killed?

cpadwick11 · September 15, 2016, 10:47pm

Hi - I think I saw some failures prior to the “kill”.

linuxdev · September 15, 2016, 11:03pm

The nice thing about the memory leak is that R24.2 was just released.