OOM on Jetson TX1

Hi,

I am runnning a program over Jetson TX1. After the program runnning a period of time, OOM happens. Then OOM-Killer killed the progress, but the programe will restart. After several times of restart, The ubuntu system hangs. Our hardware partner told us system hangs because of memory leak after analying the kernal log. But I did some experments, it shows the memory alloced to the program is all released.

So, it’s my questions:

  1. After killing by OOM-Killer, the related memory will be all released?
  2. The system hangs because of memory leak?
  3. What is the possible reason lead the OOM?

hello jasonontheway8840,

in general, if you continuous allocate buffers but not release that in the process, out-of-memory will trigger kill “any” services to insure there’s enough memory for OS running.

could you please confirm which application did not handling memory usage correctly.
you might observe the application usage following below commands

$ ps aux

also,
may I know what’s the use-case you’re running, and how long does the OOM triggered.
thanks

Hi Jerry,

This log file is downloaded from a device running our program. The system hangs around 11:00. We restart the system around 15:00.(power off then power on) . You can see the application which did not handle memory usage correctly and some time info about OOM from the log file.

please download the attachment.
thanks.

kernal-log-backup-b1.log (1.33 MB)

hello jasonontheway8840,

it seems leakage coming from below two applications.
could you please analysis these two application’s memory usage.

Dec 10 10:55:33 tegra-ubuntu kernel: [306738.554836] [ 1023]     0  1023  1292545   291587    1261       8        0             0 stream_main
Dec 10 10:55:33 tegra-ubuntu kernel: [306738.554888] [ 7494]     0  7494  2786395   280961     830       7        0             0 ubox_main

BTW,

  1. may I know what’s the use-case you’re running with.
  2. do you still trigger OOM without running these services?

Hi Jerry,

We know ubox_main and stream_main will trigger OOM. Our target is to solve the problems that ubuntu system hangs after several times of OOM.
Any ideas?
Thanks.

Dear Jerry,

The two processes ubox_main and stream_main trigger OOM, but through the log files, we can’t locate how the two processes trigger the OOM.
During the OOM, the watchdog isn’t triggered yet, in kernel space and user space, what do you suggest to do in order to locate the root cause for OOM and WDT trigger?

hi all,

I would suggest you start with memory usage, by confirming it actually consume available memory.
you might have a script to collect memory information for your process. thanks

cat /proc/meminfo
cat /sys/kernel/debug/nvmap/iovmm/procrank
cat /sys/kernel/debug/nvmap/iovmm/clients

Hi Jerry,

Attachments are three log files according to your command.

I think we should pay attention to solving the problem of system hangs instead of OOM and find a way to enable watchdog
work when system hangs. Because we can be sure memory leakage will happen in our applications.

Also can we find a way to restart system when system hangs without enabling wtchdog?

clients.log (295 Bytes)
meminfo.log (1.04 KB)
procrank.log (355 Bytes)

hello Jasonychen,

please try below command to enable watch dog manually, it’ll reboot system after 120 seconds.

sudo echo 1 > /dev/watchdog0

Hi Jerry,

We enabled watchdog manually. However watchdog did not work when system hangs.

You can see the attachment above “kernal-log-backup-b1.log”. System hangs at “Dec 10 11:42:46” but watchdog did not work.
Please search the keyword “Dec 10 11:42:46” in log file “kernal-log-backup-b1.log”.
We also want to get the reason why watchdog does not work.

Thanks.

hello Jasonychen,

BTW,
here’s Watchdog Timer configuration for your reference.
thanks

Dear Jerry,

I have sent the Watchdot Timer configuration to them, and I confirm the kernel configeration is right.

  1. echo 1 > /dev/watchdog0, the wdt will reboot system after 120s
  2. follow these steps to check the watchdog works fine too.
sudo su
cd /proc/sys/kernel
echo 0 > panic
echo c> /proc/sysrq-trigger

Now the question is they write a oom monitor process, once OOM occur, this monitor process will trigger the watchdog in user space, but actually the watchdog wasn’t triggered.

Hello Jerry,

Can you give some possible reasons why watchdog wasn’t triggered?

Dear Jasonychen,

Can you put your wdt trigger source code here. I can review the codes too.

Thanks.

Hello all,

This is the cron job.

* * * * * echo 1 > /dev/watchdog
* * * * * sleep 30;echo 1 > /dev/watchdog

Hello Jerry,

BTW,

Can I change the MAC address of our devices?

Thanks.

there are steps to configure the vendor-specified MAC addresses.
please access the documentation, [Jetson TX1-TX2 Module EEPROM Layout] from Jetson Download Center.
thanks

Just something to check: Be sure this is running as root (or sudo).

This Cron job is running as root.

BTW,
Ubuntu System hangs after several times of watchdog-reboot.
I tested the watchdog-reboot using this Cron job. Because system will reboot in 120s.

*/3 * * * * echo 1 > /dev/watchdog

I did above test on other devices including LEETOP and other partner using Jetson TX1. All of the system hangs after several times of watchdog-reboot.

Please replace the attached file at …/Linux_for_Tegra/bootloader/tos.img to flash the device to try.
tos.img.txt (44.6 KB)