have project that USES hundreds of TX1, and there's a stability problem。

1、Equipment: video multimedia encoding and decoding function.
2、Problem: after running for a period of time, the device ping is connected, the display prompt is no signal, and the serial port has no output/input printing.
3、Discover kernel kernel.log indicates that the kernel is running down。
Note: the system USES jetpak3.0 in the official version of the kernel, file system, uboot without any modification
kernellog.rar (1.1 MB)

One thing I see implies that even if the original problem itself is gone there could also be a file system issue:

EXT4-fs (mmcblk0p1): recovery complete

…recovery might imply you lost something…something which may or may not be important. Typically the file system will need to run a recovery after a crash, and although the file system won’t be corrupt, some part of it may now be missing. I see a lot of this…which tends to imply something was lost:

Dec 26 09:38:17 tegra-ubuntu kernel: [    7.233309] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1578138
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.233931] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1578150
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.238683] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586146
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.240640] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586320
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.246166] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 270989
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.246580] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 270980
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.246615] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586235
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.246647] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586243
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.246681] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586287
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.246712] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586322
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.252827] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1049750
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.252869] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586285
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.252900] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586237
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.252931] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1586195
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.258866] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1048690
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.258890] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1048670
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.258909] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1048667
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.258925] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1048666
Dec 26 09:38:17 tegra-ubuntu kernel: [    7.258948] EXT4-fs (sda): ext4_orphan_cleanup: deleting unreferenced inode 1048621

I don’t know what “VAG.bin” is, but it keep showing up. Perhaps VAG.bin is an issue if it uses corrupt or missing file system content…I don’t know.

The first kernel dump is from:

Dec 27 12:22:35 tegra-ubuntu kernel: [96267.028122] CPU: 2 PID: 3884 Comm: avstream Not tainted 3.10.96-tegra #1

There are notes about DMA32 sprinkled throughout…perhaps DMA is not set up correctly.

thank you, first。and, the kernel, file system, uboot is the official version. without any modification.
how to solve it?

Most of what follows is based on not ever really knowing what the original condition was which caused things to go bad. If you actually manage to track down a specific bug life is easier.

Here’s the thing to know about for file systems…in the “old days” (without journals) when a bad mount event resulted in file system corruption fixing it involved removing pieces of it and putting them in the “lost+found” directory (ever wondered why that was there? it’s where you put pieces of unknown file system error). When a journaled system is written, and if the write fails, then instead of going through the whole structure and forcibly making it correct through various cuts and splices the system can replay changes in reverse and revert back to a non-corrupt state…you lose recent changes, but you don’t get orphans (that “lost+found” section is where “orphans” are put…journal replay does not create lost orphans for you to find). If issues exceed the journal, then you will still get “lost+found” orphans.

If an orphan is put in “lost+found”, then you have no idea what was lost (and in fact this is why it is put in “lost+found”…you can open it with an editor or run “strings” on it to identify what it came out of)…it could be a piece of the middle of a file which was not touched in a very long time. If the orphan is reversed via a journal replay it is only from recent changes and not so much of a worry to the operating system (whatever program depended on that data may still corrupt each time it reads the missing chunk…but there won’t be a “lost+found” entry). Either way you don’t know what was lost. If there is nothing in the lost+found of any of your partitions (“sudo find / -type d -name ‘lost+found’”) it is likely a journal replayed and only recent data is bad. You have something crashing and burning…if you can’t identify what it is, then probably re-installing whatever that application or driver uses will be your only way to be sure.

What if this is just a side-effect? What if re-installing is just a temporary help? Is some bug causing the problem? Or is the corruption on the file system causing the crash now? Perhaps both? You might want to re-install and carefully watch those units. Re-install just the application and the data it might depend on…its environment. If the problem goes away, then it is likely the operating system was never harmed. If the problem is still there, then you may need to re-install the entire operating system. If this were something like a web browser I’d say be sure all cache and temporary data and any database is removed and re-installed. Should this level of re-install fail, then you probably have to re-install the entire operating system. When a unit fails like this again you may want to clone the rootfs and preserve a copy of it prior to ever rebooting. A clone can be used to research…but you want to see what is on that clone immediately before running the system more and having tools try to repair things.

If there was ever a power off when a system was not properly shut down this can cause it, e.g., power failure even if it was faster than the blink of an eye. Sometimes a driver in the kernel can fail and umount is also not possible…shutdown wouldn’t correctly umount the partition. If you can examine each of the failed units, and determine if they had something go wrong with a previous shutdown, then you could have at least some confidence the problem won’t show up again.

But…what is AVR.bin? Where is it? “sudo find / -name AVR.bin”. What hardware is involved with this (it wouldn’t have notes about DMA32 if hardware were not involved)? This might offer clues as to what files to examine.

Hi zbit,
Is there a way we can reproduce the issue?

And also are you able to try Jetpack 3.1?