Reboot failure.

I have a tegra K1 chip that seems to bug out on reboot when the chip has been playing video.

I use this command to play a video:

$ gst-launch-0.10 filesrc location=<filename.mp4> ! qtdemux name=demux
demux.video_00 ! queue ! nv_omx_h264dec ! nv_omx_hdmi_videosink -e

then let the chip warm up, then simply reboot

$ reboot
[ 67.911308] Restarting system.
[ 67.914399] Restarting Linux version 3.10.40-ged4f697 (enewnham@arch) (gcc ve
rsion 4.8.3 20140401 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2014.04 - Lina
ro GCC 4.8-2014.04) ) #1 SMP PREEMPT Wed Mar 11 14:25:40 EDT 2015
[ 67.914399]

Then this error crops up:

No valid FDT found - please append one to U-Boot binary, use u-boot-dtb.bin or define CONFIG_OF_EMBED. For sandbox, use -d <file.dtb> initcall sequence 83de1db0 failed at call 83dc388c ### ERROR ### Please RESET the board ###

We have seen this issue before on R21.2, however U-Boot hangs without the error message. Now on R21.3 this error starts to crop up. To exacerbate the issue, you can unplug the fan to allow the chip to warm up more.

Thanks!

Just as a test, do you have the same result with “shutdown -r now” as with “reboot”? I doubt it differs but it helps to have baseline data.

In your /boot/extlinux/extlinux.conf file, what is the FDT line? An example would be:

FDT /boot/tegra124-jetson_tk1-pm375-000-c00-00.dtb

Also, does this file actually exist at the location named by the FDT entry?

Hey linuxdev,

yup, same result with shutdown -r now. Tried sync; reboot -f as well.

Yes, my FDT does look like that, and does exist at that location. However I believe U-Boot is failing before that, this error comes out the very first line after U-Boot prints it’s version information.

EDIT: I just tried changing my FDT file name, then rebooted, same problem. However I do believe this error is occurring before accessing the ext4 file system, so it may be the FDT that is compiled into uboot?

But that isn’t ‘corruptable’ because its baked into the uboot binary, so it must be a CPU runtime issue/corruption that causes memory to corrupt. Possibly heat related?

The best way to recreate this issue is to unplug the fan, play a video, and reboot. Repeat until reboot fails. However this can happen if the fan is on, just happens more often with it off.

When using a serial port on serial console you should be able to log the entire shutdown and reboot-to-failure point. Would it be possible to see the logs of this?

FYI, memory sits next to the tegra124 chip…the memory I’m looking at is probably the same on all Jetsons, but may not be…the ones I see are Samsung. As a test I wonder if there would be a way to disconnect the fan but carefully add some sort of cooling capacity to the memory itself next to the tegra124. The goal being to test tegra124 heating separately from memory heating.

I believe it is more likely the memory cooling would be at issue than the tegra124 heat itself. If you don’t have a way to cool the memory chips with some sort of improvised heat sink, there are spray bottles for fast cooling which could “very carefully” be used on the memory chips. For stress testing you would completely and quickly freeze those chips while monitoring operation…but this is NOT stress testing, all you would want to do is keep memory cool but not frozen while allowing the tegra124 to heat. Or the reverse. Results with memory cooled but not tegra124, or vice versa, would help for knowing if there is a marginal component.

My board uses the HYNIX H5TC4G63AFR-RDA.

A reason why I bring this up is in the change log for R21.3 was regarding a fix in reboot stress testing.

[200072946] Improved system stability during extended reboot stress testing

Do you know what is included in the fix? Was this a kernel fix or a U-Boot fix?

Looking more closely at two of my boards, I see it does actually say “hynix”…it’s just hard to read.

I don’t know what the actual fix was, but it’s almost always some sort of timing adjustment and/or voltage adjustment. Initial values are always set up in the boot loader; it is likely that a reboot failure does not allow booting to an extent that the kernel controls this. nVidia will probably be interested in this, but they are still likely to want to see the logs from serial console.

Reboot seems sporadic, but mostly related to temperatures, If the CPU temperature is above 70C when it reboots. it is more likely to fail reboot, but not guaranteed. Some more insight:

I then proceeded to turn on U-Boot debug and arrived with this log.

Broadcast message from root@localhost.localdomain
        (unknown) at 1:04 ...

The system is going down for reboot NOW!
[   67.067735] Restarting system.
[   67.070802] Restarting Linux version 3.10.40-ged4f697 (enewnham@arch) (gcc ve
rsion 4.8.3 20140401 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2014.04 - Lina
ro GCC 4.8-2014.04) ) #2 SMP PREEMPT Mon Mar 16 10:35:39 EDT 2015
[   67.070802] 

U-Boot - Fri Mar 13 11:13:00 EDT 2015boot device - 0
mkimage signature not found - ih_magic = ffffffff
Jumping to U-Boot
image entry point: 0x83D8E000
start_cpu entry, reset_vector = 83d8e000
tegra124_init_clocks entry
Setting up PLLX
init_pllx entry
tegra_get_chip: CHIPID is 0x40
 init_pllx: SoC = 0x40
tegra_get_sku_info: SKU info byte is 0x87
 init_pllx: SKU info byte = 0x87
tegra_get_chip: CHIPID is 0x40
tegra_get_sku_info: SKU info byte is 0x87
 init_pllx: Chip SKU = 4
 init_pllx: osc = 2
tegra_get_chip: CHIPID is 0x40
 pllx_set_rate entry
pllx_set_iddq: IDDQ: PLLX IDDQ = 0x00000000
pllx_set_rate: base = 0x00107401
pllx_set_rate: misc = 0x00040000
pllx_set_rate: base final = 0x40107401
Enabling clocks
Taking periphs out of reset
tegra124_init_clocks exit
enable_cpu_power_rail entry
pmic_enable_cpu_vdd entry
pmic_enable_cpu_vdd: Setting VDD_CORE to 1.0V via AS3722 reg 1/4D, 0x2801
pmic_enable_cpu_vdd: Setting VDD_CPU to 1.0V via AS3722 reg 0/4D, 0x3c00
pmic_enable_cpu_vdd: Setting VDD_GPU to 1.0V via AS3722 reg 6/4D, 0x2806
pmic_enable_cpu_vdd: Set VPP_FUSE to 1.2V via AS3722 reg 0x12/4E
pmic_enable_cpu_vdd: Set VDD_SDMMC to 3.3V via AS3722 reg 0x16/4E
enable_cpu_clocks entry
enable_cpu_clocks: PLLX base = 0x48107401
enable_cpu_clocks: PLLX locked, delay for stable clocks
enable_cpu_clocks: Setting CCLK_BURST and DIVIDER
enable_cpu_clocks: Enabling clock to all CPUs
enable_cpu_clocks: Enabling main CPU complex clocks
enable_cpu_clocks: Done
clock_enable_coresight entry
remove_cpu_resets entry
powerup_cpus entry
powerup_cpus entry: G cluster
powerup_cpus: CRAIL
power_partition: part ID = 00000000
power_partition, toggling state
powerup_cpus: C0NC
power_partition: part ID = 0000000F
powerup_cpus: CE0
power_partition: part ID = 0000000E
power_partition, toggling state
tegra_get_chipopw:e CruHpIP_IcpDu iss:  d0xon4e0
start_cpu exit, should continue @ reset_vector

.... this is where the CPU hangs. It does not continue to the reset_vector.

A successful reboot looks like this:

Broadcast message from root@localhost.localdomain
        (unknown) at 1:03 ...

The system is going down for reboot NOW!
[   80.677904] Restarting system.
[   80.681061] Restarting Linux version 3.10.40-ged4f697 (enewnham@arch) (gcc ve
rsion 4.8.3 20140401 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2014.04 - Lina
ro GCC 4.8-2014.04) ) #2 SMP PREEMPT Mon Mar 16 10:35:39 EDT 2015
[   80.681061] 

U-Boot - Fri Mar 13 11:13:00 EDT 2015boot device - 0
mkimage signature not found - ih_magic = ffffffff
Jumping to U-Boot
image entry point: 0x83D8E000
start_cpu entry, reset_vector = 83d8e000
tegra124_init_clocks entry
Setting up PLLX
init_pllx entry
tegra_get_chip: CHIPID is 0x40
 init_pllx: SoC = 0x40
tegra_get_sku_info: SKU info byte is 0x87
 init_pllx: SKU info byte = 0x87
tegra_get_chip: CHIPID is 0x40
tegra_get_sku_info: SKU info byte is 0x87
 init_pllx: Chip SKU = 4
 init_pllx: osc = 2
tegra_get_chip: CHIPID is 0x40
 pllx_set_rate entry
pllx_set_iddq: IDDQ: PLLX IDDQ = 0x00000000
pllx_set_rate: base = 0x00107401
pllx_set_rate: misc = 0x00040000
pllx_set_rate: base final = 0x40107401
Enabling clocks
Taking periphs out of reset
tegra124_init_clocks exit
enable_cpu_power_rail entry
pmic_enable_cpu_vdd entry
pmic_enable_cpu_vdd: Setting VDD_CORE to 1.0V via AS3722 reg 1/4D, 0x2801
pmic_enable_cpu_vdd: Setting VDD_CPU to 1.0V via AS3722 reg 0/4D, 0x3c00
pmic_enable_cpu_vdd: Setting VDD_GPU to 1.0V via AS3722 reg 6/4D, 0x2806
pmic_enable_cpu_vdd: Set VPP_FUSE to 1.2V via AS3722 reg 0x12/4E
pmic_enable_cpu_vdd: Set VDD_SDMMC to 3.3V via AS3722 reg 0x16/4E
enable_cpu_clocks entry
enable_cpu_clocks: PLLX base = 0x48107401
enable_cpu_clocks: PLLX locked, delay for stable clocks
enable_cpu_clocks: Setting CCLK_BURST and DIVIDER
enable_cpu_clocks: Enabling clock to all CPUs
enable_cpu_clocks: Enabling main CPU complex clocks
enable_cpu_clocks: Done
clock_enable_coresight entry
remove_cpu_resets entry
powerup_cpus entry
powerup_cpus entry: G cluster
powerup_cpus: CRAIL
power_partition: part ID = 00000000
power_partition, toggling state
powerup_cpus: C0NC
power_partition: part ID = 0000000F
powerup_cpus: CE0
power_partition: part ID = 0000000E
power_partition, toggling state
tegra_get_chipiopw: erCuHpI_PcIpDu si:s  0doxn40e
opw: erCuHpI_PcIpDu si:s  done

e tegarrat__cgeptu_ echxiipt,:  CshHIoPulIdD  icson 0tixn4u0
  @ res�initcall: 83dc71c0
initcall: 83dc9008


U-Boot 2014.10-rc2-svn7540 (Mar 18 2015 - 10:15:23)

initcall: 83d96b10
U-Boot code: 83D8E000 -> 83DEDDA4  BSS: -> 83F43FBC
initcall: 83d904c0
TEGRA124
initcall: 83d8f70c
Board: NVIDIA Jetson TK1
initcall: 83d96b58

............ then it continues on into linux.

What is strange that the last legible debug message is “power_partition, toggling state” then is proceeds to be garbled. I believe this garbling is a result of the CPU beginning a reset cycle.

All this fun stuff is occuring in

src/arch/arm/cpu/arm720t/tegra124/cpu.c
277:void start_cpu(u32 reset_vector)

If the u-boot code logic was the cause of failure, then the odds are high that the failure would be consistent and not change just with increased heat. A marginal voltage or clock setting could do this, but then I would expect some instability under a wider set of circumstances outside of boot loader execution. The log is kind of a “smoking gun” that the failure begins during the boot loader and never reaches the kernel.

What I find interesting is that the first success/fail difference seems to occur at line 62 “tegra_get_chipopw”. I see several “tegra_get_chip…” calls in R21.3 u-boot source, but I do not see “iopw” anywhere in any of the source files. I’m not sure what to make of that.

The “power_partition, toggling state” is identical between success and failure. I’m not all that familiar with this code, but it seems that this function is activating eMMC in some way. The code which fails once past this function tends to make me believe the issue is not the cpu reset cycle, but instead a memory access failure (reset would be a side-effect of the failure, but not the cause of the failure). The last part of this function is:

/* Give I/O signals time to stabilize */
udelay(IO_STABILIZATION_DELAY);

I have not examined earlier versions of u-boot, so I don’t know if anything here has changed recently, but this final setting makes me very very suspicious that this “stabilization” delay is for the very purpose of preventing the issue you are running into.

In “arch/arm/cpu/arm720t/tegra-common/cpu.h” “IO_STABILIZATION_DELAY” is defined as:

/* Stabilization delays, in usec */
...
#define IO_STABILIZATION_DELAY  (1000)

I don’t know if you are feeling adventurous, but I wonder if arbitrarily increasing IO_STABILIZATION_DELAY to something like 1250 or 1500 would increase reboot success under stress based on heat.

Line 62 is actually garbled, and is a combination of a few print statements.

tegra_get_chipopw:e CruHpIP_IcpDu iss:  d0xon4e0
tegra_get_chip: CHIPID is 0x40
powerup_cpus: done

which is funky to say the least. I will try increasing the delays.

The reboot issue addressed by R21.3 can be found in the release notes,
http://developer.download.nvidia.com/embedded/L4T/r21_Release_v3.0/Tegra_Linux_Driver_Package_Release_Notes_R21.3.pdf

one entry about reboot stress testing

Let me ask from H/W perspective if it is memory corruption issue due to memory layout difference.

  1. Are you using PM375_Hynix_2GB_H5TC4G63AFR_RDA_924MHz.cfg in \bootloader\ardbeg\BCT?

  2. Is this your designed own board, not Jetson?

  3. All the memory related layout, PCB stackup & PCB material are exactly the same as that of Jetson if yes for 2. ?

  4. Have you followed the memory layout requirements mentioned in Tegra K1 Embedded Platform Design Guide in Jetson portal if no for 3. ?

  5. Have you gone through memory characterization process in order to generate optimal .cfg (memory controller settings) file for your memory layout if yes for 2. ?
    Please refer to https://developer.nvidia.com/rdp/assets/tegra-k1-memory-characterization

You may also want to try nvflash with lower clocked PM375_Hynix_2GB_H5TC4G63AFR_RDA_792MHz.cfg \bootloader\ardbeg\BCT to see if the issue is less reproducible or not.