TK1 demo board hang

Hi all,
We develop application on nvida tk1 demo board (test samples: 20pcs).
Kernel source : r21.6.0
TK1 hanging afer run 20s hours,and output warning message: possible elpg refcnt mismatch. elpg refcnt=2

We want to know how to locate the issue, hardware, software ,kernel module?

Best regards!

2017/11/02

Don’t know, but is there some specific program running when this happens? If you run htop is there anything noticeable sticking out (X shows virtual memory so the huge allocation is correct…this isn’t actual usage, it’s just what X is capable of using)?

Hi all,
I check VDD of CPU & GPU, when error occurs, VDD is 0.82V, I want to know VDD influence the error or error influence VDD?

I would guess that heat has an influence on this…which comes first of heat causing error or error causing heat I don’t know. What is the environment like? Certainly the system load would influence this, and if a bug causes an increased load then heat would go up.

I am wondering if you might be able to try this same thing with an external fan keeping the entire board a bit cooler. Or the reverse…put the Jetson in a box which will allow heat build up a bit sooner to see if the error hits faster.

Also, are you running in a performance mode, or is this a default for performance?

TK1 works under normal condition(about 20 degrees), we build kernel r21.6 directly and flash into board.
For default or performance mode, how to change in kernel source, we consider it to full-performace mode.

You will find a list of performance topics here:
https://elinux.org/Jetson_TK1#Performance_and_Power_Topics

I use this script to list some performance information (name “peformance_ls.sh”):

#!/bin/bash

echo -n "CPUquiet: ";
cat /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable
echo -n "Scaling Governor: ";
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo -n "CPU 0 online: ";
cat /sys/devices/system/cpu/cpu0/online
echo -n "CPU 1 online: ";
cat /sys/devices/system/cpu/cpu1/online
echo -n "CPU 2 online: ";
cat /sys/devices/system/cpu/cpu2/online
echo -n "CPU 3 online: ";
cat /sys/devices/system/cpu/cpu3/online

This will set for increased performance (name “performance_set_max.sh”):

#!/bin/bash

echo '0' > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable

g_ONLINE="$(cat /sys/devices/system/cpu/cpu0/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
   echo '1' > /sys/devices/system/cpu/cpu0/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu1/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
   echo '1' > /sys/devices/system/cpu/cpu1/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu2/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
   echo '1' > /sys/devices/system/cpu/cpu2/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu3/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
   echo '1' > /sys/devices/system/cpu/cpu3/online
fi

echo 'performance' > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

And this to set back to default (name “performance_set_default.sh”):

#!/bin/bash

echo '1' > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable

g_ONLINE="$(cat /sys/devices/system/cpu/cpu0/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
   echo '1' > /sys/devices/system/cpu/cpu0/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu1/online)";
if [[ "${g_ONLINE}" != '0' ]]; then
   echo '0' > /sys/devices/system/cpu/cpu1/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu2/online)";
if [[ "${g_ONLINE}" != '0' ]]; then
   echo '0' > /sys/devices/system/cpu/cpu2/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu3/online)";
if [[ "${g_ONLINE}" != '0' ]]; then
   echo '0' > /sys/devices/system/cpu/cpu3/online
fi

echo 'interactive' > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

If you want to you could create an edited version of one of these to purposely force into a lower performance mode, e.g., take the third and fourth cores offline and run only on the first two cores. I don’t believe any direct kernel edit would be required since these settings are available through “/sys” (you must use root, so “sudo”).

I really like monitoring memory and other usage on command line via htop while waiting for some sort of failure…if the system freezes you will literally get a nice snapshot of the last moment.

I mention performance mostly because it has influence on heat, and heat is the most likely point of failure if the system is actually working as expected, but failing in some particular circumstance such as cutting back on a power rail.

Thanks for your reply, following message is output from kernel:

[ 25.422028] ------------[ cut here ]------------
[ 25.422031] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8()
[ 25.422037] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.422040] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8
[ 25.422046] [] (unwind_backtrace+0x0/0x13c) from [] (show_stack+0x18/0x1c)
[ 25.422050] [] (show_stack+0x18/0x1c) from [] (warn_slowpath_common+0x5c/0x74)
[ 25.422054] [] (warn_slowpath_common+0x5c/0x74) from [] (warn_slowpath_null+0x24/0x2c)
[ 25.422058] [] (warn_slowpath_null+0x24/0x2c) from [] (gk20a_pmu_enable_elpg+0x1d8/0x2b8)
[ 25.422062] [] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [] (gk20a_alloc_obj_ctx+0x8e8/0xbd4)
[ 25.422066] [] (gk20a_alloc_obj_ctx+0x8e8/0xbd4) from [] (gk20a_channel_ioctl+0x6a8/0x10cc)
[ 25.422071] [] (gk20a_channel_ioctl+0x6a8/0x10cc) from [] (do_vfs_ioctl+0x3f8/0x5b8)
[ 25.422075] [] (do_vfs_ioctl+0x3f8/0x5b8) from [] (SyS_ioctl+0x58/0x168)
[ 25.422080] [] (SyS_ioctl+0x58/0x168) from [] (ret_fast_syscall+0x0/0x30)
[ 25.422082] —[ end trace e0184c4e90ebc46e ]—
[ 25.422086] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1
[ 25.422087] ------------[ cut here ]------------
[ 25.422089] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3192 gk20a_pmu_disable_elpg+0x80/0x2e0()
[ 25.422096] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.422099] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8
[ 25.422104] [] (unwind_backtrace+0x0/0x13c) from [] (show_stack+0x18/0x1c)
[ 25.422109] [] (show_stack+0x18/0x1c) from [] (warn_slowpath_common+0x5c/0x74)
[ 25.422113] [] (warn_slowpath_common+0x5c/0x74) from [] (warn_slowpath_null+0x24/0x2c)
[ 25.422279] [] (warn_slowpath_null+0x24/0x2c) from [] (gk20a_pmu_disable_elpg+0x80/0x2e0)
[ 25.422285] [] (gk20a_pmu_disable_elpg+0x80/0x2e0) from [] (gk20a_alloc_obj_ctx+0x568/0xbd4)
[ 25.422289] [] (gk20a_alloc_obj_ctx+0x568/0xbd4) from [] (gk20a_channel_ioctl+0x6a8/0x10cc)
[ 25.422296] [] (gk20a_channel_ioctl+0x6a8/0x10cc) from [] (do_vfs_ioctl+0x3f8/0x5b8)
[ 25.422301] [] (do_vfs_ioctl+0x3f8/0x5b8) from [] (SyS_ioctl+0x58/0x168)
[ 25.422306] [] (SyS_ioctl+0x58/0x168) from [] (ret_fast_syscall+0x0/0x30)
[ 25.422309] —[ end trace e0184c4e90ebc46f ]—
[ 25.422490] gk20a gk20a.0: gk20a_pmu_enable_elpg: gk20a_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2
[ 25.422520] ------------[ cut here ]------------
[ 25.422525] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8()
[ 25.422533] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.422536] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8
[ 25.422543] [] (unwind_backtrace+0x0/0x13c) from [] (show_stack+0x18/0x1c)
[ 25.422547] [] (show_stack+0x18/0x1c) from [] (warn_slowpath_common+0x5c/0x74)
[ 25.422550] [] (warn_slowpath_common+0x5c/0x74) from [] (warn_slowpath_null+0x24/0x2c)
[ 25.422554] [] (warn_slowpath_null+0x24/0x2c) from [] (gk20a_pmu_enable_elpg+0x1d8/0x2b8)
[ 25.422559] [] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [] (gk20a_alloc_obj_ctx+0x66c/0xbd4)
[ 25.422563] [] (gk20a_alloc_obj_ctx+0x66c/0xbd4) from [] (gk20a_channel_ioctl+0x6a8/0x10cc)
[ 25.422567] [] (gk20a_channel_ioctl+0x6a8/0x10cc) from [] (do_vfs_ioctl+0x3f8/0x5b8)
[ 25.422571] [] (do_vfs_ioctl+0x3f8/0x5b8) from [] (SyS_ioctl+0x58/0x168)
[ 25.422576] [] (SyS_ioctl+0x58/0x168) from [] (ret_fast_syscall+0x0/0x30)
[ 25.422577] —[ end trace e0184c4e90ebc470 ]—
[ 25.429992] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1
[ 25.429994] ------------[ cut here ]------------
[ 25.429992] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1
[ 25.429994] ------------[ cut here ]------------
[ 25.430001] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3192 gk20a_pmu_disable_elpg+0x80/0x2e0()
[ 25.430010] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.430014] CPU: 1 PID: 78 Comm: irq/189-gk20a_s Tainted: G W 3.10.40-ga7da876 #8
[ 25.430025] [] (unwind_backtrace+0x0/0x13c) from [] (show_stack+0x18/0x1c)
[ 25.430031] [] (show_stack+0x18/0x1c) from [] (warn_slowpath_common+0x5c/0x74)
[ 25.430035] [] (warn_slowpath_common+0x5c/0x74) from [] (warn_slowpath_null+0x24/0x2c)
[ 25.430039] [] (warn_slowpath_null+0x24/0x2c) from [] (gk20a_pmu_disable_elpg+0x80/0x2e0)
[ 25.430045] [] (gk20a_pmu_disable_elpg+0x80/0x2e0) from [] (gk20a_intr_thread_stall+0xd0/0x5f8)
[ 25.430050] [] (gk20a_intr_thread_stall+0xd0/0x5f8) from [] (irq_thread+0x114/0x168)
[ 25.430055] [] (irq_thread+0x114/0x168) from [] (kthread+0xd4/0xd8)
[ 25.430060] [] (kthread+0xd4/0xd8) from [] (ret_from_fork+0x14/0x20)
[ 25.430062] —[ end trace e0184c4e90ebc471 ]—
[ 25.430082] gk20a gk20a.0: gk20a_pmu_enable_elpg: gk20a_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2
[ 25.430084] ------------[ cut here ]------------
[ 25.430087] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8()
[ 25.430094] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.430097] CPU: 1 PID: 78 Comm: irq/189-gk20a_s Tainted: G W 3.10.40-ga7da876 #8
[ 25.430103] [] (unwind_backtrace+0x0/0x13c) from [] (show_stack+0x18/0x1c)
[ 25.430107] [] (show_stack+0x18/0x1c) from [] (warn_slowpath_common+0x5c/0x74)
[ 25.430112] [] (warn_slowpath_common+0x5c/0x74) from [] (warn_slowpath_null+0x24/0x2c)
[ 25.430116] [] (warn_slowpath_null+0x24/0x2c) from [] (gk20a_pmu_enable_elpg+0x1d8/0x2b8)
[ 25.430121] [] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [] (gk20a_intr_thread_stall+0xe8/0x5f8)
[ 25.430126] [] (gk20a_intr_thread_stall+0xe8/0x5f8) from [] (irq_thread+0x114/0x168)
[ 25.430129] [] (irq_thread+0x114/0x168) from [] (kthread+0xd4/0xd8)
[ 25.430134] [] (kthread+0xd4/0xd8) from [] (ret_from_fork+0x14/0x20)
[ 25.430135] —[ end trace e0184c4e90ebc472 ]—

I don’t know how to locate the issue.

Thanks & best regards!

It seems to be the video driver complaining, but it doesn’t say anything specific beyond:

gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1

Did performance mode cause this to occur sooner? Can you compare how long this takes with and without a second fan cooling the whole board? Perhaps someone with access to the video driver source code could look at what might cause the reference count mismatch. Any information on what is running when it fails would be useful.

Thank you so much,
TK1 work in default mode, when above message output, system hang.
This message is deferent to deferent board, in general 10~20 hrs, few boards can be up to servaral days.

Without some sort of way of knowing what influences the failure it won’t be possible to debug. Would you install package “lm-sensors” (“sudo apt-get install lm-sensors”) on a failed unit, and then run this command in a terminal until it fails (preferably an ssh connection or a serial console because we don’t want screen blanking interfering after it fails):

watch -n 1 sensors

When the unit fails this should be showing the final value of some temperatures.

Also, verify this shows everything ok:

sha1sum -c /etc/nv_tegra_release

Thanks,

Anyway, how set GPU to fixed frequency, such as 756000000Hz.

I use following command:

echo 756000000 > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state

but I find the file. I only find:

/sys/kernel/tegra_gpu/gpu_rate

but it cannot be writen!

Thanks.

The rates are in KHz, not Hz, so your rate is probably being rejected as out of range. The information I am using is from here, but this too is using Hz and I am guessing is from an older L4T:
https://elinux.org/Jetson/Performance#Controlling_GPU_performance

Notice the test to run prior to changing the clock:

cat /sys/kernel/debug/clock/gbus/possible_rates

On my R21.6 system I get:

72000 108000 180000 252000 324000 396000 468000 540000 612000 648000 684000 708000 756000 804000 <i><b>852000 (kHz)</b></i>

Thanks,
But I am not find /sys/kernel/debug/clock/gbus/possible_rates in our TK1 system.
I think if kernel is not configured?

We wonder if dynamically change rate to cause system hang? and how to change the rate in kernel source?

Thanks!

About the missing “/sys/kernel/debug/clock/gbus/possible_rates”: Normally I would suggest this depends on which L4T version you use because kernel features will change depending on version, but it appears we are both using R21.6.

That file is a reflection of a driver or kernel feature…this tells me you are missing a driver or kernel feature. Is this fully R21.6 L4T? Is the kernel custom compiled? Should this be custom compiled the “/sys” content may be missing due to missing configuration during compile.

I use r21.6 L4T, kernel configuration is tegra12_defconfig, is filesystem affect it?

Hi,

May I ask what is “tk1 demo board”? Is that a tk1 dev kit?

Thanks!

I found the file is visible when login as root, if login as ubuntu, the file is invisible.

Is tk1 dev kit, we run application a few hours, tk1 will hang, could you help us to solve the problem?

Thanks!

Could you share your application? If you cannot reveal it, could you describe what is it for?