TK1 demo board hang
Hi all, We develop application on nvida tk1 demo board (test samples: 20pcs). Kernel source : r21.6.0 [b]TK1 hanging afer run 20s hours[/b],and output warning message: [u]possible elpg refcnt mismatch. elpg refcnt=2[/u] ... We want to know how to locate the issue, hardware, software ,kernel module? Best regards! 2017/11/02
Hi all,
We develop application on nvida tk1 demo board (test samples: 20pcs).
Kernel source : r21.6.0
TK1 hanging afer run 20s hours,and output warning message: possible elpg refcnt mismatch. elpg refcnt=2
...

We want to know how to locate the issue, hardware, software ,kernel module?

Best regards!

2017/11/02

#1
Posted 11/02/2017 07:07 AM   
Don't know, but is there some specific program running when this happens? If you run htop is there anything noticeable sticking out (X shows virtual memory so the huge allocation is correct...this isn't actual usage, it's just what X is capable of using)?
Don't know, but is there some specific program running when this happens? If you run htop is there anything noticeable sticking out (X shows virtual memory so the huge allocation is correct...this isn't actual usage, it's just what X is capable of using)?

#2
Posted 11/02/2017 11:50 PM   
Hi all, I check VDD of CPU & GPU, when error occurs, VDD is 0.82V, I want to know VDD influence the error or error influence VDD?
Hi all,
I check VDD of CPU & GPU, when error occurs, VDD is 0.82V, I want to know VDD influence the error or error influence VDD?

#3
Posted 11/03/2017 05:03 AM   
I would guess that heat has an influence on this...which comes first of heat causing error or error causing heat I don't know. What is the environment like? Certainly the system load would influence this, and if a bug causes an increased load then heat would go up. I am wondering if you might be able to try this same thing with an external fan keeping the entire board a bit cooler. Or the reverse...put the Jetson in a box which will allow heat build up a bit sooner to see if the error hits faster. Also, are you running in a performance mode, or is this a default for performance?
I would guess that heat has an influence on this...which comes first of heat causing error or error causing heat I don't know. What is the environment like? Certainly the system load would influence this, and if a bug causes an increased load then heat would go up.

I am wondering if you might be able to try this same thing with an external fan keeping the entire board a bit cooler. Or the reverse...put the Jetson in a box which will allow heat build up a bit sooner to see if the error hits faster.

Also, are you running in a performance mode, or is this a default for performance?

#4
Posted 11/03/2017 05:14 PM   
TK1 works under normal condition(about 20 degrees), we build kernel r21.6 directly and flash into board. For default or performance mode, how to change in kernel source, we consider it to full-performace mode.
TK1 works under normal condition(about 20 degrees), we build kernel r21.6 directly and flash into board.
For default or performance mode, how to change in kernel source, we consider it to full-performace mode.

#5
Posted 11/04/2017 03:07 AM   
You will find a list of performance topics here: [url]https://elinux.org/Jetson_TK1#Performance_and_Power_Topics[/url] I use this script to list some performance information (name "peformance_ls.sh"): [code]#!/bin/bash echo -n "CPUquiet: "; cat /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable echo -n "Scaling Governor: "; cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo -n "CPU 0 online: "; cat /sys/devices/system/cpu/cpu0/online echo -n "CPU 1 online: "; cat /sys/devices/system/cpu/cpu1/online echo -n "CPU 2 online: "; cat /sys/devices/system/cpu/cpu2/online echo -n "CPU 3 online: "; cat /sys/devices/system/cpu/cpu3/online [/code] This will set for increased performance (name "performance_set_max.sh"): [code]#!/bin/bash echo '0' > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable g_ONLINE="$(cat /sys/devices/system/cpu/cpu0/online)"; if [[ "${g_ONLINE}" != '1' ]]; then echo '1' > /sys/devices/system/cpu/cpu0/online fi g_ONLINE="$(cat /sys/devices/system/cpu/cpu1/online)"; if [[ "${g_ONLINE}" != '1' ]]; then echo '1' > /sys/devices/system/cpu/cpu1/online fi g_ONLINE="$(cat /sys/devices/system/cpu/cpu2/online)"; if [[ "${g_ONLINE}" != '1' ]]; then echo '1' > /sys/devices/system/cpu/cpu2/online fi g_ONLINE="$(cat /sys/devices/system/cpu/cpu3/online)"; if [[ "${g_ONLINE}" != '1' ]]; then echo '1' > /sys/devices/system/cpu/cpu3/online fi echo 'performance' > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor[/code] And this to set back to default (name "performance_set_default.sh"): [code]#!/bin/bash echo '1' > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable g_ONLINE="$(cat /sys/devices/system/cpu/cpu0/online)"; if [[ "${g_ONLINE}" != '1' ]]; then echo '1' > /sys/devices/system/cpu/cpu0/online fi g_ONLINE="$(cat /sys/devices/system/cpu/cpu1/online)"; if [[ "${g_ONLINE}" != '0' ]]; then echo '0' > /sys/devices/system/cpu/cpu1/online fi g_ONLINE="$(cat /sys/devices/system/cpu/cpu2/online)"; if [[ "${g_ONLINE}" != '0' ]]; then echo '0' > /sys/devices/system/cpu/cpu2/online fi g_ONLINE="$(cat /sys/devices/system/cpu/cpu3/online)"; if [[ "${g_ONLINE}" != '0' ]]; then echo '0' > /sys/devices/system/cpu/cpu3/online fi echo 'interactive' > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor [/code] If you want to you could create an edited version of one of these to purposely force into a lower performance mode, e.g., take the third and fourth cores offline and run only on the first two cores. I don't believe any direct kernel edit would be required since these settings are available through "/sys" (you must use root, so "sudo"). I really like monitoring memory and other usage on command line via htop while waiting for some sort of failure...if the system freezes you will literally get a nice snapshot of the last moment. I mention performance mostly because it has influence on heat, and heat is the most likely point of failure if the system is actually working as expected, but failing in some particular circumstance such as cutting back on a power rail.
You will find a list of performance topics here:
https://elinux.org/Jetson_TK1#Performance_and_Power_Topics

I use this script to list some performance information (name "peformance_ls.sh"):
#!/bin/bash

echo -n "CPUquiet: ";
cat /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable
echo -n "Scaling Governor: ";
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo -n "CPU 0 online: ";
cat /sys/devices/system/cpu/cpu0/online
echo -n "CPU 1 online: ";
cat /sys/devices/system/cpu/cpu1/online
echo -n "CPU 2 online: ";
cat /sys/devices/system/cpu/cpu2/online
echo -n "CPU 3 online: ";
cat /sys/devices/system/cpu/cpu3/online


This will set for increased performance (name "performance_set_max.sh"):
#!/bin/bash

echo '0' > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable

g_ONLINE="$(cat /sys/devices/system/cpu/cpu0/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
echo '1' > /sys/devices/system/cpu/cpu0/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu1/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
echo '1' > /sys/devices/system/cpu/cpu1/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu2/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
echo '1' > /sys/devices/system/cpu/cpu2/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu3/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
echo '1' > /sys/devices/system/cpu/cpu3/online
fi

echo 'performance' > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor


And this to set back to default (name "performance_set_default.sh"):
#!/bin/bash

echo '1' > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable

g_ONLINE="$(cat /sys/devices/system/cpu/cpu0/online)";
if [[ "${g_ONLINE}" != '1' ]]; then
echo '1' > /sys/devices/system/cpu/cpu0/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu1/online)";
if [[ "${g_ONLINE}" != '0' ]]; then
echo '0' > /sys/devices/system/cpu/cpu1/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu2/online)";
if [[ "${g_ONLINE}" != '0' ]]; then
echo '0' > /sys/devices/system/cpu/cpu2/online
fi

g_ONLINE="$(cat /sys/devices/system/cpu/cpu3/online)";
if [[ "${g_ONLINE}" != '0' ]]; then
echo '0' > /sys/devices/system/cpu/cpu3/online
fi

echo 'interactive' > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor


If you want to you could create an edited version of one of these to purposely force into a lower performance mode, e.g., take the third and fourth cores offline and run only on the first two cores. I don't believe any direct kernel edit would be required since these settings are available through "/sys" (you must use root, so "sudo").

I really like monitoring memory and other usage on command line via htop while waiting for some sort of failure...if the system freezes you will literally get a nice snapshot of the last moment.

I mention performance mostly because it has influence on heat, and heat is the most likely point of failure if the system is actually working as expected, but failing in some particular circumstance such as cutting back on a power rail.

#6
Posted 11/04/2017 08:42 PM   
Thanks for your reply, following message is output from kernel: -------------------------------------------------------------------- [ 25.422028] ------------[ cut here ]------------ [ 25.422031] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8() [ 25.422037] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill [ 25.422040] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8 [ 25.422046] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c) [ 25.422050] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74) [ 25.422054] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c) [ 25.422058] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) [ 25.422062] [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [<c03df940>] (gk20a_alloc_obj_ctx+0x8e8/0xbd4) [ 25.422066] [<c03df940>] (gk20a_alloc_obj_ctx+0x8e8/0xbd4) from [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) [ 25.422071] [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) from [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) [ 25.422075] [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) from [<c0158d38>] (SyS_ioctl+0x58/0x168) [ 25.422080] [<c0158d38>] (SyS_ioctl+0x58/0x168) from [<c000ed00>] (ret_fast_syscall+0x0/0x30) [ 25.422082] ---[ end trace e0184c4e90ebc46e ]--- [ 25.422086] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1 [ 25.422087] ------------[ cut here ]------------ [ 25.422089] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3192 gk20a_pmu_disable_elpg+0x80/0x2e0() [ 25.422096] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill [ 25.422099] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8 [ 25.422104] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c) [ 25.422109] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74) [ 25.422113] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c) [ 25.422279] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03fa108>] (gk20a_pmu_disable_elpg+0x80/0x2e0) [ 25.422285] [<c03fa108>] (gk20a_pmu_disable_elpg+0x80/0x2e0) from [<c03df5c0>] (gk20a_alloc_obj_ctx+0x568/0xbd4) [ 25.422289] [<c03df5c0>] (gk20a_alloc_obj_ctx+0x568/0xbd4) from [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) [ 25.422296] [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) from [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) [ 25.422301] [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) from [<c0158d38>] (SyS_ioctl+0x58/0x168) [ 25.422306] [<c0158d38>] (SyS_ioctl+0x58/0x168) from [<c000ed00>] (ret_fast_syscall+0x0/0x30) [ 25.422309] ---[ end trace e0184c4e90ebc46f ]--- [ 25.422490] gk20a gk20a.0: gk20a_pmu_enable_elpg: gk20a_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2 [ 25.422520] ------------[ cut here ]------------ [ 25.422525] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8() [ 25.422533] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill [ 25.422536] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8 [ 25.422543] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c) [ 25.422547] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74) [ 25.422550] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c) [ 25.422554] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) [ 25.422559] [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [<c03df6c4>] (gk20a_alloc_obj_ctx+0x66c/0xbd4) [ 25.422563] [<c03df6c4>] (gk20a_alloc_obj_ctx+0x66c/0xbd4) from [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) [ 25.422567] [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) from [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) [ 25.422571] [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) from [<c0158d38>] (SyS_ioctl+0x58/0x168) [ 25.422576] [<c0158d38>] (SyS_ioctl+0x58/0x168) from [<c000ed00>] (ret_fast_syscall+0x0/0x30) [ 25.422577] ---[ end trace e0184c4e90ebc470 ]--- [ 25.429992] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1 [ 25.429994] ------------[ cut here ]------------ [ 25.429992] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1 [ 25.429994] ------------[ cut here ]------------ [ 25.430001] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3192 gk20a_pmu_disable_elpg+0x80/0x2e0() [ 25.430010] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill [ 25.430014] CPU: 1 PID: 78 Comm: irq/189-gk20a_s Tainted: G W 3.10.40-ga7da876 #8 [ 25.430025] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c) [ 25.430031] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74) [ 25.430035] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c) [ 25.430039] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03fa108>] (gk20a_pmu_disable_elpg+0x80/0x2e0) [ 25.430045] [<c03fa108>] (gk20a_pmu_disable_elpg+0x80/0x2e0) from [<c03c2430>] (gk20a_intr_thread_stall+0xd0/0x5f8) [ 25.430050] [<c03c2430>] (gk20a_intr_thread_stall+0xd0/0x5f8) from [<c00d2cb0>] (irq_thread+0x114/0x168) [ 25.430055] [<c00d2cb0>] (irq_thread+0x114/0x168) from [<c0089148>] (kthread+0xd4/0xd8) [ 25.430060] [<c0089148>] (kthread+0xd4/0xd8) from [<c000ed98>] (ret_from_fork+0x14/0x20) [ 25.430062] ---[ end trace e0184c4e90ebc471 ]--- [ 25.430082] gk20a gk20a.0: gk20a_pmu_enable_elpg: gk20a_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2 [ 25.430084] ------------[ cut here ]------------ [ 25.430087] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8() [ 25.430094] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill [ 25.430097] CPU: 1 PID: 78 Comm: irq/189-gk20a_s Tainted: G W 3.10.40-ga7da876 #8 [ 25.430103] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c) [ 25.430107] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74) [ 25.430112] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c) [ 25.430116] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) [ 25.430121] [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [<c03c2448>] (gk20a_intr_thread_stall+0xe8/0x5f8) [ 25.430126] [<c03c2448>] (gk20a_intr_thread_stall+0xe8/0x5f8) from [<c00d2cb0>] (irq_thread+0x114/0x168) [ 25.430129] [<c00d2cb0>] (irq_thread+0x114/0x168) from [<c0089148>] (kthread+0xd4/0xd8) [ 25.430134] [<c0089148>] (kthread+0xd4/0xd8) from [<c000ed98>] (ret_from_fork+0x14/0x20) [ 25.430135] ---[ end trace e0184c4e90ebc472 ]--- -------------------------------------------------------------------- I don't know how to locate the issue. Thanks & best regards!
Thanks for your reply, following message is output from kernel:
--------------------------------------------------------------------
[ 25.422028] ------------[ cut here ]------------
[ 25.422031] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8()
[ 25.422037] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.422040] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8
[ 25.422046] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c)
[ 25.422050] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74)
[ 25.422054] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c)
[ 25.422058] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8)
[ 25.422062] [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [<c03df940>] (gk20a_alloc_obj_ctx+0x8e8/0xbd4)
[ 25.422066] [<c03df940>] (gk20a_alloc_obj_ctx+0x8e8/0xbd4) from [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc)
[ 25.422071] [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) from [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8)
[ 25.422075] [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) from [<c0158d38>] (SyS_ioctl+0x58/0x168)
[ 25.422080] [<c0158d38>] (SyS_ioctl+0x58/0x168) from [<c000ed00>] (ret_fast_syscall+0x0/0x30)
[ 25.422082] ---[ end trace e0184c4e90ebc46e ]---
[ 25.422086] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1
[ 25.422087] ------------[ cut here ]------------
[ 25.422089] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3192 gk20a_pmu_disable_elpg+0x80/0x2e0()
[ 25.422096] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.422099] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8
[ 25.422104] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c)
[ 25.422109] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74)
[ 25.422113] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c)
[ 25.422279] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03fa108>] (gk20a_pmu_disable_elpg+0x80/0x2e0)
[ 25.422285] [<c03fa108>] (gk20a_pmu_disable_elpg+0x80/0x2e0) from [<c03df5c0>] (gk20a_alloc_obj_ctx+0x568/0xbd4)
[ 25.422289] [<c03df5c0>] (gk20a_alloc_obj_ctx+0x568/0xbd4) from [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc)
[ 25.422296] [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) from [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8)
[ 25.422301] [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) from [<c0158d38>] (SyS_ioctl+0x58/0x168)
[ 25.422306] [<c0158d38>] (SyS_ioctl+0x58/0x168) from [<c000ed00>] (ret_fast_syscall+0x0/0x30)
[ 25.422309] ---[ end trace e0184c4e90ebc46f ]---
[ 25.422490] gk20a gk20a.0: gk20a_pmu_enable_elpg: gk20a_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2
[ 25.422520] ------------[ cut here ]------------
[ 25.422525] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8()
[ 25.422533] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.422536] CPU: 1 PID: 1363 Comm: gnome-session-c Tainted: G W 3.10.40-ga7da876 #8
[ 25.422543] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c)
[ 25.422547] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74)
[ 25.422550] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c)
[ 25.422554] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8)
[ 25.422559] [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [<c03df6c4>] (gk20a_alloc_obj_ctx+0x66c/0xbd4)
[ 25.422563] [<c03df6c4>] (gk20a_alloc_obj_ctx+0x66c/0xbd4) from [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc)
[ 25.422567] [<c03cf1c8>] (gk20a_channel_ioctl+0x6a8/0x10cc) from [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8)
[ 25.422571] [<c0158b20>] (do_vfs_ioctl+0x3f8/0x5b8) from [<c0158d38>] (SyS_ioctl+0x58/0x168)
[ 25.422576] [<c0158d38>] (SyS_ioctl+0x58/0x168) from [<c000ed00>] (ret_fast_syscall+0x0/0x30)
[ 25.422577] ---[ end trace e0184c4e90ebc470 ]---
[ 25.429992] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1
[ 25.429994] ------------[ cut here ]------------
[ 25.429992] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1
[ 25.429994] ------------[ cut here ]------------
[ 25.430001] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3192 gk20a_pmu_disable_elpg+0x80/0x2e0()
[ 25.430010] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.430014] CPU: 1 PID: 78 Comm: irq/189-gk20a_s Tainted: G W 3.10.40-ga7da876 #8
[ 25.430025] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c)
[ 25.430031] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74)
[ 25.430035] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c)
[ 25.430039] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03fa108>] (gk20a_pmu_disable_elpg+0x80/0x2e0)
[ 25.430045] [<c03fa108>] (gk20a_pmu_disable_elpg+0x80/0x2e0) from [<c03c2430>] (gk20a_intr_thread_stall+0xd0/0x5f8)
[ 25.430050] [<c03c2430>] (gk20a_intr_thread_stall+0xd0/0x5f8) from [<c00d2cb0>] (irq_thread+0x114/0x168)
[ 25.430055] [<c00d2cb0>] (irq_thread+0x114/0x168) from [<c0089148>] (kthread+0xd4/0xd8)
[ 25.430060] [<c0089148>] (kthread+0xd4/0xd8) from [<c000ed98>] (ret_from_fork+0x14/0x20)
[ 25.430062] ---[ end trace e0184c4e90ebc471 ]---
[ 25.430082] gk20a gk20a.0: gk20a_pmu_enable_elpg: gk20a_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2
[ 25.430084] ------------[ cut here ]------------
[ 25.430087] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8()
[ 25.430094] Modules linked in: dm_crypt dm_mod rfcomm bnep bluetooth rfkill
[ 25.430097] CPU: 1 PID: 78 Comm: irq/189-gk20a_s Tainted: G W 3.10.40-ga7da876 #8
[ 25.430103] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c)
[ 25.430107] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c006297c>] (warn_slowpath_common+0x5c/0x74)
[ 25.430112] [<c006297c>] (warn_slowpath_common+0x5c/0x74) from [<c0062a48>] (warn_slowpath_null+0x24/0x2c)
[ 25.430116] [<c0062a48>] (warn_slowpath_null+0x24/0x2c) from [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8)
[ 25.430121] [<c03f9718>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [<c03c2448>] (gk20a_intr_thread_stall+0xe8/0x5f8)
[ 25.430126] [<c03c2448>] (gk20a_intr_thread_stall+0xe8/0x5f8) from [<c00d2cb0>] (irq_thread+0x114/0x168)
[ 25.430129] [<c00d2cb0>] (irq_thread+0x114/0x168) from [<c0089148>] (kthread+0xd4/0xd8)
[ 25.430134] [<c0089148>] (kthread+0xd4/0xd8) from [<c000ed98>] (ret_from_fork+0x14/0x20)
[ 25.430135] ---[ end trace e0184c4e90ebc472 ]---
--------------------------------------------------------------------
I don't know how to locate the issue.

Thanks & best regards!

#7
Posted 11/06/2017 02:25 AM   
It seems to be the video driver complaining, but it doesn't say anything specific beyond: [code]gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1[/code] Did performance mode cause this to occur sooner? Can you compare how long this takes with and without a second fan cooling the whole board? Perhaps someone with access to the video driver source code could look at what might cause the reference count mismatch. Any information on what is running when it fails would be useful.
It seems to be the video driver complaining, but it doesn't say anything specific beyond:
gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1


Did performance mode cause this to occur sooner? Can you compare how long this takes with and without a second fan cooling the whole board? Perhaps someone with access to the video driver source code could look at what might cause the reference count mismatch. Any information on what is running when it fails would be useful.

#8
Posted 11/06/2017 03:54 AM   
Thank you so much, TK1 work in default mode, when above message output, system hang. This message is deferent to deferent board, in general 10~20 hrs, few boards can be up to servaral days.
Thank you so much,
TK1 work in default mode, when above message output, system hang.
This message is deferent to deferent board, in general 10~20 hrs, few boards can be up to servaral days.

#9
Posted 11/06/2017 09:07 AM   
Without some sort of way of knowing what influences the failure it won't be possible to debug. Would you install package "lm-sensors" ("sudo apt-get install lm-sensors") on a failed unit, and then run this command in a terminal until it fails (preferably an ssh connection or a serial console because we don't want screen blanking interfering after it fails): [code]watch -n 1 sensors[/code] When the unit fails this should be showing the final value of some temperatures. Also, verify this shows everything ok: [code]sha1sum -c /etc/nv_tegra_release[/code]
Without some sort of way of knowing what influences the failure it won't be possible to debug. Would you install package "lm-sensors" ("sudo apt-get install lm-sensors") on a failed unit, and then run this command in a terminal until it fails (preferably an ssh connection or a serial console because we don't want screen blanking interfering after it fails):
watch -n 1 sensors


When the unit fails this should be showing the final value of some temperatures.

Also, verify this shows everything ok:
sha1sum -c /etc/nv_tegra_release

#10
Posted 11/06/2017 08:01 PM   
Thanks, Anyway, how set GPU to fixed frequency, such as 756000000Hz. I use following command: [code] echo 756000000 > /sys/kernel/debug/clock/override.gbus/rate echo 1 > /sys/kernel/debug/clock/override.gbus/state [/code] but I find the file. I only find: [code]/sys/kernel/tegra_gpu/gpu_rate[/code] but it cannot be writen! Thanks.
Thanks,

Anyway, how set GPU to fixed frequency, such as 756000000Hz.

I use following command:
echo 756000000 > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state

but I find the file. I only find:
/sys/kernel/tegra_gpu/gpu_rate

but it cannot be writen!

Thanks.

#11
Posted 11/08/2017 11:04 AM   
The rates are in KHz, not Hz, so your rate is probably being rejected as out of range. The information I am using is from here, but this too is using Hz and I am guessing is from an older L4T: [url]https://elinux.org/Jetson/Performance#Controlling_GPU_performance[/url] Notice the test to run prior to changing the clock: [code]cat /sys/kernel/debug/clock/gbus/possible_rates[/code] On my R21.6 system I get: [code]72000 108000 180000 252000 324000 396000 468000 540000 612000 648000 684000 708000 756000 804000 [i][b]852000 (kHz)[/b][/i][/code]
The rates are in KHz, not Hz, so your rate is probably being rejected as out of range. The information I am using is from here, but this too is using Hz and I am guessing is from an older L4T:
https://elinux.org/Jetson/Performance#Controlling_GPU_performance

Notice the test to run prior to changing the clock:
cat /sys/kernel/debug/clock/gbus/possible_rates


On my R21.6 system I get:
72000 108000 180000 252000 324000 396000 468000 540000 612000 648000 684000 708000 756000 804000 852000 (kHz)

#12
Posted 11/08/2017 03:39 PM   
Thanks, But I am not find [b]/sys/kernel/debug/clock/gbus/possible_rates[/b] in our TK1 system. I think if kernel is not configured?
Thanks,
But I am not find /sys/kernel/debug/clock/gbus/possible_rates in our TK1 system.
I think if kernel is not configured?

#13
Posted 11/09/2017 01:35 AM   
We wonder if dynamically change rate to cause system hang? and how to change the rate in kernel source? Thanks!
We wonder if dynamically change rate to cause system hang? and how to change the rate in kernel source?

Thanks!

#14
Posted 11/09/2017 12:37 PM   
About the missing "/sys/kernel/debug/clock/gbus/possible_rates": Normally I would suggest this depends on which L4T version you use because kernel features will change depending on version, but it appears we are both using R21.6. That file is a reflection of a driver or kernel feature...this tells me you are missing a driver or kernel feature. Is this fully R21.6 L4T? Is the kernel custom compiled? Should this be custom compiled the "/sys" content may be missing due to missing configuration during compile.
About the missing "/sys/kernel/debug/clock/gbus/possible_rates": Normally I would suggest this depends on which L4T version you use because kernel features will change depending on version, but it appears we are both using R21.6.

That file is a reflection of a driver or kernel feature...this tells me you are missing a driver or kernel feature. Is this fully R21.6 L4T? Is the kernel custom compiled? Should this be custom compiled the "/sys" content may be missing due to missing configuration during compile.

#15
Posted 11/09/2017 09:21 PM   
Scroll To Top

Add Reply