tx2 jetpack3.2 wifi drop connection

After jetpack3.2, we notice the wifi will randomly be dropped. If in the GUI, the NetworkManager will prompt a dialog for password. (with correct password). If I click the connect, the wifi will be restored.

I found the following emails that are similar to our problem, but I have not try the patches yet.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843387

I will report back after I try.

We don’t have similar problem in jetpack3.1. (at least not noticed)

This may also relate to out AP. In out warehouse, we have to disable the power_save. (otherwise the ssh would take 1s for each char, but strangely, iperf is good, solid 50M+bits/s).

More detail about the kernel warning:

[1175168.236504] CFG80211-ERROR) wl_is_linkdown : Link down Reason : WLC_E_LINK
[1175168.243614] CFG80211-ERROR) wl_notify_connect_status : link down if wlan0 may call cfg80211_disconnected. event : 16, reason=2 from 1c:b9:c4:a9:77:b8
[1175168.270888] CFG80211-ERROR) wl_cfg80211_connect : Connectting with1c:b9:c4:a9:65:d8 channel (1) ssid "gd", len (2)

[1175168.324286] CFG80211-ERROR) wl_notify_connect_status : wl_bss_connect_done succeeded with 1c:b9:c4:a9:65:d8
[1175168.335832] cfg80211: World regulatory domain updated:
[1175168.337710] SCV_DEBUG, wifi power_set, wldev_ioctl 67, set:0
[1175168.347151] cfg80211:  DFS Master region: unset
[1175168.351777] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[1175168.358619] CFG80211-ERROR) wl_bss_connect_done : Report connect result - connection succeeded
[1175168.370575] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[1175168.378794] cfg80211:   (2457000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[1175168.387030] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
[1175168.395221] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (N/A)
[1175168.403599] CFG80211-ERROR) wl_notify_connect_status : wl_bss_connect_done succeeded with 1c:b9:c4:a9:65:d8
[1175168.414793] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (0 s)
[1175168.424449] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
[1175168.432729] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
[1175168.440911] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
[1175168.449359] ------------[ cut here ]------------
[1175168.454153] WARNING: at ffffffc000b3c3dc [verbose debug info unavailable]
[1175168.461104] Modules linked in: fuse bcmdhd pci_tegra bluedroid_pm

[1175168.469078] CPU: 5 PID: 22116 Comm: kworker/u12:1 Tainted: G        W       4.4.38+ #2
[1175168.477154] Hardware name: quill (DT)
[1175168.480993] Workqueue: cfg80211 cfg80211_event_work
[1175168.486046] task: ffffffc11f6fd780 ti: ffffffc04f298000 task.ti: ffffffc04f298000
[1175168.493697] PC is at __cfg80211_connect_result+0x220/0x254
[1175168.499353] LR is at __cfg80211_connect_result+0xb8/0x254
[1175168.504920] pc : [<ffffffc000b3c3dc>] lr : [<ffffffc000b3c274>] pstate: 40000045
[1175168.512481] sp : ffffffc04f29bc90
[1175168.515968] x29: ffffffc04f29bca0 x28: 0000000000000000 
[1175168.521478] x27: 0000000000000000 x26: ffffffc0013ce6f8 
[1175168.526989] x25: ffffffc09dac0218 x24: ffffffc000d40854 
[1175168.532505] x23: ffffffc1e1db3eb0 x22: ffffffc09dac0218 
[1175168.538014] x21: 0000000000000000 x20: 0000000000000000 
[1175168.543523] x19: ffffffc1e1db3e00 x18: 0000000000152f07 
[1175168.549031] x17: 0000007fb1a21f68 x16: ffffffc000b67a60 
[1175168.554542] x15: 00000000fa83b2da x14: 0000000000000000 
[1175168.560054] x13: 00000001163d000d x12: e746060400000000 
[1175168.565568] x11: 0000000000000000 x10: 0000000000000001 
[1175168.571082] x9 : 0000000000000010 x8 : ffffffbffc0b0c38 
[1175168.576597] x7 : 0000000000000001 x6 : 0000000000000002 
[1175168.582178] x5 : 00000000fffffffe x4 : 0000000000000000 
[1175168.587803] x3 : ffffffc001409700 x2 : 0000000000000000 
[1175168.593428] x1 : 0000000000000000 x0 : 0000000000000000 

[1175168.608525] ---[ end trace 8177347c0d39a7ab ]---
[1175168.613400] Call trace:
[1175168.616133] [<ffffffc000b3c3dc>] __cfg80211_connect_result+0x220/0x254
[1175168.622939] [<ffffffc000b16634>] cfg80211_process_wdev_events+0x148/0x1a8
[1175168.629961] [<ffffffc000b166c4>] cfg80211_process_rdev_events+0x30/0x6c
[1175168.636818] [<ffffffc000b11238>] cfg80211_event_work+0x1c/0x28
[1175168.642935] [<ffffffc0000bc2d0>] process_one_work+0x154/0x434
[1175168.648901] [<ffffffc0000bc6e4>] worker_thread+0x134/0x40c
[1175168.654639] [<ffffffc0000c1f30>] kthread+0xe0/0xf4
[1175168.659708] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[1175168.686022] cfg80211: World regulatory domain updated:
[1175168.691689] cfg80211:  DFS Master region: unset
[1175168.696413] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[1175168.706754] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[1175168.715172] cfg80211:   (2457000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[1175168.723488] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
[1175168.731803] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (N/A)
[1175168.741606] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (0 s)
[1175168.751455] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
[1175168.759786] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
[1175168.768020] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
[1175171.157640] IPVS: Creating netns size=1424 id=15
[1175194.450785] dhd_ndo_remove_ip: ndo ip addr remove failed, retcode = -23
[1175194.458987] dhd_inet6_work_handler: Removing host ip for NDO failed -23

The warning is from

$addr2line ffffffc000b3c3dc -e sources/kernel/kernel-4.4/output/vmlinux
jetpack/64_TX2/Linux_for_Tegra/sources/kernel/kernel-4.4/net/wireless/sme.c:714
if (status != WLAN_STATUS_SUCCESS) {
                kzfree(wdev->connect_keys);
                wdev->connect_keys = NULL;
                wdev->ssid_len = 0;
                if (bss) {
                        cfg80211_unhold_bss(bss_from_pub(bss));
                        cfg80211_put_bss(wdev->wiphy, bss);
                }
                cfg80211_sme_free(wdev);
                return;
        }
//the warning was because of the following line, line 714
        if (WARN_ON(!bss))
                return;

wjzhou,

Please describe more about your usecase. It is not clear to me what your scenario is.

Sounds like you are using TX2 to connect to a AP and sometimes it goes into disconnection and NM requests you to enter the password again. Is it correct?

If so, it sounds like an error we never met before. Is the signal of AP not good enough so that tx2 keeps lost?

I have tried to reproduce your issue by putting my device in idle and connected to AP.
But cannot see kernel panic when NM shows a prompt for pwd.

Do you also put tx2 in idle?

If I click the connect, the wifi will be restored.-> It sounds wifi is still working, isn’t it?

We use the TX2 in a moving cart. When it moving around our warehouse. It will connect to different access points. (They share the same ssid/passwd)

What we found is the cart lost wifi randomly. If we re-connect, it will connect to the wifi successfully. It just don’t reconnect by its own.

Also, these disconnections happened infrequently and randomly. (once every several days)

The pasted is a kernel warning instead of panic. To be honest, we don’t exactly know if they are related to the lost wifi connection. After the device lost wifi, we always see these in dmesg.

From my understanding of the kernel, I think WARN_ON should not ever happen.

  • Sounds like you are using TX2 to connect to a AP and sometimes it goes into disconnection and NM requests you to enter the password again. Is it correct?
    Yes. wifi would suddenly lost and the NM show the passwd dialog.(with the correct password already filled)

  • Is the signal of AP not good enough so that tx2 keeps lost?
    Maybe, our warehouse’s wifi is complicated, we have ap’s all over the places. Normally, the TX2 will switch to new ap without problem. but sometime, it will stop trying and show the password dialog

  • Do you also put tx2 in idle?
    I don’t know about the idle part. Is that a sleep or hibernate thing? We don’t put the device into low power mode. It is always on

  • If I click the connect, the wifi will be restored.-> It sounds wifi is still working, isn’t it?
    Yes. The wifi is still working. just don’t reconnect by its own

wjzhou,

Thanks for clarification.
In conclusion, there are two problems here

  1. The warning message during wifi handover.
  2. How to automatically connect to another wifi ap
    3.Jetpack3.1 does not have this problem.

Have you tried updating nm?

sudo apt-get --only-upgrade install network-manager

Any update?

I updated the network manager. And have not see the dropping connections for 4 days. Thank you.

(Our Jetpack 3.1 boards had the updated network manager, so the difference we saw may because of the network manager instead of the kernel. But I think I had never seen these kernel warning on jetpack 3.1)

I think we can close this thread for now.

Hi guys,

I also have a similar problem. My board is Jetson TX2 and it’s located in a fixed position all the time and has access to a high-quality wireless link. But several time per hour I need to reconnect to the wireless AP, as it disconnects from it. It’s bummer actually to reconnect each 15 min.

Note that I used the update and it is not helped in my case as the problem still exist.

Which BSP are you using? Could you fallback to rel-28.1 and try?

NVIDIA Jetson TX2 L4T 28.2.1 [JetPack 3.2.1] Board: t186ref Ubuntu 16.04 LTS Kernel Version: 4.4.38-tegr. Actually, I prefer not to do it. Is there any suggestion how to solve this problem?

mohsen.m.razlighi,

Please download the same version of NetworkManager as rel-28.1 and install to your 28.2.1 system.

Those problems look like directly from NetworkManager but not wifi driver and NM is not directly provided by NV, so the only thing we can suggest here is to try latest or previous NM.

I am seeing the same behavior after flashing JetPack 3.3 which wasn’t seen on 28.1. I will try to install NM from 28.1 and report.

I still see the error in my environment after update the network manager and trying different versions of it.

It is probably environment related. We only see the problem in one of our warehouse.

Since most time it was just network manager doesn’t re-connection, I write a script to check connection and reconnect as needed.

The following workaround seems to be working in our enviroment.

#/usr/local/bin/checkwifi.sh
#!/bin/bash
echo "checkwifi.sh started, it will check the wifi every 60 secs"

while true
do
if `/usr/bin/nmcli device show wlan0 | /bin/grep disconnected -q` ; then
    echo "The wlan0 seems to be disconnected, reconnecting..."
    /usr/bin/nmcli device connect wlan0
fi

/bin/sleep 60
done
#/etc/systemd/system/checkwifi.service
[Unit]
Description=checkwifi connection

[Service]
Type=simple
Restart=always
RestartSec=3
StartLimitIntervalSec=0
ExecStart=/usr/local/bin/checkwifi.sh

[Install]
WantedBy=multi-user.target

and enable the checkwifi service

Nice! Thanks for sharing!

How is this an acceptable solution in Nvidia’s eyes? Dropping connection even for a second or two is horrible for many use-cases.

Hi Undertow10,

This issue seems come from NetworkManager which is a 3rd-party network application.
It means if you use hostapd or latest version of NetworkManager, the issue may gone.

Hi Wayne,

What makes you think it’s a NetworkManager issue? It may very well be a NetworkManager issue, but how can we be sure? We have other non-Nvidia machines running Ubuntu 16.04 with the same version of NetworkManager on our same network and have never seen the issue.

Hi Undertow10,

We start to receive wifi issue after rel-28.2.1. But we don’t see such issue when we were on rel-28.1. After some tests, we noticed it was an issue from NetworkManager.

Updating NM version or fallback to old one both resolve some users problem.

If you think you’ve met a new issue, please file a new topic on forum and we can check.
It should be better reproducing on nvidia devkit with our BSP. Also, multiple devices and environment should also be helpful.

Please also tell us how to reproduce your issue. Some wifi error only shows up during long run test.