Wifi Disconnect problem on JetPack 3.3

We are experiencing the same problem other users have experienced with JetPack 3.3 disconnecting randomly from Wifi and prompting for network password. The problem happens when hopping access points on the same SSID by physically moving the TX1 device around our office. I’ve attached the dmesg log, which includes a kernel warning.

Here’s what we know:

  • We can easily reproduce the problem with JetPack 3.3 and it does not happen on JetPack 3.1.
  • Both JetPacks come with NetworkManager 1.2.0. If we upgrade NetworkManager using apt-get, we can install 1.2.6. The problem occurs on both versions of NetworkManager when using JetPack 3.3, but not when using JetPack 3.1.
  • Similar to https://devtalk.nvidia.com/default/topic/1036987/jetson-tx1/r28-2-wifi-in-tx1-disconnect-sometimes/post/5268036/#5268036, we thought we had narrowed the problem down to when the TX1 device would hop, joining the new AP on a 2.4ghz channel and then automatically switching to a 5ghz channel on the same AP. However, we made a 5ghz-only SSID and we were still able to reproduce the problem just by hopping APs and verifying the TX1 connected to 5ghz channels each time.

    We have full control to try various settings on our network to debug the issue, but we use Aruba APs with settings that would likely be standard in any office environment, so we shouldn’t have to change anything.

    We cannot downgrade to an earlier JetPack because our entire product and all its drivers have been built around JetPack 3.3. We’re about to deploy this product into thousands of locations around the world, and we’re seriously concerned about Nvidia’s lackluster response to this problem. It has been referenced in at least the following 4 posts:

    https://devtalk.nvidia.com/default/topic/1036350/jetson-tx2/tx2-jetpack3-2-wifi-drop-connection/1

    https://devtalk.nvidia.com/default/topic/1036960/jetson-tx1/28-2-1-wifi-error/post/5267896/#5267896

    https://devtalk.nvidia.com/default/topic/1036987/jetson-tx1/r28-2-wifi-in-tx1-disconnect-sometimes/post/5268036/#5268036

    https://devtalk.nvidia.com/default/topic/1045760/jetson-tx2/wifi-disconnections-with-jetpack-3-3-/post/5306447/#5306447

    We are happy to work with anyone to help us solve this. Thanks in advance.

    dmesg_log.txt (5.69 KB)

    I am curious if any of your access points have a hidden SSID? If so, then can you try without the SSID being hidden?

    We have no hidden SSIDs.

    We also tried your suggestion (from the linked post) to run jetson_clocks.sh and can still reproduce the problem.

    [url]https://devtalk.nvidia.com/default/topic/1045760/jetson-tx2/wifi-disconnections-with-jetpack-3-3-/post/5306447/#5306447[/url]

    I can confirm the same problem.

    If I remembered that right. From the source code, this warning is triggered when the hardware returns success but the SSID is empty.

    We notice another problem with this.
    When the wifi disconnected. Sometime, we would lost usb camera (Logitech C920). The symptom is V4l2 returns “select timeout”.
    And only recovery when the affected process is restarted. (simply close and re-open doesn’t help) (We have three cameras, all of them will select timeout in this case).

    This problem is a high priority bug in our end now. So if there is anything we can help, please let us know.

    The reason I asked is because the error seems to relate indirectly to the broadcom driver. The struct which is NULL (and should not be) would be filled in by a function which has some notes on hidden SSID handling problems (which differs from some of the other drivers for other chipsets in WiFi). I don’t have a way of narrowing it down, but if for some reason SSID cannot be found this error could occur. Hidden SSID was just one possibility for that, but there may be other reasons why that struct returned NULL. I’m not really in a position to debug further without an environment to reproduce it.

    If you can add some sort of verbose logging to the access points involved it might provide a clue as to whether there was an interaction and the nature of the interaction.

    I’m attaching our AP log here in case anyone else wants to compare. Nothing stands out to us, but we’re still investigating.

    The MAC address of the TX1 device is 00:04:4b:a1:cf:57.
    ap-all-log-2-8-2019.txt (101 KB)

    Keep in mind that I am making wild guesses from this point on. Hopefully someone knowing more about the particular errors from the Jetson logs might comment.

    I did see elsewhere a comment about “band steering” or “spectrum load balancing”. Would you happen to know if this is enabled on your APs? One reason this caught my eye is that the bands and channels supported on the TX1 may not include everything the AP supports, and you’ve mentioned changes between 2.4GHz and 5GHz as related. If for some reason the AP is trying to migrate to a channel not allowed, then the migration could result in a disconnect.

    In terms of the actual address used I am assuming that there is a VLAN set up in common to all of the APs. Trying to keep an IP address during migration would be important, and the VLAN would be in charge of keeping the address assignment. Is there separate logging on VLAN and APs? Note that DHCP could renew a lease to keep an address, but DHCP without renew won’t guarantee keeping the preferred address. Static address would need to be bound to the MAC if not using DHCP. A lack of the address or a reset instead of a renew might cause that struct to be NULL.

    Hi Linuxdev,

    Thanks for your help investigating this. However, I don’t think either of those issues are in play here. We have specifically limited our network to use only the Nvidia approved channels (which is also frustrating, but a different topic). We also happen to have our network set to provide a static IP to this TX1 device based on MAC address (for other debugging reasons).

    We have spent a little time trying to debug the kernel warning we get when the error occurs. We’re only assuming this warning is related to the disconnect, but it may not be.

    The warning gets thrown from net/wireless/sme.c in __cfg80211_connect_result() when status = WLAN_STATUS_SUCCESS but there is no bss. It calls cfg80211_get_bss() in net/wireless/scan.c but that function returns without a valid bss. That scan.c function cfg80211_get_bss() appears to loop through a list of bss objects (available on our network?) to find one that matches the one provided in the input parameters.

    We were able to recreate the problem once with our debugging in place, and the channel passed to this function was “center frequency 2462”, and the list of bss objects contains four objects with that channel, but none of those four objects matched all of the requirements to get returned as a valid bss. We added more debugging printouts to see which requirement didn’t match, but then we were not able to recreate the problem. We will continue this tomorrow, but it feels like a pretty deep rabbit hole.

    These are the if statements in scan.c where it bails out without returning the correct bss.

    if (!cfg80211_bss_type_match(bss->pub.capability,
    			     bss->pub.channel->band, bss_type))
    	continue;
    
    bss_privacy = (bss->pub.capability & WLAN_CAPABILITY_PRIVACY);
    if ((privacy == IEEE80211_PRIVACY_ON && !bss_privacy) ||
        (privacy == IEEE80211_PRIVACY_OFF && bss_privacy))
    	continue;
    if (channel && bss->pub.channel != channel)
    	continue;
    if (!is_valid_ether_addr(bss->pub.bssid))
    	continue;
    /* Don't get expired BSS structs */
    if (time_after(now, bss->ts + IEEE80211_SCAN_RESULT_EXPIRE) &&
        !atomic_read(&bss->hold))
    	continue;
    
    // DOES NOT GET HERE FOR ANY OBJECT, EVEN ONE WITH THE CORRECT CHANNEL...
    
    if (is_bss(&bss->pub, bssid, ssid, ssid_len)) {
    	res = bss;
    	bss_ref_get(rdev, res);
    	break;
    }
    

    Hi,

    Moving the discussion to this thread.
    I saw this description “disconnecting randomly from Wifi and prompting for network password.” in your first comment.

    The “prompt for network password” seems a new feature from NM on rel-28.2/rel-28.2.1.
    I wonder why there are same NM revisions on your side. My tx2i gives out below result for NM version, which is not 1.2.0

    nvidia@tegra-ubuntu:~$ NetworkManager --version
    1.2.6 
    nvidia@tegra-ubuntu:~$ cat /etc/nv_tegra_release 
    # R28 (release), REVISION: 2.1, GCID: 11272647, BOARD: t186ref, EABI: aarch64, DATE: Thu May 17 07:29:06 UTC 2018
    

    That is why I suggested some forum users to fallback to NM 1.2.0 if they don’t want the password prompt.
    Maybe it is because you are on TX1. I will find one device and check later.

    Could you elaborate more about this description
    “The problem happens when hopping access points on the same SSID by physically moving the TX1 device around our office”?

    Does it mean if you don’t move around the TX1 device, there would be no disconnection?

    We believe the problem only happens when a hop occurs. The hop could be triggered by physically moving the device around the office, but the device will also hop while sitting in one location, presumably just due to varying network signal strength.

    For example, one day we will try purposely moving the device around many times in a row to try to trigger the problem, but it won’t happen. The next day, it’ll happen almost immediately when moving the device around. Either way, it will inevitably happen once every few hours without moving the device.

    We’ve seen the problem occur on both NetworkManager 1.2.0 and 1.2.6.

    Let me align and confirm that NetworkManager 1.2.0 would not have password prompt but still lose the connection, right?

    Please share new dmesg with below setting.

    sudo -s
     echo 0x10801 > /sys/module/bcmdhd/parameters/dhd_msg_level
     echo 120 > /sys/module/bcmdhd/parameters/dhd_console_ms
    

    This is a log with the debug setting.

    For some reason, when the wifi has problem, our camera may return select timeout.(opencv v4l2 backend)
    For our environment, temporary network lost is inconvenience but not too bad, but this camera problem is.

    In the log, “restoring control 00000000-0000-0000-0000-000000000001/1/2” is the reinitialization of the camera. Our app level log returns select timeout.

    [ 3362.959327] Avoid pkt processing if credit is low (<3)
    [ 3362.965389] Avoid pkt processing if credit is low (<3)
    [ 3362.972307] Avoid pkt processing if credit is low (<3)
    [ 3362.978821] Avoid pkt processing if credit is low (<3)
    [ 3362.985352] Avoid pkt processing if credit is low (<3)
    [ 3362.991525] Avoid pkt processing if credit is low (<3)
    [ 3362.998431] Avoid pkt processing if credit is low (<3)
    [ 3363.004747] Avoid pkt processing if credit is low (<3)
    [ 3363.011632] Avoid pkt processing if credit is low (<3)
    [ 3392.695478] wl_host_event: Link event 11, flags 0, status 0
    [ 3392.701763] MACEVENT: WLC_E_DISASSOC, MAC 56:xx:xx:92:xx:02
    [ 3392.707884] wl_host_event: Link event 16, flags 0, status 0
    [ 3392.714264] MACEVENT: WLC_E_LINK DOWN
    [ 3392.718431] CFG80211-ERROR) wl_is_linkdown : Link down Reason : WLC_E_LINK
    [ 3392.725386] CFG80211-ERROR) wl_notify_connect_status : link down if wlan0 may call cfg80211_disconnected. event : 16, reason=2 from 56:xx:xx:92:xx:02
    [ 3392.753687] CFG80211-ERROR) wl_cfg80211_connect : Connectting with92:xx:xx:18:xx:01 channel (44) ssid "my-ssid", len (14)
    
    [ 3392.820804] MACEVENT: WLC_E_ASSOC_REQ_IE
    [ 3392.825438] MACEVENT: WLC_E_AUTH, MAC 92:xx:xx:18:xx:01, Open System, SUCCESS
    [ 3392.833959] MACEVENT: WLC_E_ASSOC_RESP_IE
    [ 3392.838818] MACEVENT: WLC_E_ASSOC, MAC 92:xx:xx:18:xx:01, SUCCESS
    [ 3392.846159] wl_host_event: Link event 16, flags 1, status 0
    [ 3392.852849] MACEVENT: WLC_E_LINK UP
    [ 3392.856986] CFG80211-ERROR) wl_notify_connect_status : wl_bss_connect_done succeeded with 92:xx:xx:18:xx:01
    [ 3392.875155] SCV_DEBUG, wifi power_set, wldev_ioctl 67, set:0
    [ 3392.901218] CFG80211-ERROR) wl_bss_connect_done : 
    [ 3392.901273] ------------[ cut here ]------------
    [ 3392.901276] WARNING: at ffffffc000b3705c [verbose debug info unavailable]
    [ 3392.901277] Modules linked in:
    [ 3392.901279]  fuse
    [ 3392.901281]  ipt_MASQUERADE
    [ 3392.901282]  nf_nat_masquerade_ipv4
    [ 3392.901282]  iptable_nat
    [ 3392.901283]  nf_nat_ipv4
    [ 3392.901284]  xt_addrtype
    [ 3392.901285]  iptable_filter
    [ 3392.901286]  ip_tables
    [ 3392.901287]  xt_conntrack
    [ 3392.901288]  nf_nat
    [ 3392.901288]  br_netfilter
    [ 3392.901289]  overlay
    [ 3392.901290]  leaf(O)
    [ 3392.901291]  kvcommon(O)
    [ 3392.901292]  snd_usb_audio
    [ 3392.901293]  snd_hwdep
    [ 3392.901294]  uvcvideo
    [ 3392.901295]  snd_usbmidi_lib
    [ 3392.901296]  xpad
    [ 3392.901297]  videobuf2_vmalloc
    [ 3392.901297]  xsens_mt
    [ 3392.901298]  bcmdhd
    [ 3392.901299]  pci_tegra
    [ 3392.901300]  bluedroid_pm
    
    [ 3392.901305] CPU: 3 PID: 491 Comm: kworker/u12:2 Tainted: G        W  O    4.4.38+ #2
    [ 3392.901306] Hardware name: quill (DT)
    [ 3392.901314] Workqueue: cfg80211 cfg80211_event_work
    
    [ 3392.901316] task: ffffffc1b9c73200 ti: ffffffc1868d0000 task.ti: ffffffc1868d0000
    [ 3392.901319] PC is at __cfg80211_connect_result+0x220/0x254
    [ 3392.901321] LR is at __cfg80211_connect_result+0xb8/0x254
    [ 3392.901323] pc : [<ffffffc000b3705c>] lr : [<ffffffc000b36ef4>] pstate: 40000045
    [ 3392.901323] sp : ffffffc1868d3c90
    [ 3392.901324] x29: ffffffc1868d3ca0 
    [ 3392.901325] x28: 0000000000000000 
    
    [ 3392.901327] x27: 0000000000000000 
    [ 3392.901327] x26: ffffffc0013c66f8 
    
    [ 3392.901329] x25: ffffffc06ffdc218 
    [ 3392.901329] x24: ffffffc000d3ab54 
    
    [ 3392.901330] x23: ffffffc07a5ec6b0 
    [ 3392.901331] x22: ffffffc06ffdc218 
    
    [ 3392.901332] x21: 0000000000000000 
    [ 3392.901333] x20: 0000000000000000 
    
    [ 3392.901334] x19: ffffffc07a5ec600 
    [ 3392.901335] x18: 0000000000000000 
    
    [ 3392.901336] x17: ffffffc000b62a60 
    [ 3392.901336] x16: ffffffc000b62a60 
    
    [ 3392.901337] x15: ffffffc000b62a60 
    [ 3392.901338] x14: 0000000000000001 
    
    [ 3392.901339] x13: 0000667bc55858f7 
    [ 3392.901340] x12: 0000000000359c94 
    
    [ 3392.901341] x11: 0000000000359c94 
    [ 3392.901342] x10: 0000000000000001 
    
    [ 3392.901343] x9 : 0000000000000010 
    [ 3392.901343] x8 : ffffffbffc0b0bf8 
    
    [ 3392.901344] x7 : ffffffc1e0203a48 
    [ 3392.901345] x6 : 0000000000000002 
    
    [ 3392.901346] x5 : 00000000fffffffe 
    [ 3392.901347] x4 : 0000000000000000 
    
    [ 3392.901348] x3 : ffffffc001401700 
    [ 3392.901348] x2 : 0000000000000000 
    
    [ 3392.901349] x1 : 0000000000000000 
    [ 3392.901350] x0 : 0000000000000000 
    
    [ 3392.901564] ---[ end trace f6fc6841626edf6e ]---
    [ 3392.901565] Call trace:
    [ 3392.901570] [<ffffffc000b3705c>] __cfg80211_connect_result+0x220/0x254
    [ 3392.901573] [<ffffffc000b112b4>] cfg80211_process_wdev_events+0x148/0x1a8
    [ 3392.901576] [<ffffffc000b11344>] cfg80211_process_rdev_events+0x30/0x6c
    [ 3392.901578] [<ffffffc000b0beb8>] cfg80211_event_work+0x1c/0x28
    [ 3392.901582] [<ffffffc0000bc2d0>] process_one_work+0x154/0x434
    [ 3392.901583] [<ffffffc0000bc6e4>] worker_thread+0x134/0x40c
    [ 3392.901586] [<ffffffc0000c1f30>] kthread+0xe0/0xf4
    [ 3392.901589] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
    [ 3392.904604] cfg80211: World regulatory domain updated:
    [ 3392.904614] cfg80211:  DFS Master region: unset
    [ 3392.904620] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
    [ 3392.904637] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
    [ 3392.904650] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz, 92000 KHz AUTO), (N/A, 2000 mBm), (N/A)
    [ 3392.904658] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
    [ 3392.904668] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (N/A)
    [ 3392.904679] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (0 s)
    [ 3392.904688] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
    [ 3392.904696] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
    [ 3392.904704] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
    [ 3392.948433] cfg80211: World regulatory domain updated:
    [ 3392.948441] cfg80211:  DFS Master region: unset
    [ 3392.948448] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
    [ 3392.948459] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
    [ 3392.948468] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz, 92000 KHz AUTO), (N/A, 2000 mBm), (N/A)
    [ 3392.948474] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
    [ 3392.948484] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (N/A)
    [ 3392.948492] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (0 s)
    [ 3392.948551] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
    [ 3392.948557] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
    [ 3392.948563] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
    [ 3393.078921] dhd_ndo_remove_ip: ndo ip addr remove failed, retcode = -23
    [ 3393.078924] dhd_inet6_work_handler: Removing host ip for NDO failed -23
    [ 3393.121339] MACEVENT: WLC_E_JOIN, MAC 92:xx:xx:18:xx:01
    [ 3393.121375] MACEVENT: WLC_E_SET_SSID, MAC 92:xx:xx:18:xx:01
    [ 3393.326734] Report connect result - connection succeeded
    [ 3393.350477] CFG80211-ERROR) wl_notify_connect_status : wl_bss_connect_done succeeded with 92:xx:xx:18:xx:01
    [ 3395.138819] CONSOLE: 003383.159 wl0.0: wlc_wsec_recvdata_enc_toss unsupported encrypted unicast frame from 92:xx:xx:18:xx:01
    [ 3395.150206] CONSOLE: 003383.460 wl0.0: wlc_wsec_recvdata_enc_toss unsupported encrypted unicast frame from 92:xx:xx:18:xx:01
    [ 3398.922810] CONSOLE: 003384.062 wl0.0: wlc_wsec_recvdata_enc_toss unsupported encrypted unicast frame from 92:xx:xx:18:xx:01
    [ 3398.934189] CONSOLE: 003384.175 wl0.0: wlc_wsec_recvdata_enc_toss unsupported encrypted unicast frame from 92:xx:xx:18:xx:01
    [ 3398.945503] CONSOLE: 003384.498 wl0.0: wlc_wsec_recvdata_enc_toss unsupported encrypted unicast frame from 92:xx:xx:18:xx:01
    [ 3398.956778] CONSOLE: 003384.680 wl0.0: wlc_wsec_recvdata_enc_toss unsupported encrypted unicast frame from 92:xx:xx:18:xx:01
    [ 3398.968187] CONSOLE: 003385.263 wl0.0: wlc_wsec_recvdata_enc_toss unsupported encrypted unicast frame from 92:xx:xx:18:xx:01
    [ 3402.903281] SCV_DEBUG, wifi power_set, wldev_ioctl 67, set:2
    [ 3403.819100] xhci-tegra 3530000.xhci: tegra_xhci_mbox_work mailbox command 6
    [ 3403.878777] xhci-tegra 3530000.xhci: tegra_xhci_mbox_work mailbox command 6
    [ 3403.983905] xhci-tegra 3530000.xhci: tegra_xhci_mbox_work mailbox command 6
    [ 3407.872085] CONSOLE: e from 92:xx:xx:18:xx:01
    [ 3409.088216] serial-tegra c280000.serial: configured rate out of supported range by -0.2 %
    [ 3411.662839] xhci-tegra 3530000.xhci: tegra_xhci_mbox_work mailbox command 6
    [ 3411.950095] xhci-tegra 3530000.xhci: tegra_xhci_mbox_work mailbox command 6
    [ 3411.950489] restoring control 00000000-0000-0000-0000-000000000001/1/2
    [ 3411.950498] restoring control 00000000-0000-0000-0000-000000000001/3/4
    [ 3411.950505] restoring control 00000000-0000-0000-0000-000000000001/5/6
    [ 3411.950513] restoring control 00000000-0000-0000-0000-000000000001/17/8
    [ 3411.964860] restoring control 00000000-0000-0000-0000-000000000101/9/4
    [ 3412.245034] restoring control 00000000-0000-0000-0000-000000000001/1/2
    [ 3412.251764] restoring control 00000000-0000-0000-0000-000000000001/3/4
    [ 3412.258524] restoring control 00000000-0000-0000-0000-000000000001/5/6
    [ 3412.265153] restoring control 00000000-0000-0000-0000-000000000001/17/8
    [ 3412.272668] restoring control 00000000-0000-0000-0000-000000000101/9/4
    [ 3413.165054] xhci-tegra 3530000.xhci: tegra_xhci_mbox_work mailbox command 6
    

    As far as we can tell, NetworkManager 1.2.0 and 1.2.6 perform exactly the same with this error.

    Attached is a dmesg log when the error occurs with your debugging enabled.

    dmesg_with_bcmdhd_debug.txt (25.7 KB)

    Also, for some reason, the logged bssid is not the same as of AP. May just the logging part that changed it. But I didn’t find anything in source code.

    For example,

    in the log we have:

    56:xx:xx:92:xx:02
    

    but the real bssid is

    44:xx:xx:90:xx:02
    

    in the log we have:

    92:xx:xx:18:xx:01
    

    but the real bssid is

    80:xx:xx:16:xx:01
    

    Thanks for this info. I’ll look into it.

    Please try to use this patch and see if error is still

    --- a/drivers/net/wireless/bcmdhd/wl_cfg80211.c
    +++ b/drivers/net/wireless/bcmdhd/wl_cfg80211.c
    @@ -9874,9 +9874,8 @@ wl_bss_roaming_done(struct bcm_cfg80211 *cfg, struct net_device *ndev,
            if ((*channel == cur_channel) && ((memcmp(curbssid, &e->addr,
                    ETHER_ADDR_LEN) == 0) || (memcmp(&cfg->last_roamed_addr,
                    &e->addr, ETHER_ADDR_LEN) == 0))) {
    -               WL_ERR(("BSS already present, Skipping roamed event to"
    +               WL_ERR(("BSS already present, but donot skip roamed event to"
                    " upper layer\n"));
    -               return  err;
            }
    

    Correct: please use below patch instead of above one.

    diff --git a/drivers/net/wireless/bcmdhd/wl_cfg80211.c b/drivers/net/wireless/bcmdhd/wl_cfg80211.c
    index 1975c77..8412a5d 100644
    --- a/drivers/net/wireless/bcmdhd/wl_cfg80211.c
    +++ b/drivers/net/wireless/bcmdhd/wl_cfg80211.c
    @@ -9938,6 +9938,7 @@
     	do {
     		bss = CFG80211_GET_BSS(wiphy, NULL, curbssid,
     			ssid->SSID, ssid->SSID_len);
    +		cfg->wdev->ssid_len = ssid->SSID_len;
     		if (bss || (count > 5)) {
     			break;
     		}
    

    @Undertow10 could you try the patch?

    In our environment, it only happens in production, which is remote and the device don’t have monitor or ethernet.
    I can’t even remote reboot the device without help from colleague.

    I can try the patch next time I’m on-site, but it will be awhile.

    BTW, @WayneWWW
    I saw

    [ 3263.923618] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
    [ 3263.923626] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
    [ 3272.544014] usb 1-3.3: new low-speed USB device number 8 using xhci-tegra
    [ 3272.570419] usb 1-3.3: New USB device found, idVendor=045e, idProduct=0053
    [ 3272.570588] usb 1-3.3: New USB device strings: Mfr=1, Product=3, SerialNumber=0
    [ 3272.570693] usb 1-3.3: Product: Microsoft 3-Button Mouse with IntelliEye(TM)
    [ 3272.570791] usb 1-3.3: Manufacturer: Microsoft
    

    in Undertow10’s log. which means their usb is probably reset too. Do you know why the wifi problem can affect the USB.

    Is that possible that it was the other way round that the bus error, and every devices got reset?

    it seems the ssid was wrong on the query

    I enabled the cfg80211 trace

    sudo -i
    echo 1 > /sys/kernel/debug/tracing/events/cfg80211/enable
    cat /sys/kernel/debug/tracing/trace
    

    The following is from the trace

    wl_event_handle-848   [003] ...1  6348.779401: cfg80211_get_bss: phy0, band: 0, freq: 2437, xx:xx:xx:xx:05:8f, buf: 0x62, bss_type: 0, privacy: 2
     wl_event_handle-848   [003] ...1  6348.779409: cfg80211_return_bss: 46:d9:e7:f7:05:8f, band: 0, freq: 2437
     wl_event_handle-848   [003] ...1  6348.794747: cfg80211_get_bss: phy0, band: 0, freq: 0, xx:xx:xx:xx:05:8f, buf: 0x62, bss_type: 0, privacy: 2
     wl_event_handle-848   [003] ...1  6348.794754: cfg80211_return_bss: 46:d9:e7:f7:05:8f, band: 0, freq: 2437
       kworker/u12:0-12002 [003] ...1  6348.900694: cfg80211_get_bss: phy0, band: 0, freq: 0, xx:xx:xx:xx:05:8f, buf: 0x1e, bss_type: 0, privacy: 2
      wpa_supplicant-1303  [004] ...1  6349.124876: rdev_scan: phy0
    

    For the first two, the 0x62 is the first char 0x62(b) of our ssid. (boxed-xxx), but the third one 0x1e is [record separator] in ascii

    cross-reference with our dmesg, it seems the third one may trigger the WARN_ON.

    Feb 13 19:18:42 magnolia kernel: [ 6348.794765] CFG80211-ERROR) wl_bss_connect_done : Report connect result - connection succeeded
    Feb 13 19:18:42 magnolia kernel: [ 6348.807599] cfg80211:  DFS Master region: unset
    Feb 13 19:18:42 magnolia kernel: [ 6348.812011] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
    Feb 13 19:18:42 magnolia kernel: [ 6348.821819] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
    Feb 13 19:18:42 magnolia kernel: [ 6348.829911] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz, 92000 KHz AUTO), (N/A, 2000 mBm), (N/A)
    Feb 13 19:18:42 magnolia kernel: [ 6348.839371] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
    Feb 13 19:18:42 magnolia kernel: [ 6348.847436] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (N/A)
    Feb 13 19:18:42 magnolia kernel: [ 6348.856962] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (0 s)
    Feb 13 19:18:42 magnolia kernel: [ 6348.866500] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
    Feb 13 19:18:42 magnolia kernel: [ 6348.868039] CFG80211-ERROR) wl_notify_connect_status : wl_bss_connect_done succeeded with 46:d9:e7:f7:05:8f
    Feb 13 19:18:42 magnolia kernel: [ 6348.884407] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
    Feb 13 19:18:42 magnolia kernel: [ 6348.892460] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
    Feb 13 19:18:42 magnolia kernel: [ 6348.900732] ------------[ cut here ]------------
    Feb 13 19:18:42 magnolia kernel: [ 6348.905348] WARNING: at ffffffc000b3705c [verbose debug info unavailable]
    Feb 13 19:18:42 magnolia kernel: [ 6348.912123] Modules linked in: fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat leaf(O) br_netfilter kvcommon(O) overlay uvcvideo snd_usb_audio snd_hwdep snd_usbmidi_lib videobuf2_vmalloc xsens_mt bcmdhd pci_tegra bluedroid_pm
    

    @Undertow10 when you are testing, could you record the ssid?