Hard CPU lockup when using PCIe USB Hub

Hi All,

I’m currently running Linux for Tegra R24.2.1 and am experiencing periodic restarts.
Some specifics of the situation:

  • I'm using a custom developed carrier board that contains an MCS9990 USB2 root hub connected to the TX1 PCIe bus
  • The restarts occur only when an Atheros 9K WiFi module is connected to one of the MCS9990 USB lanes
  • Things are stable when the Atheros 9K is connected directly to the TX1 USB3 lane
  • I've confirmed that the issue relates to USB data lines not power lines (have isolated power lines and symptoms remain)
  • Restarts occur following kernel rcu_preempt detecting a CPU stall and the kernel watchdog timer expiring
  • I have applied the PCIe IOMMU patch: https://devtalk.nvidia.com/default/topic/967515/l4t-r24-2-pcie-iommu-not-working-when-switches-bridges-are-present/ to correct for IOMMU not working when PCIe bridges are present
  • The restarts don't occur when rfkill is used to disable WiFi
  • When idling the system will reboot roughly half hourly however will reboot withing a few minites of starting the transfer of a large file over WiFi to /dev/null

I’ve noticed that there are a references to PCIe IOMMU and DMA issues in L4T 24.x as experienced by several people, are there fundamental issues with PCIe support in L4T 24.x?
Should I try using L4T 23.x in the meantime?
If anyone has any insights or further steps I should take to troubleshoot this issue it’d be appreciated.

Following are snips of the kernel debug output prior to restart due to watchdog restart:

[  336.280001] ------------[ cut here ]------------
[  336.284830] WARNING: at /dvs/git/dirty/git-master_linux/kernel/kernel/watchdog.c:269 watchdog_check_hardlockup_other_cpu+0x10c/0x158()
[  336.297287] Watchdog detected hard LOCKUP on cpu 0
[  336.302058] Modules linked in: rfcomm bnep hci_uart ath9k_htc(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211 bcmdhd cfg80211 mcp251x(O) can_dev(O) spid
ev(O) ohci_hcd(O) bluedroid_pm
[  336.319390] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G           O 3.10.96-tegra #1
[  336.327108] Call trace:
[  336.329650] [<ffffffc000089df4>] dump_backtrace+0x0/0xf4
[  336.335138] [<ffffffc00008a0f4>] show_stack+0x14/0x1c
[  336.340359] [<ffffffc00032a59c>] dump_stack+0x20/0x28
[  336.345582] [<ffffffc0000a72e4>] warn_slowpath_common+0x78/0x9c
[  336.351698] [<ffffffc0000a7358>] warn_slowpath_fmt+0x50/0x58
[  336.357544] [<ffffffc000122614>] watchdog_check_hardlockup_other_cpu+0x10c/0x158
[  336.365176] [<ffffffc000122748>] watchdog_timer_fn+0x54/0x204
[  336.371111] [<ffffffc0000d5e90>] __run_hrtimer+0x178/0x294
[  336.376775] [<ffffffc0000d6874>] hrtimer_interrupt+0x15c/0x278
[  336.386946] [<ffffffc0007ecda4>] tegra210_timer_isr+0x3c/0x4c
[  336.397027] [<ffffffc00012377c>] handle_irq_event_percpu+0xd8/0x22c
[  336.407675] [<ffffffc000123918>] handle_irq_event+0x48/0x74
[  336.417537] [<ffffffc000126aec>] handle_fasteoi_irq+0xc4/0x10c
[  336.427596] [<ffffffc000122c34>] generic_handle_irq+0x28/0x40
[  336.437493] [<ffffffc0000859dc>] handle_IRQ+0x94/0xc8
[  336.446647] [<ffffffc0000813d4>] gic_handle_irq+0x74/0x194
[  336.456140] Exception stack(0xffffffc0fe477d80 to 0xffffffc0fe477ea0)
[  336.466587] 7d80: 1fef9400 ffffffc0 00000003 00000000 fe477ee0 ffffffc0 007b06bc ffffffc0
[  336.482632] 7da0: 80000345 00000000 00000001 00000000 8007b000 00000000 8007d000 00000000
[  336.498768] 7dc0: 00000000 00000000 3411c141 00000000 93115800 00000049 2110888e 00000000
[  336.515023] 7de0: 00000018 00000000 038ff5a1 0026547b 3411c141 00000000 ffffffff 00ffffff
[  336.531278] 7e00: 7666670f 00000001 01000000 00000000 37186e19 ddfa6921 bade7435 df6c2675
[  336.547608] 7e20: 00000003 00000000 3c2c5eda c3d3403a 411d2383 2811f006 012cc42a 2fed3407
[  336.564295] 7e40: 32e74f37 640166b7 6760258a 49430234 e6a82c5b ab24e6cb 1fef9400 ffffffc0
[  336.581430] 7e60: 00000003 00000000 48dfae3c 0000004e 4847e5ae 0000004e 1fef9980 ffffffc0
[  336.599011] 7e80: 00000003 00000000 8007b000 00000000 8007d000 00000000 00080200 ffffffc0
[  336.616833] [<ffffffc000084e04>] el1_irq+0x84/0xf0
[  336.626445] [<ffffffc0007b08c0>] cpuidle_idle_call+0x188/0x294
[  336.637134] [<ffffffc000086724>] arch_cpu_idle+0xc/0x24
[  336.647206] [<ffffffc0000f9a08>] cpu_idle_loop+0x21c/0x284
[  336.657536] [<ffffffc0000f9a90>] freezing_slow_path+0x0/0x84
[  336.668021] [<ffffffc000b3a134>] secondary_start_kernel+0x148/0x154
[  336.679162] ---[ end trace 94bda14e5254cd5c ]---
[21417.951206] INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 3, t=2200 jiffies, g=37074, c=37073, q=107)
[21417.962271] Task dump for CPU 0:
[21417.965491] swapper/0       R  running task        0     0      0 0x00000002
[21417.972545] Call trace:
[21417.974992] [<ffffffc000086b24>] __switch_to+0x3c/0x48
[21417.980121] [<ffffffc0007a5084>] cpuidle_enter_state+0x60/0xcc
[21417.985943] [<ffffffc0007a526c>] cpuidle_idle_call+0x17c/0x284
[21417.991765] [<ffffffc000086740>] arch_cpu_idle+0xc/0x24
[21417.996981] [<ffffffc0000f8574>] cpu_idle_loop+0x1b4/0x21c
[21418.002454] [<ffffffc0000f85fc>] freezing_slow_path+0x0/0x80
[21418.008104] [<ffffffc000b262f8>] rest_init+0x88/0x94
[21418.013060] [<ffffffc0011338a8>] start_kernel+0x28c/0x298

Please try with L4T 23.x
Also, Can you please try connecting any other USB device(s) (like a thumb drive/mouse/keyboard) and see if things are fine?

Thanks for the response.

After flashing with L4T 23.2 no usb devices connected via the MC9990 are being identified (using lsusb) and I’m now seeing the following error messages:

[  297.290123] usb 3-1: new high-speed USB device number 2 using ehci-pci
[  307.743727] usb 3-1: device not accepting address 2, error -110
[  307.863817] usb 3-1: new high-speed USB device number 3 using ehci-pci
[  318.317096] usb 3-1: device not accepting address 3, error -110
[  318.437141] usb 3-1: new high-speed USB device number 4 using ehci-pci

Hi atmurray,

Sorry for the late response.
Have you clarified and fixed this issue?

Thanks

@atmurray,
Can you please give the output of ‘lspci’. Would like to see Vendor_ID and device_ID of the device you are using.

Problem still persists unfortunately.
Vendor and device id is 9710:9990

Following is the output as requested:

00:02.0 PCI bridge [0604]: NVIDIA Corporation Device [10de:0faf] (rev a1)
01:00.0 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990]
01:00.1 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990]
01:00.2 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990]
01:00.3 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990]
01:00.4 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990]
01:00.5 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990]
01:00.6 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990]
01:00.7 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990]

Verbose output is as follows:

00:02.0 PCI bridge [0604]: NVIDIA Corporation Device [10de:0faf] (rev a1) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 130
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 0000f000-00000fff
        Memory behind bridge: 13000000-130fffff
        Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Subsystem: NVIDIA Corporation Device [10de:0000]
        Capabilities: [48] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
                Mapping Address Base: 00000000fee00000
        Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Off, PwrInd On, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet+ LinkState+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=30us PortTPowerOnTime=70us
        Kernel driver in use: pcieport

01:00.0 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990] (prog-if 10 [OHCI])
        Subsystem: Device [a000:4000]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 130
        Region 0: Memory at 13000000 (32-bit, non-prefetchable) 
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM not supported, Exit Latency L0s <64ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [800 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: ohci_hcd
        Kernel modules: ohci_hcd

01:00.1 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990] (prog-if 20 [EHCI])
        Subsystem: Device [a000:4000]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 130
        Region 0: Memory at 13001000 (32-bit, non-prefetchable) 
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM not supported, Exit Latency L0s <64ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: ehci-pci

01:00.2 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990] (prog-if 10 [OHCI])
        Subsystem: Device [a000:4000]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 130
        Region 0: Memory at 13002000 (32-bit, non-prefetchable) 
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM not supported, Exit Latency L0s <64ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: ohci_hcd
        Kernel modules: ohci_hcd

01:00.3 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990] (prog-if 20 [EHCI])
        Subsystem: Device [a000:4000]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 130
        Region 0: Memory at 13003000 (32-bit, non-prefetchable) 
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM not supported, Exit Latency L0s <64ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: ehci-pci

01:00.4 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990] (prog-if 10 [OHCI])
        Subsystem: Device [a000:4000]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin C routed to IRQ 130
        Region 0: Memory at 13004000 (32-bit, non-prefetchable) 
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM not supported, Exit Latency L0s <64ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: ohci_hcd
        Kernel modules: ohci_hcd

01:00.5 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990] (prog-if 20 [EHCI])
        Subsystem: Device [a000:4000]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin C routed to IRQ 130
        Region 0: Memory at 13005000 (32-bit, non-prefetchable) 
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM not supported, Exit Latency L0s <64ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: ehci-pci

01:00.6 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990] (prog-if 10 [OHCI])
        Subsystem: Device [a000:4000]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin D routed to IRQ 130
        Region 0: Memory at 13006000 (32-bit, non-prefetchable) 
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM not supported, Exit Latency L0s <64ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: ohci_hcd
        Kernel modules: ohci_hcd

01:00.7 USB controller [0c03]: MosChip Semiconductor Technology Ltd. MCS9990 PCIe to 4‐Port USB 2.0 Host Controller [9710:9990] (prog-if 20 [EHCI])
        Subsystem: Device [a000:4000]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin D routed to IRQ 130
        Region 0: Memory at 13007000 (32-bit, non-prefetchable) 
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM not supported, Exit Latency L0s <64ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: ehci-pci

Could this issue be related to work enabling IOMMU for PCIe as outlined in the following thread?
https://devtalk.nvidia.com/default/topic/967515/l4t-r24-2-pcie-iommu-not-working-when-switches-bridges-are-present/

I’ve applied that patch to my custom kernel but the issue still persists.

Can someone confirm my reading of the kernel dumps:

The RCU CPU Stall Detector (documentation here: https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt) is detecting a stall when the CPU is calling swapper to change the running task:

[21417.951206] INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 3, t=2200 jiffies, g=37074, c=37073, q=107)
[21417.962271] Task dump for CPU 0:
[21417.965491] swapper/0       R  running task        0     0      0 0x00000002
[21417.972545] Call trace:
[21417.974992] [<ffffffc000086b24>] __switch_to+0x3c/0x48

If this is correct, would it be fair to say that the swapper task is somehow stalling the CPU?
If so, is the swapper task a generic (linux), architecture (arm64) or manufacturer (nvidia) specific implementation?

I happen to get my hands on a x1 card with the same chipset ( http://www.iocrest.com/en/product_details294.html )
This card has support for both ASPM-L0s and L1 states and they are getting enabled by default which is causing issues. We are not sure of card’s functionality with ASPM states enabled. So, I disabled ASPM completely by adding ‘pcie_aspm=off’ to kernel command line (added it to ‘APPEND’ line in /boot/extlinux/extlinux.conf file). With this, card is working flawlessly with USB thumb drive. But, surprisingly, in your case, I see that ASPM is not even supported by the card yet you are facing issues. Let me see if I can get same WiFi module.

That’s brilliant that you’ve been able to track down a card with the same chipset.
I’ve disabled ASPM in my setup and initial tests are promising but I’ll have to let it run overnight.
Curiously, I’m now getting periodic flooding of the kernel message ‘rx buffer full’ from the ath9k kernel driver when in verbose output mode and performing large file transfers (however things then recover).

Once I’ve tested this more I’ll report back on the results.
Thanks for the help thus far, sincerely appreciated.

Update: after running tests overnight, the system still crashed with the same kernel dump.

Hi Vidyas, were you able to track down the same WiFi module?
For reference, we experience the same issue when using a USB thumb drive.
I’ve noticed that there are several other threads with people with PCIe related issues, is there anything you would suggest I could look at being related to this issue?

Hi atmurray,

We’re running some experiments on similar device, now there are some observations, but not sure if they are related.
The status will be updated once clarified.

Thanks

Any update?

Hello, is there any update on this?

Hi rklucik,

Which BSP you’re using? Issue should be fixed already, please try with the latest L4T 28.2 release.

Thanks