Tegra TK1 PCIe failure

We have a custom FPGA card connected to the TK1 platform. Some times (once out of about 10 times), we see that the PCIe reports below mentioned errors and become no longer accessible. We need to reboot the device to recover.

[ 1.169182] msgmni has been set to 1430
[ 1.171852] io scheduler noop registered (default)
[ 1.172412] of_get_named_gpio_flags: can’t parse gpios property
[ 1.172445] of_get_named_gpio_flags: can’t parse gpios property
[ 1.172474] of_get_named_gpio_flags: can’t parse gpios property
[ 1.214012] PCI host bridge to bus 0000:00
[ 1.214047] pci_bus 0000:00: root bus resource [mem 0x32100000-0x3fffffff]
[ 1.214082] pci_bus 0000:00: root bus resource [mem 0x12100000-0x320fffff pref]
[ 1.214119] pci_bus 0000:00: root bus resource [io 0x1000-0xffff]
[ 1.214155] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
[ 1.214239] pci 0000:00:00.0: [10de:0e12] type 01 class 0x060400
[ 1.214435] pci 0000:00:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 1.214996] PCI: bus0: Fast back to back transfers disabled
[ 1.215033] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 1.215402] pci 0000:01:00.0: [1172:e001] type 00 class 0xff0000
[ 1.215484] pci 0000:01:00.0: reg 10: [mem 0x00000000-0x0fffffff 64bit pref]
[ 1.215572] pci 0000:01:00.0: reg 18: [mem 0x00000000-0x00007fff]
[ 1.217633] PCI: bus1: Fast back to back transfers disabled
[ 1.217670] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[ 1.217710] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 01
[ 1.218252] pcieport 0000:00:00.0: Signaling PME through PCIe PME interrupt
[ 1.218290] pci 0000:01:00.0: Signaling PME through PCIe PME interrupt
[ 1.218327] pcie_pme 0000:00:00.0:pcie01: service driver pcie_pme loaded
[ 1.218606] aer 0000:00:00.0:pcie02: service driver aer loaded
[ 1.218971] PCI host bridge to bus 0000:02
[ 1.219002] pci_bus 0000:02: root bus resource [mem 0x32100000-0x3fffffff]
[ 1.219036] pci_bus 0000:02: root bus resource [mem 0x12100000-0x320fffff pref]
[ 1.219072] pci_bus 0000:02: root bus resource [io 0x10000-0x1ffff]
[ 1.219107] pci_bus 0000:02: No busn resource found for root bus, will use [bus 02-ff]
[ 1.219195] pci 0000:02:00.0: [10de:0e13] type 01 class 0x060400
[ 1.219384] pci 0000:02:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 1.219938] PCI: bus2: Fast back to back transfers disabled
[ 1.219974] pci 0000:02:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 1.220340] pci 0000:03:00.0: [8086:1533] type 00 class 0x020000
[ 1.220410] pci 0000:03:00.0: reg 10: [mem 0x00000000-0x000fffff]
[ 1.220487] pci 0000:03:00.0: reg 18: [io 0x0000-0x001f]
[ 1.220570] pci 0000:03:00.0: reg 1c: [mem 0x00000000-0x00003fff]
[ 1.220672] pci 0000:03:00.0: reg 30: [mem 0x00000000-0x000fffff pref]
[ 1.220844] pci 0000:03:00.0: PME# supported from D0 D3hot D3cold
[ 1.222597] PCI: bus3: Fast back to back transfers disabled
[ 1.222634] pci_bus 0000:03: busn_res: [bus 03-ff] end is updated to 03
[ 1.222673] pci_bus 0000:02: busn_res: [bus 02-ff] end is updated to 03
[ 1.223177] pcieport 0000:02:00.0: Signaling PME through PCIe PME interrupt
[ 1.223215] pci 0000:03:00.0: Signaling PME through PCIe PME interrupt
[ 1.223253] pcie_pme 0000:02:00.0:pcie01: service driver pcie_pme loaded
[ 1.223539] aer 0000:02:00.0:pcie02: service driver aer loaded
[ 1.223753] pcieport 0000:02:00.0: BAR 8: assigned [mem 0x32100000-0x322fffff]
[ 1.223793] pcieport 0000:02:00.0: BAR 9: assigned [mem 0x12100000-0x121fffff pref]
[ 1.223832] pcieport 0000:02:00.0: BAR 7: assigned [io 0x10000-0x10fff]
[ 1.223871] pci 0000:03:00.0: BAR 0: assigned [mem 0x32100000-0x321fffff]
[ 1.223914] pci 0000:03:00.0: BAR 6: assigned [mem 0x12100000-0x121fffff pref]
[ 1.223952] pci 0000:03:00.0: BAR 3: assigned [mem 0x32200000-0x32203fff]
[ 1.223993] pci 0000:03:00.0: BAR 2: assigned [io 0x10000-0x1001f]
[ 1.224032] pcieport 0000:02:00.0: PCI bridge to [bus 03]
[ 1.224064] pcieport 0000:02:00.0: bridge window [io 0x10000-0x10fff]
[ 1.224101] pcieport 0000:02:00.0: bridge window [mem 0x32100000-0x322fffff]
[ 1.224140] pcieport 0000:02:00.0: bridge window [mem 0x12100000-0x121fffff pref]
[ 1.224202] pcieport 0000:00:00.0: BAR 9: assigned [mem 0x20000000-0x2fffffff 64bit pref]
[ 1.224242] pcieport 0000:00:00.0: BAR 8: assigned [mem 0x32300000-0x323fffff]
[ 1.224282] pci 0000:01:00.0: BAR 0: assigned [mem 0x20000000-0x2fffffff 64bit pref]
[ 1.224346] pci 0000:01:00.0: BAR 2: assigned [mem 0x32300000-0x32307fff]
[ 1.224386] pcieport 0000:00:00.0: PCI bridge to [bus 01]
[ 1.224421] pcieport 0000:00:00.0: bridge window [mem 0x32300000-0x323fffff]
[ 1.224459] pcieport 0000:00:00.0: bridge window [mem 0x20000000-0x2fffffff 64bit pref]
[ 1.249454] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[ 1.249593] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[ 1.249642] pcieport 0000:00:00.0: device [10de:0e12] error status/mask=00004000/00000000
[ 1.249681] pcieport 0000:00:00.0: [14] Completion Timeout (First)
[ 1.249722] pcieport 0000:00:00.0: broadcast error_detected message
[ 1.249759] pcieport 0000:00:00.0: AER: Device recovery failed
[ 1.249793] pcieport 0000:00:00.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0010
[ 1.249853] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[ 1.249861] pcieport 0000:00:00.0: device [10de:0e12] error status/mask=00004000/00000000
[ 1.249868] pcieport 0000:00:00.0: [14] Completion Timeout (First)
[ 1.249882] pcieport 0000:00:00.0: broadcast error_detected message
[ 1.249894] pcieport 0000:00:00.0: AER: Device recovery failed
[ 1.250929] pwm-backlight pwm-backlight: unable to request PWM, trying legacy API
[ 1.251128] sysedp_create_consumer: unable to create pwm-backlight, no consumer_data for pwm-backlight found
[ 1.252688] tsec tsec: initialized

What could be the reason?

We are using the latest 21.5 version kernel. Any suggestions will be very helpful.

Thanks in advance.

Under “lspci” you will see a slot assignment for each device, e.g., it looks something like “01:00.0”. Using this with the “-s” option of lspci limits the response to just that slot. Commands below use an example slot, adjust for your device…

More information would help. If possible, use sudo and show the output of this both before and after an error (verbose listing won’t show full information if not using sudo):

lspci
sudo lspci -s 01:00.0 -vvv

If lspci fails completely after the issue hits this too would be useful information.

Capture below:

ubuntu@tegra-ubuntu:~$ sudo lspci -s 01:00.0 -vvv                               
[sudo] password for ubuntu:                                                     
01:00.0 Unassigned class [ff00]: Altera Corporation Device e001 (rev 04)        
        Subsystem: Altera Corporation Device e001                               
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step
ping- SERR+ FastB2B- DisINTx-                                                   
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-                                                    
        Latency: 0                                                              
        Interrupt: pin A routed to IRQ 130                                      
        Region 0: Memory at 20000000 (64-bit, prefetchable)          
        Region 2: [virtual] Memory at 32300000 (32-bit, non-prefetchable)                                                                             
        Capabilities: [50] MSI: Enable- Count=1/4 Maskable- 64bit+              
                Address: 0000000000000000  Data: 0000                           
        Capabilities: [78] Power Management version 3                           
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
-,D3cold-)                                                                      
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-          
        Capabilities: [80] Express (v2) Endpoint, MSI 00                        
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1
 <1us                                                                           
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+         
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupporte
d-                                                                              
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-   
                        MaxPayload 128 bytes, MaxReadReq 512 bytes              
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPe
nd-                                                                             
                LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s, Exit Latency L
0s <4us, L1 <1us                                                                
                        ClockPM- Surprise- LLActRep- BwNot-                     
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-          
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-          
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive
- BWMgmt- ABWMgmt-                                                              
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF
 Not Supported                                                                  
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OB
FF Disabled                                                                     
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-   
                         Transmit Margin: Normal Operating Range, EnterModifiedC
ompliance- ComplianceSOS-                                                       
                         Compliance De-emphasis: -6dB                           
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete
-, EqualizationPhase1-                                                          
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizat
ionRequest-                                                                     
        Capabilities: [100 v1] Virtual Channel                                  
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1                     
                Arb:    Fixed- WRR32- WRR64- WRR128-                            
                Ctrl:   ArbSelect=Fixed                                         
                Status: InProgress-                                             
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-      
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-   
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff           
                        Status: NegoPending- InProgress-                        
        Capabilities: [200 v1] Vendor Specific Information: ID=a000 Rev=0 Len=04
4 <?>                                                                           
        Kernel driver in use: egc_mfd

Was the lspci listing using sudo? I don’t see AER there, but I do see AER in the earlier dmesg. Also, was the lspci shown above from before a failure? Using “sudo” with the lspci after a failure would give the most detailed information. About all I see in the above is that the device is capable of PCIe v2 speeds, but throttled back to v1 speeds.

Btw, you can highlight the logs you posted and then click on the “code block” icon in the upper right (looks like “</>”) and it will be more readable (indent will be preserved and a scrollbar added). You can edit an existing forum post via the “pencil” icon in the upper right and mouse highlight the logs and click on the “</>” code block icon to add scroll bars and indent preservation on an existing post.

Hi,

Have updated the above message.

Yes, the lspci is done with sudo.

In your original message I see AER mentioned (Advanced Error Reporting, an option on PCI devices), but not in the lspci. This has me wondering if post #1 and #3 have some hardware difference. I could see AER not being visible if sudo were not used, but with sudo lspci the existence of optional AER should always show. Knowing if there was an AER “First Error Pointer” to NULL or to an address would be useful information. If you use “lspci” by itself does anything other than your device and PCI bridge devices show up?

Does the specification on your PCIe device suggest it has AER? If so, is there anything you need to do with the FPGA to enable AER?

What L4T version is this (e.g., R21.5 is the most recent)? See:

head -n 1 /etc/nv_tegra_release

From what is seen above about all I can see is that the x1 lane is capable of PCIe v2 speeds, but throttled back to PCIe v1 (this indicates an imperfect signal quality, though normally throttling back to v1 would mean it still works…perhaps there is a signal quality issue causing even v1 to be questionable). Is there anything unusual about the wiring between the FPGA and the Jetson, e.g., is it directly connected, or does it have some sort of extension cable?