PCIe Bus Error message

Hello,
After booting, our system keeps printing message at random times like:
ubuntu@tegra-ubuntu:~$ [ 113.058887] pcieport 0000:00:00.0: PCIe Bus Error: se)
[ 113.074892] pcieport 0000:00:00.0: device [10de:0e12] error status/mask=000
[ 113.088362] pcieport 0000:00:00.0: [ 0] Receiver Error (First)
[ 113.833824] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=)
[ 113.888024] pcieport 0000:00:00.0: device [10de:0e12] error status/mask=000
[ 113.944518] pcieport 0000:00:00.0: [ 0] Receiver Error (First)
[ 114.089597] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=)
[ 114.108394] pcieport 0000:00:00.0: device [10de:0e12] error status/mask=000
[ 114.129385] pcieport 0000:00:00.0: [ 0] Receiver Error (First)

How can I fix this?
Thanks!

Can you please confirm if ASPM is disabled in your system? ‘sudo lspci -vvvv’ would tell this.
If not, please add ‘pcie_aspm=off’ to kernel command line to disable it. let us know if you still see the issue.

Hello,
I add ‘pcie_aspm=off’ to kernel command line and now ASPM is disabled:
ubuntu@tegra-ubuntu:~$ sudo lspci -vvvv | grep ASPM
[sudo] password for ubuntu:
LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latens
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Lats
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-

But I still see these pcie error messages.
Thanks!

Then there could be some electrical issues. Are these issues seen when the link is operating in Gen-1 mode also?

Hello,
What does Gen-1 mode mean? What should I do to set the link operating in Gen-1 mode? and How can I know which mode the system is now?

Thanks!

‘sudo lspci -vv’ would tell the current link operating speed. Gen-1 is 2.5GT/s and Gen-2 is 5GT/s.
If the link is operating currently in Gen-2, you can program Target_Link_Speed to Gen-1 in Link_Control_2 register and go for re-train of the link in your client driver.

Hello,
I think the link was already operating in Gen-1 mode:

ubuntu@tegra-ubuntu:~$ sudo lspci -vv
[sudo] password for ubuntu:
00:00.0 PCI bridge: NVIDIA Corporation Device 0e13 (rev a1) (prog-if 00 [Normal)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Ste+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort–
Latency: 0, Cache Line Size: 64 bytes
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 00001000-00001fff
Memory behind bridge: 32100000-321fffff
Prefetchable memory behind bridge: 0000000012100000-00000000121fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort–
BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Subsystem: NVIDIA Corporation Device 0000
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3ho)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+
Address: 00000000a9b68000 Data: 0000
Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
Mapping Address Base: 00000000fee00000
Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupport+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransP-
LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latens
ClockPM- Surprise- LLActRep+ BwNot+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActiv-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surpri-
Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- -
Control: AttnInd Off, PwrInd On, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Inter-
Changed: MRL- PresDet+ LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRS-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF -
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, O-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModified-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplet-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualiza-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Kernel driver in use: pcieport

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 )
Subsystem: Realtek Semiconductor Co., Ltd. Device 0123
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Ste+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort–
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 641
Region 0: I/O ports at 1000
Region 2: Memory at 32100000 (64-bit, non-prefetchable)
Region 4: Memory at 12100000 (64-bit, prefetchable)
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000a9b68000 Data: 0001
Capabilities: [70] Express (v2) Endpoint, MSI 01
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, s
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupport-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransP-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Lats
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActiv-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBF#
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, Od
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModified-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete–
EqualizationPhase2-, EqualizationPhase3-, LinkEqualiza-
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00000800
Capabilities: [d0] Vital Product Data
Unknown small resource type 00, will not decode more.
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
CESta: RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Kernel driver in use: r8169

ubuntu@tegra-ubuntu:~$ sudo lspci -vv | grep Speed
LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latens
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActiv-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Lats
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActiv-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-

Thanks!

Ok.
Then you may have to get the carrier board tested to see if electricals are good or not.
BTW, these errors are seen any PCIe end point or only with a specific end point?

Hello,
There is one PCIE ethernet chip “Realtek RTL8111GS-CG” On-board.

Thanks!

Ok.
I thought you are using a different carrier board. looks like you are using the default Jetson-TK1 board.
Coming back to errors, it looks like your board electricals are not good.
If you are annoyed by these prints, you can disable AER mechanism in configs which just suppresses prints popping up in console. I can’t think of any solution for this at this point in time.

Hello,
We have some boards that will keep showing pcie error message(only with Realtek RTL8111GS-CG On-board).
(most other boards don’t have this situdation)
These boards work seems normal, but we don’t know if it will have side effect and if we swap the RTL8111GS-CG on it with other RTL8111GS-CG IC, there will be no pcie error message.
So I am wondering if it is proper that I just suppress printing pcie error message.

Thanks!

Hello,
About this issue, I have also done these testings:
(1).

Please try below to see if still see AER message:
#echo performance > /sys/module/pcie_aspm/parameters/policy
#setpci -s 00:00.0 90.l=70410042 (root port) 
#setpci -s 01:00.0 80.l=10110042 (end point)..

<— doesn’t work

(2).

Pls try below:
1. Add param 'pcie_aspm=force' in kernel command line parameters. 
2. echo powersave > /sys/module/pcie_aspm/parameters/policy 
3. echo performance > /sys/module/pcie_aspm/parameters/policy 
4. setpci -s 00:00.0 90.l -- check bits 0:1 should be 0 on disabled.

<— doesn’t work (Pls refer to attachment file: pcie_error.png)

(3).

For RTL8111GS-CG, you can also refer to the linux driver on Realtek website for using.

RTL8111B/RTL8168B/RTL8111/RTL8168
RTL8111C/RTL8111CP/RTL8111D(L)
RTL8168C/RTL8111DP/RTL8111E
RTL8168E/RTL8111F/RTL8411
RTL8111G/RTL8111GUS/RTL8411B(N)
RTL8118AS
<a target='_blank' rel='noopener noreferrer' href='http://www.realtek.com.tw/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false'>http://www.realtek.com.tw/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false</a>

<— doesn’t work
I use “LINUX driver for kernel up to 4.7” to build r8168.ko and use r8168.ko instead of r8169.ko as RTL8111GS-CG’s driver module.
The system still prints a lot of pcie error messages.
Then I remove r8168.ko under /lib/modules/… (I don’t use any RTL8111GS-CG driver now), the system also still prints a lot of pcie error messages.

Thanks!

Are you using upstream kernel on your board?

We use kernel source code from Nvidia git(Git tag: tegra-l4t-r21.5) with some modifications for our board.

Then I remove r8168.ko under /lib/modules/… (I don’t use any RTL8111GS-CG driver now), the system also still prints a lot of pcie error messages
You mean, if you don’t load any driver for the end point, you still get these error messages? If so, then there is certainly something wrong w.r.t electricals on that board.

Hello,
About this issue, can we adjust pcie driving by Software?
And how can we adjust pcie driving by Software?

Thanks!

In this case, it looks like the board has gone bad. I’m afraid there is nothing much that can be done in software to better the situation

Hello,
We also see the message “pcieport 0000:00:00.0: can’t find device of ID0018”
What is the device of ID0018?

Thanks!

PCIE error 3.txt (26.8 KB)

Hi vidyas,
This error doesn’t happen to every customize boards we made. We swpapped the rtl chips on working and not working boards.
Both of them were working well. No pcie error messages printed.
Is there any tool or way can test the compatability of rtl chips and TK1 soc?

This somehow doesn’t look correct.
Assuming it is complaining about Realtek NIC missing from the bus, either it should say 0x100 or 0x200 as the NIC would have been enumerated on bus-1 or bus-2 (as Device-0 and Function-0), but 0x18 is B=0, D=3, F=0 which doesn’t correspond to any device as such.