PCIe silicon bug to cause denial of service on Tegra X1 and X2 ?

THIS IS SEPARATE ISSUE FROM PCIe HANGS THE SoC bug described here:

We have found another strange behavior of the PCIe root complex on the Tegra platform.

Lets connect a FPGA, with a simple debug core which can issue MRd32 requests and receive completions.

Reading of 0x80000000 address initially works on both TX1 and TX2 for indefinite time. Now fire a read request to 0x08000000 (TX2) or 0x00800000 (TX1) - the trigger address - this will not return, a completion timeout happens (we have 30usec timeout). Consequent reads of the original working address (0x80000000) work until a certain point - each 128-th read (TX2) or 64-th read (TX1) will not be completed by the Tegra. Furthermore, after about 12000 reads (TX2) or 2700 reads (TX1) the PCIe host is in a blocked state, not allowing to issue further MRd32 commands. This was tested on R28.1.

The address which triggers the issue is not listed in the /proc/iomem file.
I would be happy to see a completion timeout on such non-assigned region, but that it would break the whole PCIe subsystem, that is unexpected.

Is anywhere an errata for the Tegra SoCs to see if that is duplicate or new bug?

Enabling SMMU for PCIe (R28.2 has it) would result in SMMU error when an unassigned address is being accessed by PCIe. We will check on your observations and get back to you. Since real end points (off the shelf devices available in market I meant) don’t go on issuing further reads when completion for previous read is not received, this issue is not seen earlier.

Since the intermediate addresses (i.e. 0x08000000 TX2 or 0x00800000 TX1 ) are illegal addresses, end point is observing completion timeout which is an uncorrectable error. so, correct operation of root port is not guaranteed in the long run. In this particular case, either end point shouldn’t have sent illegal addresses to root port or shouldn’t have sent further reads after an uncorrectable error is received.

Fair enough.

But the devices are often possible to be reset from completion timeout - either as part of AER handler callback or by means of driver logic like the NVMe driver does - when its FATAL ERROR FLAG is seen, it will try to reset the device even without AER.

We will check whether a root port reset/reinitialization will solve the missing reads (after we solve some other issues - this is now rather a feature than a bug… but it shall be documented anyway so lets keep it here).

Thanks vidyas!