PCIe not being recognized - TX2

personnongrata · June 7, 2019, 4:36pm

I am trying to get a PCIe card to be recognized by the TX2 but so far have had had no luck. I am running the latest version of Jetpack 4.2, with a modified rootfs to remove the oem configuration to allow running headless (removed the oem-* services from systemd).

The device is a PCIe x4 Gen 2 device.

Kernel: 4.9.140
L4T: 32.1
Rootfs: 18.04.2 LTS (Bionic Beaver)

I have made a couple of changes to the device tree:

tegra186-quill-p3310-1000-c03-00-base.dtb

Added in boot delay. Note I am decompiling the full dtb then making changes to this rather than the dtsi files.

pcie-controller@10003000 {
                compatible = "nvidia,tegra186-pcie";
                power-domains = <0x1f 0x9>;
                device_type = "pci";
                ....
                phandle = <0x110>;
                nvidia,boot-detect-delay = <20000>;

                pci@1,0 {
                        device_type = "pci";
                        assigned-addresses = <0x82000800 0x0 0x10000000 0x0 0x1000>;
                        reg = <0x800 0x0 0x0 0x0 0x0>;
                        status = "okay";
                        #address-cells = <0x3>;
                        #size-cells = <0x2>;
                        ranges;
                        nvidia,num-lanes = <0x4>;
                        nvidia,afi-ctl-offset = <0x110>;
                        nvidia,disable-aspm-states = <0xf>;
                        nvidia,disable-clock-request;
                };

tegra186-a02-bpmp-quill-p3310-1000-c04-00-te770d-ucm2.dtb

Added in clock@plle to disable SSC

clock@sdmmc2 {
                        clk-id = <0x35>;
                        allow_fractional_divider = <0x1>;
                };

                clock@plle {
                        clk-id = <0x200>;
                        pll_freq_table = <0x249f000 0x5f5e100 0x2 0x7d 0x18 0xffffffff 0xffffffff 0xffffffff 0xffffffff>;
                };

Boot log shows it trying 10 seconds after boot:

[   21.087259] tegra-pcie 10003000.pcie-controller: PCIE: Enable power rails
[   21.092805] tegra-pcie 10003000.pcie-controller: probing port 0, using 4 lanes
[   21.096158] tegra-pcie 10003000.pcie-controller: probing port 2, using 1 lanes
[   21.527196] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   21.939004] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   22.349838] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   22.351986] tegra-pcie 10003000.pcie-controller: link 0 down, ignoring
[   22.766871] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   23.182838] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   23.598477] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   23.600636] tegra-pcie 10003000.pcie-controller: link 2 down, ignoring
[   23.600753] tegra-pcie 10003000.pcie-controller: PCIE: no end points detected
[   23.603858] tegra-pcie 10003000.pcie-controller: PCIE: Disable power rails

Does any one have any suggestions about where to go next? Note the HDMI is also disabled in the device tree to make sure it definitely runs as headless.

Thanks

personnongrata · June 11, 2019, 6:17am

respectful bump

personnongrata · June 12, 2019, 6:32am

So after further research I have come across several people saying that they have had issues with FPGAs not working as they take time to power up on after the power is supplied.

Looking at the logs I can see that the time after the PCIe power rails are enabled and the time at which the probing starts is likely not long enough and could be causing the issue.

To help investigate:

Is it possible to always have the power to the PCIe enabled? Using the device tree for example?
Is it possible to add a longer delay between the power on and the checking (for example 20 seconds) through the recompilation of the kernel?

Thanks,

linuxdev · June 12, 2019, 7:10pm

Sorry, I don’t know the details, but there are ways to edit the kernel to allow a late probe via echo into “/sys” after boot. The PCIe, as it is, is not hot plug and normally only enumerates upon initial startup (you are correct that an FPGA takes more time than other PCIe devices and needs to do late enumeration). What the patch does (and to emphasize, I don’t remember where I saw the patch) is allow the echo into “/sys” after boot is complete before PCI enumeration starts. Someone else will likely see this and know the patch.

personnongrata · June 13, 2019, 6:31am

Firstly, thank you for the response linuxdev.

I think that the patch that you are referring to likely comes from this post here: https://devtalk.nvidia.com/default/topic/997845/jetson-tx1/force-rescan-of-pcie-bus-/post/5101082/#5101082 with the patch code in this post: https://devtalk.nvidia.com/default/topic/997845/jetson-tx1/force-rescan-of-pcie-bus-/post/5102006/#5102006

To copy that post:

There are couple of ways to handle this (i.e. end point being an FPGA and comes up slowly) situation
1. There is an entry in device tree "nvidia,boot-detect-delay" which can be used to delay enumeration by a specified time
2. Make PCIe host controller driver as a loadable module (pci-tegra.ko) and insmod it only after FPGA is ready
3. If you can tell me the version being used here (L4T 23 / L4T 24.2 Etc...), I can provide a patch with which we can have only root ports getting enumerated (in case end point is not ready) and later 'rescan' can be used to rescan the bus and add newly found devices to hierarchy.

In response to those three points

1 - I added this to the device tree and can see it coming up much later, but still does not work.
2 - I will try this, but can’t see how it is different to number 1 as the device tree simply loads the driver later
3 - I shall try this, with the caveat being that the patch is for 24.2 and there is a report in that thread that it does not work with the 32.1 version of L4T.
4 - Spread spectrum clocking could still be an issue, can a Nvidia rep confirm how to disable SSC in linux 32.1?

Thanks,

linuxdev · June 13, 2019, 2:10pm

After boot did you run “sudo echo 1 > /sys/bus/pci/rescan”?

personnongrata · June 13, 2019, 3:06pm

Yes I did. There was no change, and nothing in dmesg that indicated anything had happened.

personnongrata · June 14, 2019, 12:01pm

I have done some of the above points.

Added a longer delay between enabling the PCIE ports and checking for a link (12 seconds)
Added the ability to force rescan (the patch in the previous post)
Disabled SSC by changing the clk-plle.c to disable it using this post https://devtalk.nvidia.com/default/topic/1036587/jetson-tx1/what-are-best-way-for-disable-jetson-tx1-pcie-ssc-/post/5266788/#5266788
Added the nvidia pcie delay to the device tree to delay the pcie loading of the controller

The results are that now I can indeed force a rescan of the PCIe bus (kind of, I see that it now does something in dmesg) but still no luck. It also appears that I have disabled the USB (a side effect of doing the SSC change). See the boot dump here, first is from the boot then several minutes later the rescan

[   11.091362] tegra-xusb 3530000.xhci: entering ELPG
[   11.098859] tegra-xusb 3530000.xhci: entering ELPG done
[   21.093261] tegra-pcie 10003000.pcie-controller: PCIE: Enable power rails
[   21.100126] tegra-pcie 10003000.pcie-controller: probing port 0, using 4 lanes
[   21.102579] tegra-pcie 10003000.pcie-controller: probing port 2, using 1 lanes
[   36.585747] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   36.993650] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   37.407221] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   37.409422] tegra-pcie 10003000.pcie-controller: Skipping for later 0
[   37.817211] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   38.228124] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   38.637093] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   38.639205] tegra-pcie 10003000.pcie-controller: Skipping for later 2
[   38.646302] tegra-pcie 10003000.pcie-controller: PCI host bridge to bus 0000:00
[   38.646360] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
[   38.646385] pci_bus 0000:00: root bus resource [mem 0x40100000-0x47ffffff]
[   38.646398] pci_bus 0000:00: root bus resource [mem 0x48000000-0x7fffffff pref]
[   38.646424] pci_bus 0000:00: root bus resource [bus 00-ff]
[   38.646648] pci 0000:00:01.0: [10de:10e5] type 01 class 0x060400
[   38.647188] pci 0000:00:01.0: PME# supported from D0 D1 D2 D3hot D3cold
[   38.648343] iommu: Adding device 0000:00:01.0 to group 54
[   38.648364] arm-smmu: forcing sodev map for 0000:00:01.0
[   38.648555] pci 0000:00:03.0: [10de:10e6] type 01 class 0x060400
[   38.648670] pci 0000:00:03.0: PME# supported from D0 D1 D2 D3hot D3cold
[   38.648958] iommu: Adding device 0000:00:03.0 to group 55
[   38.648966] arm-smmu: forcing sodev map for 0000:00:03.0
[   38.649081] pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[   38.649095] pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[   38.649268] tegra-pcie 10003000.pcie-controller: PCIE: Response decoding error, signature: 10010001
[   38.658366] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[   38.658530] pci_bus 0000:02: busn_res: [bus 02-ff] end is updated to 02
[   38.658568] pci 0000:00:01.0: PCI bridge to [bus 01]
[   38.658583] pci 0000:00:03.0: PCI bridge to [bus 02]
[   38.660176] pcieport 0000:00:01.0: Signaling PME through PCIe PME interrupt
[   38.660184] pcie_pme 0000:00:01.0:pcie001: service driver pcie_pme loaded
[   38.660281] aer 0000:00:01.0:pcie002: service driver aer loaded
[   38.660533] pcieport 0000:00:03.0: Signaling PME through PCIe PME interrupt
[   38.660539] pcie_pme 0000:00:03.0:pcie001: service driver pcie_pme loaded
[   38.660757] aer 0000:00:03.0:pcie002: service driver aer loaded
[  542.425670] tegra-pcie 10003000.pcie-controller: PCIE: Response decoding error, signature: 10010001
[  542.434779] pci_bus 0000:01: busn_res: [bus 01] end is updated to 01
[  542.434837] pci_bus 0000:02: busn_res: [bus 02] end is updated to 02
[  542.435004] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[  542.435017] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[  542.446914] pcieport 0000:00:03.0:   device [10de:10e6] error status/mask=00004000/00000000
[  542.455388] pcieport 0000:00:03.0:    [14] Completion Timeout     (First)
[  542.462321] pcieport 0000:00:03.0: broadcast error_detected message
[  542.462326] pcieport 0000:00:03.0: broadcast mmio_enabled message
[  542.462330] pcieport 0000:00:03.0: broadcast resume message
[  542.462337] pcieport 0000:00:03.0: AER: Device recovery successful

Interestingly when doing the rescan I get the following out

[  580.105671] tegra-pcie 10003000.pcie-controller: PCIE: Response decoding error, signature: 10010001
[  580.115092] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=T??ransaction Layer, id=0018(Requester ID)
[  580.126884] pcieport 0000:00:03.0:   device [10de:10e6] error status/mask=00004000/00000000
[  580.135361] pcieport 0000:00:03.0:    [14] Completion Timeout     (First)

Thoughts? I think the bus that it keeps referring to is not the PCIe slot on the carrier card but rather the USB parts.

linuxdev · June 14, 2019, 4:17pm

Just to emphasize, I do not know the details required for delayed PCIe enumeration. However, I can tell you the TX1 itself has a different device tree. The TX1 change might have the right idea, but it is likely the actual required changes need editing. To make this more difficult, the TX1 kernel in use for the other forum threads is significantly older in version compared to R32.1. It is possible that some of the device tree edits are affecting parts of the system you wouldn’t expect due to being for the wrong kernel and carrier.

So I believe you are on the right track, but someone from NVIDIA will have to add information on how to delay PCIe enumeration under R32.1.

personnongrata · June 14, 2019, 4:38pm

Thanks for the input - yes I think so too. I was hoping someone from Nvidia might have replied by now with the following info (hint hint):

Disable SSC for TX2 in 32.1
Delay enumeration of the PCIe for TX2 in 32.1

vidyas · June 20, 2019, 6:19am

Hi,
Is there any specific reason to doubt that having SSC enabled would be causing issues to PCIe link up? Is FPGA endpoint here not using REFCLK from TX2 but using its own internal clock as REFCLK?

Following patch can be used to delay the enumeration by 10 secs. Please adjust it to increase/decrease the delay further

diff --git a/drivers/pci/host/pci-tegra.c b/drivers/pci/host/pci-tegra.c
index 7b6fbd5d90a8..d189348fddcd 100644
--- a/drivers/pci/host/pci-tegra.c
+++ b/drivers/pci/host/pci-tegra.c
@@ -2513,6 +2513,7 @@ static void tegra_pcie_check_ports(struct tegra_pcie *pcie)
        }

        /* Wait for clock to latch (min of 100us) */
+       msleep(10000);  /* currently it is 10 sec. Adjust here to increase/decrease it */
        udelay(100);
        reset_control_deassert(pcie->pciex_rst);
        /* at this point in time, there is no end point which would

Let me see if I can come up with a patch to disable SSC as well.

vidyas · June 20, 2019, 7:37am

It is a bit tricky to disable SSC. Just to give some background, Tegra’s PLLs/clocks are controlled by an entity called BPMP-FW and instructions to disable SSC need to be given to BPMP-FW through its device-tree (Please note that BPMP-FW’s device-tree is different from kernel’s device-tree)
Please apply the following patch in $TOP/platform/bpmp/ folder and reflash the target with new BPMP-FW file.

diff --git a/tegra186-common.dtsi b/tegra186-common.dtsi
index a5580ad5c0d6..e3ff4230bc32 100644
--- a/tegra186-common.dtsi
+++ b/tegra186-common.dtsi
@@ -5,4 +5,12 @@
                edition = <1>;
                dbs = <(TEGRA186_DB_CPU_S | TEGRA186_DB_CPU_NS | TEGRA186_DB_DMCE)>;
        };
+
+       clocks {
+               clock@plle {
+                       clk-id = <TEGRA186_CLK_PLLE>;
+                       /* disable ssc on PLLE */
+                       pll_freq_table = <38400000 100000000 2 125 24 (-1) (-1) (-1) (-1)>;
+               };
+       };
 };

One way to check if SSC is enabled/disabled in a platform is to dump the value @ address 0x05043000. If bit-12 is
0 → SSC enabled (with default build, we would get a value like 0x20010025 where bit-12 is ‘0’)
1 → SSC disabled (after applying above patch, the value would be 0x20011C25 where bit-12 is ‘1’)

personnongrata · June 21, 2019, 11:29am

Hi vidyas,

I have tried both of your patches, unfortunately neither of them have resulted in the PCIe working.

The SSC patch, I can confirm has worked by reading the memory location:

nvidia@tegra-ubuntu:~$ sudo busybox devmem 0x05043000
0x20011C25

And they delay for the PCIe (12 seconds):

[    0.898157] tegra-pcie 10003000.pcie-controller: 4x1, 1x1 configuration
[    0.899271] tegra-pcie 10003000.pcie-controller: PCIE: Enable power rails
[    0.899733] tegra-pcie 10003000.pcie-controller: probing port 0, using 4 lanes
[    0.903181] tegra-pcie 10003000.pcie-controller: probing port 2, using 1 lanes
[   13.539812] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   13.943641] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   14.348243] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[   14.350281] tegra-pcie 10003000.pcie-controller: link 0 down, ignoring
[   14.763644] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   15.173706] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   15.579663] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[   15.581708] tegra-pcie 10003000.pcie-controller: link 2 down, ignoring
[   15.581719] tegra-pcie 10003000.pcie-controller: PCIE: no end points detected
[   15.582017] tegra-pcie 10003000.pcie-controller: PCIE: Disable power rails

Do you have any suggestions about where to check/debug next?

vidyas · June 24, 2019, 9:53am

What about the questions asked in #11?
Repeating them here…
Is there any specific reason to doubt that having SSC enabled would be causing issues to PCIe link up?
Is FPGA endpoint here not using REFCLK from TX2 but using its own internal clock as REFCLK?
Also, does this device come up on any other host (i.e. x86 for example)?
Do we have CLKREQ pin routing from the endpoint to host? (Although this may not have any impact on getting the link up, just asking to get more info)

personnongrata · June 27, 2019, 8:44am

Hi Vidyas,

Apologies for the delay - I was waiting for confirmation of answers from the manufacturer.

Is there any specific reason to doubt that having SSC enabled would be causing issues to PCIe link up?

No - but it has been suggested previously on this forum that not all FPGAs end-points support SSC. However, the manufacturer has confirmed that the card is SSC compatible.

Is FPGA endpoint here not using REFCLK from TX2 but using its own internal clock as REFCLK?

The FPGA endpoint is using REFCLK from the root controller.

Also, does this device come up on any other host (i.e. x86 for example)?

We will retest this but before we started to test against the TX2 it was working correctly.

Do we have CLKREQ pin routing from the endpoint to host? (Although this may not have any impact on getting the link up, just asking to get more info)

We are waiting for confirmation on this, I will confirm once I know.

Thoughts and next steps?

vidyas · June 27, 2019, 10:46am

Can you please try with the patch which emulates 'no CLKREQ routing from endpoint to host" scenario.

diff --git a/drivers/pci/host/pci-tegra.c b/drivers/pci/host/pci-tegra.c
index 7b6fbd5d90a8..2648af82df56 100644
--- a/drivers/pci/host/pci-tegra.c
+++ b/drivers/pci/host/pci-tegra.c
@@ -3516,6 +3516,7 @@ static int tegra_pcie_parse_dt(struct tegra_pcie *pcie)
                        return -EADDRNOTAVAIL;
                rp->disable_clock_request = of_property_read_bool(port,
                        "nvidia,disable-clock-request");
+               rp->disable_clock_request = 1;

                rp->rst_gpio = of_get_named_gpio(port, "nvidia,rst-gpio", 0);
                if (gpio_is_valid(rp->rst_gpio)) {

personnongrata · July 4, 2019, 7:55am

Hi vidya,

This has not worked - I have been looking for a way to confirm that the kernel change has been successful? Is it possible to read a memory location to confirm?

The manufacturer has confirmed the CLKREQ is not connected on the PCIe card.

vidyas · July 4, 2019, 8:38am

Can you please remove “nvidia,enable-power-down” from the PCIe node’s device-tree entry? This would enable at least root port being listed in lspci output even though there is no endpoint connected (default behavior is that even root port also doesn’t get listed if PCIe link is not up endpoint).
Once only root port is listed, you can dump the value at location 0x141a0000 and update us.

shrinathchoudhary · November 6, 2019, 10:58am

vidyas:

It is a bit tricky to disable SSC. Just to give some background, Tegra’s PLLs/clocks are controlled by an entity called BPMP-FW and instructions to disable SSC need to be given to BPMP-FW through its device-tree (Please note that BPMP-FW’s device-tree is different from kernel’s device-tree)
Please apply the following patch in $TOP/platform/bpmp/ folder and reflash the target with new BPMP-FW file.
diff --git a/tegra186-common.dtsi b/tegra186-common.dtsi
index a5580ad5c0d6..e3ff4230bc32 100644
--- a/tegra186-common.dtsi
+++ b/tegra186-common.dtsi
@@ -5,4 +5,12 @@
                edition = <1>;
                dbs = <(TEGRA186_DB_CPU_S | TEGRA186_DB_CPU_NS | TEGRA186_DB_DMCE)>;
        };
+
+       clocks {
+               clock@plle {
+                       clk-id = <TEGRA186_CLK_PLLE>;
+                       /* disable ssc on PLLE */
+                       pll_freq_table = <38400000 100000000 2 125 24 (-1) (-1) (-1) (-1)>;
+               };
+       };
 };
One way to check if SSC is enabled/disabled in a platform is to dump the value @ address 0x05043000. If bit-12 is
0 → SSC enabled (with default build, we would get a value like 0x20010025 where bit-12 is ‘0’)
1 → SSC disabled (after applying above patch, the value would be 0x20011C25 where bit-12 is ‘1’)

even after applying the above patch SSC is not disabled

nvidia@tegra-ubuntu:~$ sudo busybox devmem 0x05043000
0x20010025

an i missing anything ??

shrinathchoudhary · November 11, 2019, 5:28am

any updates ??