[SOLVED]Problem with Intel 600p NVMe SSD

Hi all,

I recently got a Jetson TX1 Module and an Auvidea J120 rev.3 carrier board. I plugged an Intel 600P SSD on the M.2 PCIe slot of the J120. But can not successfully make the SSD work.

I am using L4T R24.2 while the kernel do not have the NVMe driver built in. So I recompiled the kernel with the NVMe driver and flush the new kernel to the board. But it does not work.

Here is the output of lsblk:

ubuntu@tegra-ubuntu:~$ lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0rpmb  179:16   0    4M  0 disk 
mmcblk0      179:0    0 14.7G  0 disk 
|-mmcblk0p1  179:1    0   14G  0 part /
|-mmcblk0p2  179:2    0    2M  0 part 
|-mmcblk0p3  179:3    0    4M  0 part 
|-mmcblk0p4  179:4    0    2M  0 part 
|-mmcblk0p5  179:5    0    6M  0 part 
|-mmcblk0p6  179:6    0    4M  0 part 
|-mmcblk0p7  179:7    0    6M  0 part 
|-mmcblk0p8  179:8    0    2M  0 part 
|-mmcblk0p9  179:9    0    2M  0 part 
|-mmcblk0p10 179:10   0   20M  0 part 
|-mmcblk0p11 179:11   0   64M  0 part 
|-mmcblk0p12 179:12   0   64M  0 part 
|-mmcblk0p13 179:13   0    4M  0 part 
|-mmcblk0p14 179:14   0    2M  0 part 
|-mmcblk0p15 179:15   0    6M  0 part 
|-mmcblk0p16 259:0    0    6M  0 part 
|-mmcblk0p17 259:1    0    2M  0 part 
`-mmcblk0p18 259:2    0  496M  0 part

Seems only internal eMMC is found

the output of lspci:

00:01.0 PCI bridge: NVIDIA Corporation Device 0fae (rev a1)
01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a5 (rev 03)

Seems the ssd is connected.

and I checked the dmesg, and found this:

...
[    2.565807] brd: module loaded
[    2.572587] loop: module loaded
[    2.577479] PCI: enabling device 0000:01:00.0 (0140 -> 0142)
[    2.742637] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[    2.753924] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
[    2.769019] pcieport 0000:00:01.0:   device [10de:0fae] error status/mask=00004000/00000000
[    2.780667] pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
[    2.789130] pcieport 0000:00:01.0: broadcast error_detected message
[   62.752384] nvme: probe of 0000:01:00.0 failed with error -5
[   62.759704] pcieport 0000:00:01.0: AER: Device recovery failed
[   62.760089] zram: Created 1 device(s) ...
...

I googled around with nothing useful found. Did anyone have any idea what is the problem?
Thx!

This isn’t much help, but about all I can suggest is the PCIe sees the device, and that the PCIe bus survived (but this doesn’t mean the particular device will work, though non-fatal implies it has a chance). Then the nvme probe suggests you did get the driver installed, but that the driver can’t live with the error. I have no idea if this is a software issue or a hardware issue. If you have another Linux machine with the M.2 slot you might add the SSD to that and check what dmesg shows there as a comparison. Also, if you have that other Linux machine, use “lspci -vvv” on the device for more PCIe information (lspci -vvv probably could run on the Auvidea board as well).

Hi linuxdev,

thx for the quick response! here is the lspci -vvv output from J120:

ubuntu@tegra-ubuntu:~$ sudo lspci -vvvv -s 1:0
01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a5 (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 390a
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 130
	Region 0: Memory at 13000000 (64-bit, non-prefetchable) [disabled] 
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Via message
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [158 v1] #19
	Capabilities: [178 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [180 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us

Actrually I have 2 set of this combination(TX1 + J120 + 600p SSD), and both have the same issue.
Maybe the NVMe of Linux 3.10 is too old? Or some hardware incompatibility between J120 and 600p?
Has anyone successfully build newer kernel for TX1?

Looks like the drive is capable of PCIe V3 (8GT/s), but Jetson only runs at max V2 (5GT/s). Even so, the drive has throttled all the way back to V1 speeds (2.5GT/s). Throttling back would imply there is a signal quality issue. There are a lot of places signal quality could be a problem, including cables (and cables are the easiest to check…unless it happens to be on an M.2 slot without cables…unfortunately you can’t actually swap out circuit board traces).

One non-conclusive test would be to see the “lspci -vvv” listing on both a known working system and the Jetson, side-by-side. The goal would be to see if the M.2 slot device actually runs at PCIe V2 speeds (or perhaps V3 if the board supports this). Although PCIe V3 could actually be harder to support because of higher speeds, one of the features of PCIe V3 which is new over and beyond PCIe V2 is fine tuning of the de-emphasis. V1 has only -3.5dB available, V2 adds -6dB as a possible de-emphasis, whereas V3 has 11 predefined levels (with active feedback between TX and RX I think the predefined 11 levels are equivalent to something like 32 levels). All of that is to say that V3 might actually work on a system where bus and device are capable of V3 even when V2 on that same system would fail (this would be a case of active tuning between RX and TX for de-emphasis outsmarting the signal problems). To make this long story short, use “lspci -vvv” from a different system and see what kind of signal rate this other system achieves…if this other system throttles back as well, then odds go up that the signal quality is part of the M.2 SSD…if the other system works at V2 or V3 levels, odds go up the issue is related to the Jetson host (which is a mix of module and carrier board).

As to whether older kernels have driver issues, this may be the case. However, there has been a lot of attention to fixing SSD driver problems on L4T, so the driver may not be as far out of date as the 3.10.xx kernel version might otherwise imply. Someone may have more knowledge if you can give more specific information about the Intel 600P SSD (like exact model number and any identifying information). If it turns out you need newer driver code, then rather than using a newer kernel you are better off back porting from mainline into the L4T kernel. Going to mainline kernel actually opens up a whole new set of problems.

Hi linuxdev,

Thank you! This is really helpful!
I will check the M.2 slot to find whether it is a connection issue to prevent an PCIe V2 speed.
Unfortunately I currently do not have a machine to test the SSD (I have to say the TX1 is the newest platform I have for now).
Is there any command or sth. I can run to verify if one machine support PCIe v3/V2/V1?

I am downloading linux driver bachports, and trying to compile an kernel module and give it another try.
And the SSD is Intel 600P 256G https://www-ssl.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-600p-series.html

The root complex or PCI bridge (the part determining what version of PCIe is possible) is typically shown as a PCIe device under lspci (if you don’t have a device connected, then the root complex may not show up). If that bridge or root complex is used by your end device, then “lspci -t” will show a tree view where your device can be matched to a given bridge or root complex. Within the bridge or root complex information (sudo lspci -vvv) will be the link capability, “LnkCap”, and will list one of 2.5GT/s, 5GT/s, or 8GT/s. These correspond to a capability of PCIe V1, V2, or V3, respectively. The link status, “LnkSta”, offers information about the speed a lane of the link is actually running at.

I do not know what form of support Intel has for Linux drivers on that drive, but do keep in mind that if the driver was intended for a 4.x kernel version, then you would need source code to back port (binary drivers wouldn’t help). Should you choose to back port, it does not mean the changes from the 3.10.x version are an improvement…the changes could be just to make mainline work. Plus, you still don’t know if the particular problem is driver related or not.

I just order an M.2 to PCIe x4 convertion board, and planning to try there SSDs on my desktop machine. If it works, it should be the problem of the M.2 slot or the driver.

And then I am going to get another M.2 SATA3 SSD to test the slot. ( so much money, oh my god! )

It seems Intel do not provide driver for the drive. And I tried to the linux backports, but no nvme driver there.
http://drvbp1.linux-foundation.org/~mcgrof/rel-html/backports/

I am trying to figure out how to backport linux nvme driver module to 3.10. But from what I find, there are too many code changed for NVMe driver in Linux. So …

I can confirm that I have the same error with a different 600P.

Does intel 600p M.2 2280 SSD work now? If not, what’s a workround?

Is there any solution/workaround for this topic arround ?

Not yet for me

M.2 E slot does NOT support SATA SSD, is that right?

I think whether or not the SSD works on M.2 depends on the drive. Here’s an M.2 description…note that the slot supports multiple ways of using it, and the device in the slot can pick among them (and the pick will change driver requirements):
[url]https://en.wikipedia.org/wiki/M.2[/url]

Generally, this M.2 E slot does not hug SSD card. The official explanation on https://en.wikipedia.org/wiki/M.2 is as below:
What type of applications use key E?
TE’s M.2 key E connectors are found in applications that use wireless connectivity including Wi-Fi, Bluetooth, NFC of GNSS. Module card types used include 1630, 2230 and 3030.

Also, Can someone try disabling ASPM completely at defconfig level (‘CONFIG_PCIEASPM=n’) and try?

WOW! Disabling ASPM really works ! Thank you for the help.