Auvudea J120 / PCIe NVMe/ r24.2.1 booting from SSD

I am having hard time (2 week odeal already) to get the TX1 to boot from SSD. As the title says my setup is:

  • Auvidea carrier board J120 with IMU
  • TX1 module
  • M.2 PCIe NVMe SSD (samsung 960 evo 256GB or Intell 600p 128GB)

First out of the box the kernel r24.2.1 does not support NVME device. I rebuilded it with “CONFIG_BLK_DEV_NVME=y” the SAMSUNG drive will be recognized, for the Intell SSD to be recognized I had to also set the “CONFIG_PCIEASPM is not set”.
I am using linaro 5.3-2016.02 compilers to crosscompile the image and then copy the Image, zImage and content of the lib dir over to target.
I tried to partition the SSD using gdisk, gparted. With the same results. To get the file system on SSD I tried to extract the sample root file system from nvidia site and then use the apply_binaries.sh. I also tried to copy the content of root MMC partition using the guide from jetsonhacks “cp -ax / /mnt/…”
The result is always similar, once I change the “root=/dev/nvme0n1p1” in extlinux.conf the boot process will stuck I can see in the console “Root device found: n” or worse the log will be full of “Buffer I/O error on device nvme0n1, logical block xxxxxx” errors, followed bu core dumps…

I am running out of ideas… any help will be greatly appreciated.

As I know we can boot to USB but for the PCIE need to check.

It is necessary to add nvme to init script in /boot/initrd.

rootdev=cat /proc/cmdline | sed -e 's/.*root=\/dev\/\([abcdefklmnps[b]v[/b]012]*\).*/\1/';

elif [[ “${rootdev}” == nvme0n1p* ]]; then

Thanks hummer_x!
This is exactly what I figured out today.
For those as me that did not know what initrd is it is a zipped filesystem image used as a steping stone between uboot and full root file system. Inside there is ‘init’ shell script that parses the uboot arguments. The problem is that it makes assumption that the root partition is either /dev/sdxxxx or /dev/mmcblk0px plus some special cases as netwotk boot. So the /dev/nvme0n1p1 will not be parsed and boot will stop.
To fix it 2 things a needed:

  1. as above hummer_x pointed the letter ‘v’ has to be added to regular expression in sed command
rootdev=`cat /proc/cmdline | sed -e 's/.*root=\/dev\/\([abcdefklmnps<b>v</b>012]*\).*//'`;
  1. we need to make a case for /dev/nvme0n1p1. You can either add elif section as hummer_x suggest but I do not see the need to do anything different as for the mmcblk0px sections so I just added another condition
    if [[ “${rootdev}” == mmcblk0p1 || “${rootdev}” == mmcblk0p1 || “${rootdev}” == nvme0n1p* ]]then;

There are different method of repackaging the initrd if you google for it I used the one described in kernel source documentation: tegra-linux-3.10/Documentation/initrd.txt

# running as root be mindfull
sudo -s

cd $TEMPDIR
cp /path/to/original/initrd ./
mkdir extracted
cd extracted
# extract the image
gzip -cd ../initrd | cpio -imd --quiet
# now in the extracted folder is content of the initrd ramdisk

Now make the changes to init script
and pack it back:

find . | cpio --quiet -H newc -o | gzip -9 -n > ../initrd_new
# to prevent disaster exit the root shell
exit

now you can copy the initrd_new to /boot folder on the target device, adjust the /boot/extlinux/extlinux.conf to point to initrd_new.

see this how to prepare file system on the new disk: https://devtalk.nvidia.com/default/topic/974861/jetson-tx1/exact-steps-to-extend-move-file-system-root-on-tx1-to-use-vnand-m-2-on-j120/post/5012309/#5012309

I find out that “cp -ax / /mount/point/of/to/be/root/partition” works well I only removed the content of /tmp folder. (http://www.jetsonhacks.com/2017/01/28/install-samsung-ssd-on-nvidia-jetson-tx1/)

Now you should be able to boot the r24.2.1 from nvme…

Next thing to port the dts and config changes for the J120… I think I have it figured-out. I will make experiments on Monday and will post complete instruction. If someone done it especially the FAN control then I would be glad to listen.

OK I think almost there,
I have the TX1 on J120 booting from SSD, I can access the IMU on J120 via SPI but I am still suffering from Buffer I/O error on device nvme0n1, logical block xxxxxx. It will work ok for a while and then after n’th restart it will come with those errors. I can usually boot into extlinux configuration where nvme is mounted and not booted from and run e2fsck. Usually there are orphans nodes found and fixed but some times it is so bad there will be core dumps and it will end-up in “emergency shell”. I have impression that it is more likely to happen when I work via serial port system console and GUI session at the same time.

Here are the details of the build I am using:
config_override:

# CONFIG_NET_EMATCH_CANID is not set

CONFIG_CAN=y
CONFIG_CAN_RAW=y
CONFIG_CAN_BCM=y
CONFIG_CAN_GW=y

#
# CAN Device Drivers
#
# CONFIG_CAN_VCAN is not set
# CONFIG_CAN_SLCAN is not set
CONFIG_CAN_DEV=y
CONFIG_CAN_CALC_BITTIMING=y
# CONFIG_CAN_LEDS is not set
CONFIG_CAN_MCP251X=m
# CONFIG_PCH_CAN is not set
# CONFIG_CAN_GRCAN is not set
# CONFIG_CAN_SJA1000 is not set
# CONFIG_CAN_C_CAN is not set
# CONFIG_CAN_CC770 is not set

#
# CAN USB interfaces
#
# CONFIG_CAN_EMS_USB is not set
# CONFIG_CAN_ESD_USB2 is not set
# CONFIG_CAN_KVASER_USB is not set
# CONFIG_CAN_PEAK_USB is not set
# CONFIG_CAN_8DEV_USB is not set
# CONFIG_CAN_SOFTING is not set
# CONFIG_CAN_DEBUG_DEVICES is not set
#

#enable support for NVME drives like M2
CONFIG_BLK_DEV_NVME=y

#Why the eeprom support??
CONFIG_EEPROM_93CX6=m

CONFIG_RTL8187=m
CONFIG_RTL8187_LEDS=y

CONFIG_ATH_COMMON=m
CONFIG_ATH_CARDS=m
# CONFIG_ATH_DEBUG is not set
# CONFIG_ATH5K is not set
# CONFIG_ATH5K_PCI is not set
# CONFIG_ATH9K is not set
# CONFIG_ATH9K_HTC is not set
CONFIG_CARL9170=m
CONFIG_CARL9170_LEDS=y
CONFIG_CARL9170_WPC=y
# CONFIG_ATH6KL is not set
# CONFIG_AR5523 is not set
# CONFIG_WIL6210 is not set

CONFIG_RT2X00=m
# CONFIG_RT2400PCI is not set
# CONFIG_RT2500PCI is not set
# CONFIG_RT61PCI is not set
# CONFIG_RT2800PCI is not set
CONFIG_RT2500USB=m
CONFIG_RT73USB=m
CONFIG_RT2800USB=m
CONFIG_RT2800USB_RT33XX=y
CONFIG_RT2800USB_RT35XX=y
CONFIG_RT2800USB_RT53XX=y
CONFIG_RT2800USB_RT55XX=y
CONFIG_RT2800USB_UNKNOWN=y
CONFIG_RT2800_LIB=m
CONFIG_RT2X00_LIB_USB=m
CONFIG_RT2X00_LIB=m
CONFIG_RT2X00_LIB_FIRMWARE=y
CONFIG_RT2X00_LIB_CRYPTO=y
CONFIG_RT2X00_LIB_LEDS=y
# CONFIG_RT2X00_DEBUG is not set
CONFIG_RTLWIFI=m
CONFIG_RTLWIFI_DEBUG=y
# CONFIG_RTL8192CE is not set
# CONFIG_RTL8192SE is not set
# CONFIG_RTL8192DE is not set
# CONFIG_RTL8723AE is not set
# CONFIG_RTL8188EE is not set
CONFIG_RTL8192CU=m
CONFIG_RTL8192C_COMMON=m

CONFIG_SPI_SPIDEV=m

# CONFIG_SOC_CAMERA_PLATFORM is not set

CONFIG_MEDIA_ATTACH=y

CONFIG_CRC_ITU_T=m

I use the merge_config script to merge with config generated by tegra21_default target:

./scripts/kconfig/merge_config.sh -n -r -O $AUV_KERNEL_OUT $AUV_KERNEL_OUT/.config ./auvidea/config_override

modified dts, only added the section after AUVIDEA J120 overrides.
arch/arm64/boot/dts/tegra210-jetson-tx1-p2597-2180-a01-devkit.dts:

/*
 * arch/arm64/boot/dts/tegra210-jetson-tx1-p2597-2180-a01-devkit.dts
 *
 * Copyright (c) 2014-2015, NVIDIA CORPORATION.  All rights reserved.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; version 2 of the License.
 *
 * This program is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
 * more details.
 *
 */

#include "tegra210-jetson-cv-base-p2597-2180-a00.dts"

/ {
	model = "jetson_tx1";
	compatible = "nvidia,jetson-cv", "nvidia,tegra210";
	nvidia,dtsfilename = __FILE__;

	#address-cells = <2>;
	#size-cells = <2>;

	chosen {
		bootloader {
			nvidia,skip-display-init;
		};
	};

	host1x {
		dc@54200000 {
			status = "disabled";
		};

		dc@54240000 {
			nvidia,dc-or-node = "/host1x/sor1";
		};

		dsi {
			status = "disabled";
			panel-a-wuxga-8-0 {
				status = "disabled";
			};
			panel-s-wqxga-10-1 {
				status = "disabled";
			};
		};
	};

	i2c@7000c400 {
		lp8557-backlight-a-wuxga-8-0@2c {
			status = "disabled";
		};
	};

	backlight {
		status = "disabled";
	};
	
/****************************
 * AUVIDEA J120 overrides
 ****************************/
	camera-pcl {
		/delete-property/compatible;
		/delete-property/configuration;
		/delete-node/modules;
		/delete-node/profiles;
	};
	

	gpio@6000d000 {
		camera-control {
			/* reset and powerdown gpios for ov5693_c */
			gpio-output-low = <0x94 0x97>;
		};

		default {
			gpio-to-sfio = <0x10 0x11 0x12 0x13 0x14>;
		};
	};
	
	/*
	 * Omnivision camera 
	 */
	ov5693_c@36 {
		avdd-reg = "vana";
		compatible = "nvidia,ov5693";
		iovdd-reg = "vif";
		mclk = "cam_mclk1";
		physical_h = "2.738";
		physical_w = "3.674";
		pwdn-gpios = <&gpio 0x97 0x0>;
		reset-gpios = <&gpio 0x94 0x0>;
		vana-supply = <&en_vdd_cam_hv_2v8>;
		vif-supply = <&en_vdd_cam>;

		mode0 {
			active_h = "1944";
			active_w = "2592";
			cil_settletime = [30 00];
			discontinuous_clk = "no";
			dpcm_enable = "false";
			inherent_gain = [31 00];
			line_length = "2688";
			max_exp_time = "550385";
			max_framerate = "30";
			max_gain_val = "16";
			max_hdr_ratio = "64";
			mclk_khz = "24000";
			mclk_multiplier = "6.67";
			min_exp_time = "34";
			min_framerate = "1.816577";
			min_gain_val = "1.0";
			min_hdr_ratio = [31 00];
			num_lanes = [32 00];
			pix_clk_hz = "160000000";
			pixel_t = "bayer_bggr";
			readout_orientation = "90";
			tegra_sinterface = "serial_c";
		};

		mode1 {
			active_h = "1458";
			active_w = "2592";
			cil_settletime = [30 00];
			discontinuous_clk = "no";
			dpcm_enable = "false";
			inherent_gain = [31 00];
			line_length = "2688";
			max_exp_time = "550385";
			max_framerate = "30";
			max_gain_val = "16";
			max_hdr_ratio = "64";
			mclk_khz = "24000";
			mclk_multiplier = "6.67";
			min_exp_time = "34";
			min_framerate = "1.816577";
			min_gain_val = "1.0";
			min_hdr_ratio = [31 00];
			num_lanes = [32 00];
			pix_clk_hz = "160000000";
			pixel_t = "bayer_bggr";
			readout_orientation = "90";
			tegra_sinterface = "serial_c";
		};

		mode2 {
			active_h = "720";
			active_w = "1280";
			cil_settletime = [30 00];
			discontinuous_clk = "no";
			dpcm_enable = "false";
			inherent_gain = [31 00];
			line_length = "1752";
			max_exp_time = "358733";
			max_framerate = "120";
			max_gain_val = "16";
			max_hdr_ratio = "64";
			mclk_khz = "24000";
			mclk_multiplier = "6.67";
			min_exp_time = "22";
			min_framerate = "2.787078";
			min_gain_val = "1.0";
			min_hdr_ratio = [31 00];
			num_lanes = [32 00];
			pix_clk_hz = "160000000";
			pixel_t = "bayer_bggr";
			readout_orientation = "90";
			tegra_sinterface = "serial_c";
		};

		mode3 {
			active_h = "1944";
			active_w = "2592";
			cil_settletime = [30 00];
			discontinuous_clk = "no";
			dpcm_enable = "false";
			inherent_gain = [31 00];
			line_length = "3696";
			max_exp_time = "687981";
			max_framerate = "24";
			max_gain_val = "16";
			max_hdr_ratio = "64";
			mclk_khz = "24000";
			mclk_multiplier = "7.33";
			min_exp_time = "42";
			min_framerate = "1.453262";
			min_gain_val = "1.0";
			min_hdr_ratio = [31 00];
			num_lanes = [32 00];
			pix_clk_hz = "176000000";
			pixel_t = "hdr_bggr";
			readout_orientation = "90";
			tegra_sinterface = "serial_c";
		};
	};
	
	tegra-camera-platform {
		
		modules {

			module0 {
				badge = "e3326_front_P5V27C";
				orientation = [31 00];
				position = "rear";

				drivernode0 {
					pcl_id = "v4l2_soc_sensor";
					proc-device-tree = "/proc/device-tree/ov5693_c@36";
				};

				drivernode1 {
					pcl_id = "v4l2_focuser_stub";
				};
			};
		};
	};
	
		
	sound {
		/delete-property/compatible;
		/delete-property/nvidia,audio-routing;
		/delete-property/nvidia,dummy-codec-dai;
		/delete-property/nvidia,dummy-codec-dai-name;
		/delete-property/nvidia,hp-det-gpios;
		/delete-property/nvidia,model;
		/delete-property/nvidia,num-codec-link;
		/delete-property/nvidia,xbar;
		
		/delete-node/nvidia,dai-link-1;
		/delete-node/nvidia,dai-link-2;
		/delete-node/nvidia,dai-link-3;
		/delete-node/nvidia,dai-link-4; 
		
	};
	
    /*****************
	 * SPIs
	 *****************/
	
	clocks {
	
		/** clocks for can0 can1 */
		mcp2515_clk: mcp2515 {
			#clock-cells = <0x0>;
			clock-frequency = <16000000>;
			compatible = "fixed-clock";
		};
	};
	spi0: spi@7000d400 {
		spi-max-frequency = <25000000>;
		status = "okay";

		can0@0 {
			clocks = <&mcp2515_clk>;
			compatible = "microchip,mcp2515";
			gpio-controller;
			interrupt-parent = <&gpio>;
			interrupts = <0x41 0x0>;
			mcp251x,enable-clkout = <0x1>;
			mcp251x,oscillator-frequency = <16000000>;
			mcp251x,stay-awake = <0x1>;
			reg = <0x1>;
			spi-max-frequency = <10000000>;
			status = "okay";
		};

		can1@0 {
			clocks = <&mcp2515_clk>;
			compatible = "microchip,mcp2515";
			gpio-controller;
			interrupt-parent = <&gpio>;
			interrupts = <0x26 0x0>;
			mcp251x,enable-clkout = <0x1>;
			mcp251x,oscillator-frequency = <16000000>;
			mcp251x,stay-awake = <0x1>;
			reg = <0x1>;
			spi-max-frequency = <10000000>;
			status = "okay";
		};
	};
	spi1: spi@7000d600 {
		spi-max-frequency = <25000000>;
		status = "okay";
		spi1_0 {
			#address-cells = <0x1>;
			#size-cells = <0x0>;
			compatible = "linux,spidev", "spidev";
			nvidia,chipselect-gpio = <&gpio 0x10 0x0>;
			nvidia,cs-hold-clk-count = <0x1e>;
			nvidia,cs-setup-clk-count = <0x1e>;
			nvidia,enable-hw-based-cs;
			nvidia,rx-clk-tap-delay = <0x1f>;
			nvidia,tx-clk-tap-delay = <0x0>;
			reg = <0x0>;
			spi-max-frequency = <25000000>;
		};

		spi1_1 {
			#address-cells = <0x1>;
			#size-cells = <0x0>;
			compatible = "linux,spidev", "spidev";
			nvidia,chipselect-gpio = <&gpio 0x11 0x0>;
			nvidia,cs-hold-clk-count = <0x1e>;
			nvidia,cs-setup-clk-count = <0x1e>;
			nvidia,enable-hw-based-cs;
			nvidia,rx-clk-tap-delay = <0x1f>;
			nvidia,tx-clk-tap-delay = <0x0>;
			reg = <0x1>;
			spi-max-frequency = <25000000>;
		};
	};
	
	spi2: spi@7000d800 {
		status = "okay";
	};
	

	spi3: spi@7000da00 {
		spi-max-frequency = <25000000>;
		status = "okay";
		
		spi3_0 {
			#address-cells = <0x1>;
			#size-cells = <0x0>;
			compatible = "linux,spidev", "spidev";
			nvidia,cs-hold-clk-count = <0x1e>;
			nvidia,cs-setup-clk-count = <0x1e>;
			nvidia,enable-hw-based-cs;
			nvidia,rx-clk-tap-delay = <0x1f>;
			nvidia,tx-clk-tap-delay = <0x0>;
			reg = <0x0>;
			spi-max-frequency = <25000000>;
		};
	};
	
};

and custom initrd init script:

#!/bin/bash

# Copyright (c) 2014-2015, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
DATE="`date +\"%T\"`"
build_date=`date`;
initrd_dir=/mnt/initrd;
dhclient_flag="true";
count=0;
echo "[ $DATE ] L4T initrd starts from here.....";
#Mount procfs, devfs and sysfs
mount -t proc proc /proc
if [ $? -ne 0 ]; then
	echo "ERROR: mounting proc fail...";
	exec /bin/bash;
fi;
mount -t devtmpfs none /dev
if [ $? -ne 0 ]; then
	echo "ERROR: mounting dev fail...";
	exec /bin/bash;
fi;
mount -t sysfs sysfs /sys
if [ $? -ne 0 ]; then
	echo "ERROR: mounting sys fail...";
	exec /bin/bash;
fi;
echo "[ $DATE ] L4T-INITRD Build DATE: $build_date";

rootdev=`cat /proc/cmdline | sed -e 's/.*root=\/dev\/\([abcdefklmnps<b>v</b>012]*\).*/\1/'`;

echo "[ $DATE ] L4T-INITRD Root device found in /proc/cmdline: $rootdev";
if [[ "${rootdev}" == mmcblk0p* || "${rootdev}" == mmcblk1p*<b> || "${rootdev}" == nvme* </b>]]; then
	if [ ! -e "/dev/${rootdev}" ]; then
		count=0;
		while [ ${count} -lt 50 ]
		do
			sleep 0.2;
			count=`expr $count + 1`;
			if [ -e "/dev/${rootdev}" ]; then
				break;
			fi
		done
	fi
	if [ -e "/dev/${rootdev}" ]; then
			echo "[ $DATE ] L4T-INITRD Rootdevice found: /dev/${rootdev}";
	else
		echo "[ $DATE ] L4T-INITRD ERROR: ${rootdev} not found";
		exec /bin/bash;
	fi
	mount /dev/${rootdev} /mnt/;
	if [ $? -ne 0 ]; then
		echo "[ $DATE ] L4T-INITRD ERROR: ${rootdev} mount fail...";
		exec /bin/bash;
	fi;
elif [[ "${rootdev}" == sd* ]]; then
	if [ ! -e "/dev/${rootdev}" ]; then
		while [ ${count} -lt 50 ]
		do
			sleep 0.2;
			count=`expr $count + 1`;
			if [ -e "/dev/${rootdev}" ]; then
				break;
			fi
		done
	fi
	if [ -e "/dev/${rootdev}" ]; then
			echo "Rootdevice found: /dev/${rootdev}";
	else
		echo "ERROR: ${rootdev} not found";
		exec /bin/bash;
	fi
	mount /dev/${rootdev} /mnt/;
	if [ $? -ne 0 ]; then
		echo "ERROR: ${rootdev} mount fail...";
		exec /bin/bash;
	fi;
elif [[ "${rootdev}" == "nfs" ]]; then
	eth_interface=`ifconfig -a | grep eth | awk '{print $1}'`;
	echo "[ $DATE ] Ethernet interfaces: $eth_interface";
	for eth in ${eth_interface}
	do
		ipaddr=`ifconfig "$eth" | grep -A1 "$eth" | grep "inet addr" | sed 's/.*addr:\([0-9\.]*\) .*/\1/'`;
		echo "[ $DATE ] eth_interface is pointing to $eth and ipaddr: $ipaddr"
		if [[ "$ipaddr" =~ [0-9]*.[0-9]*.[0-9]*.[0-9]* ]]; then
			echo "[ $DATE ]: static ipaddr found: $ipaddr";
			dhclient_flag="false";
			break;
		fi
		while [ ${count} -lt 50 ]
		do
			sleep 0.2;
			ipaddr=`ifconfig "$eth" | grep -A1 "$eth" | grep "inet addr" | sed 's/.*addr:\([0-9\.]*\) .*/\1/'`;
			echo "[ $DATE ] eth_interface is pointing to $eth and ipaddr: $ipaddr"
			if [[ "$ipaddr" =~ [0-9]*.[0-9]*.[0-9]*.[0-9]* ]]; then
				echo "[ $DATE ]: static ipaddr found: $ipaddr";
				dhclient_flag="false";
				break;
			fi
			count=`expr $count + 1`;
		done
		if [ "$dhclient_flag" == "false" ]; then
			break;
		fi
	done
	if [ "$dhclient_flag" == "true" ]; then
		for eth in ${eth_interface}
		do
			echo "[ $DATE ] eth_interface is pointing to $eth";
			timeout 8s /sbin/dhclient $eth;
			if [ $? -eq 0 ]; then
				break;
			fi;
		done
		if [ $? -ne 0 ]; then
			echo "[ $DATE ] ERROR: dhclient fail...";
			exec /bin/bash;
		fi;
	fi;
	nfsroot_path=`cat /proc/cmdline | sed -e 's/.*nfsroot=\([^ ]*\) .*/\1 /'`;
	# JCK: mount -t nfs -o nolock $nfsroot_path /mnt/;
	mount -t nfs -v -o nolock $nfsroot_path /mnt/;
	if [ $? -ne 0 ]; then
		echo "ERROR: NFS mount fail...";
		exec /bin/bash;
	fi;
else
	echo "[ $DATE ] No root-device: Mount failed"
	exec /bin/bash;
fi

echo "[ $DATE ] L4T-INITRD Rootfs mounted over ${rootdev}";
mount -o bind /proc /mnt/proc;
mount -o bind /sys /mnt/sys;
mount -o bind /dev/ /mnt/dev;
cd /mnt;
cp /etc/resolv.conf etc/resolv.conf

echo "[ $DATE ] L4T-INITRD Switching from initrd to actual rootfs";
exec chroot . /sbin/init 2;

The above is full change set I was able to figureout, but the problem with Buffer IO errors show up with just “CONFIG_BLK_DEV_NVME=y” added to config.
I am looking for ideas how to narrow/isolate the errors, stress tests on SSD?
Check for hardware issues?
I think it only appears after reboot, it it works then all is ok until next reboot.

Maciej

OK, I need some help with this.
I do not know where the Buffer I/O error on device nvme0n1, logical block xxxxxx comming from.
To simplify the setup I did the following:
-flashed the L4T 24.2.1 - just the os nothing else
-rebuild the kernel with CONFIG_BLK_DEV_NVME=y (I compared the configs and only changes are the CONFIG_BLK_DEV_NVME and CONFIG_LOCALVERSION)

So I have NVME support but as side partition and not booting from. If all goes OK I can access the partition no problems. Most likely I will get errors.
see the attached logs ,I removed timestamps to be able to compare them easily. The most complete and error free is system_console_nvme_lspci_clean_10.txt. It is clean from boot to shutdown.
comparing it to system_console_nvme_isomgr_coredump_01.txt I think there are indications of pcie problems:

LPDDR4 Training: Can’t find emc-table node
External Media

WARNING: at /data/git/jetson_kernel/linux-3.10/drivers/platform/tegra/mc/isomgr.c:565 - followed by stack
dump

External Media

Buffer I/O error on device nvme0n1, logical block
External Media

I compiled the nvme-cli tools and checked the nvme smart-log /dev/nvme0n1 , it is error free, no badblocks spares used. The only think suspicious there that the unsafe_shutdowns 116 is very close to power_cycles 118 which I think suggest the pcie/nvme drivers do not handle correctly shutdown?

root@tegra-ubuntu:~# nvme smart-laog /dev/nvme0ana1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 23 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 142,663
data_units_written                  : 92,064
host_read_commands                  : 1,690,361
host_write_commands                 : 2,205,435
controller_busy_time                : 1
power_cycles                        : 118
power_on_hours                      : 14
unsafe_shutdowns                    : 116
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 23 C
Temperature Sensor 2                : 42 C

Here is output of lspci -vvv if that will help:

root@tegra-ubuntu:~# lspcai -vvva
00:01.0 PCI bridge: NVIDIA Corporation Device 0fae (rev a1) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 130
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: 13000000-130fffff
	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Subsystem: NVIDIA Corporation Device 0000
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
		Mapping Address Base: 00000000fee00000
	Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag+ RBE+
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Off, PwrInd On, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet+ LinkState+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [140 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=30us PortTPowerOnTime=70us
	Kernel driver in use: pcieport

01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a804 (prog-if 02 [NVM Express])
	Subsystem: Samsung Electronics Co Ltd Device a801
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 130
	Region 0: Memory at 13000000 (64-bit, non-prefetchable) 
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
		Vector table: BAR=0 offset=00003000
		PBA: BAR=0 offset=00002000
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [158 v1] Power Budgeting <?>
	Capabilities: [168 v1] #19
	Capabilities: [188 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [190 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
	Kernel driver in use: nvme

Other logs attached, in one badblocks -v completes without errors, in another it will spill the buffer I/O errors…
system_console_primary_… are booted with original r24.2.1 image
system_console_nvme_… are booted with recompiled image

Please help!!



system_console_nvme_badblocksFailed_09.txt (96.8 KB)
system_console_nvme_isomgr_02.txt (101 KB)
system_console_nvme_isomgr_03.txt (153 KB)
system_console_nvme_isomgr_coredump_01.txt (101 KB)
system_console_nvme_lspci_clean_10.txt (101 KB)
system_console_nvme_orphants_badblocksOK_clean_07.txt (94.3 KB)
system_console_nvme_short_04.txt (5 KB)
system_console_nvme_shutdown08.txt (91 KB)
system_console_primary_05.txt (91 KB)
system_console_primary_clean_06.txt (90.5 KB)

Please try follows.

diff --git a/arch/arm64/boot/dts/tegra210-jetson-cv-base-p2597-2180-a00.dts b/arch/arm64/boot/dts/tegra210-jetson-cv-bas
index dc24ce8..2bd86c7 100644
--- a/arch/arm64/boot/dts/tegra210-jetson-cv-base-p2597-2180-a00.dts
+++ b/arch/arm64/boot/dts/tegra210-jetson-cv-base-p2597-2180-a00.dts
@@ -231,7 +231,7 @@
                vddio-pex-ctl-supply = <&max77620_sd3>;
                status = "okay";
 
-               iommus = <&smmu TEGRA_SWGROUP_AFI>;
+//             iommus = <&smmu TEGRA_SWGROUP_AFI>;
 
                pci@1,0 {
                        status = "okay";

Brilliant! You nailed it again :)
It works with the Samsung 960 EVO drives.
If you could also have a look at this slightly different drive it is Intell 600p series 128GB.
The boot process will stop and hang 60sec on the this error:

[    3.520274] gk20a gpu.0: failed to allocate secure buffer -12
[    3.534752] brd: module loaded
[    3.543771] loop: module loaded
[b][    3.550717] PCI: enabling device 0000:01:00.0 (0140 -> 0142)
[    3.714372] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[    3.725935] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
[    3.741228] pcieport 0000:00:01.0:   device [10de:0fae] error status/mask=00004000/00000000
[    3.753109] pcieport 0000:00:01.0:    [14] Completion Timeout     (First)[/b]
[   63.714098] nvme: probe of 0000:01:00.0 failed with error -5
[   63.723245] pcieport 0000:00:01.0: AER: Device recovery failed
[   63.723705] zram: Created 1 device(s) ...
[   63.723794] nct1008_nct72 0-004c: find device tree node, parsing dt
[   63.723799] nct1008_nct72 0-004c: starting parse dt
[   63.724080] nct1008_nct72 0-004c: success parsing dt
[   63.724325] nct1008_nct72 0-004c: success in enabling rail vdd_nct72
[   63.775162] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[   63.786481] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
[   63.801623] pcieport 0000:00:01.0:   device [10de:0fae] error status/mask=00004000/00000000
[   63.813391] pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
[   63.823625] pcieport 0000:00:01.0: AER: Device recovery failed
[   63.895207] NCT72: Enabled overheat logging at 104.00C
[   63.903916] gpio wake33 for gpio=188

I know from previous experiments and reports that it will work when disabling CONFIG_PCIEASPM but that’s kind of important to have the power management working.
I also notice NCT72: Enabled overheat logging at 104.00C error after the 60sec hang. I do not have the FAN working yet but I have external FAN providing cooling. It is cold to the touch, not even worm.
Full log attached.

system_console_nvme_intel600p.txt (135 KB)

After initial success I am still having problems and suspecting NVMe drivers.
I am getting file system corruptions mainly in relationship to /tmp/… files. It is my impression that it mostly happen under heavy load, when I run ZED camera and ORB_SLAM frameworks together.
I have another NVME partition where I keep data, code and compile from, no corruption happen there yet.
Can someone please point me to or explain why disabling the iommus section made difference? Could it only limit the occurrence of the problem and not solve it?

Maciej

Disabling IOMMU/SMMU for PCIe would help to better the situation only if NVMe stack has some issues working with IOVA addresses. Otherwise, I don’t see how can it help better the situation in any other way.
Coming to NVMe driver in K-3.10, it is in a very basic shape and probably that is the reason why it can’t handle heavy loads. If possible, try back porting drivers from K-3.18 or K-4.4 where driver has gone through lot of modifications and quite stable compared to K-3.10