Backup and restore on multiple Jetsons

Hi Everyone,

We have about 15 Jetson TX2. We needed to to install some packages and do some configuration on all of them. Doing so on every Jetson takes a lot of time, so we thought to backup one Jetson that has everything and flash the image on the rest of them.

So here what we did:

  1. Flash one Jetson (say Jetson1) using JetPack 3.0, install and configure everything on it.
  2. On linux host machine, connect the Jetson1 in recovery mode and take a backup using L4T with the following command:
    sudo ./flash.sh -r -k APP -G system.img jetson-tx2 mmcblk0p1
    
  3. Move system.img and system.img.raw to L4T/bootloaders folder.
  4. With some research, I knew that L4T can only backup one partition at a time, so we thought to flash new Jetsons (say Jetson2) using JetPack 3.0 first to fill all other partitions with "Flash OS Image to Target" option only active.
  5. Restart the Jetson2 and connect it in recovery mode again then flash it using same L4T with the following command:
    sudo ./flash.sh -r -k APP jetson-tx2 mccblk0p1
    

So here is the problem, this method worked with 10 of the Jetsons while fails the other 5 Jetsons.

The ones that didn’t work shows the following message while booting: “The system is running in low-graphics mode”. And I’m not able to fix this till now.

Please note that I use the same process/equipment with all of them. I can’t figure out why it works with some of them and doesn’t with the others!

I hope you can help me figure out this problem or let me know if there is a better/easier way to accomplish the same.

Thanks for your help in advance,
Ayman.

Is there any historic difference between the failed and working clone installs as to which L4T version which was on the Jetson just prior to the restore? There are many hidden partitions, and if one of those partitions is from a different version of L4T, or if one of those others had different install options, then I could see this happening.

As an alternate view, you could clone all of the partitions on one Jetson which works, and attempt to clone restore each of the hidden partitions one at a time into the failing system which already has the rootfs clone restored. See if one of those partitions fixes it.

A note about partitions and cloning: There is still some “bookkeeping” type information at the start of the eMMC which is not part of the hidden partitions, nor part of the rootfs, but which is needed for the system to identify where things are. If this boot record information were from a system cloned with a different partition size or layout versus what was cloned in as a restore, then it would probably imply something would fail when offsets do not match metadata. So consider that if anything were historically different not only in partition content, but also in partition size (especially rootfs since it is the first partition after the metadata), this would break boot.

On a TX2 I have here with R28.1 (using sudo) I can see output from “gdisk -l /dev/mmcblk0” as this:

GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/mmcblk0: 61071360 sectors, 29.1 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 00000000-0000-0000-0000-000000000000
Partition table holds up to 17 entries
First usable sector is 4097, last usable sector is 61071327
Partitions will be aligned on 1-sector boundaries
Total free space is 1 sectors (512 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            4097        41947136   20.0 GiB    0700  APP
   2        41947137        41955328   4.0 MiB     0700  mts-bootpack
   3        41955329        41955840   256.0 KiB   0700  cpu-bootloader
   4        41955841        41956864   512.0 KiB   0700  bootloader-dtb
   5        41956865        41963008   3.0 MiB     0700  secure-os
   6        41963009        41963012   2.0 KiB     0700  eks
   7        41963013        41964220   604.0 KiB   0700  bpmp-fw
   8        41964221        41965220   500.0 KiB   0700  bpmp-fw-dtb
   9        41965221        41969316   2.0 MiB     0700  sce-fw
  10        41969317        41981604   6.0 MiB     0700  sc7
  11        41981605        41985700   2.0 MiB     0700  FBNAME
  12        41985701        42247844   128.0 MiB   0700  BMP
  13        42247845        42313380   32.0 MiB    0700  SOS
  14        42313381        42444452   64.0 MiB    0700  kernel
  15        42444453        42445476   512.0 KiB   0700  kernel-dtb
  16        42445477        42969764   256.0 MiB   0700  CAC
  17        42969765        61071326   8.6 GiB     0700  UDA

APP is the rootfs and is the first partition. It begins at byte 4097, which is a 4096 byte offset or one sector offset when sector size is 4096 bytes. An older BIOS style partition would reserve 512 bytes for MBR, plus backup MBR…UEFI moves some of the firmware into the start of the disk and out of the BIOS. If this were a working and bootable system you could use dd to write into the initial disk metadata using a copy of the metadata from dd reading the first 4096 bytes of a working system, but since you can’t boot you can’t use dd for write. I don’t know if the clone software is capable of cloning this unlabeled raw byte offset (you could on the TK1, but it has different flasher options).

Can anyone from NVIDIA suggest if the R28.1 driver package clone can copy via exact byte offset (perhaps with a patch)? Is it mandatory to clone only by partition label? Is there a way to clone offset byte zero through byte 4096, and then to write this by offset into one of the failed units? This would provide a true backup and restore mechanism even on customized installs.

Thanks linuxdev for your reply.

There is no historic difference between both, in the same session I was able to flash one Jetson and works and another that doesn’t. Reflashing both gives the same results.

Cloning and restoring each partition separately is lengthy but I can try it. Is it the same process? For example for “mts-bootpack” partition, I need to “sudo ./flash.sh -r -k mts-bootpack -G system.img jetson-tx2 mmcblk0p1”, move it to bootloaders folder then “sudo ./flash.sh -r -k mts-bootpack jetson-tx2 mccblk0p1”?

I checked both working and failing Jetsons and the partition sizes and layout are identical. You can check the result I got here as it bit different than yours: [url]https://imgur.com/a/ULVby[/url]

The weird problem is that I flash a new Jetson using Jetpack with “Flash OS Image to Target” option only and it boots normally. Then I flash the APP partition with the image I took from the other Jetson using “sudo ./flash.sh -r -k APP jetson-tx2 mccblk0p1” and that when I get “The system is running in low-graphics mode” message!

So far as I know cloning is the same for every partition other than having different partition names. Notice that you can list partitions on a Jetson with “sudo gdisk -l /dev/mmcblk0”. One is “APP”, which is the rootfs, and this is why the clone or restore would name “APP”. “mts-bootpack” should be valid for that partition.

One of the things I wanted to emphasize about “being the same” when flashed is that they also be the same rootfs partition size, not just the same version. But given this I would think all Jetsons should function with clones of rootfs. I suppose there is a possibility of a board revision being an issue of requiring some change, but I don’t know of any specific example of this.

I do hope we can find a way to clone via byte offset as well since partition names do not allow clone of the entire eMMC…the first 4096 bytes (one sector) really needs to be cloned and written too if complete control over partition content is to be available during production runs based on clone of a reference unit.

I tried cloning the other partitions, a lot of them are not supported and it seems the process for cloning other partition is not the same as with APP partition.

The partitions are exactly the same sizes and layout. I also thought it might be the board revision but what I have tried today didn’t make any sense to me. I flashed the original Jetson with the same image I took from it and it did show the same problem of “The system is running in low-graphics mode”!

I don’t know what would be the problem, again the way I do it as following:

  1. Backup APP partition
  2. Use Jetpack on new Jetson to flash the OS (to fill other paritions)
  3. Restore APP partition on the new Jetson

Note that I restored the image on the same Jetson I did the backup from using the same Jetpack version I originally flashed it.

Can anyone tell me if I’m doing something wrong? and if there is different way to achieve the same?

Thanks.

hello Elkfrawy,

we had tried to reproduce your issue but not able to meet the same failure so far.

according to your comments,
>> this method worked with 10 of the Jetsons while fails the other 5 Jetsons.
may i know had you meet this issue consistently on the same 5 Jetson boards?
how about flashing again? did you still bump into this failure?
thanks

I’d just like to add that if we can get a modified flash.sh which allows cloning and restore of raw byte offsets (or raw sector numbers) there would be a lot more we could do in terms of backup/restore/production/testing.

JerryChang,

Thanks for your reply. I tried reflashing multiple times and the same problem of “The system is running in low-graphics mode” still exists. The weird thing is that everything got flashed correctly and all files are there, we can still can build and ssh on it. Don’t know what is the problem with xserver so. And why is it with some units and not with the others!

Does “sha1sum -c /etc/nv_tegra_release” show all files valid after the clone restore?

It seems not all files are valid, here you the result I got:

/usr/lib/aarch64-linux-gnu/tegra/libnvrm_graphics.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvll.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvcamerautils.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvcolorutil.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnveglstreamproducer.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libscf.so: FAILED
/usr/lib/aarch64-linux-gnu/tegra/libnvexif.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvmmlite.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvrm.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvmm_contentpipe.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvcameratools.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvos.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvmm_parser.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvcam_imageencoder.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvapputil.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvwinsys.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvtestresults.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvomxilclient.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvmm_utils.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libglx.so: FAILED
/usr/lib/aarch64-linux-gnu/tegra/libnvcamlog.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvosd.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvomx.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libargus_socketserver.so: FAILED
/usr/lib/aarch64-linux-gnu/tegra/libnvmmlite_utils.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnveglstream_camconsumer.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvavp.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvtnr.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvmedia.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libargus_socketclient.so: FAILED
/usr/lib/aarch64-linux-gnu/tegra/libnvparser.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libargus.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvimp.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libtegrav4l2.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so: FAILED
/usr/lib/aarch64-linux-gnu/tegra/libnvddk_vic.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvmm.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvtvmr.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvmmlite_image.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvmmlite_video.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvodm_imager.so: OK
/usr/lib/aarch64-linux-gnu/tegra/libnvdc.so: OK
/usr/lib/aarch64-linux-gnu/libv4l/plugins/libv4l2_nvvidconv.so: OK
/usr/lib/aarch64-linux-gnu/libv4l/plugins/libv4l2_nvvideocodec.so: OK
/usr/lib/xorg/modules/drivers/nvidia_drv.so: OK
/usr/lib/xorg/modules/extensions/libglx.so: FAILED
sha1sum: WARNING: 6 computed checksums did NOT match

What does that mean? I’m not able to think about a reason for that!

It appears something has updated the system such that NVIDIA-specific versions have been replaced with another version. This will cause all kinds of failures, and I’d be surprised if anything using libglx.so has any kind of success with a monitor at all.

In the driver package the “sudo ./apply_binaries.sh” step installs these files. You can run this with the “-r /some/where/else” option to apply the binaries to a different directory than the rootfs subdirectory, and you can also put the “nv_tegra/nvidia_drivers.tbz2” file in the “/” directory of the Jetson and then do this to put them back in place (this isn’t all files, just the core drivers):

cd /
sudo tar xvfj --overwrite nvidia_drivers.tbz2

Watch for failures while extracting the drivers, but after this the sha1sum should be ok. However, this does not account for why any of the systems would work at all…I’m curious, do any of the working systems pass the sha1sum test?

The working units have sha1sum passing. I copied nvidia_drivers.tbz2 to the failing Jetson and extracted it into / directory then sha1sum was passing. But after restarting and try to connect it to a monitor, it showed the booting sequence then mouse pointer appears briefly and I get a black screen after that (No signal to the monitor). I still can ssh to the Jetson.

The part which really sticks out is that if you have cloned from a Jetson with a passing sha1sum, then everything receiving the clone should also pass. Does the sha1sum pass on the Jetson which was the source of the clone? If so, then either the clone was bad, or the restore.

My concern would be that if something modified those files, then other parts of the system were probably also modified…and without knowing exactly what happened you can’t trust that those other parts will work interchangeably with the “corrected” unpack of files.

That said, I would recommend watching what happens via serial console as the system boots. This could provide a very good insight into what remains failing with far less effort than figuring it out one step at a time (serial console might just tell you directly what’s failing…ssh only shows after networking is up…though you can dig through “dmesg” and “/var/log/Xorg.0.log” and the answer might be there).