DIGITS can't use TX2 from Host PC

Peter3D · July 14, 2017, 11:38pm

When I start the DIGITS server on the HOST PC (Ubuntu 16.04), it has Cuda version error 35 (not correct CUDA). When I tried to follow install instruction for nVidia drivers on the host, it almost required reformat because it installed the graphics driver on the laptop (has Intel graphics board).

I installed using Jetpack 3.0, so CUDA is installed on both Host and Target. Target works fine, host DIGITS works as long as it doesn’t need GPU. Fails if it does. How do I test connection from Host to Target? SSH to target login works. Thanks.

-peter

linuxdev · July 15, 2017, 12:15am

I can’t answer your question, but there is a common subtle point to remote connections and X11. People think of X as a graphics system, and it is, but it is also an interface to a GPU even if you don’t use it for graphics.

X has the ability to forward to another X server, and when it does that, it forwards events, not resulting computations. The GPU on the server doing the display is the one which does the processing regardless of which machine the program runs on. If you run and display entirely on the Jetson, then you can never go wrong because CUDA and driver on its GPU will always match. If you run on PC and display on PC, you are also in good shape.

If you run on Jetson and display on PC, then you found the quirk. PC’s CUDA/GPU/drivers are what the program running on Jetson will require, but most likely the program was compiled for the Jetson’s CUDA version. You could compile the program to make available both CUDA versions with some advanced effort (I haven’t done it, but there is an “advanced” setup section in the nsight edition of eclipse which allows this). You’d still be using the PC’s GPU though, which may not be what you want. If you use “ssh -X” or “ssh -Y”, then you can be certain X11 events are being forwarded and there is an attempt to offload GPU work from Jetson to PC.

If you run a virtual desktop on the Jetson, and display to a virtual desktop client on the PC, then you get all you wanted…it runs on the Jetson, uses the Jetson’s GPU, can work even when the Jetson doesn’t have a monitor, and can be viewed in real time on the PC. There have been several threads about virtual desktop setup, and there may even be a suitable virtual desktop server setup from the Jetson’s Ubuntu tool box…I haven’t set this up myself though. This even works to display from Windows.

Peter3D · July 15, 2017, 12:41am

Not sure that helps. The DIGITS server has to run on the Host PC because its working with a 100Gb training image dataset. This should offload the CUDA functions to the GPU on the Jetson TX2 (my Host pc doesn’t have a nvidia gpu), but it doesn’t know there is one connected, there no choice to select the GPU as it shows in the demo I’ve been following (2 days to demo). My understanding of how this works is the Host PC creates a SSH connection to the Jetson and transfers data using that connection, but I don’t think its working.

AastaLLL · July 17, 2017, 3:02am

Hi,

Steps should be:

Train your model with DIGITs on host
Inference your model with TensorRT on the device.

DIGITs can’t inference model on the device remotely.
Please copy your model to the device and run it with TensorRT.
We have a sample code to demonstrate how to do this; please check:

Thanks.

snarky · July 17, 2017, 4:38am

No, it cannot do that; digits tells caffe to train, and caffe only trains on local GPUs on the host it runs on.
(Digits itself is just a nice web UI to manage the input and output files and show you pictures in web pages instead of just raw data on disk.)

If you’re going to be doing this for real, I highly recommend getting a GeForce GTX 1080 Ti graphics card for your desktop computer. The GPU in the Jetson is nice for running pre-trained models, but it’s quite slow on actual training tasks. (Still faster than a bare CPU, but much slower than “real” hardware.)

If you currently can’t put a GPU into your host computer, then another option is to transfer the data to a GPU instance on Amazon Elastic Compute Cloud. You can lease instances with 1, 8, or 16 GPUs, for a buck or two per hour of training. (Beware though that Amazon will also charge you for storage and network transfer costs, and if you forget to turn off the instance once you’re done training, charges can accumulate drastically over time.)

Once you have the trained model, you can transfer it (the parameters files) to the Jetson, and use the runtime libraries/tools on the Jetson to actually run inference.

Peter3D · July 17, 2017, 4:54am

Ok, now I see the Training PC needs a GPU, I don’t have it or get it, also can’t pay for Amazon cloud (corporate limits). What does it take to use the Jetson for the training? As long as it doesn’t take a few weeks, I’m good. Can I continue the 2 days demo without it?

snarky · July 17, 2017, 5:15am

You can presumably download and install DIGITS on the Jetson module itself.

If the data set is too big, then you can export the data set from your host computer using NFS, and import/mount it on the Jetson as an NFS remote. There are tutorials for using NFS under Linux if you just search for them.

Don’t expect wonders, but it’ll probably work.

AastaLLL · July 17, 2017, 5:30am

Hi,

DIGITs use NVCaffe backend.
NVCaffe requires cuDNNv6 which is not available for Jetson yet.
Please wait for our JetPack3.1 release. It will be soon.

More, we have put some pre-trained model in GitHub - dusty-nv/jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson..
You can continue inference part with our released model first.

AastaLLL · July 25, 2017, 1:31am

Hi,

cuDNNv6 is available now.
Please check JetPack SDK | NVIDIA Developer

Peter3D · July 25, 2017, 5:03pm

Ok, so now I can use DIGITS to train on Jetson, but large datasets are going to be a problem (64Gb limit on SD card?). We were told the Jetson provides a total Image Processing package that both trains and deploys image recognition, segmentation and classification. Assuming we have a dataset that can fit on Jetson, why do you not recommend it for training???

snarky · July 25, 2017, 6:01pm

You can totally make it work, assuming you are patient.
You can plug a SATA hard drive into the Jetson TX2 developer motherboard. SATA drives can be several terabytes in size.
You can also mount data sets on NFS and use Ethernet to get it to the board. NFS servers can be approximately infinite in size.

If, at some point, you feel that your time is worth some amount of money to accelerate, renting some GPU instances from Amazon to use for training is a pretty significant value-for-money lift for those who are less patient.

If, at some point, you find that you do so much training that you pay thousands of dollars to Amazon every month, because of production training or whatever, you may in the end find that building your own training machines based on rack mounted servers and high-end graphics cards may end up being the most cost effective.

It entirely depends on your personal money value of time.

Peter3D · July 28, 2017, 4:18pm

Ok, install Jetpack 3.1, got new cuDNN, NVcaffe installed there too. However, DIGITS won’t run because since I don’t load a nvidia PCI driver, I don’t get the libnvidia-ml.so file. What now?

Skypuppy2 · July 30, 2017, 3:24am

You can get by with a 1050 Ti for about $130 but I’m not sure if it is faster than the Tx2, but it certainly has more memory available to it.

Peter3D · July 30, 2017, 5:57am

Cost for board isn’t the problem, we only have laptops and netbooks. Justifying a desktop just for this is more of a problem.

Skypuppy2 · July 30, 2017, 2:52pm

There is a method for using video cards on laptops but I don’t remember how. Suggest a web search.

Peter3D · July 30, 2017, 10:54pm

Ok, I guess I have to get a lot more specific for my case. I’m doing an embedded hardware application that requires both training and deployment on the same stand-alone platform (non-PC). So I need to train the model and deploy it on the Jetson. Is this possible?

Skypuppy2 · July 31, 2017, 2:11am

Peter, it is EXTREMELY time consuming during the learning process. If you can use a high-powered card, like the 1080 Ti, you could possible save months, literally, over the TX2. Ask your bosses whether the tradeoff is practical for your environment. One could even use the high-powered GPU to learn at least the basic patterns and save a great deal of time in the deployed TX2 application. Just my thoughts.

AastaLLL · July 31, 2017, 2:42am

Hi,

Nvidia-smi not supports Tegra.

Training:
Clarify first:

DIGITs is just the UI for tracking status

NvCaffe is the backend frameworks who do the training job.

There are two approaches:

Use NvCaffe directly. Command line based training.
If you want DIGITs’ interface, please modify DIGITs source(python) to independent with nvidia-smi.
GitHub - NVIDIA/DIGITS: Deep Learning GPU Training System

Deploy:
It should be good.

Thanks.