Jetson TK1 MPI Cluster

Pyrex · July 13, 2014, 7:12am

I am looking into building a rather large cluster of Jetson TK1’s and I was wondering if anyone had any suggestions on how I can properly utilize the GPU/CPU in an MPI cluster?

I am pretty new to MPI in general but I do plan to learn and read as much as possible in order to reach my end goals.

Thank You

dustin_franklin · July 14, 2014, 10:17pm

I’d check this out – [url]http://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/[/url]

Both MVAPICH2 and OpenMPI support CUDA. With MVAPICH you enable it with the USE_CUDA environment var
http://mvapich.cse.ohio-state.edu/static/media/mvapich/MV2-GDR-README.txt

Also for efficient spacing I have a coupled Jetson’s stacked using these – Pololu - Aluminum Standoff: 1" Length, 4-40 Thread, M-F (4-Pack)

HenrikAndresen · July 15, 2014, 7:48am

Hi Pyrex

Why do you want to use the TK1 for a cluster? You should be able to get much better performance from larger GPUs and for less $/core, less requirements on low synchronization etc. etc.

If this is a learning/student project then you should go right ahead.

I mostly see the TK1 as a low-power tool rather than a cluster building block.

Bill · July 15, 2014, 2:36pm

The Jetson should work like any other Linux box with Nvidia hardware available when it comes to writing MIPI or CUDA.
So whatever MIPI or CUDA stuff you have that will work on desktop machines should run the same on the TK1.
The built-in build-essentials with ARM GCC should support both MIPI and OpenMP without any issue. Just use the same make commands to get it going as you would on a desktop linux machine.

However at this time OpenCL is not fully released or supported on the TK1 yet.

Pyrex · July 16, 2014, 5:35pm

I appreciate the feedback guys! I am mostly doing this for learning purposes and my goal is to stack 6 TK1’s in a cluster.

I am trying to get some kind of software compiled that I can use to benchmark them. I tried compiling John the Ripper with MPI and CUDA support but that is proving to be quite difficult…

@dustin_franklin - Are you running them in an MPI cluster?

dustin_franklin · July 17, 2014, 5:47pm

I’m not using MPI yet, currently using sockets API and gigabit ethernet network to communicate between TK1’s.

Check out the Tiled Tegra section near the bottom of this: http://devblogs.nvidia.com/parallelforall/low-power-sensing-autonomy-nvidia-jetson-tk1/

If one compares the hex-Tegra cluster to a discrete GPU similar in compute performance, the 1152-core GeForce 760, there are benefits in power savings (170W discrete vs. 50W Tegra)

Pyrex · July 17, 2014, 7:03pm

I managed to get my cluster up and running some CUDA MPI code and I am bench-marking with John (JtR) against my own WPA2 key. Now I only need to expand my cluster to my planned size of 25.

Part of why I am designing this cluster is exactly as you stated, power efficiency and size!

I was reading the article you linked, I suddenly feel small…

Pyrex · July 26, 2014, 2:41am

I have managed to get my cluster running! I am just waiting on another order of TK1’s to cone in so that I can add to the cluster.

dustin_franklin · July 26, 2014, 2:30pm

ooo very nice :D

That’s a lot of power bricks hehe.

Do you configure each board manually? Or clone the eMMC somehow…

Pyrex · July 26, 2014, 7:20pm

After I created a working set of steps I just plugged each board into my serial cable of my laptop and ran a config script. I plan to make an image later but for now this worked. I have another 6 nodes coming in next week. I also utilize NFS to sync the work folder and user bin folder.

ShervinE · July 30, 2014, 5:17am

Nice work Pyrex! I agree with Dustin, that a cluster of Jetson TK1’s is perfect if you run algorithms that are suitable for cluster computing (ie: are OK to split up between separate GPUs with separate memory). Comparing a top-of-the-line desktop GPU versus a large cluster of Raspberry Pi’s versus a small cluster of Jetson TK1’s, it looks like Jetson TK1’s are significantly better in terms of performance per dollar or Watt of electricity than the other options!

Pyrex, when you have your new cluster of Jetson TK1’s, post some info here and I can add it to the Jetson TK1 Wiki such as in the Projects section and possibly in a performance section.

enthusi · July 30, 2014, 11:23am

Very cool (even literally).
With 4 main CPU cores each, this is a nice little workstation on its own (even if the code can not utilize the bare GPU power).

Pyrex · August 2, 2014, 9:31pm

Still moving along, I am going to be digging into to the OS to find anything I can disable or just remove in order to boost performance. As it is I just managed to get OpenMP working without a hitch by disabling the CPU power saving features and manually enabling all the processor cores. I then managed to utilize OpenMP + MPI + CUDA. More to come as I continue to improve on the code.

Also, if someone has some ideas on building a PSU I’d appreciate any feedback! I am running out of plugs…

Bill · August 3, 2014, 3:12pm

I believe if you aren’t plugging in SATA drives or PCIe cards then the Jetson needs about 1.5A max of 12V power.
So if you have an ATX PSU with 15A+ 12V available you could just build a plug adapter.
Or just go order a 20A or higher 12V PSU online similar to this.

I would get higher than you think you need so 20A or more would be what I would get.

Also as a benefit, a 20A 12V PSU could help you run SLI with a cheaper PSU once you are done working with the Jetsons.
You should check the power measurements that were posted on this forum before ordering to make sure I was right at the maximum power draw, since you are planning on running the maximum power draw here. The Jetson ships with a 5A supply so you might need to go higher for 10 of them.

dustin_franklin · August 4, 2014, 12:20pm

I’d use Bill’s PSU with these barrel connectors [url]Pololu - DC Barrel Plug to 2-Pin Terminal Block Adapter, this should help keep the wiring pretty straightforward. Also now that you’ll only have one AC plug, perhaps one of these (or similar) would be useful [url]http://www.p3international.com/products/p4400.html[/url]

docdailey · August 10, 2014, 6:20am

Why not power it through the molex connector? I have seen references stating yo can power the board using the onboard 4 pin pc ide connector. I only see one on the board and I presume the 12v rail doesn’t care where it gets power.

This way you use a commercial off the shelf pc power supply. If you buy a 1300W supply you should be able to avoid the barrel connecters altogether and just run a bunch of molex connectors. That supply would support 26 TK1’s. Nice, compact solution that is more energy efficient than 26 wall warts.

Doc

linuxdev · August 10, 2014, 7:41pm

Answered in the other thread…don’t do it. MOSFET and diode active components between the molex and main 12V input.

docdailey · August 13, 2014, 8:08pm

Thanks… I need to change my settings here because I was just going to do that.

Appreciate the reply.

bill

sigkill · October 26, 2014, 6:04am

Not quite as awesome as Pyrex’s cluster - but it’s a start.

I have jumbo frames enabled and have the cluster on a 2nd subnet using a linux box as a deb apt repo with and packages i’ve had to compile for the Jetson.

I’m attempting to get a kernel compiled that will support lxc and docker (though arm support is pretty bad right now)

I’ll put up some info on steps I’ve done so far shortly. It is using a modified Grinch Kernel with some additional modules / things enabled and recompiled.

Have attempted to get PyNN running but having issues with neuron + iv compilation and other weird issues where folks assume everything is x86 with SSE support coded into the source.

If possible I would like to setup a PCI based network - as mentioned on the nvidia OpenCV article, 32Gb connection between each Jetson but it’s still really hard to find cheap PCI networking

The x86 linux host is also running puppet master - one thing that could additionally help adding new nodes would be the ability to create or alter hostnames when flashing new boards. This way puppet could just image each new board as they are added and give a unique but similar config to all newly added boards.

Lastly - if I could create a memcache or shared DDR3/4 based service that was also available to these hosts via the PCI bus I believe I could do a bunch more cool things. As it stands this is a great platform for learning, automating, and understanding parallel programing.

sigkill · October 26, 2014, 6:07am

Forgot to add I am putting the cluster into an Ikea Helmer next month when I get some more free time :)