Mesos or Slurm or.. for job scheduling

Beco · January 12, 2016, 12:41pm

At my work place we have just built a DevBox with 4 Titan X gpus. We are several people who will be using this machine and wonder about what the best way to share access to the gpus and schedule jobs would be.

Since we are running Mesos+Marathon on the cluster where we will deploy the machine, I guess we could also share access to the DevBox’s GPUs via Mesos. From [1] and [2] I understand that the newest version of Mesos has now in-built support for Nvidia gpus. However, I cannot find anything concrete on the Mesos documentation about this and on [2] it seems this is only for Tesla GPUs. So… has anyone successfully used Mesos as a job scheduler with a DevBox? Would you recommended it?

We were also thinking on running our jobs with help of Docker containers. I have seen that NVIDIA provides utilities to build and run NVIDIA Docker images in [3]. Would I need to change anything in Marathon to launch these docker images?

I have also seen that Nvidia recommends several cluster management tools. Out of them, Slurm looks quite good and it is open-source, so I wonder if it would be easire and /or better to use Slurm instead of Mesos+Marathon (e.g., in terms of scheduling options). Any experience?

Any other options that would enable a small team to share the DevBox’s GPUs in an effective and painless way?

Cheers,
Humberto

[1] We're working with NVIDIA to bring GPUs and deep learning to the DCOS | D2iQ
[2] http://www.nvidia.com/object/apache-mesos
[3] GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs

Kerzmann · January 14, 2016, 12:23pm

We have pretty much the same setup featuring two DevBoxes and have literally the exact same questions and problems. We’ve also been trying to decide betweeen slurm and Mesos, but slurm has no Docker integration (for now), while Mesos’ documentation of GPU support is nowhere to be found.

Really looking forward to some insightful replies!

Best,
Robert

Amit_Kumar1 · February 3, 2016, 7:01pm

Hi Folks, I am keen in getting an answer to this question as well.

Looking forward to expert advise…

Regards,
Amit

stream_Y · June 13, 2016, 9:48am

Hi Folks,

I’m also searching for such a solution about scheduler + container.
Found that Docker has not integrated slurm officially yet.(If there is one, please ignore this.)
Is there any third-party implementation to combine slurm and docker right now?
Or does Marathon + Mesos support MPI?

Looking forward any advise!
Thx!

Regards,
Chace