At my work place we have just built a DevBox with 4 Titan X gpus. We are several people who will be using this machine and wonder about what the best way to share access to the gpus and schedule jobs would be.
Since we are running Mesos+Marathon on the cluster where we will deploy the machine, I guess we could also share access to the DevBox’s GPUs via Mesos. From [1] and [2] I understand that the newest version of Mesos has now in-built support for Nvidia gpus. However, I cannot find anything concrete on the Mesos documentation about this and on [2] it seems this is only for Tesla GPUs. So… has anyone successfully used Mesos as a job scheduler with a DevBox? Would you recommended it?
We were also thinking on running our jobs with help of Docker containers. I have seen that NVIDIA provides utilities to build and run NVIDIA Docker images in [3]. Would I need to change anything in Marathon to launch these docker images?
I have also seen that Nvidia recommends several cluster management tools. Out of them, Slurm looks quite good and it is open-source, so I wonder if it would be easire and /or better to use Slurm instead of Mesos+Marathon (e.g., in terms of scheduling options). Any experience?
Any other options that would enable a small team to share the DevBox’s GPUs in an effective and painless way?
Cheers,
Humberto
[1] We're working with NVIDIA to bring GPUs and deep learning to the DCOS | D2iQ
[2] http://www.nvidia.com/object/apache-mesos
[3] GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs