What is the Least Awful Linux Distro for CUDA development? Is it possible that Windows, or anything, could be worse than Ubuntu?

Since November, I have probably spent about 500+ hours (e.g. 80 hours, six times, on average) dealing with getting Ubuntu 16.04 running on three machines, each with multiple NVIDIA GPUs. This is perfectly understandable to go through once… well, no it isn’t. But let’s pretend it is. Especially if two of those three machines have different mobos.

But this is ridiculous. Many of the problems I have pursued were caused by, for example, shutting down the machine and then … turning it back on.

Sometimes the problem is caused by … wait for it … NVIDIA changing the device driver. By a point release. Which they pretty much have to do every time an A-title game comes out. According to the Linux community this is because NVIDIA doesn’t publish their source code on github. Except no software falls as flat as open-source community drivers for graphics and sound cards. And for some reason, all these point releases that make my Ubuntu “stable” release boot to a black screen don’t seem to have such a traumatic effect on my Windows 10 machine.

Back in late January I tried booting up a new machine with a couple of 1080Ti’s … after about 18 hours of pure futility it turned out that Ubuntu had pushed a new Linux kernel that didn’t work with anything, and NVIDIA had to push a beta driver that could deal with it.

Tonight, NVIDIA drivers aside, I’ve been up all night trying to make my cursor be not-invisible. It was not-invisible when I booted two days ago, but it’s not not-invisible now. I don’t recall anything this bad when Windows Vista came out, and that set a new standard for bad.

So, I can see that at some time it might have made sense to make CUDA development Linux-Mandatory. But now it’s just tormenting people who want to do Deep Learning or HPC. Use Containers you say? You mean, like Docker-CE containers? Like the ones that stopped working in January and still basically don’t?

SO, aside from the obvious question of whether CUDA development for DL and HPC can be migrated to Windows, which seems to be improving rather than deteriorating - unlike Linux - is there a Linux distro that doesn’t suffer from the capricious instability of Ubuntu?

Seriously, I really want a powerful gang of SIMD GPUs to work on. But truth be told, the capital spent acquiring four 1080Tis, four TitanX’s, and a Titan-V should have bought me a LOT of MFLOPs of research computing. The reality is that the same money, spent on high-end Intel CPUs running MKL would have gotten 10x or more effective GFLOPS - if not 100x - simply because the time spent to write the code that would suck up those GFLOPs would have been enormously higher. Instead, those GPUs have sat there most of the time doing nothing while I’m futzing with an OS whose developers seem dead determined to prevent GPUs from being used, unless their drivers are written by unpaid volunteers.

So, any recommendations? From here it looks like I’ve wasted about $15000 on hardware that is effectively not supported for HPC on a working OS. Worse, I have wasted many times that in opportunity cost of other things I could have been doing. I have wasted time and money relative to what I could have made working as a manager at McDonalds. And I say that as someone with many years’ experience in software (windows, unix, mac) and having built my own computers for 15+ years, and advanced degrees in the computing field.

Help. Or kill me. Just make the torture stop.

  • B. Student
1 Like

My observation is that 99.9% of reported issues with CUDA on Linux appear to occur on … Ubuntu. In my perception, there are two root causes for this. This distro: (1) relentlessly chases the latest bleeding-edge components and may even upgrade them automatically (2) often does things somehwhat differently from other Linux distros (the “Think Differently” faction of the Linux world, so to speak :-).

For a stable, robust Linux environment, I would suggest taking a look at RHEL (or CentOS, which is supposedly equivalent, but I haven’t used it). The potential downside is that you probably won’t get to play with the latest, bleeding-edge features. So if you absolutely need the latest C++17 features, RHEL probably is not the right platform for you.

Windows as a CUDA development platform has its own set of challenges (see the “why can I only use 80% of my GPU memory on Windows 10” thread in these forums as one example), and as someone who is generally OS agnostic and has about equal mileage developing with CUDA on Linux and Windows, I consider Linux generally to be the more productive environment. I sometimes refer to the “Windows Tax” because of that.

Note that some third-party GPU-accelerated software may not be available or supported on Windows, so proceed with caution if you plan to go in that direction.

Turn off all auto-updating by Ubuntu.

1 Like

NJ and Txbob you bring up a good point, I am NOT big on upgrades that lack solid motive:

My personal motto is “never use version 1.0 of anything,” which I’m willing to break for a new piece of HW like the Titan V, but I generally opt for “the previous stable release” of any dev tool or OS.

My Use Case is that I need systems that do math quickly and correctly:

I’m basically an applied researcher in a specific application domain - I am building systems that run live with real money at stake. My job is to make algorithms with unique capabilities and get them to run really really fast. Usually, anything bleeding edge introduces unacceptable risk. I am not huge on Deep Learning frameworks for the most part, as many of them sacrifice performance for ease-of-use or force the user into similar tradeoffs. The main thing (some) have to offer is automatic differentiation, and even then many of the associated AD algos are stilted toward braindead optimization methods such as steepest descent vs Newton/Quasi-Newton optimal gradient directions.

I’ve done a lot of dev on Windows in recent years because it’s the standard in my industry, but late 80’s - mid 90’s in grad school mostly unix on Sun / PDP / VAX (and early SIMD systems such as the Connection Machines CM1 and CM2 - so I don’t EVER complain about NVIDIA dev tools …).

But over the past year I’ve made the dive into Linux specifically to get the most out of NVIDIA HW, so…

Glad to see you have a better grasp on The Queen’s English than Apple :-) …

So, NJ, do you use RHEL regularly?
Definitely sounds like I should give it a shot.

First thing I always do, sir. The problem is when a new card comes out (e.g. when I put a TitanV into one of my boxes) or when I build a new box with significantly different config from others. I also run into problems like moving the computer from one area to another, and then re-seating cards and cables to make sure nothing came loose. That’s usually good for a black screen. Or like last night, when my cursor became not not-invisible with no apparent cause. I never discovered the cause, but after trying about 10-20 other things, I fixed it by changing the DM from lightdm to gdm (which did not do the trick by itself) and THEN installed gnome-tweaks (note that I did not have to actually run gnome-tweaks, but the act of installing it restored the cursor).

Txbob, while you probably cannot make recommendations, do you have any off-the-cuff empirical statistics regarding the ratio:
(Rate of reported CUDA issues w/distro Z)/(# of CUDA installs on distro Z)

??

  • B. Student

I did CUDA development on RHEL for about a decade but am in a “Windows phase” right now. I never had one single issue with RHEL, with and without CUDA. As I said, the one major drawback of using a conservative distro like that is missing out on the latest C++ language features.

I am a conservative upgrader and continue to use Windows 7 Professional. Windows 10 has some of the same automated update issues that you are experiencing with Ubuntu, or so I read. I often have long-running optimization jobs that run for hundreds of CPU hours. I do not want that work to be interrupted by forced upgrades. I apply OS patches manually about once a months, and only the indispensable security-relevant ones.

I continue to use CUDA 8 which is fully sufficient for my Pascal-family GPU, and my host compiler is MSVS 2010. It may be deprecated and unsupported on the CUDA side by now, but it still seems to work just fine.

BTW, you wouldn’t happen to work for Guinness, like this gentleman: [url]https://en.wikipedia.org/wiki/William_Sealy_Gosset[/url]? :-)

  1. Conservative? I’m still a bit leery of this “++” thing they’re trying to push on the C language.
  2. Good for you, sad for everyone else: you’re the first person I’ve encountered in a few years to get the ref, or at least to say so. I develop statistical methods (which, when done in a way that requires computing may be referred to as Machine Learning), and though I haven’t ever worked for Guinness, I’ve worked for employers where publishing was almost-forbidden, and any public commentary on social media or forums by a person openly affiliated with the firm was forbidden unless subject to extensive review. The publishing part is just bureaucracy in action, the social media part makes perfect sense in context. Hence the moniker. I like to say the B stands for Bourbakis. And I belong to the Cat Fancier’s Association.

[nitpick] I think it is Bourbaki, without the trailing ‘s’. [/nitpick] A pseudonym used by a group of French mathematicians, as I recall. I assume those unnamed employers of yours had mostly names comprised of three letters :-)

I have been programming since bits were still carved by hand (OK, a tad later: 1981) so I am a bit weary of that new-fangled “++” addition to C myself. It would probably be fair to say that I have remained a C++ light programmer to this day. In the past I also worked with embedded systems when C was still king in that space due to its low overhead and more predictable performance characteristics.

I found your comment about re-seating cards after moving your equipment a bit intriguing. As long as your GPUs are secured at the bracket (screw, clamp-down bar, etc.) and the little plastic tab at the PCIe slot is engaged, there should be nothing to worry about, unless the machines are in an environment where constant vibrations shake the components apart. The PCIe power connectors should have tiny hooks that engage little tabs on the GPU side with an audible click. I have learned from experience that all these connectors are apparently designed for a surprisingly small number of plugging cycles, so once all components are firmly plugged in during initial installation it is best to leave them alone.

The most frequently reported hardware stability issues in these forums seem to originate from insufficient power supplied to the GPUs.

Yes, bourbakis is possessive with apostrophe omitted.
Mostly I re-seat cards due to fat finger incidents, 98% of which occur to the card in the bottom-most PCI-E slot as a result of accessing a switch or header along the bottom row of the mobo.
As for power, I’m running Corsair AXi1500s with eight individual PSU PCIe sockets feeding the eight individual GPU power sockets (1080Tis and TitanXps), respectively. But those machines aren’t really the ones giving me trouble.
Now that you mention it, the machine that I am having the most trouble with has a 1600W EVGA psu which is likewise hooked firmly to the “big” GPUs, but I am trying to run the graphics off of a stubby half-depth EVGA 1050Ti which doesn’t have external power connectors. And it’s the one machine running on a gaming mobo. A Gigabyte gaming mobo (that is to say, the manufacturer is that one computer component company who never got the memo about Moore’s law). And the 1050Ti has shown enough instability that I tried swapping it and got different results. It does seem plausible that the stubby card could be power-starved…

The PCIe slot is specified to deliver up to 75W. Most NVIDIA GPUs are engineered conservatively and limit power consumption through the PCIe slot to around 45W or so. But the GTX 1050 (like the Quadro P2000, which I currently use) are designed for a power consumption of 75W, with no additional power connectors.

That’s not what I would call a conservative design, and something I’d normally stay away from. But the part looked too appealing to me so I risked it. I haven’t encountered any issues so far with actual power consumption (as reported by nvidia-smi) up to 70W, but it may work out slightly differently with different motherboards. I don’t use any “gaming” components.

I assume you are running dual Corsair AXi1500 for the eight GPUs? The rule of thumb for 100% stable operation is that the total nominal wattage of all system components should not exceed 60% of the nominal PSU wattage. GPUs in particular (but also CPUs) are prone to short-duration power spikes under rapidly changing loads, so the PSU needs adequate reserves to prevent local brown-outs. Also, PSU efficiency is typically best in the 25% to 60% load range, assuming 115V operation.

80 PLUS Platinum rated PSUs is what I recommend for high-end workstations and 80 PLUS Titanium for servers, due to the higher efficiency (we pay around $0.20/kWh in California for residential rates, so with 24/7 operation higher efficiency becomes noticeable in the electricity bill) and the use of higher quality components. PSUs are the system components that most frequently fail in my experience, followed by sticks of DRAM. I recognize Corsair and EVGA as quality brands.

I haven’t had trouble with the little 1050Ti before, but I’ll believe anything at this point.
Anyhow, I’m writing this from RHEL Workstation 7.4, on the same machine that was getting a black screen before. I agree that drawing 75W on a 75W bus is not what I’d call a good design margin.

LoL I’m running dual Ax1500i’s for 8 GPUs in the sense that they are two separate computers, each with one Ax1500i and four GPUs :-)

As for gaming components, guess which of my computers came from a vendor?

Actually, to be fair to the vendor - who I find to be knowledgeable and reputable, this is an x299 machine and they put me on the MoBo that was the least bad at the time.

My other boxes (4xGPU) are totally self-built on ASUS X99 E10G WS boards which AFAIK are the only boards of that generation that offer four PCIe 3.0 16/x16 slots.

So far there is no x299 board outside of vaporwareland that supports four GPUs running full x16. The ASUS x299 WS SAGE does so in theory, but it was announced in early November and there are a few douchebags out there who will take preorders and charge your card immediately. But no actual boards seen in the wild. All reviews I’ve seen seem to be from spec sheets.

Asus do have another x299 WS board that would theoretically be better than the board I have now, but I’m hoping some actual appropriate mobos emerge from the aether soon.

If you know a better alternative I’d love to hear about it!

WAT

I stopped building my personal machines from components 15 years ago, and started buying from Dell. I have stuck with them until now. I use only Xeon processors and Quadro GPUs in these systems. I simply want my systems to work out of the box and keep working. I doubt anything would compel me to go back to assembling my own systems from best-of-breed components.

I used a lot of Asus components for my custom builds, as that seemed like the high-quality brand at the time which gave me the least problems. But past performance is not always indicative of future results, and 15 years is a very long time in the computer world.

I have acquired gaming GPUs for work before and am partial to EVGA when it comes to those, as I have never had an issue with their products. But I would generally recommend staying away from any insanely overclocked parts for compute tasks. Instead, stick with either GPUs running at NVIDIA-designate stock frequencies or at most mildly overclocked ones.

While the GPU vendors may have thoroughly validated their factory-overclocked parts for the gaming market, I am not at all sure they apply the same level of diligence to compute applications. And from my CPU overclocking days I know that overclocking processors can result in subtly incorrect computations that may go unnoticed for quite some while and are difficult to track down when they happen. Too much unnecessary hassle for me.

Yeah with the exception of the stubby 1050Ti I only buy NVidia GPUs.
My story was the same as yours, except I used a vendor other than Dell, until last November.
I became convinced (empirically) that with new-generation multi-GPU systems I’d need to go with specialty builders who charged big markups.
Now I have learned why they charge big markups.
Interesting that you use Dell. I assume their server-class / corporate offerings must be higher quality than their consumer offerings in that case…

Yeah, I don’t buy their consumer offerings.

If I wanted an innovative high-end GPU-accelerated platform and had the money to spend, I would love to take a Power9 system with a Volta GPU for a spin. I have always considered PowerPC to be an interesting architecture, but only once had a chance to use it, way back around the year 2000: first an IBM PowerPC 750, then a Motorola MPC 7400 (?) and no, this wasn’t on Apple platforms :-)

LoL, I’ve had enough learning for now (RHEL 7 working great btw).
I like to say that the POWER9 is very interesting, but I’ll probably wait until Intel and AMD leak plans for prototype chips …