GPU hard reset

lars · March 3, 2011, 3:41am

Hi,

We’re having trouble getting our TYAN S7015 based 8 x GTX 580 GPU box to run reliably.

Box specs:

http://www.tyan.com/product_SKU_spec.aspx?ProductType=BB&pid=412&SKU=600000188

After a fresh reboot, our CUDA regression tests usually run fine for quite a while… However, after a random period of time (a day or so), some of the cards start getting NVRM Xid errors… From dmesg:

NVRM: Xid (0084:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000

After this, the 0084:00 GPU is going into a quite dodgy state, with kernels randomly getting “launch failed” errors and generally producing incorrect results. It seems like the Xid error caused a GPU memory corruption or something that the card is unable to recover from. Reloading the nvidia kernel module doesn’t help. Even after a soft reboot, the failing card often still behaves badly. Only power cycling the whole server seems to restore the GPU to a good state.

My first question is, is there any way to trigger a GPU “hard reset” from software, without power cycling the server, in order to try to work around this kind of errors? Something like nvidia-smi --hard-reset -g 0 would be really useful. Is it even possible with current hardware?

Second questions is, does anyone else out there have any experiences to share from trying to run CUDA under Linux on an 8 x GTX 5x0 GPU server similar to ours? Does it work for you, or are you running into similar problems? Any suggestions would be greatly appreciated.

nvidia-bug-report.log.gz attached… Btw, we’re loading the kernel module with NVreg_EnableMSI=1 which seems to improve system stability.

Thanks,

Lars

twofisher · March 7, 2011, 10:58am

Hi Lars,

sorry, but this is just a “me too”.

We have the same board equipped with 8 GTX 460 GPUs (these use less power than the “official” Tesla-configuration).

Some of the devices seem to work quite reliably, others show the exact same error messages as yours do.

In addition to your messages i get this:
NVRM: Xid (0011:00): 31, Ch 00000001, engmask 000000081, intr 10000000

I tested with the pre-release of cuda 4.0, too but i see the same problem there, too.

Does the SDK’s simpleMultiGPU test program work correctly on your box?

Other test programs like bandwidthTest seem to work o.k. on all 8 devices (command line swith -device=[0-7] used).
matrixMul runs fine on device 0 and 6, 7 but not on the others.

This is really strange. The box runs with OpenSuse 11.2 (64 bit).

If it helped folks at nvidia, i could temporarily provide a ssh account to the box.

Regards
Martin

MikhailK · March 7, 2011, 9:15pm

May be there are some power instability problems w/server power supply ?
Also try to exchange the GPUs places to see if the error is really “bonded” w/particular GPUs.

Mikhail

lars · March 10, 2011, 3:44am

Hi Martin,

The MultiGPU SDK samples seems to work fine on our box, while simple SDK samples such as “scan” and reduction fail on several cards…

I also get the same Xid messages as you:

NVRM: Xid (0011:00): 31, Ch 00000001, engmask 000000081, intr 10000000

I’m currently trying to gather more information in order to file a better bug report. I’ve tried to enable more detailed logging using the nvidia ResmanDebugLevel parameter, but except for the Xid error message, the outputs from a successful run on card 0 is almost identical to a failing run on card 1. Any suggestions as for how to gather more useful information would be very welcome.

As mentioned before, all the cards seems to work fine for a while after power cycling the servers. After a while though, cards starts failing and wont work properly again until a power cycle is performed (a simple reboot doesn’t seem to be enough)

I would also be happy to provide NVIDIA with ssh access to this machine if that would helpful.

/Lars

lars · March 10, 2011, 3:52am

Mikhail,

The box is using 2 redundant 2400W power supplies, so the amount of power they can provide should be enough… We’ve measured the all-time maximum power consumption of the box under full CPU+GPU+HD load to be about 2100W.

The stability of power is maybe another question… Not sure how to debug that? Also, if it would be a power stability issue, I assume that GPU errors would be likely to be evenly and randomly distributed between cards? As it seems now, cards 0, 3 and 5 seems to be working ok while the others usually fail after a while.

Will try to experiment more with exchanging GPU’s once we get local access to a similar server.

/L

Arakageeta · March 15, 2011, 11:11pm

This is a “me too” post as well. I’ve found the only way to fix this problem is to UNPLUG both power supplies, wait for a few seconds, then boot back up. I know this sounds crazy, but it’s the only “fix” that I’ve found. This Tyan system is pretty unstable-- or “a piece of crap.” We’ve been waiting for Tyan to RMA our motherboard for a MONTH and they seem to keep stalling.

Other quirks I’ve found with this Tyan system:

The visible memory modules changes from boot to boot. It appears one to all three memory modules on Node 1 disappear. (This is the problem Tyan is covering under warranty.)
The BOIS doesn’t always come up on power-on. This happens nearly every time after a kernel panic (I do kernel development and I panic all the time). I have to attempt to power on the system many many times over.

Arakageeta · March 17, 2011, 4:28pm

I thought that I would post an update. Tyan refuses to send us a new motherboard to replace this flawed one. Instead, they want us to ship our machine out to them (at our expense) where they “promise” to fix it within two weeks and ship the machine back to us at “no cost.” The machine is still partially functional and actively being used in research, so a two week downtime is unacceptable (who’s to say it wouldn’t take longer?). We continue to go back and forth with Tyan—waiting to hear their response to our request for a new system. I know their 8-GPU system is very appealing, but I recommend people think twice before buying this particular Tyan system. Perhaps we just got a lemon, but this product appears to have skipped QA testing. External Image