Issues with multiple contexts on M60s in Azure

Hi,

we are currently working on a migration of our platform from AWS to Azure and have run into an issue with multiple contexts on Azure M60s using Cuda 7.5. All Cuda code we have tried, even simple ‘Hello Worlds’ hangs when the 10th or greater context is opened.

The test case is a hello world which opens a context then pauses. The first 8 initialise OK and show as using 73MB on the card and run OK. The 10th and higher block and show as using 1MB. If the first 8 processes exit the blocked processes do not recover. I can find no resource limitation that could be causing this. All suggestions welcome !

Sun Mar 26 14:57:33 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 89AE:00:00.0     Off |                  Off |
| N/A   43C    P8    15W / 150W |      2MiB /  8123MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 8C45:00:00.0     Off |                  Off |
| N/A   43C    P0    39W / 150W |    601MiB /  8123MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1     12387    C   ./a.out                                         73MiB |
|    1     12396    C   ./a.out                                         73MiB |
|    1     12403    C   ./a.out                                         73MiB |
|    1     12412    C   ./a.out                                         73MiB |
|    1     12419    C   ./a.out                                         73MiB |
|    1     12424    C   ./a.out                                         73MiB |
|    1     12431    C   ./a.out                                         73MiB |
|    1     12438    C   ./a.out                                         73MiB |
|    1     12445    C   ./a.out                                          1MiB |
|    1     12480    C   ./a.out                                          1MiB |
|    1     12652    C   a.out                                            1MiB |
|    1     12659    C   a.out                                            1MiB |
+-----------------------------------------------------------------------------+

I cannot offer any insights here other than that 73 MB would appear to be in the expected range of CUDA context sizes. When estimating GPU memory available to user programs, I usually make a more conservative assumption of 80 MB to 100MB depending on hardware and CUDA version. As a corollary, 1 MB seems way too small to represent a valid CUDA context, so it seems CUDA context creation never completed successfully.

Does the app use proper CUDA error checking?

I’ve tried this two ways.

  1. We default to python. pycuda.autoinit hangs - so no chance to error check as no exception just a hang

  2. 1st Cuda call in c also just hangs. So again no chance to error check as it does not return asynchronously - it just hangs.

I think this is an Azure virtualisation problem as I see the same behaviour across both M60 and K80 in Azure - whereas both AWS and our own hardware behave as expected with same O/S, CUDA Driver, Cuda Toolkits and code.

I’ve reproduced this on both m60 and K80 in Azure on both centOS and Ubuntu.

Dom

That seems to be a reasonable indication that you should raise the issue with the Azure vendor (Microsoft, I think?), and proceed based on whatever the vendor’s response is.