CUDA shared memory CNN (convolutional neural network)

Hi,

I have a question regarding a CNN model that is loaded into VRAM.
The model is loaded from a process when it first starts. (the process is a linux application written in c++ compiled with gcc and linked with cuda libraries).

Is it possible to share the model loaded in VRAM among different processes? I want to launch another process that uses the same model and not to load again the model in VRAM (there will be two exact models in VRAM occupying twice the VRAM). I would like to use something like shared memory IPC in Linux.(I create a shared memory segment in a process, which somehow maps the loaded memory in VRAM and from another process I access the shared memory segment created in the previous process and so I have access to the CNN model)

Regards,
Radu

Yes, CUDA has an IPC facility, that allows a device pointer and the data (allocation) it represents, to be shared with CUDA code running in another (64-bit linux) process.

There is a CUDA IPC API, and a CUDA IPC sample code demonstrating the necessary concepts.

Thanks for the response.

I’ve updated the code and now, I have a main process that creates the CNN.
For some layers of CNN - which are cudaMalloc- ed, I created handlers, I saved those handlers in a file (from the main process)then I launched another process that read that file and used the model already loaded. So the IPC apparently worked.

I say apparently because if I launch another process that reads those handlers and links to the model in VRAM and yet another one, at some point in time I get an error like - “misaligned address: Resource temporarily unavailable” in one of the processes(not the main one) and it crashes. And after a while the error will be thrown by another process that was launched and it crashes too.

What could be the reason for this kind of misalignment “after a while”?

Could it be due to the fact that the handlers will point in the future to some invalid data?(I mean maybe I made some wrong assumptions and something does change in the CNN model in the memory. As I said in the beginning I am saving only some layers from the model -e.g I save the convolution layers but not the max pool layers. I assumed that everything that was cudaMalloc-ed would never be freed because I didn’t notice any place in the code where the model might be changed.)

resource temporarily unavailable may arise due to an inability to fork() or otherwise create a new process. I can’t immediately suggest ideas for “misaligned address” but this may be a cascade of errors - the process issue may give rise to a data interpretation problem.

This is really just speculation. I would investigate limits on creating new process (actual limits, resource limits, running out of a resource like swap space, etc.) and carefully checking on errors reported by either CUDA (are you doing careful CUDA error checking?) or any system calls you may be making.

If that turns up nothing useful, I’m out of ideas and I would suggest a minimal reproducer might be in order. Such a problem may depend on exact OS and exact OS settings (e.g. resource limits) and maybe even other things like amount of system memory.

If all of this fails, consider having a master process that fields work from other processes by ordinary linux IPC, then issues that work to the GPU from a single process.

Regarding the resources, without this ‘hack’ I could have up to 8 processes started without any problems. The only issue here was that each process loaded the same convolutional net, and that resulted in a VRAM usage of 7GB from 8G.
So I thought it would be a good idea to share the model among the processes.
Now, with this partial share (as I told you I am not mapping the whole net- I wanted to check incrementally if it works) I would spare at leas 3G of VRAM for 8 processes but as I discovered, this sharing might not be trivial.

I am trying to find a starting point to debug this or at least to understand the crash.
As an observation, when the crash occurs I can notice in the system log:

(GPC 3, TPC 3): Physical Multiple Warp Errors
[234355.055134] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x51de48=0x3000f 0x51de50=0x24 0x51de44=0xd3eff2 0x51de4c=0x17f
[234355.055173] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 4): Misaligned Address
[234355.055177] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 3, TPC 4): Physical Multiple Warp Errors
[234355.055180] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x51e648=0xf 0x51e650=0x24 0x51e644=0xd3eff2 0x51e64c=0x17f
[234355.056566] NVRM: Xid (PCI:0000:01:00): 43, Ch 00000030, engmask 00000101

If I start the main process and then after 30 minutes I start a ‘slave’ that reads the handlers saved by main, everything looks fine- I mean the model works perfect in the slave and the process occupies less VRAM than before sharing.(I waited 30 minutes because I thought that the handlers I would use in slave, would be 'inconsistent, inferring that in these 30 minutes the main process may have changed the model in the VRAM memory such that the saved handlers would be invalid for the slave)

of course cuda will allow you to do that

I saw something interesting when I used cuda-memcheck.
I said in my previous post that if I launched only one process after the main one, I didn’t see any crash.
With cuda-memcheck for the second process (which reads the handlers and basically does not reload all the net in VRAM) the second process will crash in a couple of seconds with ‘misaligned error’. The trace is like this:

======== Misaligned Shared or Local Address
========= at 0x00000570 in maxwell_scudnn_128x32_relu_small_nn
========= by thread (32,0,0) in block (40,0,0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x213a85]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x9b6241]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x9d5053]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x6f0a4e]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x3d35b7]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x3d541b]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x360f3d]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x5ca21]
========= Host Frame:/usr/local/lib/libcudnn.so.6 (cudnnConvolutionForward + 0x69) [0x5d2d9]
========= Host Frame:/home/rududoo/app_no_gpu [0x2b15b5]
========= Host Frame:/home/rududoo/app_no_gpu [0x2b4213]
========= Host Frame:/home/rududoo/app_no_gpu [0x2b595d]
========= Host Frame:/home/rududoo/app_no_gpu [0x18ef4d]
========= Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x1073dd]

CUDA Error: unspecified launch failure
CUDA Error: unspecified launch failure: Resource temporarily unavailable
OpenCV Error: Gpu API call (NCV Assertion Failed: cudaError_t=29, file=/home/rududoo/opencv2.4.13/modules/gpu/src/nvidia/core/NCV.cu, line=487
) in NCVDebugOutputHandler, file /home/rududoo/opencv2.4.13/modules/gpu/src/cascadeclassifier.cpp, line 173
OpenCV Error: Gpu API call (NCV Assertion Failed: cudaError_t=29, file=/home/rududoo/opencv2.4.13/modules/gpu/src/nvidia/core/NCV.cu, line=487
) in NCVDebugOutputHandler, file /home/rududoo/opencv2.4.13/modules/gpu/src/cascadeclassifier.cpp, line 173
terminate called after throwing an instance of ‘cv::Exception’
what(): /home/rududoo/opencv2.4.13/modules/gpu/src/cascadeclassifier.cpp:173: error: (-217) NCV Assertion Failed: cudaError_t=29, file=/home/rududoo/opencv2.4.13/modules/gpu/src/nvidia/core/NCV.cu, line=487
in function NCVDebugOutputHandler

========= Error: process didn’t terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found

So there is an issue with this IPC implementation and the issue appears immediately when using the cuda-memcheck tool.

Right now I am stuck.

Basically the IPC is like this:

if (!mattach) {
net.workspace = cuda_make_array(0, (workspace_size-1)/sizeof(float)+1);
cudaIpcMemHandle_t handle;
memset(&handle, 0, sizeof(handle));
cudaIpcGetMemHandle(&handle, net.workspace);

		for (int i=0; i < sizeof(handle); i++){
			int ret;
			ret = fprintf(fp,"%c", handle.reserved[i]);
			if (ret != 1)
				printf("ret = %d\n", ret);
		}
	} else {
			cudaIpcMemHandle_t handle;
			memset(&handle, 0, sizeof(handle));
			int ret;
			printf("sizeof handle %d\n", sizeof(handle));
			for (int i = 0; i < sizeof(handle); i++){
				ret = fscanf(fp,"%c", handle.reserved+i);
				if (ret == EOF)
					printf("received EOF\n");
				else if (ret != 1)
					printf("fscanf returned %d\n", ret);
		  }
		  cudaIpcOpenMemHandle((void **)&net.workspace, handle, cudaIpcMemLazyEnablePeerAccess);
	}
	fclose(fp);

So what happens here is that the main process will go on ‘then’ and the other launched process will go on ‘else’ branch. (will attach to VRAM using the handle).

Before this code, there was only this line: net.workspace = cuda_make_array(0, (workspace_size-1)/sizeof(float)+1); - so for each process new space was allocated from VRAM.

I do not understand how and why the address becomes unaligned.

The crash will follow further after the network is loaded, during the call to a ‘prediction function’ which will pass a frame through the net for prediction.(see the call stack from previous post)