Generated PTX Code is not returned completely

Hello Developers

I am generating PTX code with the nvrtc JIT compiler to hopefully improve the performance of the application. As a reference tutorial i am using this PDF.

I have written a function that is producing for example the following code to compile (this is working):

__device__ void GetSumOfBranches(double* branches, double* outSum)
{
	double sum = (branches[38])+(-branches[334])+(-branches[398])+(-branches[411]);
	*outSum = sum;
}

So then i am creating the PTX code with the following function (here is the first problem which results in problem 2 - see below):

char* FunctionGenerator::CreatePtx(const char* programText)
{

	nvrtcProgram prog;
	nvrtcCreateProgram(&prog, programText, "GetSumOfBranches.cu", 0, NULL, NULL);

	const char *opts[] = {"--gpu-architecture=compute_52", "--fmad=false"};
	nvrtcCompileProgram(prog, 2, opts);

	// Obtain compilation log from the program.
	size_t logSize;
	nvrtcGetProgramLogSize(prog, &logSize);
	char *log = new char[logSize];
	nvrtcGetProgramLog(prog, log);
	// Obtain PTX from the program.
	size_t ptxSize;
	nvrtcGetPTXSize(prog, &ptxSize);
	char *ptx = new char[ptxSize];
	nvrtcGetPTX(prog, ptx);
	return ptx;
}

The problem in the code above is that ptxSize is around 770 but when I read out the ptx char* with the Nsight EE Debugger ptx containst less than 700 characters. This results in a problem in the function which should return the CUfunction:

CUfunction* FunctionGenerator::CreateFunction(const char* programText)
{
	auto ptx = FunctionGenerator::CreatePtx(programText);

	CUdevice cuDevice;
	CUcontext context;
	CUmodule module;
	CUfunction* kernel;
	kernel = (CUfunction*)malloc(sizeof(CUfunction));
	cuInit(0);
	cuDeviceGet(&cuDevice, 0);
	cuCtxCreate(&context, 0, cuDevice);
	cuModuleLoadDataEx(&module, ptx, 0, 0, 0);
	auto result = cuModuleGetFunction(kernel, module, "GetSumOfBranches");
	return kernel;
}

Because ptx does not contain the full PTX code the module is not loaded correctly and cuModuleGetFunction returns the NOT_FOUND error because the GetSumOfBranches can (obviously) not be found.

The generated “PTX” which i have read out with the Nsight EE Debugger is this (I have not seen any PTX code yet, but i do not think that it is this “plain text”):

// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19856038
// Cuda compilation tools, release 7.5, V7.5.17
// Based on LLVM 3.4svn
//

.version 4.3
.target sm_52
.address_size 64

It would be great if you could explain to me why my ptx does not contains all of the needed characters. it seems that ptx is terminated too early with the \0 terminator

Thanks for your time and help. i am appreciating your help!

Edit 1
It seems that the debugger is not able to read the correct value from ptx. However i was able to print the value to the console and this is the output. It looks pretty good to me.

// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19856038
// Cuda compilation tools, release 7.5, V7.5.17
// Based on LLVM 3.4svn
//

.version 4.3
.target sm_52
.address_size 64

	// .globl	GetSumOfBranches

.visible .func GetSumOfBranches(
	.param .b64 GetSumOfBranches_param_0,
	.param .b64 GetSumOfBranches_param_1
)
{
	.reg .f64 	%fd<8>;
	.reg .b64 	%rd<3>;

	ld.param.u64 	%rd1, [GetSumOfBranches_param_0];
	ld.param.u64 	%rd2, [GetSumOfBranches_param_1];
	ld.f64 	%fd1, [%rd1+304];
	ld.f64 	%fd2, [%rd1+2672];
	sub.rn.f64 	%fd3, %fd1, %fd2;
	ld.f64 	%fd4, [%rd1+3184];
	sub.rn.f64 	%fd5, %fd3, %fd4;
	ld.f64 	%fd6, [%rd1+3288];
	sub.rn.f64 	%fd7, %fd5, %fd6;
	st.f64 	[%rd2], %fd7;
	ret;
}

Edit 2
I added extern “C” so the function name in the PTX output is not mangled. However cuModuleGetFunction still fires the error NOT_FOUND

<b>extern "C"</b> __device__ void GetSumOfBranches(double* branches, double* outSum)
{
	double sum = (branches[38])+(-branches[334])+(-branches[398])+(-branches[411]);
	*outSum = sum;
}

cross posting:

[url]c++ - cuModuleGetFunction returns not found - Stack Overflow

The answer to my own question is the following:

  1. To fetch the PTX function you always should add a extern "C" to the function which is compiled. So the function name is not mangled in the PTX code!
  2. I am trying to compile a __device__ function which does not work. I have changed __device__ to __global__ and everything is working like a charm

I hope this will help someone.