Raygen SBT data access results to illegal memory access

Hi everyone,

I am trying to make a relatively simple application using OptiX 7. The application is casting rays from viewpoint positions (origin) to a mesh’s vertices (direction) - i.e. with this mesh not being an actual GAS. Its vertex positions just serve to calculate the ray directions.

I have a raygen sbt record, which contains the vertex data and the viewpoint positions needed to create the rays.

struct raygen_record_data
{
	float3* vertex_buffer;
	float3* viewpoint_buffer;
};

template <typename DataType>
struct sbt_record
{
	__align__(OPTIX_SBT_RECORD_ALIGNMENT) char header[OPTIX_SBT_RECORD_HEADER_SIZE];
	DataType data;
};

using raygen_sbt_record = sbt_record<raygen_record_data>;

Specifically I have a problem with the ray generation program’s sbt record (for the moment). When I use optixGetSbtDataPointer() and access the data saved I get an illegal memory access error.
The host code for uploading data on the device is the following (I have a global tracer_state object that holds the data, i.e. m_impl):

/// m_impl->m_viewpoints is an std::vector<float3>
int num_viewpoints = m_impl->m_viewpoints.size()
int num_verts = m_impl->m_mesh->number_of_vertices()
int total_rays_to_cast = num_viewpoints * num_verts;

float3* verts = (float3*)malloc(total_rays_to_cast * sizeof(float3));
for (int i = 0; i < num_viewpoints; i++)
{
    memcpy(verts + i * num_vertices, m_impl->mesh->vertex_ptr(), num_vertices * sizeof(float3));
}
// device buffer (just like CUDABuffer in the SDK examples)
m_impl->d_mesh_vertices.alloc(total_rays_to_cast * sizeof(float3));
m_impl->d_mesh_vertices.upload(verts, tota_rays_to_cast);
m_impl->d_viewpoint_positions.alloc_and_upload(m_impl->m_viewpoints);

/// raygen sbt record device ptr
m_impl->m_raygen_sbt_buffer.alloc(sizeof(raygen_sbt_record));
raygen_sbt_record raygen_rec;
raygen_rec = {};
raygen_rec.data.vertex_buffer = m_impl->d_mesh_vertices.device_ptr();
raygen_rec.data.viewpoint_buffer = m_impl->d_viewpoint_positions.device_ptr();

/// m_impl->m_raygen_program is the ray generation program group
OPTIX_CHECK(optixSbtRecordPackHeader(m_impl->m_raygen_program, &raygen_rec));
m_impl->m_raygen_sbt_buffer.upload(&raygen_rec, 1);

/// m_impl->m_sbt is the SBT associated with context launch
m_impl->m_sbt.raygenRecord = m_impl->m_raygen_sbt_buffer.device_ptr();
free(verts);

The device code does not do anything at the moment, I just printf things to see

///	\brief Raycasting function
extern "C" __global__ void __raygen__rigid_ray_generation()
{
uint3 launch_dims = optixGetLaunchDimensions();
uint3 launch_idx = optixGetLaunchIndex();
unsigned int idx = launch_idx.x;

unsigned int num_viewpoints = rigid_generation_launch_params.num_viewpoints;
unsigned int num_vertices = rigid_generation_launch_params.num_vertices;

if (idx == 0)
{
	printf("Launch dimensions: %u x %u x %u\n", launch_dims.x, launch_dims.y, launch_dims.z);
	printf("number of viewpoints: %u\n", num_viewpoints);
	printf("number of vertices  : %u\n", num_vertices);
}
/// access raygen shader's sbt record
const ray::raygen_record_data* raygen_data = (ray::raygen_record_data*)optixGetSbtDataPointer();

if (idx == 0)
{
	printf("raygen sbt: %p\n", raygen_data);
	printf("%p", raygen_data->vertex_buffer); // here there is an illegal memory access
}
 }
}

I really hope these code snippets are enough. (However, if you need more info I can provide more code). Does anyone see anything obvious that I cannot see?

Thank you in advance,
Perukas

Hi Perukas,

I don’t see anything wrong with your code at first glance, the code I can see looks correct. Does the printf in there verify that the raygen_data pointer is the one you expect, same as the value you have on the host? Since m_impl is hidden, maybe worth asking whether it’s new or well-worn and tested? The upload() function’s 2nd argument, for example is definitely in units of struct size, and not bytes or words?

Is the problem definitely on the first/only raygen_data dereference using ->vertex_buffer? It’s a long shot, but you could try switching to float4 pointers just to verify it’s not some weird alignment issue with float3. If none of that helps, it’s possible you might find something fishy by catching this in the debugger, and seeing if it shows you the raygen_data memory. I don’t really expect either of those to be likely to work, I’m just throwing out some ideas since I can’t see a problem.


David.

Hi David,

Thank you very much for your prompt answer and your time.

  • The printf() call actually does not verify that the pointers are the same
  • "m_impl" is a (new) struct which is basically very similar to "RendererState" as in the SDK examples.
  • the buffer struct is just the same with "CUDAbuffer" in Ingo Wald's SIGGRAPH course examples, in which the upload() function accepts an element count and calculates the actual byte size itself.
  • I tried to add a single float as the raygen sbt data and I also tried to pad the struct, and still the same. Maybe there is something wrong with my allocations (however I can access the launch parameters using same allocation scheme).

    I use msvc140 with CUDA9.2. I am trying to figure out how to debug with Nsight (Do I need VS2019 and the latest Nsight release?)

    Edit: Just installed latest CUDA release and Nsight works just like that :).
    I still haven’t figured it out, but while the device pointer that is assigned to the sbt differs from time to time, the one that I get with otixSbtGetDataPointer() seems to be always the same.

    For now I worked around it, by moving the raygen-related data to the launch parameters.

    Best,
    Perukas

    I think step 1 is to make sure the pointer you get from optixGetSbtDataPointer() is correct. I implied that you have the value on the host, but it’s not there explicitly in your code sample above. The pointer returned should be equal to m_impl->m_sbt.raygenRecord plus the size of the record header which is currently 32 bytes. So for example, if your m_impl->m_sbt.raygenRecord has a value of 0x7f2e1ae00000, then optixGetSbtDataPointer() should return 0x7f2e1ae00020.


    David.

    Hi again David,

    For some reason when I print the device pointer on host-side I get for example: 0xb0575aa00 and on device-side I get: 0x575aa20.
    I checked that this is really consistent in the application (i.e. the device pointer on host side has an offset of 0xb00000000 from the pointer I get with optixGetSbtDataPointer()). I cannot say the same for the other program kinds because I didn’t get there yet. I really hacked my way for now, but does this generate any ideas?

    Thank you again.

    That looks to me like your pointer is correct and just printing out as 32 bit rather than 64 bit. If for some reason you really are only getting the lower 32 bits of your pointer, that would completely explain the illegal memory access error. But first try using “%ull” in your device printf() to see if the difference is caused by printf rather than the data.


    David.

    Hi David,

    Well now when printing sbt.raygenRecord on host I get: 47336237568 and printing using:

    ray::raygen_sbt_data* raygen_data = (ray::raygen_sbt_data*)optixGetSbtDataPointer();
    
    if (idx == 0)
    {
    	printf("raygen device pointer: %ull\n", raygen_data);
    }
    

    I get: 91597344ll

    Thanks again,
    Perukas

    Those still match the low 32 bits, but not the high. Hmmm.

    (Easier to see they match if you use “%llx” instead of my earlier “%ull” suggestion.)

    Just to make sure there isn’t an OptiX bug, I tried modifying the OptiX 7 SDK sample optixHello. In my case the pointer prints out correctly, and it is used to pass the color values to raygen that get displayed in the window. Just to rule out some kind of problem with printf and/or your toolchain, maybe you could try modifying the same sample and printing out the host & device pointers, and also make sure it works.

    Aside from that, maybe there’s some reason your pointer is being masked to the lower 32 bits. I don’t see it, but if you want to package the sample up and send it to me, I could try to reproduce the issue.


    David.

    Well, printing with “%llx” has the same pattern. On host: b0575aa00 and on device: 575aa20.

    When I get to the office, I will modify the “hello world” example and get back to you. If this won’t work I’ll try to wrap most of the code and send it to you (thank you very much for this).

    PS: I tried to build the SDK examples with nvrtc enabled and if I understood correctly there are a lot of standard header dependencies and you basically cannot use nvrtc (for running the examples) yet, right? But the other thing is that I couldn’t make them run without nvrtc enabled also (that’s on Visual Studio 2015 vc140). I think (and that’s coming from managing to build Ingo Wald’s examples on VS2017) that the custom build step that compiles the cuda source files to ptx is not “passed” from cmake to the Visual Studio projects, at least on Visual Studio 2015. I will check again on that tomorrow.

    Thanks again, I am really grateful,
    Perukas

    To completely rule out printf, you could for example do something like this in raygen:

    if (idx == 0) {
        int* x = (int*)0x0B57AC1EDE7EC7ED;
        printf("%llx\n", x);
    }
    

    If you can see all the bits when it runs (a 15 character word, not just the last 8 chars), that means your problem really is that the sbt data pointer is losing the high 32 bits for some reason. In that case, we need to figure out if it’s a compilation problem of some sort, or a bug in OptiX base on your pipeline settings or something.

    It will be helpful to know if optixHello runs for you, and if so, what the build differences are between the two projects.


    David.

    Hi David
    I tried running the “optixHello” example and everything was fine.
    Actually, I had the following problem while building the examples:
    When I build the SDK with cmake for VS2017 and CUDA 10.2 (and 9.2) everything is fine. But when I build the SDK for VS2015 the “run_nvcc.cmake” script does not run when compiling the project, but copy-pasting the compiled ptx files (from the VS2017 projects) at the correct path does the job and “optixHello” runs on VS2015 with the expected results (i.e. the host and device pointers are the correct).
    This led me to believe that there is something wrong with the nvrtc compiler options (or the compiler itself?) I used to compile my cuda source file to .ptx. :)

    So I did the following:
    Building the cuda source file with the shaders code using nvrtc and with the following flags:

    static constexpr char* k_nvrtc_compiler_options[] =
    {
    	"-arch",
    	"compute_30",
    	"-use_fast_math",
    	"-lineinfo",
    	"-default-device",
    	"-rdc",
    	"true",
    	"-D__x86_64",
    	"--std=c++11",
    	"--device-debug",
    	"-D_USE_MATH_DEFINES",
    	"-DNOMINMAX",
    	"-fmad=false",
    };
    

    Results in masking the first 32 bits of the pointer. But when building offline using nvcc with the same flags (except the ones that nvcc didn’t like) the pointer address is not masked. Is there something I didn’t notice about using nvrtc to compile cuda source files at runtime?

    Thanks again,
    Perukas

    Thanks for the update!

    I don’t see any compiler options there that would cause any pointer bit masking, and I also tried compiling optixHello using the exact options you have shown, though I’m on Linux today. It worked fine, and I don’t expect this is a Windows specific issue, so I’m guessing there is some other issue with compilation somewhere, either in your new project or in your system or environment setup. But maybe I’ll try on Windows next week if you’d like and don’t make progress before then.

    At least you have a working version, but this doesn’t seem resolved yet, correct? If you have multiple CUDA installs, maybe it’s worth trying to check exactly which nvrtc dll is being used? Is it possible your new project is setup with a 32 bit toolchain, rather than 64 bit?


    David.

    Hi David,

    Everything is set up for x64 (both the Visual Studio project and in the CUDA C/C++ properties of the project the Target Machine Platform is 64-bit). I also use the 64bit nvrtc and nvrtc_builtins dlls.

    So now, while I progressed with the raygen record, I don’t get any closest_hit program invocation. I checked to see if my traversable is valid with the following codse inside the raygen_program.

    if (idx == 0)
    {
    	printf("device traversable %llx\n", rigid_generation_launch_params.traversable_handle);
    
    	for (int i = 0; i < 10; i++)
    	{
    		float3 tr_vertex[3];	
    		optixGetTriangleVertexData(rigid_generation_launch_params.traversable_handle, i, 1, 0, tr_vertex);
    		printf("traversable vertex 0 at prim idx %u: (%f, %f, %f)\n", i, tr_vertex[0].x, tr_vertex[0].y, tr_vertex[0].z);
    		printf("traversable vertex 1 at prim idx %u: (%f, %f, %f)\n", i, tr_vertex[1].x, tr_vertex[1].y, tr_vertex[1].z);
    		printf("traversable vertex 2 at prim idx %u: (%f, %f, %f)\n", i, tr_vertex[2].x, tr_vertex[2].y, tr_vertex[2].z);
    	}	
    }
    

    The printf usually prints nan, 0.0, and arbitrary values. I read in the programming guide that OptiX may remove degenerate vertices from the Acceleration Structure during construction. I can’t say what’s going on because I really followed Ingo Wald’s examples on building the acceleration srtucture.

    // tvm_index_buffer -> std::vector<uint3> with the geometry's face indices (triangles)
    // tvm_normal_buffer -> std::vector<float3> with the vertex normals
    // tvm_vertex_buffer -> std::vector<float3> with the vertex positions
    if (m_tvm_index_buffer_host.empty() || m_tvm_normal_buffer_host.empty() || m_tvm_vertex_buffer_host.empty())
    	throw std::runtime_error("Tried to build an acceleration structure while there is not any geometry data set.");
    		
    m_tvm_vertex_buffer_device.alloc_and_upload(m_tvm_vertex_buffer_host);
    m_tvm_normal_buffer_device.alloc_and_upload(m_tvm_normal_buffer_host);
    m_tvm_index_buffer_device.alloc_and_upload(m_tvm_index_buffer_host);
    
    OptixTraversableHandle as_handle{ 0 };
    /// triangle input
    OptixBuildInput	triangle_input;
    memset(&triangle_input, 0, sizeof(OptixBuildInput));
    /// local vars, are gonna be needed
    CUdeviceptr d_tvm_vertices = m_tvm_vertex_buffer_device.device_ptr();
    CUdeviceptr d_tvm_indices = m_tvm_index_buffer_device.device_ptr();
    
    triangle_input.type = OPTIX_BUILD_INPUT_TYPE_TRIANGLES;
    triangle_input.triangleArray.vertexFormat = OPTIX_VERTEX_FORMAT_FLOAT3;
    triangle_input.triangleArray.vertexStrideInBytes = sizeof(float3);
    triangle_input.triangleArray.numVertices = (int)m_tvm_vertex_buffer_host.size();
    triangle_input.triangleArray.vertexBuffers = &d_tvm_vertices;
    	
    triangle_input.triangleArray.indexFormat = OPTIX_INDICES_FORMAT_UNSIGNED_INT3;
    triangle_input.triangleArray.indexStrideInBytes	= sizeof(uint3);
    triangle_input.triangleArray.numIndexTriplets = (int)m_tvm_index_buffer_host.size();
    triangle_input.triangleArray.indexBuffer = d_tvm_indices;
    
    uint32_t triangle_input_flags[1] = { 0 };
    triangle_input.triangleArray.flags = triangle_input_flags;
    triangle_input.triangleArray.numSbtRecords = 1;
    triangle_input.triangleArray.sbtIndexOffsetBuffer = 0;
    triangle_input.triangleArray.sbtIndexOffsetSizeInBytes = 0;
    triangle_input.triangleArray.sbtIndexOffsetStrideInBytes = 0;
    /// blas setup
    OptixAccelBuildOptions	accel_options = {};
    accel_options.buildFlags = OPTIX_BUILD_FLAG_NONE | OPTIX_BUILD_FLAG_ALLOW_COMPACTION;
    accel_options.motionOptions.numKeys = 0;
    accel_options.operation	= OPTIX_BUILD_OPERATION_BUILD;
    
    OptixAccelBufferSizes blas_buffer_sizes;
    OPTIX_CHECK(optixAccelComputeMemoryUsage(
    	m_optix_ctx, 
    	&accel_options, 
    	&triangle_input, 1, 
    	&blas_buffer_sizes));
    	/// prepare compaction
    cuda_buffer compacted_size_buffer;
    compacted_size_buffer.alloc(sizeof(uint64_t));
    
    OptixAccelEmitDesc emit_desc;
    emit_desc.type = OPTIX_PROPERTY_TYPE_COMPACTED_SIZE /*OPTIX_PROPERTY_TYPE_AABBS*/;
    emit_desc.result = compacted_size_buffer.device_ptr();
    
    /// execute build
    cuda_buffer temp_buffer;
    temp_buffer.alloc(blas_buffer_sizes.tempSizeInBytes);
    
    cuda_buffer output_buffer;
    output_buffer.alloc(blas_buffer_sizes.outputSizeInBytes);
    
    OPTIX_CHECK(optixAccelBuild(
    	m_optix_ctx, 
    	m_cuda_stream, 
    	&accel_options,
    	&triangle_input, 
    	1, 
    	temp_buffer.device_ptr(), temp_buffer.byte_size, 
    	output_buffer.device_ptr(), output_buffer.byte_size, 
    	&as_handle, 
    	&emit_desc, 
    	1));
    	CUDA_SYNC_CHECK();
    
    /// perform compaction
    uint64_t compacted_size;
    compacted_size_buffer.download(&compacted_size, 1);
    
    m_as_buffer.alloc(compacted_size);
    OPTIX_CHECK(optixAccelCompact(
    	m_optix_ctx, 
    	m_cuda_stream, 
    	as_handle, 
    	m_as_buffer.device_ptr(), 
    	m_as_buffer.byte_size, 
    	&as_handle));
    	CUDA_SYNC_CHECK();
    
    	/// clean up
    	output_buffer.free();
    	temp_buffer.free();
    	compacted_size_buffer.free();
    	/// assign the trversable handle
    	m_tvm_traversable = as_handle;
    

    Printing the traversable handle yields the same result on device and on host.
    I tried putting the traversable’s vertex and index device pointers in the launch_params to see if they are uploaded OK on the device but they seem to be fine.

    Edit: I used the same code to print the vertex positions in one of Ingo Wald’s examples and I get the same things (i.e. nans, zeros and arbitrary values)so I believe that this is not the way to debug this.

    I am pretty sure that this setup should generate hits because I implemented the same thing in Optix 5.1 and everything was running fine.

    Thank you for your time,
    Perukas

    I have a similar problem. I have uploaded the hitgroupRecordBase and got the Data Pointer to this from the kernel with optixSbtGetDataPointer, but it was also 32 Bit masked. I checked after the kernel launch both addresses which had a difference in the upper 32 Bit (0x0005200420 and 0x2305200400) but only on Windows, on Linux it runs without problems.

    Found the solution, most probably you have an empty stddef.h, because the stddef.h of the cuda toolkit isn’t working. You have to define in the stddef.h LP64, which sets CUdeviceptr to long long instead of int.

    Edit: If you’re on Windows, use _WIN64, less confusing and less bugs later

    Thank you very much DigitalDragon64. I’ll have to check it out tomorrow and get back to you. :)