OptiX 6.0.0 performance loss?

mapic · May 22, 2019, 8:43pm

Hi,

I started developing an application with Optix 5.1.0 and CUDA 9.1 that currently runs at 30 fps. Recently, I updated to Optix 6.0.0 and now it runs at 20 fps. I decided to test if this was a problem of my application, so I decided to build the “OptiX Advanced Samples” (https://github.com/nvpro-samples/optix_advanced_samples) with different OptiX versions to see how it affects performance. Here are my results:

OptiX 5.1.0 and CUDA 9.1

optixGlass: 9.3 fps.
optixOcean: 27 fps.
optixParticleVolumes: ~30 fps.

OptiX 6.0.0 and CUDA 9.1

optixGlass: 8 fps.
optixOcean: 8.6 fps.
optixParticleVolumes: ~11 fps.

OptiX 6.0.0 and CUDA 10.1

optixGlass: 8 fps.
optixOcean: 0.4 fps (less than a frame per second!!).
optixParticleVolumes: ~11 fps.

I must be missing something and I can’t figure out what. I noticed that, in the Windows task manger at the GPU performance tab, the “Copy” graphic is near 100% with OptiX 5.1.0 and 0% when using OptiX 6.0.0 (not sure if this is relevant). I wonder if someone else can reproduce this results, any help would be appreciated!

My environment:
Windows 10, x64
Nvidia GTX 750Ti
Driver version: 430.64
(CUDA and OptiX versions mentioned above)

droettger · May 23, 2019, 7:50am

I filed a bug to reproduce and analyze it.

That the OptiX 6.0.0 performance is impacted by the different CUDA versions is strange. They should all compile PTX to the SM 3.0 target.

About the 30 fps and 20 fps numbers in your own application: If these numbers are really exactly that, please check if you have vsync disabled inside the NVIDIA Display Control Panel when benchmarking. Otherwise you might be limited by the monitor refresh rate or dividers of it.

Other than that, the recommended CUDA Toolkits to be used with the different OptiX versions can be found inside the OptiX Release Notes next to the resp. download buttons on the developer.nvidia.com site.

For OptiX 5 the recommended CUDA toolkit is 9.0. (Some issues had been reported on this forum when using CUDA 9.1.)
For OptiX 6 the recommended version is CUDA 10.0.

Which CUDA toolkit can be used is dependent on the supported host compiler versions. That list can be found at the beginning of the CUDA_Installation_Guide_.pdf documents of the individual CUDA Toolkit installations.

dhart · May 23, 2019, 2:23pm

For what it’s worth, it might also be worth monitoring your clock to make sure it’s the same for each run. If it’s not, you can consider locking your GPU clock when benchmarking to rule out dynamic clock adjustments. It’s not normally a big issue, but if your GPU heats up too much, it will automatically slow down. You can monitor and control clock behavior using nvidia-smi. For example try nvidia-smi -q -d CLOCK, or nvidia-smi -h and look for the -lgc option. If you lock clocks, don’t forget to reset them and don’t lock them for very long.

–
David.

mapic · May 26, 2019, 6:51pm

Thank you both for your answers! I really appreciate your support.
I followed your advice and I would like to share with you the results:

To be exact, I should have said that those values were rounded-up (~30 fps and ~20 fps respectively). Sorry for the misleading information. Anyway, I disabled vsync and tested again, but results didn’t change.

I downloaded CUDA 9.0 to try again with OptiX 5.1.0 and also CUDA 10.0 to try with OptiX 6.0.0. You can see the updated results below:

OptiX 5.1.0 and CUDA 9.0:

optixGlass: 9.2 fps.
optixOcean: 27 fps.
optixParticleVolumes: ~30 fps.

OptiX 6.0.0 and CUDA 10.0:

optixGlass: 8 fps.
optixOcean: 8.5 fps.
optixParticleVolumes: ~11 fps.

I seems that CUDA 9.0 and CUDA 9.1 behave similarly with OptiX 5.1.0. However, it seems OptiX 6.0.0 performs better with CUDA 10.0 that with CUDA 10.1, at least with the optixOcean sample.

I checked that reference and it seems the one I use (Visual Studio 2015) is supported for both CUDA 9.0 and CUDA 10.0.

I tried the “nvidia-smi -q -d CLOCK” command you suggest (I didn’t know about that) and here are my results:

“Idle” GPU

Graphics: 141 MHz
SM: 141 MHz
Memory: 405 MHz
Video: 405 MHz

OptiX 5.1.0 and CUDA 9.0 (optixOcean sample)

Graphics: 1189 MHz
SM: 1189 MHz
Memory: 2700 MHz
Video: 1070 MHz

OptiX 6.0.0 and CUDA 10.0 (optixOcean sample)

Graphics: 1189 MHz
SM: 1189 MHz
Memory: 2700 MHz
Video: 1070 MHz

I’m not sure how to interpret this, but they seem quite similar to me. I also tested with GPU-Z (https://www.techpowerup.com/gpuz/) just to make sure performance was not capped by temperature, as you suggested.

If you have any other suggestion you want me to try, or if you need extra information to identify the issue, please let me know. I will be delighted to help.

Thanks!

mahonyyy · June 28, 2019, 8:02am

Hi,
I already wanted to start a separate thread for this but it seems to be the same issue so I post it in this thread.
I can see similar behaviour with the optixOcean sample on my desktop PC with RTX2060 Card with Cuda 10.1 and Optix 6.0 and driver version 425.25:

optixParticleVolumes ~220 fps
optixOcean ~9fps (!)

A note here: I measured timing for all 4 context launches in this sample and the limiting factor is the one that does the actual rending (height field ray tracing) - this launch alone takes ~100ms.

I would expect performance on this card to be at some hundred fps as on this exact same setup a simple height map rendering using Cuda only runs around 1000 fps! As I want to use some similar rendering with optix too it might be nice to find the cause and solution for this.

What is also interesting is the fact that on my notebook with MX130 card (dito Cuda 10.1 and Optix 6.0 and driver 419.17) this sample will not even launch but stop with this error message:
“OptiX Error: ‘Unknown error (Details: Function “_rtContextLaunch2D” caught exception: Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (719): Launch failed)’”

All other advanced and basic samples run fine on both machines.

Thanks and kind regard!

dhart · June 28, 2019, 8:19pm

Hi @mahonyyy,

Are you able to try optixOcean on your machines with more recent drivers?

I just ran optixOcean on an RTX GPU with the most recent driver and I’m getting above 100 fps.

–
David.

mahonyyy · June 28, 2019, 9:54pm

Hi @dhart,
unfortunately not on those two machines but I just checked my home machine with an older GTX960 and driver version 430.64.

I initially compiled the optix advanced samples with cuda 8.0 and optix 5.1 but in the meantime also installed both cuda 10.1 and optix 6.0. So I just compiled the optixOcean sample for the new toolkits:

cuda 8.0 / optix 5.1 gives ~42 fps

cuda 10.1 / optix 6.0 gives 0.48 fps (sometimes 0.92 fps)

This is almost x100 slowdown compared to the older toolkits even on the more recent driver version. I might also upgrade this machine to the latest drivers at any point. The most recent seems to be 430.86 not sure if this might give any improvements.

Regards
Toni

mahonyyy · June 28, 2019, 10:13pm

Small addition:
I also tested some of the other samples:

optixGlass: 8.0/5.1 - 18fps vs 10.1/6.0 - 12 fps
optixVox: 8.0/5.1 - 19fps vs 10.1/6.0 - 17 fps
optixParticleVolume: 8.0/5.1 - 46fps vs 10.1/6.0 - crash
optixProgressivePhotonMap: 8.0/5.1 - 46fps vs 10.1/6.0 - crash

the crash on optixParticleVolume shows the same error as my work machine with the optixOcean sample:
“OptiX Error: ‘Unknown error (Details: Function “_rtContextLaunch2D” caught exception: Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (719): Launch failed)’”

on optixProgressivePhotonMap it is a different error:
“OptiX Error: ‘Unknown error (Details: Function “_rtContextLaunch2D” caught exception: Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (700): Illegal address)’”

dhart · June 28, 2019, 10:15pm

Hmmm… we’ve had a few other unsolved cases of things going super crazy slow like this on certain machines. I’m sure something is going wrong, those framerates are abnormal; I promise OptiX didn’t really slow down 100x. ;) Maybe one thing to try is clearing your OptiX shader cache by deleting the cache file and re-running. On windows, it’s usually in a folder like this: C:\Users\dhart\AppData\Local\NVIDIA\OptixCache\cache7.db, and on Linux it’s in /var/tmp/OptixCache.

–
David.

mahonyyy · June 28, 2019, 10:22pm

Interesting, I never read/heard about the cache - but unfortunately this does not change anything at least on the machine I am currently on. I cleaned the folder an re-ran the sample but still only 0.48 fps.

Regards
Toni

mahonyyy · July 2, 2019, 12:36pm

Small update: I managed to improve the performance of the code by roughly x20 through some simple changes.
In the first place I merged the double calls of rtPotentialIntersection and rtReportIntersection in the while-loop into a single occurence. This lowered the render time from 100ms to roughly 50ms.

In the second step I completely removed both calls from the loop and got another improvement from 50ms down to ~5ms which is on the acceptable side, but still ~10x slower compared to a similar rending in pure Cuda (0.1 - 0.5 ms on the same image resolution and similar data size!). All the Timings are related to the RTX2060 Card.

So obviously optix does not like rtPotentialIntersection and/or rtReportIntersection calls in loops with high duty - at least for the combination of cuda 10.1 and optix 6.0.

Modified intersection routine:

RT_PROGRAM void intersect(int primIdx)
{
	// Step 1 is setup (handled in CPU code)

	// Step 2 - transform ray into grid space and compute ray-box intersection
	float3 t0 = (boxmin - ray.origin) / ray.direction;
	float3 t1 = (boxmax - ray.origin) / ray.direction;
	float3 near = fminf(t0, t1);
	float3 far = fmaxf(t0, t1);
	float tnear = fmaxf(near);
	float tfar = fminf(far);

	if (tnear >= tfar)
		return;
	if (tfar < 1.e-6f)
		return;
	tnear = max(tnear, 0.f);
	tfar = min(tfar, ray.tmax);

	// Step 3
	uint2 nnodes;
	nnodes.x = heights.size().x;
	nnodes.y = heights.size().y;
	float3 L = (ray.origin + tnear * ray.direction - boxmin) * inv_cellsize;
	int Lu = min(__float2int_rz(L.x), nnodes.x - 2);
	int Lv = min(__float2int_rz(L.z), nnodes.y - 2);

	// Step 4
	float3 D = ray.direction * inv_cellsize;
	int diu = D.x > 0 ? 1 : -1;
	int div = D.z > 0 ? 1 : -1;
	int stopu = D.x > 0 ? (int)(nnodes.x) - 1 : -1;
	int stopv = D.z > 0 ? (int)(nnodes.y) - 1 : -1;

	// Step 5
	float dtdu = abs(cellsize.x / ray.direction.x);
	float dtdv = abs(cellsize.z / ray.direction.z);

	// Step 6
	float far_u = (D.x > 0.0f ? Lu + 1 : Lu) * cellsize.x + boxmin.x;
	float far_v = (D.z > 0.0f ? Lv + 1 : Lv) * cellsize.z + boxmin.z;

	// Step 7
	float tnext_u = (far_u - ray.origin.x) / ray.direction.x;
	float tnext_v = (far_v - ray.origin.z) / ray.direction.z;

	// Step 8
	float yenter = ray.origin.y + tnear * ray.direction.y;
	float3 n, n2, p00;
	bool hit = false;
	float  t, beta, gamma;
	while (tnear < tfar){
		float texit = min(tnext_u, tnext_v);
		float yexit = ray.origin.y + texit * ray.direction.y;

		// Step 9
		float d00 = heights[make_uint2(Lu, Lv)];
		float d01 = heights[make_uint2(Lu, Lv + 1)];
		float d10 = heights[make_uint2(Lu + 1, Lv)];
		float d11 = heights[make_uint2(Lu + 1, Lv + 1)];
		float datamin = min(min(d00, d01), min(d10, d11));
		float datamax = max(max(d00, d01), max(d10, d11));
		float ymin = min(yenter, yexit);
		float ymax = max(yenter, yexit);

		if (ymin <= datamax && ymax >= datamin) {
			//float3
			p00 = make_float3(boxmin.x + Lu*cellsize.x, d00, boxmin.z + Lv*cellsize.z);
			float3 p11 = make_float3(p00.x + cellsize.x, d11, p00.z + cellsize.z);
			float3 p01 = make_float3(p00.x, d01, p11.z);
			float3 p10 = make_float3(p11.x, d10, p00.z);

			//MOD:
			float t2, beta2, gamma2;
			bool ta = intersect_triangle(ray, p00, p11, p10, n, t, beta, gamma);
			bool tb = intersect_triangle(ray, p00, p01, p11, n2, t2, beta2, gamma2);

			if (ta && tb){
				hit = true;
				if (t < t2){
					break; //keep t, beta, gamma
				}
				else
				{ //copy close t, beta, gamma
					t = t2;
					beta = beta2;
					gamma = gamma2;
					n = n2;
					break;
				}				
			}
			if (tb){
				hit = true;
				//copy close t, beta, gamma
				t = t2;
				beta = beta2;
				gamma = gamma2;
				n = n2;
				break;
			}
			if (ta){
				hit = true; //just keep t, beta, gamma and quit loop
				break;
			}
		}

		// Step 11
		yenter = yexit;
		if (tnext_u < tnext_v){
			Lu += diu;
			if (Lu == stopu)
				break;
			tnear = tnext_u;
			tnext_u += dtdu;
		}
		else {
			Lv += div;
			if (Lv == stopv)
				break;
			tnear = tnext_v;
			tnext_v += dtdv;
		}
	}

	if (hit){
		if (rtPotentialIntersection(t)) {
			geometric_normal = normalize(n);
			shading_normal = computeNormal(Lu, Lv, ray.origin + t*ray.direction);
			refine_and_offset_hitpoint(ray.origin + t*ray.direction, ray.direction,
				geometric_normal, p00,
				back_hit_point, front_hit_point);
			if (rtReportIntersection(0)) {
				return;
			}
		}
	}
}

Edit: I also compiled this on the MX130 card notebook (where the sample crashed previously) and it now works too!

dhart · July 2, 2019, 5:36pm

Hey Toni,

That makes some sense, because rtPotentialIntersection() and rtReportIntersection() are real function calls that don’t get inlined, so using them inside your loop could be very heavy. Probably what is going on here is high register pressure. When you’re out of registers and need to call functions in the middle of a loop, the compiler has to write the data in registers out to memory before the call, and read it back from memory after the call, so you end up completely bottlenecked by the bandwidth of the memory system. It is possible your code in OptiX 5 is using fewer registers than the same code in OptiX 6, but this still isn’t explaining the odd behavior you see with the optixOcean sample. So with this change, the difference between the two OptiX versions is more like 5x rather than 100x? Comparing against pure CUDA might be a bit trickier and/or less informative.

It might be easy to test my theory by analysing your memory reads & writes in your intersection program. I’d maybe start by looking at the reads from the heights array. You could fake them with constants and see what that does to perf. Without seeing the intersect_triangle() function implementation, I might suggest looking there for potential savings. If you’re compiling with nvcc, make sure you’re using --fast-math.

–
David.

mahonyyy · July 3, 2019, 10:50am

Hi David,
the Code posted above is from the actual optixOcean sample - just with some minor modifications. So not my code really. I.e. intersect_triangle is a ‘built-in’ optix function (in optixu_math_namespace.h …).

I don’t get why Optix 6 has so serious trouble with the non-modified version of this advanced_sample while Optix 5.1 seems to handle it flawless. I just installed cuda 9.0 and optix 5.1 in the RTX-machine to check this and both versions of optixOcean, the original one and the one using my modified intersect-routine perform almost exactly the same.

So for me there is a serious problem in optix 6.0 compared to 5.1.