Is it worth reducing the number of parameters in GLSL function calls?

The GLSL 4.5 specification states that all in parameters of a GLSL function call will be passed by value, i.e., they will be copied at call time. Now, this does not have the same implications as for host code, but in general, unnecessary copies are a performance drag.

There are dozens of discussions about this topic, which most often end with the argument “the compiler will probably optimize this anyway”. However, this claim or assumption is never backed by any hard data or inside knowledge. Could you please provide some insight to what extent this is actually the case?

I want to highlight my current use case. In our visualization software, we implemented a new heightfield sampler which is quite complex. The actual sample function calls a lot of other functions for position clamping, index conversions, validity checks, data fetches, interpolation, … All of these functions somehow depend on a specific heightfield, so the code looks like this:

struct AdaptiveGrid
{
	bool mIsCompact;
	bool mHasValidArray;
	vec2 mOrigin;
	vec2 mUpper;
	vec2 mLevel0CellSize;
	ivec2 mLevel0Dimensions;
	index_t mDataSize;
	index_t mNumNodes;
	index_t mNumCells;
	index_t mNumPositions;
	vec2 mMaxExtrapolationDistances;

	restrict readonly vec8Idx* mInterpolationPatchIndices;
	restrict readonly vec4Idx* mInterpolationPatches;
	restrict readonly vec4Idx* mChildIndices;
	restrict readonly index_t* mNodeToCellIndices;
	restrict readonly index_t* mCellToDataIndices;
	restrict readonly vec4* mNodePositions;
	restrict readonly uint8_t* mValidArray;
};

index_t convertCellToDataIndex(AdaptiveGrid grid, index_t cellIndex)
{
	...
}

index_t convertNodeToCellIndex(AdaptiveGrid grid, index_t nodeIndex)
{
	...
}

vec4Idx getInterpolationPatch(AdaptiveGrid grid, index_t nodeIndex, vec2 samplePosition)
{
	index_t cellIndex = convertNodeToCellIndex(grid, nodeIndex);

	...
}

bool fetchValueOfCell(AdaptiveGrid grid, restrict readonly float* field, index_t cellIndex, out float value)
{
	index_t dataIndex = convertCellToDataIndex(grid, cellIndex);

	...
}

bool sample(AdaptiveGrid grid, restrict readonly float* field, vec2 position, out float value)
{
	vec2 clampedPosition = clampPosition(grid, position);
	index_t nodeIndex = getNodeIndex(grid, clampedPosition);
	vec4Idx interpolationPatch = getInterpolationPatch(grid, nodeIndex, clampedPosition);

	if (...)
	{
		return fetchValueOfCell(grid, field, convertNodeToCellIndex(grid, nodeIndex), value);
	}

	for (int vertexIndex = 0; vertexIndex < 4; vertexIndex++)
	{
		index_t currentCellIndex = convertNodeToCellIndex(grid, currentNodeIndex);

		if (fetchValueOfCell(grid, field, currentCellIndex, currentValue))
		{
			...
		}
	}

	...
}

As you can see, there are a lot of helper functions that depend on members of an AdaptiveGrid struct, of which there exist several at runtime. As the code is written and the function calling convention is specified, the grid struct would be passed down to each of these helper function calls and copied in vain. I assume there would be around one hundred copies of this struct for each sample call.

It would, of course, be possible to move all the helper functions’ code into the sample function to avoid any function parameter copies, which would result in code duplications and a very huge main function that is very hard to maintain.

My question is: Would this pay off? Or am I imagining something that Nvidia’s GLSL compiler is already doing? Since the shaders run in a stackless environment, I assume there is quite large potential for inlining and I assume this is already being done extensively. But could manual sausage code still improve the performance? We optimize our software for use with Nvidia hardware, so getting a statement about Nvidia’s GLSL compiler would suffice for our needs.

We’d appreciate a response, as trying and profiling the sausage code version would take several days.