Vulkan driver -- uniform buffer bug

Duttenheim · September 16, 2016, 4:38pm

Hello! I am currently implement a vulkan renderer, and I just ran into some trouble. It seems that simple shaders using a few amount of uniform buffers works fine, but at some point, adding more causes problems.

I have a shader which uses a push_constant block to address a texture within an array. If I add uniform buffers to the same shader, which are never used, the texture sample gets corrupted. The first image shows what happens with the unused uniform buffers, and the lower one is without them.

[Album] imgur.com

The following file is the one that produces the lower image:

The following file causes the problem seen in the first image:

virtual_storm · September 21, 2016, 11:00am

This looks like you ignored maxPerStageDescriptorUniformBuffers. This is 12 for current NVIDIA hardware. This means that any shader stage (vertex, fragment, etc.) cannot access more than 12 uniform buffers in total from all pipeline layouts. This applies to stage access as defined in the descriptor set layouts, in VkDescriptorSetLayoutBinding::stageFlags. It is irrelevant if the shader that uses the layout actually declares or uses them, the limit counts against the pipeline layout.

This seems to be the most likely limit you ignored, but there are 4 additional limits for uniform buffers, vulkan.gpuinfo.org is your friend here. Also note that 12 is the minimum amount that must be supported for maxPerStageDescriptorUniformBuffers.

Regards

Duttenheim · September 22, 2016, 1:21pm

I did implement a check for the number of uniform buffers, but I might have only checked the maxDescriptorSetUniformBuffers. But what happens if this limit is ignored? Does the pipeline layout become invalid? It would seem to be the case since the shader I am using is using push constants to update the texture and position. However, the return value from vkCreateDescriptorSetLayout is VK_SUCCESS (and the validation layers accepts the code too), which must’ve been why I missed it.

Also, is there a reason why the maxDescriptorSetUniformBuffers is 72 but the closest GL counterpart GL_MAX_UNIFORM_BUFFER_BINDINGS is 84 on the same driver? And why is the nvidia driver seemingly only implementing the absolute minimum? It can be seen not only with this case but also with pixel formats, image blits and copies, and using the different tiling flags.

That being said, any Vulkan implementation should, to be safe, be implemented against the specification’s lower limits.

Thanks for your help!

virtual_storm · September 22, 2016, 5:43pm

Well, if you use Vulkan the wrong way, the spec says “undefined behavior”, so returning VK_SUCCESS and then corrupting memory is actually valid. The validation layers should catch this, but they are still fresh. You could open an issue at LunarXchange with this, maybe they fix the validation layers.

The minimum limits are kind of “ask all the hardware vendors what they can support and then take the minimum”. NVIDIA uses, as far as i know, some special cache for constant/uniform blocks that is optimized for uniform access. So that may be the limit. OpenGL misses push constants and sets, as well as using only one binding for arrays, so an OpenGL binding isn’t the same as a Vulkan binding.

Linear tiling is basically useless and has no “for this linear tiling is awesome” feature. In fact i fully ignore it in my programs. For pixel formats, NVIDIA actually supports a lot more than the minimum spec - mostly because the spec actually requires very few formats. But most pixel formats are supported by either almost all vendors or almost no one. The most visible split is between dedicated/integrated and desktop/mobile implementations.

Of course, the elephant in the room is NVIDIA’s bindless memory that you have in OpenGL which is far more powerful than anything the binding model has, even in Vulkan. With that there’s no limit on how many buffers you can access, even divergent, in a shader. Same for textures. You can kind of get something similar with sparse memory, but the divergent texture access requires GL_NV_gpu_shader5.

Regards

Duttenheim · September 22, 2016, 7:32pm

Most Vulkan functions return VK_SUCCESS even if they are called incorrectly, that is true. However in my experience, the vkCreateDescriptorSetLayout and vkCreateDescriptorSet functions have been consistent with either causing a segfault or a INITIALIZATION_FAILED return value.

Linear textures are useful for reading back the texture to host memory, to take screenshots, debug framebuffers etc. Without linear textures you need to copy to a buffer first, then map that buffer. If you’re reading from an optimal tiling texture, then you first need to copy to a linear (or blit) then copy again to a buffer, delete the image you copied to, and map the buffer. This is because creating an optimal tiling texture memory with the HOST_VISIBLE flag is seldom allowed for obvious reasons. Otherwise with linear tiling we can map the image directly and read from it. It’s not an awesome feature but it’s quite convenient sometimes.

At least that was the only way I could get reading a BC-compressed texture to work. That being said, I wouldn’t use linear tiling in performance code, they were only used for debugging, but I find I exclude them simply because finding a format which matches my requirements using linear tiling with the nvidia driver is hopeless. But again, the Vulkan spec doesn’t require anything to be supported for linear tiling so assuming they exist might not be the best way to go.

When it comes to uniform buffers, it’s mostly annoying to manage an entire shader library using only 12 uniform buffers, when the GL implementation worked just fine (having 84 as an upper limit) and the AMD driver can handle a seemingly endless count. With descriptor sets it also becomes important to keep the descriptor set layout of every shader as similar as possible, so one can actually use the incremental binding model without having to rebind the range of descriptor sets ranging from the mismatching set and up, every time a pipeline is switched.

I’m going to see what I can do to get the uniform buffer count down when I get back to an nvidia machine and see if that’s the last thing to fix. You might see the black piece of geometry in the background - let’s just say it shouldn’t be black ;)

virtual_storm · September 23, 2016, 12:41pm

You can use vkCmdCopyImageToBuffer to copy from an optimal tiling image in GPU memory directly into an host readable buffer. No temporary linear image required. There is no real difference between a linear image and a buffer in host memory, except that the linear image is way less likely to be supported and other stuff such as row stride. Also, no need to map every time. Persistent mapping is the new norm in Vulkan, you just map once at app start. For readback you just need to add a barrier to HOST_READ after the copy and then wait on the fence.

Its not 12 total, it’s 12 per stage per pipeline. So you can have 12 in the vertex shader and another 12 in the pixel shader. That’s the 72 total limit per pipeline, 12 per stage (vert, tesc, tese, geom, frag and comp). Yeah, AMDs GCN has basically just global memory, with no real difference between index, vertex, uniform or storage buffers. GCN is more of a compute card that can do a little bit graphics. The binding model is kind of funny: AMD advertises to basically just use one giant descriptor set for everything, obviously because they don’t have limits on stuff. While NVIDIA has not only much smaller limits, but also supported just 4(!!) descriptor sets at the start (which again is the minimum required for maxBoundDescriptorSets). The “reuse lower descriptor sets” is an optimization for shaders that are close to each other. For example, the camera stats like view and projection matrix will normally stay the same for the entire frame, but the bound textures may vary per object. So if you render the objects, just swapping the subset of descriptors that actually change is faster. It isn’t a “all shaders need to share sets”, having totally different sets for example per render-pass/subpass or stage (like gbuffer, shadows, lighting, postprocess, UI, etc.) is fine, you just don’t want to swap everything per drawcall.

In your shader, you have GlobalBlock, WindParams, CameraBlock, RenderTargetBlock and GlobalLightBlock. I don’t know your pipeline, but all these sound like they are fixed for the frame. LightForwardBlock and CSMParamBlock look fixed too. So unless you actually bind/update these independent from each other per frame, there’s no reason to have 7 different buffer. Just put all that stuff into one buffer (you only need to stay below 64Kb). Of course you normally have something like triple-buffering, so you actually have 1 descriptor set layout, 1 buffer that has enough space to fit 3 of your uniform buffers (properly aligned!) and 3 descriptor sets (1 per frame) that bind the according subrange in the one buffer. The entire memory of the buffer is mapped at application start, and you just update the subrange.

Regards

Duttenheim · September 23, 2016, 1:48pm

The reason I always use a staging image is to perform a blit operation to decompress potentially compressed formats. The code does this no matter if the texture is compressed or not. The mapping method is not only for reading, it’s for reading and writing, otherwise having a pre-allocated buffer for reads is viable. However the code is supposed to perform an actual memory map, and may have more than one resource mapped at the time, meaning its hard to predict how big a scratch buffer of relevant size will be. I know about persistently mapped buffers, used them frequently in OpenGL, and practice them where mapping and un-mapping is unnecessary, for example when updating uniforms, vertex/index buffers for dynamic geometry etc.

Trying to figure out which buffers are used in which shader stage is not so simple either. The only way I can see it work in a pipeline is to have some kind of annotation language and implement a preprocessor to detect which buffers are used in which shaders, or to hardcode the usages in the program. As far as I have seen the Khronos GLSL reference compiler cannot be used to retrieve which shader stages a uniform buffer is used in either, otherwise it would’ve been trivial. Right now I couldn’t easily distinguish where they are used, and even if I were, I would want my descriptor set layout to be as identical as possible between shaders, which they will not if one shader uses some stage flags while another uses some others.

When it comes to descriptor set binding, it’s mostly just ugly to have some system which incrementally saves every descriptor set being applied in a list, and then applies this entire list every time the descriptor sets needs to be bound, instead of having each stage of the rendering code update the state. Although I find that when using secondary command buffers within render passes, one must apply all descriptor sets anyways, since each secondary command buffer has its own local state. However, in that case, we can apply all ‘shared’ descriptor sets when we start recording to the secondary buffer, and then just incrementally apply the sets that change, which is what I am doing, but it also means it is important for the code to be able to assume which descriptor sets are compatible, and which are not. In my case, I let all descriptor sets down to and including number 5 be shared, so that I know sets 0-5 can be applied once and used by all. Then only the descriptor sets unique for that shader change per object, and with offsets into each uniform buffer for that set describing the slice used by the draw. This method allows me to avoid changing descriptor sets, but instead just supply them with offsets, which I vaguely recall nvidia claimed would be more efficient since the driver can detect if only the offsets change. I would assume it’s somewhat similar to how glBindBufferRange worked where only offsets changed.

The reason for the many uniform buffers is because many subsystems are responsible to update their share of the shader state. The descriptor sets are fixed from startup, meaning I have only one descriptor set per unique set layout, and a single buffer bound to every binding slot. The buffer is expanded when a new instance is requested, and each subsystem gets a slice in that buffer to work with. This keeps the descriptor set count down to a minimum, which is necessary because I found myself running out of memory on my AMD driver when using a descriptor set per material, so I figured that is not the proper use of descriptor sets. At the same time, updating a descriptor set has to be done after the work on said descriptor set is done, so updating descriptor sets to emulate ‘binding’ as in OpenGL or DirectX is not viable either. The idea is derived from keeping the total amount of allocations and buffers down to a minimum, in order to improve memory access, as explained here [url]https://developer.nvidia.com/vulkan-memory-management[/url].

I also implement texturing using the method purposed by AMD, where all textures are bound in a single descriptor set, and textures are addressed using integers found in uniform buffers. While the AMD driver supports 2^32 textures to be bound at the same time, the nvidia supports 49k, which is sufficient for most applications. To be honest, I can’t quite figure out how to have different textures per object without either having a descriptor set per material (which ran me out of memory on the AMD driver, no clue about the nvidia one) or updating the same descriptor between every draw (which stomps data and cannot work). It could work if there was a Cmd-type command for updating descriptor sets, but there isn’t. I guess putting textures in their own descriptor set to keep the descriptor set memory footprint as small as possible might be a solution, but I really prefer the AMD idea of just binding them all at once.

So the problem isn’t really with Vulkan itself, the problem is putting it in practice. I know I have some redundancies, for example GlobalLightBlock, CSMParamBlock and LightForwardBlock can be merged into one and the same for example, so I think that’s fine, I was just not prepared to be limited by the count of uniform buffer declarations. I would’ve assumed something slightly less obvious was causing the issue.

virtual_storm · September 23, 2016, 6:08pm

Interesting. I never thought about decompression on readback. Mostly because rendertarget readback strikes me as the most obvious one, and that isn’t really compressed with BC or so. The buffer readback obviously doesn’t perform conversion. But i can’t come up with an actual example were i would want to decompress BC7 or so from the GPU to the CPU.

I know that there is a SPIR-V reflection API in the making, but i’m not sure how mature it is. Also, i wouldn’t count on the current GLSL->SPIR-V compiler to remove unreferenced globals. On the other hand, its all about the layout anyway, not the actual shader. Achieving maximum shader flexibility and maximum runtime performance don’t really go hand in hand with Vulkan. For full performance i would argue that the layouts (descriptor layouts, vertex attribute layouts, render targets, etc.) need to be carefully crafted by hand to fit multiple pipelines together. Automatically building high-performance layouts seems to be a very complicated optimization problem in the general case.

I’m using a system without descriptor set updates, even though it’s still in the building its already working really well. As i said i use 3 “frames” on the CPU side: One that is written by CPU and will be submitted at the next frame (so i lag one frame), one that is executed on the GPU and the third to not have a GPU-CPU sync point directly after the present that would stall before i could record again. I also use the dynamic offset for per-object data, i don’t have one descriptor set per object. The difference is that since all buffers and descriptor sets and command buffers are triple buffered on the CPU, i never have any sync problems. So, lets say each subsystem changes only its part of the per-frame buffer, that’s fine, they can still share one buffer. Since the data isn’t actually read by the GPU until i actually submit, a drawcall i record into a command buffer could access data that i write later, as long as its there if i submit. Of course, again, this means i need to manually design the layouts. Having a full subsystem-plugin framework looks very complex in this setting.

Yeah, i have a “all textures bound” system too. The 48K textures is again for all stages together, per stage its 8K textures per shader stage. Not enough for me. But i use an array of texture arrays. Each texture can have 2K layers. (This is for NVIDIA, the minimum requirement is just 16!! textures per stage and 256 layers, but the lowest on desktop is Intel with 64 textures and 1K layers) This means i can have at most 16 million textures, with some restrictions on format and aspect ratio. The “all layers need to have the same size” i circumvent by using sparse textures. So the textures just need the same aspect ratio instead of the same size. And i can actually stream the content incl. memory in and out from the texture without changing the textures themselves, their views or descriptor sets that contain them. I just need to update a small GPU buffer which contains which mipmap is loaded for a frame. I do the same with index/vertex/skin/etc. data, where i track the binding on the CPU and change the drawcalls. But since all sizes and offsets in the single sparse buffer (yes i have a single buffer for all geometry/skin/etc. data) stay the same for the entire application, this optimizes a lot of things. The CPU written buffers are a bit more complicated, and i haven’t fully fleshed that part out, but there are multiple variants with mixtures of dynamic offset uniform buffers and/or storage buffers and/or push_constant. I will probably end up allocating per-object data in large batches with each batch getting a descriptor set and then reuse slots in the batches. Variable per-object CPU data like skeleton matrices are still a problem here since they fragment the batches and prevent universal reuse.

Now this uses sparse memory, which AMD doesn’t expose in their current drivers, but their GCN hardware supports it in DX12 and OpenGL, so this is just a driver limit rather than a hardware limit. Intel is a bit tougher, Broadwell and up support it on DX12, but Intel has just a single queue which doesn’t work in my current system (i currently need 3 queues on NVIDIA and would need 2 on AMD).

Also, secondary command buffers: I don’t know about AMD, but on NVIDIA i found that while recording secondary command buffers in primary ones is very fast, it affects submit time substantially, to the point where the combined time is actually slower than recording everything in the primary one directly. I had some tests where recording 10k drawcalls directly in a primary buffer took just 1ms and submitting took like 40µs, recording 10k secondary drawcalls was like 10µs, but submitting them took like 4ms! This is just CPU time, GPU time has not really affected. And this was switching between a full blown tess+geom 5 stage pipeline and a simple 2 stage pipeline between every drawcall. At least at the moment directly recording tens of thousands of drawcalls per frame seems easily doable even in a single thread.

Regards

Duttenheim · September 24, 2016, 12:02pm

Actually I find it rather simple to dynamically create pipelines. In my implementation, I only use 4 high-level objects to determine the shader settings, along with subpass index. So they are render pass, subpass, shader program, vertex layout and input layout. The shader program knows all about the blend settings, and they are declared in an FX-style language meaning they are static on a resource level. With this information I simply create a DAG which, at the leaf nodes, either contains a VkPipeline or has to create and save one. Whenever I change any of the pointers in my DAG, I move an iterator to an already existing slot in the DAG on that level (each object is a level in the DAG), as well as all iterators below it, and then I can easily retrieve the VkPipeline. It sounds a bit cumbersome and slow, but all in all, the total amount of combinations in your program is likely to be very small indeed, so doing a maximum of 5 binary searches is fine. I only implemented this because it seems the VkPipelineCache broke on my AMD card, but it also has to be slower because the VkPipelineCache has to serialize 11 pointers, 2 handles and 2 integers. When I implemented this I got a pretty huge performance increase too compared to when the only pipeline cache was used (and working).

I currently have an intermediate language for shaders, which defines a ‘wrapping’ around GLSL (keeping the GLSL code intact) but allows for annotations and shader meta-data like blending, rasterizer and multisample settings. I could add support for binding uniform buffers to shaders there, so that it can be retrieved later. I simply use the reflection data I get from my own implementation to determine uniform offsets (by applying std430 or std140), shader entry point, sets and bindings, vertex shader inputs and pixel shader outputs, etc. Especially now with Vulkan, where there is no shader reflection, it is more relevant than ever for a half-decent content pipeline to be able to retrieve shader information.

I haven’t really implemented the same idea for uniform buffers as for geometry data. However, a single uniform buffer works like a memory pool, allocating and freeing offsets, meaning I can expand the uniform buffer, deallocate instances, and then simply give the freed indexes back whenever I need that memory again.

When it comes to using secondary command buffers, I only really use them because of threading reasons. The only issue I have seen with using them is that recording on a secondary command buffer is too fast for gaining anything by using several threads. It’s still an ongoing development to optimize, but currently, the way I use drawing threads is by launching a new thread job whenever I switch pipelines. All subsequent Cmd-calls are put on that thread. Whenever I switch subpasses or end the render pass, I sync all threads and execute their commands on the main command buffer. I haven’t really gotten to profiling the actual execution times, but I do find that having many draw threads does little to no difference for performance.

It sounds like I have to investigate the many useful uses of sparse memory. Never really researched that subject when doing OpenGL. Immediate geometry should be very simple to implement using sparse memory, giving us memory guaranteed to be big enough to fit all immediate geometry, but only make resident the size required. I can also use sparse memory to avoid having to actually recreate my uniform buffers and just make a new slice resident when needed. Thanks!