Bas Nieuwenhuizen

A driver on the GPU

2022-04-25T00:00:00+02:00

The title might be a bit hyperbolic here, but we’re indeed exploring a first step in that direction with radv. The impetus here is the ExecuteIndirect command in Direct3D 12 and some games that are using it in non-trivial ways. (e.g. Halo Infinite)

ExecuteIndirect can be seen as an extension of what we have in Vulkan with vkCmdDrawIndirectCount. It adds extra capabilities. To support that with vkd3d-proton we need the following indirect Vulkan capabilities:

Binding vertex buffers.
Binding index buffers.
Updating push constants.

This functionality happens to be a subset of VK_NV_device_generated_commands and hence I’ve been working on implementing a subset of that extension on radv. Unfortunately, we can’t really give the firmware a “extended indirect draw call” and execute stuff, so we’re stuck generating command buffers on the GPU.

The way the extension works, the application specifies a command “signature” on the CPU, which specifies that for each draw call the application is going to update A, B and C. Then, at runtime, the application provides a buffer providing the data for A, B and C for each draw call. The driver then processes that into a command buffer and then executes that into a secondary command buffer.

The workflow is then as follows:

The application (or vkd3d-proton) provides the command signature to the driver which creates an object out of it.
The application queries how big a command buffer (“preprocess buffer”) of $n$ draws with that signature would be.
The application allocates the preprocess buffer.
The application does its stuff to generate some commands.
The application calls vkCmdPreprocessGeneratedCommandsNV which converts the application buffer into a command buffer (in the preprocess buffer)
The application calls vkCmdExecuteGeneratedCommandsNV to execute the generated command buffer.

What goes into a draw in radv

When the application triggers a draw command in Vulkan, the driver generates GPU commands to do the following:

Flush caches if needed
Set some registers.
Trigger the draw.

Of course we skip any of these steps (or parts of them) when they’re redundant. The majority of the complexity is in the register state we have to set. There are multiple parts here

Fixed function state:
1. subpass attachments
2. static/dynamic state (viewports, scissors, etc.)
3. index buffers
4. some derived state from the shaders (some tesselation stuff, fragment shader export types, varyings, etc.)
shaders (start address, number of registers, builtins used)
user SGPRs (i.e. registers that are available at the start of a shader invocation)

Overall, most of the pipeline state is fairly easy to emit: we just precompute it on pipeline creation and memcpy it over if we switch shaders. The most difficult is probably the user SGPRs, and the reason for that is that it is derived from a lot of the remaining API state . Note that the list above doesn’t include push constants, descriptor sets or vertex buffers. The driver computes all of these, and generates the user SGPR data from that.

Descriptor sets in radv are just a piece of GPU memory, and radv binds a descriptor set by providing the shader with a pointer to that GPU memory in a user SGPR. Similarly, we have no hardware support for vertex buffers, so radv generates a push descriptor set containing internal texel buffers and then provides a user SGPR with a pointer to that descriptor set.

For push constants, radv has two modes: a portion of the data can be passed in user SGPRs directly, but sometimes a chunk of memory gets allocated and then a pointer to that memory is provided in a user SGPR. This fallback exists because the hardware doesn’t always have enough user SGPRs to fit all the data.

On Vega and later there are 32 user SGPRs, and on earlier GCN GPUs there are 16. This needs to fit pointers to all the referenced descriptor sets (including internal ones like the one for vertex buffers), push constants, builtins like the start vertex and start instance etc. To get the best performance here, radv determines a mapping of API object to user SGPR at shader compile time and then at draw time radv uses that mapping to write user SGPRs.

This results in some interesting behavior, like switching pipelines does cause the driver to update all the user SGPRs because the mapping might have changed.

Furthermore, as an interesting performance hack radv allocates all upload buffers (for the push constant and push descriptor sets), shaders and descriptor pools in a single 4 GiB region of of memory so that we can pass only the bottom 32-bits of all the pointers in a user SGPR, getting us farther with the limited number of user SGPRs. We will see later how that makes things difficult for us.

Generating a commandbuffer on the GPUs

As shown above radv has a bunch of complexity around state for draw calls and if we start generating command buffers on the GPU that risks copying a significant part of that complexity to a shader. Luckily ExecuteIndirect and VK_NV_device_generated_commands have some limitations that make this easier. The app can only change

vertex buffers
index buffers
push constants

VK_NV_device_generated_commands also allows changing shaders and the rotation winding of what is considered a primitive backface but we’ve chosen to ignore that for now since it isn’t needed for ExecuteIndirect (though especially the shader switching could be useful for an application).

The second curveball is that the buffer the application provides needs to provide the same set of data for every draw call. This avoids having to do a lot of serial processing to figure out what the previous state was, which allows processing every draw command in a separate shader invocation. Unfortunately we’re still a bit dependent on the old state that is bound before the indirect command buffer execution:

The previously bound index buffer
Previously bound vertex buffers.
Previously bound push constants.

Remember that for vertex buffers and push constants we may put them in a piece of memory. That piece of memory needs to contains all the vertex buffers/push constants for that draw call, so even if we modify only one of them, we have to copy the rest over. The index buffer is different: in the draw packets for the GPU there is a field that is derived from the index buffer size.

So in vkCmdPreprocessGeneratedCommandsNV radv partitions the preprocess buffer into a command buffer and an upload buffer (for the vertex buffers & push constants), both with a fixed stride based on the command signature. Then it launches a shader which processes a draw call in each invocation:

   if (shader used vertex buffers && we change a vertex buffer) {
      copy all vertex buffers 
      update the changed vertex buffers
      emit a new vertex descriptor set pointer
   }
   if (we change a push constant) {
      if (we change a push constant in memory) {
         copy all push constant
         update changed push constants
         emit a new push constant pointer
      }
      emit all changed inline push constants into user SGPRs
   }
   if (we change the index buffer) {
      emit new index buffers
   }
   emit a draw command
   insert NOPs up to the stride

In vkCmdExecuteGeneratedCommandsNV radv uses the internal equivalent of vkCmdExecuteCommands to execute as if the generated command buffer is a secondary command buffer.

Challenges

Of course one does not simply move part of the driver to GPU shaders without any challenges. In fact we have a whole bunch of them. Some of them just need a bunch of work to solve, some need some extension specification tweaking and some are hard to solve without significant tradeoffs.

Code maintainability

A big problem is that the code needed for the limited subset of state that is supported is now in 3 places:

The traditional CPU path
For determining how large the preprocess buffer needs to be
For the shader called in vkCmdPreprocessGeneratedCommandsNV to build the preprocess buffer.

Having the same functionality in multiple places is a recipe for things going out of sync. This makes it harder to change this code and much easier for bugs to sneak in. This can be mitigated with a lot of testing, but a bunch of GPU work gets complicated quickly. (e.g. the preprocess buffer being larger than needed still results in correct results, getting a second opinion from the shader to check adds significant complexity).

`nir_builder` gets old quickly

In the driver at the moment we have no good high level shader compiler. As a result a lot of the internal helper shaders are written using the nir_builder helper to generate nir, the intermediate IR of the shader compiler. Example fragment:

   nir_push_loop(b);
   {
      nir_ssa_def *curr_offset = nir_load_var(b, offset);

      nir_push_if(b, nir_ieq(b, curr_offset, cmd_buf_size));
      {
         nir_jump(b, nir_jump_break);
      }
      nir_pop_if(b, NULL);

      nir_ssa_def *packet_size = nir_isub(b, cmd_buf_size, curr_offset);
      packet_size = nir_umin(b, packet_size, nir_imm_int(b, 0x3ffc * 4));

      nir_ssa_def *len = nir_ushr_imm(b, packet_size, 2);
      len = nir_iadd_imm(b, len, -2);
      nir_ssa_def *packet = nir_pkt3(b, PKT3_NOP, len);

      nir_store_ssbo(b, packet, dst_buf, curr_offset, .write_mask = 0x1,
                     .access = ACCESS_NON_READABLE, .align_mul = 4);
      nir_store_var(b, offset, nir_iadd(b, curr_offset, packet_size), 0x1);
   }
   nir_pop_loop(b, NULL);

It is clear that this all gets very verbose very quickly. This is somewhat fine as long as all the internal shaders are tiny. However, between this and raytracing our internal shaders are getting significantly bigger and the verbosity really becomes a problem.

Interesting things to explore here are to use glslang, or even to try writing our shaders in OpenCL C and then compiling it to SPIR-V at build time. The challenge there is that radv is built on a diverse set of platforms (including Windows, Android and desktop Linux) which can make significant dependencies a struggle.

Preprocessing

Ideally your GPU work is very suitable for pipelining to avoid synchronization cost on the GPU. If we generate the command buffer and then execute it we need to have a full GPU sync point in between, which can get very expensive as it waits until the GPU is idle. To avoid this VK_NV_device_generated_commands has added the separate vkCmdPreprocessGeneratedCommandsNV command, so that the application can batch up a bunch of work before incurring the cost a sync point.

However, in radv we have to do the command buffer generation in vkCmdExecuteGeneratedCommandsNV as our command buffer generation depends on some of the other state that is bound, but might not be bound yet when the application calls vkCmdPreprocessGeneratedCommandsNV.

Which brings up a slight spec problem: The extension specification doesn’t specify whether the application is allowed to execute vkCmdExecuteGeneratedCommandsNV on multiple queues concurrently with the same preprocess buffer. If all the writing of that happens in vkCmdPreprocessGeneratedCommandsNV that would result in correct behavior, but if the writing happens in vkCmdExecuteGeneratedCommandsNV this results in a race condition.

The 32-bit pointers

Remember that radv only passes the bottom 32-bits of some pointers around. As a result the application needs to allocate the preprocess buffer in that 4-GiB range. This in itself is easy: just add a new memory type and require it for this usage. However, the devil is in the details.

For example, what should we do for memory budget queries? That is per memory heap, not memory type. However, a new memory heap does not make sense, as the memory is also still subject to physical availability of VRAM, not only address space.

Furthermore, this 4-GiB region is more constrained than other memory, so it would be a shame if applications start allocating random stuff in it. If we look at the existing usage for a pretty heavy game (HZD) we get about

40 MiB of command buffers + upload buffers
200 MiB of descriptor pools
400 MiB of shaders

So typically we have a lot of room available. Ideally the ordering of memory types would get an application to prefer another memory type when we do not need this special region. However, memory object caching poses a big risk here: Would you choose a memory object in the cache that you can reuse/suballocate (potentially in that limited region), or allocate new for a “better” memory type?

Luckily we have not seen that risk play out, but the only real tested user at this point has been vkd3d-proton.

Secondary command buffers.

When executing the generated command buffer radv does that the same way as calling a secondary command buffer. This has a significant limitation: A secondary command buffer cannot call a secondary command buffer on the hardware. As a result the current implementation has a problem if vkCmdExecuteGeneratedCommandsNV gets called on a secondary command buffer.

It is possible to work around this. An example would be to split the secondary command buffer into 3 parts: pre, generated, post. However, that needs a bunch of refactoring to allow multiple internal command buffers per API command buffers.

Where to go next

Don’t expect this upstream very quickly. The main reason for exploring this in radv is ExecuteIndirect support for Halo Infinite, and after some recent updates we’re back into GPU hang limbo with radv/vkd3d-proton there. So while we’re solving that I’m holding off on upstreaming in case the hangs are caused by the implementation of this extension.

Furthermore, this is only a partial implementation of the extension anyways, with a fair number of limitations that we’d ideally eliminate before fully exposing this extension.

Raytracing Starting to Come Together

2021-09-17T00:00:00+02:00

I am back with another status update on raytracing in RADV. And the good news is that things are finally starting to come together. After ~9 months of on and off work we’re now having games working with raytracing. Working on first try after getting all the required functionality was Control:

After poking for a long time at CTS and demos it is really nice to see the fruits of ones work.

The piece that I added recently was copy/compaction and serialization of acceleration structures which was a bunch of shader writing, handling another type of queries and dealing with indirect dispatches. (Since of course the API doesn’t give the input size on the CPU. No idea how this API should be secured …)

What games?

I did try 5 games:

Quake 2 RTX (Vulkan): works. This was working already on my previous update.
Control (D3D): works. Pretty much just works. Runs at maybe 30-50% of RT performance on Windows.
Metro Exodus (Vulkan): works. Needs one workaround and is very finicky in WSI but otherwise works fine. Runs at 20-25% of RT performance on Windows.
Ghostrunner (D3D): Does not work. This really needs per shadergroup compilation instead of just mashing all the shaders together as I get shaders now with 1 million NIR instructions, which is a pain to debug.
Doom Eternal (Vulkan): Does not work. The raytracing option in the menu stays grayed out and at this point I’m at a loss what is required to make the game allow enabling RT.

If anybody could tell me how to get Doom Eternal to allow RT I’d appreciate it.

What is next?

Of course the support is far from done. Some things to still make progress on:

Upstreaming what I have. Samuel has been busy reviewing my MRs and I think there is a good chance that what I have now will make it into 21.3.
Improve the pipeline compilation model to hopefully make ghostrunner work.
Improved BVH building. The current BVH is really naive, which is likely one of the big performance factors.
Improve traversal.
Move on to stuff needed for DXR 1.1 like VK_KHR_ray_query.

P.S. If you haven’t seen it yet, Jason Ekstrand from Intel recently gave a talk about how Intel implements raytracing. Nice showcase of how you can provide some more involved hardware implementation than RDNA2 does.

World’s Slowest Raytracer

2021-07-27T00:00:00+02:00

I have not talked about raytracing in RADV for a while, but after ~~some procrastination~~ being focused on some other things I recently got back to it and achieved my next milestone.

In particular I have been hacking away at CTS and got to a point where CTS on dEQP-VK.ray_tracing.* runs to completion without crashes or hangs. Furthermore, I got the passrate to 90% of non-skiped tests. So we’re finally getting somewhere close to usable.

As further show that it is usable my fixes for CTS also fixed the corruption issues in Quake 2 RTX (Github version), delivering this image:

Of course not everything is perfect yet. Besides the not 100% CTS passrate it has like half the Windows performance at 4k right now and we still have some feature gaps to make it really usable for most games.

Why is it slow?

TL;DR Because I haven’t optimized it yet and implemented every shortcut imaginable.

AMD raytracing primer

Raytracing with Vulkan works with two steps:

You built a giant acceleration structure that contains all your geometry. Typically this ends up being some kind of tree, typically a Bounding Volume Hierarchy (BVH).
Then you trace rays using some traversal shader through the acceleration structure you just built.

With RDNA2 AMD started accelerating this by adding an instruction that allowed doing intersection tests between a ray and a single BVH node, where the BVH node can either be

A triangle
A box node specifying 4 AABB boxes

Of course this isn’t quite enough to deal with all geometry types in Vulkan so we also add two more:

an AABB box
an instance of another BVH combined with a transformation matrix

Building the BVH

With a search tree like a BVH it is very possibly to make trees that are very useless. As an example consider a binary search tree that is very unbalanced. We can have similarly bad things with a BVH including making it unbalanced or having overlapping bounding volumes.

And my implementation is the simplest thing possible: the input geometry becomes the leaves in exactly the same order and then internal nodes are created just as you’d draw them. That is probably decently fast in building the BVH but surely results in a terrible BVH to actually use.

BVH traversal

After we built a BVH we can start tracing some rays. In rough pseudocode the current implementation is

stack = empty
insert root node into stack
while stack is not empty:

   node = pop a node from the stack

   if we left the bottom level BVH:
      reset ray origin/direction to initial origin/direction

   result = amd_intersect(ray, node)
   switch node type:
      triangle:
         if result is a hit:
            load some node data
            process hit
      box node:
         for each box hit:
            push child node on stack
      custom node 1 (instance):
         load node data
         push the root node of the bottom BVH on the stack
         apply transformation matrix to ray origin/direction
      custom node 2 (AABB geometry):
         load node data
         process hit

We already knew there were inherently going to be some difficulties:

We have a poor BVH so we’re going to do way more iterations than needed.
Calling shaders as a result of hits is going to result in some divergence.

Furthermore this also clearly shows some difficulties with how we approached the intersection instruction. Some advantages of the intersection instruction are that it avoids divergence in computing collisions if we have different node types in a subgroup and to be cheaper when there are only a few lanes active. (A single CU can process one ray/node intersection per cycle, modulo memory latency, while it can process an ALU instruction on 64 lanes per cycle).

However even if it avoids the divergence in the collision computation we still introduce a ton of divergence in the processing of the results of the intersection. So we are still doing pretty bad here.

A fast GPU traversal stack needs some work too

Another thing to be noted is our traversal stack size. According to the Vulkan specification a bottom level acceleration structure should support 2^24 -1 triangles and a top level acceleration structure should support 2^24 - 1 bottom level structures. Combined with a tree with 4 children in each internal node we can end up with a tree depth of about 24 levels.

In each internal node iteration of our loop we pop one element and push up to 4 elements, so at the deepest level of traversal we could end up with a 72 entry stack. Assuming these are 32-bit node identifiers, that ends up with 288 bytes of stack per lane, or ~18 KiB per 64 lane workgroup (the minimum which could possibly keep a CU busy with an ALU only workload). Given that we have 64 KiB of LDS (yes I am using LDS since there is no divergent dynamic register addressing) per CU that leaves only 3 workgroups per CU, leaving very little options for parallelism between different hardware execution units (e.g. the ALU and the texture unit that executes the ray intersections) or latency hiding of memory operations.

So ideally we get this stack size down significantly.

Where do we go next?

First step is to get CTS passing and getting an initial merge request into upstream Mesa. As a follow on to that I’d like to get a minimal prototype going for some DXR 1.0 games with vkd3d-proton just to make sure we have the right feature coverage.

After that we’ll have to do all the traversal optimizations. I’ll probably implement a bunch of instrumentation so I actually have a clue on what to optimize. This is where having some runnable games really helps get the right idea about performance bottlenecks.

Finally, with some luck better shaders to build a BVH will materialize as well.

Making Reading from VRAM less Catastrophic

2021-06-14T00:00:00+02:00

In an earlier article I showed how reading from VRAM with the CPU can be very slow. It however turns out there there are ways to make it less slow.

The key to this are instructions with non-temporal hints, in particular VMOVNTDQA. The Intel Instruction Manual says the following about this instruction:

“MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory type, the nontemporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the temporary internal buffer if data is available. “ (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2)

This sounds perfect for our VRAM and WC System Memory buffers as we typically only read 16-bytes per instruction and this allows us to read entire cachelines at time.

It turns out that Mesa already implemented a streaming memcpy using these instructions so all we had to do was throw that into our benchmark and write a corresponding memcpy that does non-temporal stores to benchmark writing to these memory regions.

As a reminder, we look into three allocation types that are exposed by the amdgpu Linux kernel driver:

VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.
Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).
USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.

Furthermore this still uses a RX 6800 XT + a 2990WX with 4 channel 3200 MT/s RAM.

method (MiB/s)	VRAM	Cacheable System Memory	USWC System Memory
read via memcpy	15	11488	137
write via memcpy	10028	18249	11480
read via streaming memcpy	756	6719	4409
write via streaming memcpy	10550	14737	11652

Using this memcpy implementation we get significantly better performance in uncached memory situations, 50x for VRAM and 26x for USWC system memory. If this is a significant bottleneck in your workload this can be a gamechanger. Or if you were using SDMA to avoid this hit, you might be able to do things at significantly lower latency. That said it is not at a level where it does not matter. For big copies using DMA can still be a significant win.

Note that I initially gave an explanation on why the non-temporal loads should be faster, but the increases in performance are significantly above what something that just fiddles with loading entire cachelines would achieve. I have not dug into the why of the performance increase.

DMA performance

I have been claiming DMA is faster for CPU readbacks of VRAM in both this article and the previous article on the topic. One might ask how fast DMA is then. To demonstrate this I benchmarked VRAM<->Cacheable System Memory copies using the SDMA hardware block on Radeon GPUs.

Note that there is a significant overhead per copy here due to submitting work to the GPU, so I will shows results vs copy size. The rate is measured while doing a wait after each individual copy and taking the wall clock time as these usecases tend to be latency sensitive and hence batching is not too interesting.

copy size	copy from VRAM (MiB/s)	copy to VRAM (MiB/s)
4 KiB	62	63
16 KiB	245	240
64 KiB	953	1015
256 KiB	3106	3082
1 MiB	6715	7281
4 MiB	9737	11636
16 MiB	12129	12158
64 MiB	13041	12975
256 MiB	13429	13387

This shows that for reads DMA is faster than a normal memcpy at 4 KiB and faster than a streaming memcpy at 64 KiB. Of course one still needs to do their CPU access at that point, but at both these thresholds even with an additional CPU memcpy the total process should still be fast with DMA.

First Rays

2021-04-16T00:00:00+02:00

Given that the new RDNA2 GPUs provide some support for hardware accelerated raytracing and there is even a new shiny Vulkan extension for it, it may not be a surprise that we’re working on implementing raytracing support in RADV.

Already some time ago I wrote documentation for the hardware raytracing support. As these GPUs contain quite minimal hardware to implement things there is a large software and shader side to implementing this.

And that is what I’ve been up to for the last couple of weeks. And I now have achieved my first personal milestones for the implementation:

A fully recursive Fibonacci shader
And a raytraced cube:

This involves writing initial versions for a lot of the software infrastructure needed, so really shows that the basis is getting there.

At the same time we’re quite a ways off from really testing using CTS or running our first real demos. In particular we are missing things like

GPU-side BVH building
any-hit and intersection shaders
Supporting BVH instances, geometry transforms etc.
pipeline libraries

and much more, in addition to some of these initial implementations likely not really being performant.

A First Foray into Rendering Less

2021-04-09T00:00:00+02:00

In RADV we just added an option to speed up rendering by rendering less pixels.

These kinds of techniques have become more common over the past decade with techniques such as checkerboarding, TAA based upscaling and recently DLSS. Fundamentally all they do is trading off rendering quality for rendering cost and many of them include some amount of postprocessing to try to change the curve of that tradeoff. Most notably DLSS has been wildly successful at that to the point many people claim it is barely a quality regression.

Of course increasing GPU performance by up to 50% or so with barely any quality regression seems like must have and I think it would be pretty cool if we could have the same improvements on Linux. I think it has the potential to be a game changer, making games playable on APUs or playing with really high resolution or framerates on desktops.

And today we took our first baby steps in RADV by allowing users to force Variable Rate Shading (VRS) with an experimental environment variable:

RADV_FORCE_VRS=2x2

VRS is a hardware capability that allows us to reduce the number of fragment shader invocations per pixel rendered. So you could say configure the hardware to use one fragment shader invocation per 2x2 pixels. The hardware still renders the edges of geometry exactly, but the inner area of each triangle is rendered with a reduced number of fragment shader invocations.

There are a couple of ways this capability can be configured:

On a per-draw level
On a per-primitive level (e.g. per triangle)
Using an image to configure on a per-region level

This is a new feature for AMD on RDNA2 hardware.

With RADV_FORCE_VRS we use this to improve performance at the cost of visual quality. Since we did not implement any postprocessing the quality loss can be pretty bad, so we restricted the reduce shading rate when we detect one of the following

Something is rendered in 2D, as that is likely some UI where you’d really want some crispness
When the shader can discard pixels, as this implicitly introduces geometry edges that the hardware doesn’t see but that significantly impact the visual quality.

As a result there are some games where this has barely any effect but you also don’t notice the quality regression and there are games where it really improves performance by 30%+ but you really notice the quality regression.

VRS is by far the easiest thing to make work in almost all games. Most alternatives like checkerboarding, TAA and DLSS need modified render target size, significant shader fixups, or even a proprietary integration with games. Making changes that deeply is getting more complicated the more advanced the game is.

If we want to reduce render resolution (which would be a key thing in e.g. checkerboarding or DLSS) it is very hard to confidently tie all resolution dependent things together. For example a big cost for some modern games is raytracing, but the information flow to the main render targets can be very hard to track automatically and hence such a thing would require a lot of investigation or a bunch of per game customizations.

And hence we decided to introduce this first baby step. Enjoy!

The Catastrophe of Reading from VRAM

2021-04-04T00:00:00+02:00

In this article I show how reading from VRAM can be a catastrophe for game performance and why.

To illustrate I will go back to fall 2015. AMDGPU was just released, it didn’t even have re-clocking yet and I was just a young student trying to play Skyrim on my new AMD R9 285.

Except it ran slowly. 10-15 FPS slowly. Now one might think that is no surprise as due to lack of re-clocking the GPU ran with a shader clock of 300 MHz. However the real surprise was that the game was not at all GPU bound.

As usual with games of that era there was a single thread doing a lot of the work and that thread was very busy doing something inside the game binary. After a bunch of digging with profilers and gdb, it turned out that the majority of time was spent in a single function that accessed less than 1 MiB from a GPU buffer each frame.

At the time DXVK was not a thing yet and I ran the game with wined3d on top of OpenGL. In OpenGL an application does not specify the location of GPU buffers directly, but specifies some properties about how it is going to be used and the driver decides. Poorly in this case.

There was a clear tweak to the driver heuristics that choose the memory location and the frame rate of the game more than doubled and was now properly GPU bound.

Some Data

After the anecdote above you might be wondering how slow reading from VRAM can really be? 1 MiB is not a lot of data so even if it is slow it cannot be that bad right?

To show you how bad it can be I ran some benchmarks on my system (Threadripper 2990wx, 4 channel DDR4-3200 and a RX 6800 XT). I checked read/write performance using a 16 MiB buffer (512 MiB for system memory to avoid the test being contained in L3 cache)

We look into three allocation types that are exposed by the amdgpu Linux kernel driver:

VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.
Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).
USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.

For context, in Vulkan this would roughly correspond to the following memory types:

Hardware	Vulkan
VRAM	VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT \| VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT \| VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
Cacheable system memory	VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT \| VK_MEMORY_PROPERTY_HOST_COHERENT_BIT \| VK_MEMORY_PROPERTY_HOST_CACHED_BIT
USWC system memory	VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT \| VK_MEMORY_PROPERTY_HOST_COHERENT_BIT

The benchmark resulted in the following throughput numbers:

method (MiB/s)	VRAM	Cacheable System Memory	USWC System Memory
read via memcpy	15	11488	137
write via memcpy	10028	18249	11480

I furthermore tested handwritten for-loops accessing 8,16,32 and 64-bit elements at a time and those got similar performance.

This clearly shows that reads from VRAM using memcpy are ~766x slower than memcpy reads from cacheable system memory and even non-cacheable system memory is ~91x slower than cacheable system memory. Reading even small amounts from these can cause severe performance degradations.

Writes show a difference as well, but the difference is not nearly as significant. So if an application does not select the best memory location for their data for CPU access it is still likely to result in a reasonable experience.

APUs Are Affected Too

Even though APUs do not have VRAM they still are affected by the same issue. Typically the GPU gets a certain amount of memory pre-allocated at boot time as a carveout. There are some differences in how this is accessed from the GPU so from the perspective of the GPU this memory can be faster.

At the same time the Linux kernel only gives uncached access to that region from the CPU, so one could expect similar performance issues to crop up.

I did the same test as above on a laptop with a Ryzen 5 2500U (Raven Ridge) APU, and got results that are are not dissimilar from my workstation.

method (MiB/s)	Carveout	Snooped System Memory	USWC System Memory
read via memcpy	108	10426	108
write via memcpy	11797	20743	11821

The carveout performance is virtually identical to the uncached system memory now, which is still ~97x slower than cacheable system memory. So even though it is all system memory on an APU care still has to be taken on how the memory is allocated.

What To Do Instead

Since the performance cliff is so large it is recommended to avoid this issue if at all possible. The following three methods are good ways to avoid the issue:

If the data is only written from the CPU, it is advisable to use a shadow buffer in cacheable system memory (can even be outside of the graphics API, e.g. malloc) and read from that instead.
If this is written by the GPU but not frequently, one could consider putting the buffer in snooped system memory. This makes the GPU traffic go over the PCIE bus though, so it has a trade-off.
Let the GPU copy the data to a buffer in snooped system memory. This is basically an extension of the previous item by making sure that the GPU accesses the data exactly once in system memory. The GPU roundtrip can take a non-trivial wall-time though (up to ~0.5 ms measured on some low end APUs), some of which is size-independent, such as command submission. Additionally this may need to wait till the hardware unit used for the copy is available, which may depend on other GPU work. The SDMA unit (Vulkan transfer queue) is a good option to avoid that.

Other Limitations

Another problem with CPU access from VRAM is the BAR size. Typically only the first 256 MiB of VRAM is configured to be accessible from the CPU and for anything else one needs to use DMA.

If the working set of what is allocated in VRAM and accessed from the CPU is large enough the kernel driver may end up moving buffers frequently in the page fault handler. System memory would be an obvious target, but due to the GPU performance trade-off that is not always the decision that gets made.

Luckily, due to the recent push from AMD for Smart Access Memory, large BARs that encompass the entire VRAM are now much more common on consumer platforms.

A New Blog, Now What?

2021-04-03T00:00:00+02:00

This is the first post of this blog and with it being past midnight I couldn’t be bothered making one about a technical topic. So instead here is an explanation of my plans with the blog.

I got inspired by the prolific blogging of Mike Blumenkrantz and some discussion on the VKx discord that some actually written updates can be very useful, and that I don’t need to make a paper out of each one.

At the same time I have been involved in some longer running things on the driver side which I think could really use some updates as progress is made. Consider for example raytracing, DRM format modifiers, RGP support and more.

I have no plans at all to be as prolific as Mike by a long shot, but I think the style of articles is probably a good template of what to expect from this blog.