The Catastrophe of Reading from VRAM
In this article I show how reading from VRAM can be a catastrophe for game performance and why.
To illustrate I will go back to fall 2015. AMDGPU was just released, it didn’t even have re-clocking yet and I was just a young student trying to play Skyrim on my new AMD R9 285.
Except it ran slowly. 10-15 FPS slowly. Now one might think that is no surprise as due to lack of re-clocking the GPU ran with a shader clock of 300 MHz. However the real surprise was that the game was not at all GPU bound.
As usual with games of that era there was a single thread doing a lot of the work and that thread was very busy doing something inside the game binary. After a bunch of digging with profilers and gdb, it turned out that the majority of time was spent in a single function that accessed less than 1 MiB from a GPU buffer each frame.
At the time DXVK was not a thing yet and I ran the game with wined3d on top of OpenGL. In OpenGL an application does not specify the location of GPU buffers directly, but specifies some properties about how it is going to be used and the driver decides. Poorly in this case.
There was a clear tweak to the driver heuristics that choose the memory location and the frame rate of the game more than doubled and was now properly GPU bound.
Some Data
After the anecdote above you might be wondering how slow reading from VRAM can really be? 1 MiB is not a lot of data so even if it is slow it cannot be that bad right?
To show you how bad it can be I ran some benchmarks on my system (Threadripper 2990wx, 4 channel DDR4-3200 and a RX 6800 XT). I checked read/write performance using a 16 MiB buffer (512 MiB for system memory to avoid the test being contained in L3 cache)
We look into three allocation types that are exposed by the amdgpu Linux kernel driver:
-
VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.
-
Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).
-
USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.
For context, in Vulkan this would roughly correspond to the following memory types:
Hardware | Vulkan |
---|---|
VRAM | VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT |
Cacheable system memory | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT |
USWC system memory | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT |
The benchmark resulted in the following throughput numbers:
method (MiB/s) | VRAM | Cacheable System Memory | USWC System Memory |
---|---|---|---|
read via memcpy | 15 | 11488 | 137 |
write via memcpy | 10028 | 18249 | 11480 |
I furthermore tested handwritten for-loops accessing 8,16,32 and 64-bit elements at a time and those got similar performance.
This clearly shows that reads from VRAM using memcpy are ~766x slower than memcpy reads from cacheable system memory and even non-cacheable system memory is ~91x slower than cacheable system memory. Reading even small amounts from these can cause severe performance degradations.
Writes show a difference as well, but the difference is not nearly as significant. So if an application does not select the best memory location for their data for CPU access it is still likely to result in a reasonable experience.
APUs Are Affected Too
Even though APUs do not have VRAM they still are affected by the same issue. Typically the GPU gets a certain amount of memory pre-allocated at boot time as a carveout. There are some differences in how this is accessed from the GPU so from the perspective of the GPU this memory can be faster.
At the same time the Linux kernel only gives uncached access to that region from the CPU, so one could expect similar performance issues to crop up.
I did the same test as above on a laptop with a Ryzen 5 2500U (Raven Ridge) APU, and got results that are are not dissimilar from my workstation.
method (MiB/s) | Carveout | Snooped System Memory | USWC System Memory |
---|---|---|---|
read via memcpy | 108 | 10426 | 108 |
write via memcpy | 11797 | 20743 | 11821 |
The carveout performance is virtually identical to the uncached system memory now, which is still ~97x slower than cacheable system memory. So even though it is all system memory on an APU care still has to be taken on how the memory is allocated.
What To Do Instead
Since the performance cliff is so large it is recommended to avoid this issue if at all possible. The following three methods are good ways to avoid the issue:
-
If the data is only written from the CPU, it is advisable to use a shadow buffer in cacheable system memory (can even be outside of the graphics API, e.g. malloc) and read from that instead.
-
If this is written by the GPU but not frequently, one could consider putting the buffer in snooped system memory. This makes the GPU traffic go over the PCIE bus though, so it has a trade-off.
-
Let the GPU copy the data to a buffer in snooped system memory. This is basically an extension of the previous item by making sure that the GPU accesses the data exactly once in system memory. The GPU roundtrip can take a non-trivial wall-time though (up to ~0.5 ms measured on some low end APUs), some of which is size-independent, such as command submission. Additionally this may need to wait till the hardware unit used for the copy is available, which may depend on other GPU work. The SDMA unit (Vulkan transfer queue) is a good option to avoid that.
Other Limitations
Another problem with CPU access from VRAM is the BAR size. Typically only the first 256 MiB of VRAM is configured to be accessible from the CPU and for anything else one needs to use DMA.
If the working set of what is allocated in VRAM and accessed from the CPU is large enough the kernel driver may end up moving buffers frequently in the page fault handler. System memory would be an obvious target, but due to the GPU performance trade-off that is not always the decision that gets made.
Luckily, due to the recent push from AMD for Smart Access Memory, large BARs that encompass the entire VRAM are now much more common on consumer platforms.