PerfTest

A simple GPU shader memory operation performance test tool. Current implementation is DirectX 11.0 based.

The purpose of this application is not to benchmark different brand GPUs against each other. Its purpose is to help rendering programmers to choose right types of resources when optimizing their compute shader performance.

This application is designed to measure peak data load performance from L1 caches. I tried to avoid known hardware bottlenecks. If you notice something wrong or suspicious in the shader workload, please inform me immediately and I will fix it. If my shaders are affected by some hardware bottlenecks, I am glad to hear about it and write more test cases to show the best performance. The goal is that developers gain better understanding of various GPU hardware on the market and gain insight to optimize code for them.

Features

Designed to measure performance of various types of buffer and image loads. This application is not a GPU memory bandwidth measurement tool. All tests operate inside GPUs L1 caches (no larger than 16 KB working sets).

Coalesced loads (100% L1 cache hit)
Random loads (100% L1 cache hit)
Uniform address loads (same address for all threads)
Typed Buffer SRVs: 1/2/4 channels, 8/16/32 bits per channel
ByteAddressBuffer SRVs: load, load2, load3, load4 - aligned and unaligned
Structured Buffer SRVs: float/float2/float4
Constant Buffer float4 array indexed loads
Texture2D loads: 1/2/4 channels, 8/16/32 bits per channel
Texture2D nearest sampling: 1/2/4 channels, 8/16/32 bits per channel
Texture2D bilinear sampling: 1/2/4 channels, 8/16/32 bits per channel

Explanations

Coalesced loads: GPUs optimize linear address patterns. Coalescing occurs when all threads in a warp/wave (32/64 threads) load from contiguous addresses. In my "linear" test case, memory loads access contiguous addresses in the whole thread group (256 threads). This should coalesce perfectly on all GPUs, independent of warp/wave width.

Random loads: I add a random start offset of 0-15 elements for each thread (still aligned). This prevents GPU coalescing, and provides more realistic view of performance for common case (non-linear) memory accessing. This benchmark is as cache efficient as the previous. All data still comes from the L1 cache.

Uniform loads: All threads in group simultaneously load from the same address. This triggers coalesced path on some GPUs and additonal optimizations on some GPUs, such as scalar loads (SGPR storage) on AMD GCN. I have noticed that recent Intel and Nvidia drivers also implement a software optimization for uniform load loop case (which is employed by this benchmark).

Notes: Compiler optimizations can ruin the results. We want to measure only load (read) performance, but write (store) is also needed, otherwise the compiler will just optimize the whole shader away. To avoid this, each thread does first 256 loads followed by a single linear groupshared memory write (no bank-conflicts). Cbuffer contains a write mask (not known at compile time). It controls which elements are written from the groupshared memory to the output buffer. The mask is always zero at runtime. Compilers can also combine multiple narrow raw buffer loads together (as bigger 4d loads) if it an be proven at compile time that loads from the same thread access contiguous offsets. This is prevented by applying an address mask from cbuffer (not known at compile time).

Uniform Load Investigation

When I first implemented this benchmark, I noticed that Intel uniform address loads were surprisingly fast. Intel ISA documents don't mention anything about a scalar unit or other hardware feature to make uniform address loads fast. This optimization affected every single resource type, unlike AMDs hardware scalar unit (which only works for raw data loads). I didnt't investigate this further however at that point. When Nvidia released Volta GPUs, they brought new driver that implemented similar compiler optimization. Later drivers introduced the same optimization to Maxwell and Pascal too. And now Turing also has it. It's certainly not hardware based, since 20x+ gains apply to all their existing GPUs too.

In Nov 10-11 weekend (2018) I was toying around with Vulkan/DX12 wave intrinsics, and came up with a crazy idea to use a single wave wide load and then use wave intrinsics to broadcast scalar result (single lane) to each loop iteration. This results in up to wave width reduced amount of loads.

See the gist and Shader Playground links here: https://gist.github.com/sebbbi/ba4415339b535d22fb18e2d824564ec4

In Nvidia's uniform load optimization case, their wave width = 32, and their uniform load optimization performance boost is up to 28x. This finding really made me curious. Could Nvidia implement a similar warp shuffle based optimization for this use case? The funny thing is that my tweets escalated the situation, and made Intel reveal their hand:

https://twitter.com/JoshuaBarczak/status/1062060067334189056

Intel has now officially revealed that their driver does a wave shuffle optimization for uniform address loads. They have been doing it for years already. This explains Intel GPU benchmark results perfectly. Now that we have confirmation of Intel's (original) optimization, I suspect that Nvidia's shader compiler employs a highly similar optimization in this case. Both optimizations are great, because Nvidia/Intel do not have a dedicated scalar unit. They need to lean more on vector loads, and this trick allows sharing one vector load with multiple uniform address load loop iterations.

Results

All results are compared to Buffer<RGBA8>.Load random result (=1.0x) on the same GPU.

AMD GCN2 (R9 390X)

Buffer<R8>.Load uniform: 11.302ms 3.907x
Buffer<R8>.Load linear: 11.327ms 3.899x
Buffer<R8>.Load random: 44.150ms 1.000x
Buffer<RG8>.Load uniform: 49.611ms 0.890x
Buffer<RG8>.Load linear: 49.835ms 0.886x
Buffer<RG8>.Load random: 49.615ms 0.890x
Buffer<RGBA8>.Load uniform: 44.149ms 1.000x
Buffer<RGBA8>.Load linear: 44.806ms 0.986x
Buffer<RGBA8>.Load random: 44.164ms 1.000x
Buffer<R16f>.Load uniform: 11.131ms 3.968x
Buffer<R16f>.Load linear: 11.139ms 3.965x
Buffer<R16f>.Load random: 44.076ms 1.002x
Buffer<RG16f>.Load uniform: 49.552ms 0.891x
Buffer<RG16f>.Load linear: 49.560ms 0.891x
Buffer<RG16f>.Load random: 49.559ms 0.891x
Buffer<RGBA16f>.Load uniform: 44.066ms 1.002x
Buffer<RGBA16f>.Load linear: 44.687ms 0.988x
Buffer<RGBA16f>.Load random: 44.066ms 1.002x
Buffer<R32f>.Load uniform: 11.132ms 3.967x
Buffer<R32f>.Load linear: 11.139ms 3.965x
Buffer<R32f>.Load random: 44.071ms 1.002x
Buffer<RG32f>.Load uniform: 49.558ms 0.891x
Buffer<RG32f>.Load linear: 49.560ms 0.891x
Buffer<RG32f>.Load random: 49.559ms 0.891x
Buffer<RGBA32f>.Load uniform: 44.061ms 1.002x
Buffer<RGBA32f>.Load linear: 44.613ms 0.990x
Buffer<RGBA32f>.Load random: 49.583ms 0.891x
ByteAddressBuffer.Load uniform: 10.322ms 4.278x
ByteAddressBuffer.Load linear: 11.546ms 3.825x
ByteAddressBuffer.Load random: 44.153ms 1.000x
ByteAddressBuffer.Load2 uniform: 11.499ms 3.841x
ByteAddressBuffer.Load2 linear: 49.628ms 0.890x
ByteAddressBuffer.Load2 random: 49.651ms 0.889x
ByteAddressBuffer.Load3 uniform: 16.985ms 2.600x
ByteAddressBuffer.Load3 linear: 44.142ms 1.000x
ByteAddressBuffer.Load3 random: 88.176ms 0.501x
ByteAddressBuffer.Load4 uniform: 22.472ms 1.965x
ByteAddressBuffer.Load4 linear: 44.212ms 0.999x
ByteAddressBuffer.Load4 random: 49.346ms 0.895x
ByteAddressBuffer.Load2 unaligned uniform: 11.422ms 3.867x
ByteAddressBuffer.Load2 unaligned linear: 49.552ms 0.891x
ByteAddressBuffer.Load2 unaligned random: 49.561ms 0.891x
ByteAddressBuffer.Load4 unaligned uniform: 22.373ms 1.974x
ByteAddressBuffer.Load4 unaligned linear: 44.095ms 1.002x
ByteAddressBuffer.Load4 unaligned random: 54.464ms 0.811x
StructuredBuffer<float>.Load uniform: 12.585ms 3.509x
StructuredBuffer<float>.Load linear: 11.770ms 3.752x
StructuredBuffer<float>.Load random: 44.176ms 1.000x
StructuredBuffer<float2>.Load uniform: 13.210ms 3.343x
StructuredBuffer<float2>.Load linear: 50.217ms 0.879x
StructuredBuffer<float2>.Load random: 49.645ms 0.890x
StructuredBuffer<float4>.Load uniform: 13.818ms 3.196x
StructuredBuffer<float4>.Load random: 49.666ms 0.889x
StructuredBuffer<float4>.Load linear: 44.721ms 0.988x
cbuffer{float4} load uniform: 16.702ms 2.644x
cbuffer{float4} load linear: 44.447ms 0.994x
cbuffer{float4} load random: 49.656ms 0.889x
Texture2D<R8>.Load uniform: 44.214ms 0.999x
Texture2D<R8>.Load linear: 44.795ms 0.986x
Texture2D<R8>.Load random: 44.808ms 0.986x
Texture2D<RG8>.Load uniform: 49.706ms 0.888x
Texture2D<RG8>.Load linear: 50.231ms 0.879x
Texture2D<RG8>.Load random: 50.200ms 0.880x
Texture2D<RGBA8>.Load uniform: 44.760ms 0.987x
Texture2D<RGBA8>.Load linear: 45.339ms 0.974x
Texture2D<RGBA8>.Load random: 45.405ms 0.973x
Texture2D<R16F>.Load uniform: 44.175ms 1.000x
Texture2D<R16F>.Load linear: 44.157ms 1.000x
Texture2D<R16F>.Load random: 44.096ms 1.002x
Texture2D<RG16F>.Load uniform: 49.739ms 0.888x
Texture2D<RG16F>.Load linear: 49.661ms 0.889x
Texture2D<RG16F>.Load random: 49.622ms 0.890x
Texture2D<RGBA16F>.Load uniform: 44.257ms 0.998x
Texture2D<RGBA16F>.Load linear: 44.267ms 0.998x
Texture2D<RGBA16F>.Load random: 88.126ms 0.501x
Texture2D<R32F>.Load uniform: 44.259ms 0.998x
Texture2D<R32F>.Load linear: 44.193ms 0.999x
Texture2D<R32F>.Load random: 44.099ms 1.001x
Texture2D<RG32F>.Load uniform: 49.739ms 0.888x
Texture2D<RG32F>.Load linear: 49.667ms 0.889x
Texture2D<RG32F>.Load random: 88.110ms 0.501x
Texture2D<RGBA32F>.Load uniform: 44.288ms 0.997x
Texture2D<RGBA32F>.Load linear: 66.145ms 0.668x
Texture2D<RGBA32F>.Load random: 88.124ms 0.501x

AMD GCN2 was a very popular architecture. First card using this architecture was Radeon 7790. Many Radeon 200 and 300 series cards also use this architecture. Both Xbox and PS4 (base model) GPUs are based on GCN2 architecture, making this arc

Perftest

Install / Use

README