AMD RDNA4 Architecture: Complete Overview and Analysis
AMD’s next-generation GPUs are nearly here, utilising the RDNA4 architecture and while you’ve probably seen our previous video about what AMD needs to do to compete with NVIDIA, today we’re focusing specifically on what RDNA4 brings to the table technically. We’ll be looking at everything from architectural improvements to performance numbers, and software features, and even making some educated price predictions.
Just the other day, we were invited onto a briefing session with AMD to go through all of the information that we can let you know about today, and looking at AMD’s slides, we can see a clear progression from RDNA2 through RDNA3 and now to RDNA4. Each generation has brought significant improvements, but RDNA4 represents perhaps the most substantial architectural overhaul since the original RDNA lineup. The most fundamental change appears to be in how the architecture handles resource allocation. RDNA3 used a static register allocation system, where registers would be reserved for specific workloads even when idle. RDNA4 introduces dynamic register allocation, which means those previously idle registers can now be repurposed on the fly. This allows for much more efficient utilization of the silicon and helps explain some of the performance gains we’re seeing. Another major advancement is in the ray tracing hardware side of things.
AMD has a sophisticated new implementation of oriented bounding box optimisation, or OBB for short. For those unfamiliar, this technique dramatically reduces the number of ray intersection tests needed when tracing complex scenes. The traversal heatmaps in the slides show significantly fewer ray tests being performed for the same scene, which translates to much better ray tracing performance.
The compute architecture itself has seen substantial improvements. The biggest gains appear to be in specialized compute operations, with half-precision operations seeing a 2x improvement over RDNA3, while INT8, FP8, and the newly supported BF8 operations show an impressive 4x improvement.
This is particularly significant for AI workloads, which often rely heavily on these lower-precision operations. Digging deeper into the compute enhancements, we can see that AMD has maintained the same F32 (standard floating-point) performance at 256 operations per compute unit and the same F64 (double precision) at 4 operations per CU. However, for the AI-relevant operations, the improvements are substantial. The FP16 and BF16 operations have increased from 512 to 1024 or 2048 operations per CU, depending on the workload. Perhaps even more impressive is the new support for FP8 and BF8 formats, which can achieve 2048 to 4096 operations per CU. The INT8 operations have dramatically improved from 512 to between 2048 and 4096 operations per CU, while INT4 has jumped from 1024 to between 4096 and 8192 operations per CU.